-
Notifications
You must be signed in to change notification settings - Fork 106
Submission Loader
This page explains how the "submission loader" ingests DABS data from Broker on a daily basis.
Bulk of the code is in usaspending_api.etl.management.commands.load_submission
The submission loader runs nightly, extracting published (used to be "certified") DABS from Broker using SQL
and loading that data into USAspending tables by saving Django ORM objects.
Submission data are for the most part copied over verbatim to USAspending database tables from Broker database tables, which store valid File A/B/C data. Submissions are loaded into USAspending nightly, after dependent data changes are loaded, including the Broker award data as well as reference data that the incoming data may refer to. Submissions that are found to already be loaded are deleted in USAspending before they are loaded.
ETL Source -> Destination
Data | Broker Source | USAspending Destination |
---|---|---|
Submission Metadata | certify_history |
submission_attributes |
Submission | submission |
submission_attributes |
File A | certified_appropriation |
appropriation_account_balances (FYTD TAS balances for budget/obligations/outlays) |
File B | certified_object_class_program_activity |
financial_accounts_by_program_activity_object_class (FYTD TAS+OC+PA balances for obligations/outlays) |
File C | certified_award_financial |
financial_accounts_by_awards (TAS+PA+OC+Award obligations/outlays) |
The submission loader uses Python code, structured in a custom Django management command using manage.py
that effectively makes it an executable python script. The loader is run as a step in the USAspending "DATA Act Production Nightly Load" data pipeline, executed by Jenkins. Its step is executed after all other data from Broker is loaded, such as award data, subaward data, and various other reference data sets are updated.
Jenkins directly calls the load_multiple_submissions
Django management command with HARDCODED fiscal years and quarters that must be updated annually. load_multiple_submissions
is then responsible for calling load_submission
once per new/updated submission.
The major steps to extract data from Broker are performed in SQL, using the Python psycopg2
library and a connection to the Broker Postgres databases. This data is turned into python dictionaries which are loaded row-by-row into the USAspending database by instantiating Django model objects from the Broker data dicts, and performing save operations using the Django ORM.
The major step to load the data from Broker to USAspending is run under a @transaction.atomic
decorator, which should allow rollbacks of the entire submission load if any errors occur.
File A, File B, and File C make reference to reference data which should exist, and in some cases must exist (for DB foreign keys), in USAspending before their records can be saved. This helps in referential integrity and data consistency across Broker and USAspending.
-
Award data:
- Referenced By: File C
- During the nightly pipeline execution, before submissions are loaded, transactions are loaded from Broker's
detached_award_procurement
table andpublished_award_financial_assistance
table. From them a unique award identifier is derived and then an award is created if one does not yet exist for this key. Creation of theaward
record allows for establishing a linkage of File C data to anaward
record.
-
TAS
- Referenced By: File A, File B, File C
- records should be loaded into USAspending to allow FK linkages from File A, File B, and File C records. At this moment those records are skipped if such a linkage cannot be made. These are refreshed using
load_tas
from Broker in a nightly pipeline step that runs before the submission loader.
-
Object Class
- Referenced By: File B, File C
- records must be loaded into USAspending so FK linkages from File B and File C records to them can be made, otherwise the submission load will error-out There are only 100-200 of them and change rarely. See Data Exchanges docs for more detail on how they're loaded.
-
Program Activity
- Referenced By: File B, File C
- records are loaded into USAspending on-the-fly as they are seen in File B or File C.
-
DEF Codes
- Referenced By: File B, File C
- records are new, and their load is TBD.
This section covers any data in USAspending that is (re)calculated from the raw Broker data loaded, or otherwise summarized (rolled-up sum, average, etc.).
NOTE: No numeric values are pre-aggregated during the load. This is left to query-time of the API.
This is a boolean
field in the File A (appropriation_account_balances
) and File B (financial_accounts_by_program_activity_object_class
) loaded USAspending tables that exists on each row with a default value of false
.
- It is set to
true
inappropriation_account_balances
when the record is the last seen reporting of a TAS value for a Fiscal Year. - It is set to
true
infinancial_accounts_by_program_activity_object_class
when the record is the last seen reporting of a TAS+OC+PA (and soon to be +DEFC) value for a Fiscal Year.
This value is reset on all records per table per loaded submission.
- It is reset on all File A records after all File A records in the single submission are loaded
- It is reset on all File B records after all File B records in the single submission are loaded
This helps with queries that need to aggregate totals across dimensions (TAS / TAS+OC+PA) for a full fiscal year, or multiple fiscal years, but need to account for some of those records not being reported each quarter of the FY.
final_of_fy
is queried, and it is being investigated
During the ingestion of File C, the submission loader attempts to link DABS File C records to an Award record in the USAspending award data, according to the C-to-D Linkage. This allows displays or queries of information about account data or award data to benefit from further information provided in its linked domain area.
load_submission.py
Python code of the submission loader by setting the created financial_accounts_by_awards.award_id
field in the load_file_c(...)
function to be the result of the find_matching_award(...)
function.
Anything “disconnected” from the process as a post-step that must be remembered to be called separately, or occasionally (every month/quarter … every time X data load runs or Y file is ingested… etc.)?
Reference Data Loads
- Object Classes should be loaded at least once before running this loader, and ideally every time there is a major change to them - see Dependent Data above
- TAS records should be loaded from Broker every time before running this loader - see Dependent Data above and note that these are loaded as a prior step in the Jenkins pipeline
Command Parameterization
- FY+Period parameterized runs need to be added to the Jenkins execution script each new FY. The calls to
load_multiple_submission
are generated by a loop to auto-generate command line invocations.- The range of this loop needs to be expanded periodically
Data Changes
- If transactions or
award
records are modified as part of a production "data change", the C-to-D linkage management command and/or SQL code may need to be re-run.
- General Python errors (
KeyError
,AttributeError
onNone
objects, etc.) -
ObjectClass.DoesNotExist
(subclass ofdjango.core.exceptions.ObjectDoesNotExist
) if a referenced Object Class is not found -
django.core.management.base.CommandError
if submission with given ID not found, or multiple found in Broker
See docs on the following management commands that do the work of the submission loader to read through more detailed steps and execution options/args
- Home
- The DATA Act
-
Data
- Account Data (A/B/C)
-
Award Data (D1/D2)
- Contract Data (D1)
- Assistance Data (D2)
- Recipient Data (E)
- Subaward Data (F)
- File C to D Linkage
- Agency Data (Federal Hierarchy)
- Location Data
- Management Commands
- Release Notes
- Resources