-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Hubble-446] Implement mechanism to only fail transient dbt task errors #419
[Hubble-446] Implement mechanism to only fail transient dbt task errors #419
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good, one small feedback and some questions for my own understanding, but none of it blocks the release.
retry_delay = 30 | ||
log_contents = [] | ||
|
||
for attempt in range(max_retries): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During testing, did you find that the log file sometimes didn't exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did, but that was mostly my fault. I firstly developed this script to work at on_execute_callback
, so I had to look for the log file related to the ti.try_number
, but after migrating to the on_retry_callback
I failed to realize that the context was already updated by the time the callback was called and the try_number was 1 step ahead. After fixing the reference I kept it mostly as a fail-safe, but it doesn't seem like it's necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine to keep as a fail safe, I was just curious if Airflow was flaky with its logging
ti = context["ti"] | ||
logging.info( | ||
f"Set task instance {ti} state to \ | ||
{TaskInstanceState.SKIPPED} to skip retrying" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm, elementary will still surface this task failure as an alert since dbt will log a FAILURE
, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this logic only runs after the first failure of the task in Airflow, be it due to dbt errors or any other. The elementary on-run-end hook have already finished running by the time this is called, so the alerts DAG should catch the warnings/failures results from the elementary dataset.
dags/stellar_etl_airflow/utils.py
Outdated
) | ||
return False | ||
# Check for transient errors in case dbt pipeline didn't finish | ||
for line in log_contents: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This function loops through the log_contents
twice. Do you think it's possible to move this logic up into the first for loop for log_contents
? Not a big deal if not, these files are small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes much sense like you suggested. I committed a change for this matter just now.
…rs (#419) (#426) * Added skip_retry_callback function and related logic * Rename ci.yml to ci-cd-dev.yml * Update and rename release.yml to ci-cd-prod.yml * Update ci-cd-dev.yml * Improved logic for transient errors search * Added dbt transient errors patterns variable * Added skip_retry_dbt_errors to on_retry_callback for dbt tasks * Update README * Exapanded scope of error search * Improved logic for more straightforward approach --------- Co-authored-by: Eduardo Alves <[email protected]> Co-authored-by: Laysa Bitencourt <[email protected]>
PR Checklist
PR Structure
otherwise).
Thoroughness
What
This PR addresses the develpment of a mechanism capable of controlling the Airflow task instances' flow in order to avoid unnecessary task retries.
As the key changes made to code:
skip_retry_dbt_errors
function to avoid retrying non-transient dbt errorsskip_retry_dbt_errors
function ason_retry_callback
for dbt tasksdbt_transient_errors_patterns
variable to include the string patterns for known dbt transient errorsThe approach followed for this development can be visualized in the flowchart:
Why
With the current dbt task setup, if a dbt test fails, Airflow will still retry the task many times. Without manual intervention, there will be no differing outcome for dbt tests, and we should not retry the task multiple times. Instead we would the dbt Airflow task to exhibit the following behavior:
Transient Errors - in the case of errors caused by Airflow infrastructure or transient issues, the task should continue to retry as originally designed. This is because we expect a different outcome for the task status with the retry.
Data Quality errors: if an error is triggered by bad data or a failed dbt test, the task should fail immediately if configured to fail. If the test is configured to warn, the task should continue successfully.
Known limitations