Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hubble-446] Implement mechanism to only fail transient dbt task errors #419

Merged
merged 12 commits into from
Jul 11, 2024

Conversation

edualvess
Copy link
Contributor

@edualvess edualvess commented Jul 8, 2024

PR Checklist

PR Structure

  • This PR has reasonably narrow scope (if not, break it down into smaller PRs).
  • This PR avoids mixing refactoring changes with feature changes (split into two PRs
    otherwise).
  • This PR's title starts with the jira ticket associated with the PR.

Thoroughness

  • This PR adds tests for the most critical parts of the new functionality or fixes.
  • I've updated the README with the added features, breaking changes, new instructions on how to use the repository.

What

This PR addresses the develpment of a mechanism capable of controlling the Airflow task instances' flow in order to avoid unnecessary task retries.
As the key changes made to code:

  • Developed skip_retry_dbt_errors function to avoid retrying non-transient dbt errors
  • Added skip_retry_dbt_errors function as on_retry_callback for dbt tasks
  • Added dbt_transient_errors_patterns variable to include the string patterns for known dbt transient errors

The approach followed for this development can be visualized in the flowchart:
transient

Why

With the current dbt task setup, if a dbt test fails, Airflow will still retry the task many times. Without manual intervention, there will be no differing outcome for dbt tests, and we should not retry the task multiple times. Instead we would the dbt Airflow task to exhibit the following behavior:

Transient Errors - in the case of errors caused by Airflow infrastructure or transient issues, the task should continue to retry as originally designed. This is because we expect a different outcome for the task status with the retry.

Data Quality errors: if an error is triggered by bad data or a failed dbt test, the task should fail immediately if configured to fail. If the test is configured to warn, the task should continue successfully.

Known limitations

@edualvess edualvess marked this pull request as ready for review July 9, 2024 21:38
@edualvess edualvess requested a review from a team as a code owner July 9, 2024 21:38
Copy link
Contributor

@sydneynotthecity sydneynotthecity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good, one small feedback and some questions for my own understanding, but none of it blocks the release.

retry_delay = 30
log_contents = []

for attempt in range(max_retries):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During testing, did you find that the log file sometimes didn't exist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did, but that was mostly my fault. I firstly developed this script to work at on_execute_callback, so I had to look for the log file related to the ti.try_number, but after migrating to the on_retry_callback I failed to realize that the context was already updated by the time the callback was called and the try_number was 1 step ahead. After fixing the reference I kept it mostly as a fail-safe, but it doesn't seem like it's necessary.

Copy link
Contributor

@sydneynotthecity sydneynotthecity Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine to keep as a fail safe, I was just curious if Airflow was flaky with its logging

ti = context["ti"]
logging.info(
f"Set task instance {ti} state to \
{TaskInstanceState.SKIPPED} to skip retrying"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm, elementary will still surface this task failure as an alert since dbt will log a FAILURE, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this logic only runs after the first failure of the task in Airflow, be it due to dbt errors or any other. The elementary on-run-end hook have already finished running by the time this is called, so the alerts DAG should catch the warnings/failures results from the elementary dataset.

)
return False
# Check for transient errors in case dbt pipeline didn't finish
for line in log_contents:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This function loops through the log_contents twice. Do you think it's possible to move this logic up into the first for loop for log_contents? Not a big deal if not, these files are small.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes much sense like you suggested. I committed a change for this matter just now.

@edualvess edualvess merged commit cb5a1a3 into master Jul 11, 2024
4 checks passed
sydneynotthecity pushed a commit that referenced this pull request Jul 12, 2024
…rs (#419) (#426)

* Added skip_retry_callback function and related logic

* Rename ci.yml to ci-cd-dev.yml

* Update and rename release.yml to ci-cd-prod.yml

* Update ci-cd-dev.yml

* Improved logic for transient errors search

* Added dbt transient errors  patterns variable

* Added skip_retry_dbt_errors to on_retry_callback for dbt tasks

* Update README

* Exapanded scope of error search

* Improved logic for more straightforward approach

---------

Co-authored-by: Eduardo Alves <[email protected]>
Co-authored-by: Laysa Bitencourt <[email protected]>
@edualvess edualvess deleted the feature/add_check_for_dbt_transient_errors branch July 12, 2024 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants