[Hubble-446] Implement mechanism to only fail transient dbt task errors #419

edualvess · 2024-07-08T14:20:57Z

PR Checklist

PR Structure

This PR has reasonably narrow scope (if not, break it down into smaller PRs).
This PR avoids mixing refactoring changes with feature changes (split into two PRs
otherwise).
This PR's title starts with the jira ticket associated with the PR.

Thoroughness

This PR adds tests for the most critical parts of the new functionality or fixes.
I've updated the README with the added features, breaking changes, new instructions on how to use the repository.

What

This PR addresses the develpment of a mechanism capable of controlling the Airflow task instances' flow in order to avoid unnecessary task retries.
As the key changes made to code:

Developed skip_retry_dbt_errors function to avoid retrying non-transient dbt errors
Added skip_retry_dbt_errors function as on_retry_callback for dbt tasks
Added dbt_transient_errors_patterns variable to include the string patterns for known dbt transient errors

The approach followed for this development can be visualized in the flowchart:

Why

With the current dbt task setup, if a dbt test fails, Airflow will still retry the task many times. Without manual intervention, there will be no differing outcome for dbt tests, and we should not retry the task multiple times. Instead we would the dbt Airflow task to exhibit the following behavior:

Transient Errors - in the case of errors caused by Airflow infrastructure or transient issues, the task should continue to retry as originally designed. This is because we expect a different outcome for the task status with the retry.

Data Quality errors: if an error is triggered by bad data or a failed dbt test, the task should fail immediately if configured to fail. If the test is configured to warn, the task should continue successfully.

Known limitations

sydneynotthecity

Changes look good, one small feedback and some questions for my own understanding, but none of it blocks the release.

sydneynotthecity · 2024-07-10T19:36:55Z

dags/stellar_etl_airflow/utils.py

+    retry_delay = 30
+    log_contents = []
+
+    for attempt in range(max_retries):


During testing, did you find that the log file sometimes didn't exist?

I did, but that was mostly my fault. I firstly developed this script to work at on_execute_callback, so I had to look for the log file related to the ti.try_number, but after migrating to the on_retry_callback I failed to realize that the context was already updated by the time the callback was called and the try_number was 1 step ahead. After fixing the reference I kept it mostly as a fail-safe, but it doesn't seem like it's necessary.

It's fine to keep as a fail safe, I was just curious if Airflow was flaky with its logging

sydneynotthecity · 2024-07-10T19:45:46Z

dags/stellar_etl_airflow/utils.py

+        ti = context["ti"]
+        logging.info(
+            f"Set task instance {ti} state to \
+                {TaskInstanceState.SKIPPED} to skip retrying"


To confirm, elementary will still surface this task failure as an alert since dbt will log a FAILURE, correct?

Yes, this logic only runs after the first failure of the task in Airflow, be it due to dbt errors or any other. The elementary on-run-end hook have already finished running by the time this is called, so the alerts DAG should catch the warnings/failures results from the elementary dataset.

sydneynotthecity · 2024-07-10T19:47:34Z

dags/stellar_etl_airflow/utils.py

+                )
+                return False
+    # Check for transient errors in case dbt pipeline didn't finish
+    for line in log_contents:


nit: This function loops through the log_contents twice. Do you think it's possible to move this logic up into the first for loop for log_contents? Not a big deal if not, these files are small.

It makes much sense like you suggested. I committed a change for this matter just now.

…rs (#419) (#426) * Added skip_retry_callback function and related logic * Rename ci.yml to ci-cd-dev.yml * Update and rename release.yml to ci-cd-prod.yml * Update ci-cd-dev.yml * Improved logic for transient errors search * Added dbt transient errors patterns variable * Added skip_retry_dbt_errors to on_retry_callback for dbt tasks * Update README * Exapanded scope of error search * Improved logic for more straightforward approach --------- Co-authored-by: Eduardo Alves <[email protected]> Co-authored-by: Laysa Bitencourt <[email protected]>

edualvess and others added 9 commits July 8, 2024 11:10

Added skip_retry_callback function and related logic

f774b28

Rename ci.yml to ci-cd-dev.yml

3680696

Update and rename release.yml to ci-cd-prod.yml

cc80b35

Update ci-cd-dev.yml

a9764bf

Improved logic for transient errors search

8b47da2

Added dbt transient errors patterns variable

0b69a38

Added skip_retry_dbt_errors to on_retry_callback for dbt tasks

27f0cb6

Update README

679a8c8

Merge branch 'master' into feature/add_check_for_dbt_transient_errors

451e23b

edualvess marked this pull request as ready for review July 9, 2024 21:38

edualvess requested a review from a team as a code owner July 9, 2024 21:38

Exapanded scope of error search

9469e01

sydneynotthecity approved these changes Jul 10, 2024

View reviewed changes

edualvess added 2 commits July 11, 2024 17:09

Improved logic for more straightforward approach

eab0265

Merge branch 'master' into feature/add_check_for_dbt_transient_errors

0e06422

edualvess merged commit cb5a1a3 into master Jul 11, 2024
4 checks passed

edualvess deleted the feature/add_check_for_dbt_transient_errors branch July 12, 2024 16:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hubble-446] Implement mechanism to only fail transient dbt task errors #419

[Hubble-446] Implement mechanism to only fail transient dbt task errors #419

edualvess commented Jul 8, 2024 •

edited

Loading

sydneynotthecity left a comment

sydneynotthecity Jul 10, 2024

edualvess Jul 11, 2024

sydneynotthecity Jul 12, 2024 •

edited

Loading

sydneynotthecity Jul 10, 2024

edualvess Jul 11, 2024

sydneynotthecity Jul 10, 2024

edualvess Jul 11, 2024

[Hubble-446] Implement mechanism to only fail transient dbt task errors #419

[Hubble-446] Implement mechanism to only fail transient dbt task errors #419

Conversation

edualvess commented Jul 8, 2024 • edited Loading

PR Structure

Thoroughness

What

Why

Known limitations

sydneynotthecity left a comment

Choose a reason for hiding this comment

sydneynotthecity Jul 10, 2024

Choose a reason for hiding this comment

edualvess Jul 11, 2024

Choose a reason for hiding this comment

sydneynotthecity Jul 12, 2024 • edited Loading

Choose a reason for hiding this comment

sydneynotthecity Jul 10, 2024

Choose a reason for hiding this comment

edualvess Jul 11, 2024

Choose a reason for hiding this comment

sydneynotthecity Jul 10, 2024

Choose a reason for hiding this comment

edualvess Jul 11, 2024

Choose a reason for hiding this comment

edualvess commented Jul 8, 2024 •

edited

Loading

sydneynotthecity Jul 12, 2024 •

edited

Loading