You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For monitoring the data processing with dbt, we need to access information about the processed data. For each model in the dbt project, some information like updated rows (for upserts) or total rows (for inserts) would be nice. Also some information about the magnitude of the referenced source model should be provided. As we plan to process a large amount of data daily with the help of dbt-glue, we would like to somehow track the record count that goes into each model and leaves it.
Describe alternatives you've considered
Other dbt-adapters have the possibility to fetch this via the result object and can access the adapter_response field for additional information like affected_rows.
Here, somethink like an accumulator could be added, that tracks the number of rows in input and output dataframe, which would at least provide a little bit of monitoring for dbt-glue. I thought about somethink like this:
valinputCounter= ...
inputDf = spark.sql(request).transform(countWithAccumulator(inputCounter)))
// integrate this somehow
println(s"number of model input records: ${inputCounter.value}")
Who will this benefit?
Everybody, who wants to have some monitoring about processed records. For now, this is not possible to access anyhow. The only possibility that occurs to me is writing custom sql statements, that track the count in tables, but we do not want to couple this dev monitoring with the actual business logic.
Are you interested in contributing this feature?
Yes, but further coordination is necessary on how to integrate this.
The text was updated successfully, but these errors were encountered:
Describe the feature
For monitoring the data processing with dbt, we need to access information about the processed data. For each model in the dbt project, some information like updated rows (for upserts) or total rows (for inserts) would be nice. Also some information about the magnitude of the referenced source model should be provided. As we plan to process a large amount of data daily with the help of dbt-glue, we would like to somehow track the record count that goes into each model and leaves it.
Describe alternatives you've considered
Other dbt-adapters have the possibility to fetch this via the result object and can access the adapter_response field for additional information like
affected_rows
.This seems to be impossible for all spark-based implementations, as it is mentioned [here](https://github.com/dbt-labs/dbt-spark/issues/812], right?
I found this section in the code in
impl.py
:Here, somethink like an accumulator could be added, that tracks the number of rows in input and output dataframe, which would at least provide a little bit of monitoring for
dbt-glue
. I thought about somethink like this:Code adaption:
Who will this benefit?
Everybody, who wants to have some monitoring about processed records. For now, this is not possible to access anyhow. The only possibility that occurs to me is writing custom sql statements, that track the count in tables, but we do not want to couple this dev monitoring with the actual business logic.
Are you interested in contributing this feature?
Yes, but further coordination is necessary on how to integrate this.
The text was updated successfully, but these errors were encountered: