Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate records from incremental model with composite unique_key with hudi and merge strategy #225

Open
zhangyuan opened this issue Aug 27, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@zhangyuan
Copy link
Contributor

zhangyuan commented Aug 27, 2023

Describe the bug

I'd like to build SCD type 2 with hudi. When use hudi and merge strategy for incremental model and the unique_key is composite key, duplicate records are created. I’m not sure it's a bug from dbt-glue, or hudi from Glue interactive session, or hudi, but it seems that _hoodie_record_key is incorrect when build the model for the first time.

Steps To Reproduce

  1. create a model
{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='id,expiry_date',
    file_format='hudi',
    hudi_options={
        'hoodie.datasource.write.precombine.field': 'ingestion_date',
    }
) }}

WITH source_data AS (
    SELECT id, title, ingestion_date
        FROM VALUES
            ('10000', 'Bread 123', '2023-08-01')
        AS data(id, title, ingestion_date)
)

select id, title, ingestion_date, '9999-12-31' as expiry_date from source_data
  1. B uild the model, and table is shown as below from Athena
# _hoodie_commit_time _hoodie_commit_seqno _hoodie_record_key _hoodie_partition_path _hoodie_file_name id title ingestion_date expiry_date update_hudi_ts
1 20230827103749521 20230827103749521_0_2 10000,9999-12-31 57a17070-d2c8-4dc5-b59b-a9155aaf5123-0_0-60-0_20230827103749521.parquet 10000 Bread 123 2023-08-01 9999-12-31 1693132671694000

Note that the _hoodie_record_key is 10000,9999-12-31.

  1. build the model again, the updated table is:
# _hoodie_commit_time _hoodie_commit_seqno _hoodie_record_key _hoodie_partition_path _hoodie_file_name id title ingestion_date expiry_date update_hudi_ts
1 20230827103749521 20230827103749521_0_2 10000,9999-12-31 57a17070-d2c8-4dc5-b59b-a9155aaf5123-0_0-123-130_20230827104059553.parquet 10000 Bread 123 2023-08-01 9999-12-31 1693132671694000
2 20230827104059553 20230827104059553_0_1 id:10000,expiry_date:9999-12-31 57a17070-d2c8-4dc5-b59b-a9155aaf5123-0_0-123-130_20230827104059553.parquet 10000 Bread 123 2023-08-01 9999-12-31 1693132859880000

Note that the _hoodie_record_key from the 2nd record is [10000,9999-12-31](id:10000,expiry_date:9999-12-31).

  1. Rebuild the model and the result is as below:
# _hoodie_commit_time _hoodie_commit_seqno _hoodie_record_key _hoodie_partition_path _hoodie_file_name id title ingestion_date expiry_date update_hudi_ts
1 20230827103749521 20230827103749521_0_2 10000,9999-12-31 57a17070-d2c8-4dc5-b59b-a9155aaf5123-0_0-177-186_20230827104208371.parquet 10000 Bread 123 2023-08-01 9999-12-31 1693132671694000
2 20230827104208371 20230827104208371_0_1 id:10000,expiry_date:9999-12-31 57a17070-d2c8-4dc5-b59b-a9155aaf5123-0_0-177-186_20230827104208371.parquet 10000 Bread 123 2023-08-01 9999-12-31 1693132928702000

There is no difference from the previous step.

  1. Add a new record to the data, and build the model:
WITH source_data AS (
    SELECT id, title, ingestion_date
        FROM VALUES
            ('10010', 'Bread 123', '2023-08-01')
        AS data(id, title, ingestion_date)
)

select id, title, ingestion_date, '9999-12-31' as expiry_date from source_data

The result is as below:

# _hoodie_commit_time _hoodie_commit_seqno _hoodie_record_key _hoodie_partition_path _hoodie_file_name id title ingestion_date expiry_date update_hudi_ts
1 20230827103749521 20230827103749521_0_2 10000,9999-12-31 57a17070-d2c8-4dc5-b59b-a9155aaf5123-0_0-27-32_20230827110130912.parquet 10000 Bread 123 2023-08-01 9999-12-31 1693132671694000
2 20230827105222582 20230827105222582_0_1 id:10000,expiry_date:9999-12-31 57a17070-d2c8-4dc5-b59b-a9155aaf5123-0_0-27-32_20230827110130912.parquet 10000 Bread 123 2023-08-01 9999-12-31 1693133543225000
3 20230827110130912 20230827110130912_0_2 id:10010,expiry_date:9999-12-31 57a17070-d2c8-4dc5-b59b-a9155aaf5123-0_0-27-32_20230827110130912.parquet 10010 Bread 123 2023-08-01 9999-12-31 1693134091557000

The new record's _hoodie_record_key is id:10010,expiry_date:9999-12-31. If rebuild the model again, there are no duplicate to be created.

So I assume the _hoodie_record_key is incorrect when create the table.

Expected behavior

There should be a single recard as the primary composite keys are the same ((id, expiry_date).

Screenshots and log output

N/A

System information

The output of dbt --version:

dbt --version 
Core:
  - installed: 1.6.1
  - latest:    1.6.1 - Up to date!

Plugins:
  - spark: 1.6.0 - Up to date!
  - glue:  1.6.2 - Up to date!

The operating system you're using:

Ubuntu

The output of python --version:

Python 3.8.16

Additional context

Config in profiles.yaml:

      glue_version: "4.0"
      datalake_formats: hudi
      conf: spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false
@zhangyuan zhangyuan added the bug Something isn't working label Aug 27, 2023
@zhangyuan
Copy link
Contributor Author

Just figured out a workaround is to create a dummy record when building the model for the first time

{% if is_incremental() %}
 --...
{% else %}
select '_default' as id, '_default' as title, date('1900-01-01') as ingestion_date
{% endif %}

However this will leave a dummy record in the database (perhaps it can be deleted with post_hook)

@johnnyC9000
Copy link

Adding these to hudi_options solved the issue for me:

 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator',
 'hoodie.datasource.write.partitionpath.field': '',

The second option was necessary to prevent this error from occuring:
: java.io.IOException: Could not load key generator class org.apache.hudi.keygen.ComplexKeyGenerator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants