Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange behavior of a table created using 'partition by' with iceberg #413

Open
hsh8220 opened this issue Jul 31, 2024 · 0 comments
Open
Labels
bug Something isn't working

Comments

@hsh8220
Copy link

hsh8220 commented Jul 31, 2024

Describe the bug

I created an iceberg table with the 'partition_by' option, and when I use it to run unit tests or create docs, it misbehaves because it reads the wrong value like 'part 0' into the schema.

Steps To Reproduce

make source_table config for Iceberg table:

{{ config(
    materialized='incremental',
    incremental_strategy='append',
    file_format='iceberg',
    iceberg_expire_snapshots='False',
    table_properties={'format-version': '2'},
    partition_by=['timestamp']
) }}
...
}}

use source profile example

test:
  target: dev
  outputs:
    dev:
      type: glue
      query-comment: comment
      role_arn: arn:aws:iam::123456789101:role/role-name
      region: ap-northeast-2
      workers: 2
      worker_type: G.1X
      glue_version: "4.0"
      datalake_formats: iceberg
      schema: "schema_name"
      session_provisioning_timeout_in_seconds: 120
      location: "s3://bucket/location"
      default_arguments: >
        --enable-metrics=true,
        --enable-continuous-cloudwatch-log=true,
        --enable-continuous-log-filter=true,
        --enable-spark-ui=true,
        --spark-event-logs-path=s3://bucket/glue/spark-ui-log/dbt/,
        --hive.optimize.sort.dynamic.partition=true,
        --spark.sql.iceberg.handle-timestamp-without-timezone=true,
        --spark.sql.iceberg.use-timestamp-without-timezone-in-new-tables=true

test model (hello_world.sql)


{{ config(
    materialized='incremental',
    file_format='iceberg',
    iceberg_expire_snapshots='False',
    table_properties={'format-version': '2'},
) }}

WITH data as (
    SELECT app_name,
        device_id,
        row_number() OVER (
            PARTITION BY device_id
            ORDER BY app_name, device_id DESC
        ) as rn
    FROM {{ source('schema_name', 'source_table') }}
    GROUP BY
        app_name,
        device_id
    ORDER BY rn DESC
)

SELECT device_id as filter_ids 
FROM data
WHERE rn > 1
GROUP BY device_id

unit test profile

unit_tests:
  - name: test_hello_world

    # Always only one transformation to test
    model: hello_world

    # No inputs needed this time!
    # Most unit tests will have inputs -- see the "real world example" section below
    given:
      - input : source('schema_name', 'source_table')
        rows:
          - {device_id : test123, app_name : test1}
          - {device_id : test123, app_name : test1}
          - {device_id : test456, app_name : test2}

    # Expected output can have zero to many rows
    expect:
      rows:
        - {filter_ids: test123}
        - {filter_ids: test456}

Expected behavior

when 'dbt test --select hello_world --target dev
run 'ok'

Screenshots and log output

error log

-- Fixture for source_table
select cast('test1' as string)
 as app_name, cast(null as string) as status, cast('test123' as string)
.... (other columns)

cast(null as timestamp) as part 0 
-----------------------------------------^^^
), data as (
    SELECT app_name,
        device_id,
        row_number() OVER (
            PARTITION BY device_id
            ORDER BY app_name, device_id DESC
        ) as rn
    FROM __dbt__cte__source_table
    GROUP BY
        app_name,
        device_id
    ORDER BY rn DESC
)

SELECT device_id as filter_ids
FROM data
WHERE rn > 1
GROUP BY device_id
    ) as __dbt_sbq
    where false
    limit 0

System information

The output of dbt --version:

Core:
  - installed: 1.8.4
  - latest:    1.8.4 - Up to date!

Plugins:
  - spark: 1.8.0 - Up to date!
  - glue:  1.8.1 - Up to date!

The operating system you're using:

The output of python --version:

Additional context

The problem occurs when doing unit tests, but also when importing schema from glue metadata via dbt docs generate, 'part 0' is generated unnecessarily.
image

@hsh8220 hsh8220 added the bug Something isn't working label Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant