Strange behavior of a table created using 'partition by' with iceberg #413

hsh8220 · 2024-07-31T08:00:35Z

Describe the bug

I created an iceberg table with the 'partition_by' option, and when I use it to run unit tests or create docs, it misbehaves because it reads the wrong value like 'part 0' into the schema.

Steps To Reproduce

make source_table config for Iceberg table:

{{ config(
    materialized='incremental',
    incremental_strategy='append',
    file_format='iceberg',
    iceberg_expire_snapshots='False',
    table_properties={'format-version': '2'},
    partition_by=['timestamp']
) }}
...
}}

use source profile example

test:
  target: dev
  outputs:
    dev:
      type: glue
      query-comment: comment
      role_arn: arn:aws:iam::123456789101:role/role-name
      region: ap-northeast-2
      workers: 2
      worker_type: G.1X
      glue_version: "4.0"
      datalake_formats: iceberg
      schema: "schema_name"
      session_provisioning_timeout_in_seconds: 120
      location: "s3://bucket/location"
      default_arguments: >
        --enable-metrics=true,
        --enable-continuous-cloudwatch-log=true,
        --enable-continuous-log-filter=true,
        --enable-spark-ui=true,
        --spark-event-logs-path=s3://bucket/glue/spark-ui-log/dbt/,
        --hive.optimize.sort.dynamic.partition=true,
        --spark.sql.iceberg.handle-timestamp-without-timezone=true,
        --spark.sql.iceberg.use-timestamp-without-timezone-in-new-tables=true

test model (hello_world.sql)


{{ config(
    materialized='incremental',
    file_format='iceberg',
    iceberg_expire_snapshots='False',
    table_properties={'format-version': '2'},
) }}

WITH data as (
    SELECT app_name,
        device_id,
        row_number() OVER (
            PARTITION BY device_id
            ORDER BY app_name, device_id DESC
        ) as rn
    FROM {{ source('schema_name', 'source_table') }}
    GROUP BY
        app_name,
        device_id
    ORDER BY rn DESC
)

SELECT device_id as filter_ids 
FROM data
WHERE rn > 1
GROUP BY device_id

unit test profile

unit_tests:
  - name: test_hello_world

    # Always only one transformation to test
    model: hello_world

    # No inputs needed this time!
    # Most unit tests will have inputs -- see the "real world example" section below
    given:
      - input : source('schema_name', 'source_table')
        rows:
          - {device_id : test123, app_name : test1}
          - {device_id : test123, app_name : test1}
          - {device_id : test456, app_name : test2}

    # Expected output can have zero to many rows
    expect:
      rows:
        - {filter_ids: test123}
        - {filter_ids: test456}

Expected behavior

when 'dbt test --select hello_world --target dev
run 'ok'

Screenshots and log output

error log

-- Fixture for source_table
select cast('test1' as string)
 as app_name, cast(null as string) as status, cast('test123' as string)
.... (other columns)

cast(null as timestamp) as part 0 
-----------------------------------------^^^
), data as (
    SELECT app_name,
        device_id,
        row_number() OVER (
            PARTITION BY device_id
            ORDER BY app_name, device_id DESC
        ) as rn
    FROM __dbt__cte__source_table
    GROUP BY
        app_name,
        device_id
    ORDER BY rn DESC
)

SELECT device_id as filter_ids
FROM data
WHERE rn > 1
GROUP BY device_id
    ) as __dbt_sbq
    where false
    limit 0

System information

The output of dbt --version:

Core:
  - installed: 1.8.4
  - latest:    1.8.4 - Up to date!

Plugins:
  - spark: 1.8.0 - Up to date!
  - glue:  1.8.1 - Up to date!

The operating system you're using:

The output of python --version:

Additional context

The problem occurs when doing unit tests, but also when importing schema from glue metadata via dbt docs generate, 'part 0' is generated unnecessarily.

The text was updated successfully, but these errors were encountered:

hsh8220 added the bug Something isn't working label Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange behavior of a table created using 'partition by' with iceberg #413

Strange behavior of a table created using 'partition by' with iceberg #413

hsh8220 commented Jul 31, 2024

Strange behavior of a table created using 'partition by' with iceberg #413

Strange behavior of a table created using 'partition by' with iceberg #413

Comments

hsh8220 commented Jul 31, 2024

Describe the bug

Steps To Reproduce

Expected behavior

Screenshots and log output

System information

Additional context