Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBT Tests not working with Glue Iceberg Tables #411

Open
Wellington-Costa91 opened this issue Jul 21, 2024 · 0 comments
Open

DBT Tests not working with Glue Iceberg Tables #411

Wellington-Costa91 opened this issue Jul 21, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@Wellington-Costa91
Copy link

Describe the bug

I'm reaching out for assistance with running DBT tests using AWS Glue Iceberg tables. It appears that the test module does not support the glue_catalog prefix required for Iceberg Tables. I have attempted several workarounds without success.

Versions:
Running with dbt=1.8.4
Registered adapter: glue=1.8.1

Steps To Reproduce

Create a dbt profile for Iceberg tables

Sample config for Iceberg table:

{{
  config(
    unique_key=["my_key"],
    partition_by=["date_ingestion"],
    materialized="incremental",
    incremental_strategy='merge',
    file_format='iceberg',
    table_properties={'format-version': '2'},
    iceberg_expire_snapshots='False',
    tags=["my_tag_name"],
    pre_hook="SET hive.default.fileformat=parquet",
	  )
}}

Sample of the Glue Profile

glue_profile:
  outputs:
    silver_light:
      type: glue
      query-comment: Profile for  Silver Layer Dev
      role_arn: arn:aws:iam::123456789101:role/role-name
      region: sa-east-1
      glue_version: "4.0"
      workers: 2
      worker_type: G.1X
      schema: "schema_name"
      session_provisioning_timeout_in_seconds: 600
      idle_timeout: 10
      location: "s3://bucket-name/silver"
      datalake_formats: iceberg
      default_arguments: "--enable-auto-scaling=true, --enable-metrics=true, --enable-continuous-cloudwatch-log=true, --enable-continuous-log-filter=true, --enable-spark-ui=true, --spark-event-logs-path=s3://bucket-name-logs/dbt-spark-logs/"
      conf: --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.warehouse=s3://bucket-name-tmp/ --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.dynamodb.DynamoDbLockManager --conf spark.sql.catalog.glue_catalog.lock.table=tbl_glue_dbt_lock_table  --conf spark.sql.legacy.allowNonEmptyLocationInCTAS=true --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.dev=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.parquet.int96RebaseModeInRead=CORRECTED --conf spark.sql.parquet.datetimeRebaseModeInRead=CORRECTED --conf  spark.kryoserializer.buffer.max=1GB --conf spark.sql.parquet.datetimeRebaseModeInWrite=CORRECTED --conf spark.sql.parquet.int96RebaseModeInWrite=CORRECTED
      threads: 2

The model schema sample:

version: 2

models: 
  - name: "my_tag_name"
    description: "Description"
    columns:
      - name: "identifier_column"
        description: "Description_column"
        data_tests:
          - unique
          - not_null
...

Command executed:
dbt test --select tag:my_tag_name --target=silver_light

Expected behavior

Error obtained because it tried to find the table without the prefix glue_catalog

AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table my_tag_name. StorageDescriptor#InputFormat cannot be null for table: my_tag_name (Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null)
18:43:01  2 of 2 ERROR unique_my_tag_name_identifier_column ............ [ERROR in 624.84s]

Screenshots and log output

LogErrorDbtIceberg

System information

The output of dbt --version:

Core:
  - installed: 1.8.4
  - latest:    1.8.4 - Up to date!

Plugins:
  - glue:  1.8.1 - Up to date!
  - spark: 1.8.0 - Up to date!

The operating system you're using:
macOS Sonoma Version 14.3.1

The output of python --version:
Python 3.9.6

Additional context

The Amazon Documentation says that to access Iceberg Tables in glue with spark, it's needed to use the prefix glue_catalog. before the database/table name.
https://docs.aws.amazon.com/prescriptive-guidance/latest/apache-iceberg-on-aws/iceberg-spark.html
When trying to use the query in dbt-logs, there is the error where it cannot find the Table, but if we use the glue_catalog prefix required for Iceberg Tables, we can access the data.

@Wellington-Costa91 Wellington-Costa91 added the bug Something isn't working label Jul 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant