Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Iceberg Partition Transforms #333

Open
yaroslav-ost opened this issue Feb 5, 2024 · 5 comments
Open

Support for Iceberg Partition Transforms #333

yaroslav-ost opened this issue Feb 5, 2024 · 5 comments

Comments

@yaroslav-ost
Copy link
Contributor

Hi guys,

It seems we are not able to use Iceberg Partition Transforms inside dbt models.

Table Config:
{{ config(
materialized='incremental',
incremental_strategy='append',
file_format='iceberg',
partition_by=['months(my_date)']
) }}

Output:
CREATE TABLE glue_catalog.test.test_table
USING iceberg
PARTITIONED BY (months(my_date))
LOCATION 's3://sandbox-source/client/my/test/test_table'
AS SELECT * FROM tmp_test_table ORDER BY (months(my_date))

Error: AnalysisException: Undefined function: months. This function is neither a built-in/temporary function, nor a persistent function that is qualified as spark_catalog.default.months.

I guess ORDER BY is the root cause since iceberg doesn't support such kind of partition expressions in ORDER BY clause.

query = f"""
CREATE TABLE glue_catalog.{target_relation.schema}.{target_relation.name}
USING iceberg
PARTITIONED BY {partition_by}
LOCATION '{location}'
{table_properties}
AS SELECT * FROM tmp_{target_relation.name} ORDER BY {partition_by}
"""
return query

@aiss93
Copy link
Contributor

aiss93 commented Feb 13, 2024

Hi,
I think the function you want to use is "month" not "months".

@yaroslav-ost
Copy link
Contributor Author

@aiss93
Copy link
Contributor

aiss93 commented Feb 16, 2024

These names are used in the older versions of iceberg. Check your iceberg version. The error message you're getting says that the function you're using wasn't found on the catalog.
I tried with "month" and I get the error : Transform is not supported

@eshetben
Copy link

eshetben commented Jun 4, 2024

from my experience, it seems that the problem is with pyspark -

  1. there's the hour() function used for extracting the hour component - that one should be used within sql queries, including the order by clause.
    https://spark.apache.org/docs/3.2.1/api/python/reference/api/pyspark.sql.functions.hour.html
  2. then there's the hours() function - used solely for partition transforms. should only be used in partition by obviously.
    https://spark.apache.org/docs/3.2.1/api/python/reference/api/pyspark.sql.functions.hours.html

dbt-glue does not support this since it uses the same function in both order by and partition by.

@aiss93
Copy link
Contributor

aiss93 commented Sep 28, 2024

Hi guys,

I managed to fix this by including extra jars in the glue session in order to use an advanced version of iceberg.
Basically, glue 4.0 comes with iceberg 1.0.0. The partitioning transform functions are implemented in iceberg 1.2.0 and above (you can check this in the iceberg repository by searching in each tag)

You'll also need to include the aws bundle associated with the iceberg release you want to use.

Finally in order to respect some iceberg expectations, you'll need to cluster your transformation by the field you're using in the partirion_by clause.

So to summarize you need to change :

  1. your_model.sql : add a CLUSTER BY my_date
  2. profile.yaml file : replace datalake-formats by extra_jars: "_s3_path_to_your_iceberg_jar,s3_path_to_your_aws_bundle_"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants