Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow seed type casting #459

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

jausanca
Copy link
Contributor

@jausanca jausanca commented Oct 7, 2024

resolves #458

Description

This PR adds the configuration property enable_spark_seed_casting to enable casting the seed to the types specified on column_types on the statement that creates the seed.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt-glue next" section.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@moomindani moomindani added the enable-functional-tests This label enable functional tests label Oct 25, 2024
@moomindani
Copy link
Collaborator

Apologizes for delay. I have verified that integration test worked with this change successfully.
Could you please resolve conflicts and let me know?

@jausanca
Copy link
Contributor Author

It's ok @moomindani, I just assumed there was other stuff. I just merged from main.

csv_chunks = [{"test_column": "1.2345"}]
model = {"name": "mock_model", "schema": "mock_schema"}
column_mappings = [ColumnCsvMappingStrategy("test_column", agate.data_types.Text, "double")]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add another case to test multiple mapping strategies?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extended the test case to include different strategies

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

[
"df",
*[
f'withColumn("{mapping.column_name}", df.{mapping.column_name}.cast("{cast_value}"))'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

withColumn works well for simple pattern, but if we have many withColumn, it easily causes StackOverflowException.
Do you think if this happens in real world use case? If yes, it will be safer to replace withColumn with select..as.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say it depends, how many withColumns would cause a SO? If it's over 100 then I'd say it's not realistic for any of the use cases we're handling. However, I can look into implementing it with "select..as", but I'm unfamiliar on that syntax.
Do you mean something like this?

df = df.select(col("foo").cast("double"), col("bar").cast("string"))

Copy link
Collaborator

@moomindani moomindani Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. df.select() or df.selectExpr() work here.
The condition depends on multiple factors, but sometimes it can occur with less than 100.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
beginning-contributor enable-functional-tests This label enable functional tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Specifying column_types doesn't work on seeds.
2 participants