-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get all composite key candidates function added #58
base: main
Are you sure you want to change the base?
Conversation
if exclude_cols is None: | ||
exclude_cols = [] | ||
df_col_excluded = df.drop(*exclude_cols) | ||
col_select_condition = df_col_excluded.distinct().count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@souvik-databricks why does it need to distinct? If .distinct().count() < .count(), there is no unique combination, or?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertkossendey .distinct()
is required because if your data has duplicates in them and then the collect_set()
in line 401
will matchup and we have only unique combination of values for the columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@souvik-databricks still struggling to understand, could you add a test with a completely duplicate line? Because that should return no combination at all right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertkossendey So in that scenario if you have a case of duplicate record then that would indicate that from the columns that you have included to search in, you don't have valid composite key combos. You would have to include more columns to act as part of the composite key.
If in the whole dataset you have a full exact duplicate record then it won't make any sense to keep it as that's then redundant information.
And I will add a test with duplicate record with the explanation of the output for that test case in the readme.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If in the whole dataset you have a full exact duplicate record then it won't make any sense to keep it as that's then redundant information.
@souvik-databricks but isn't that the information that there is no valid key candidate in the provided dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was OOO for last week. @robertkossendey I am back now, I will resume the work now.
This looks like a great start. Can you add more tests for the edge cases like when the result is empty, when the DataFrame contains null values, etc? |
column for column in df_col_excluded.columns if | ||
df_col_excluded.select(column).collect()[0][0] == col_select_condition | ||
] | ||
df_col_excluded.select("column_combos ->", *columns).show(truncate=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@souvik-databricks we can remove that show, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we can remove the show
@souvik-databricks - let me know when you've addressed the feedback from @robertkossendey and this is ready for another review, thank you! |
@souvik-databricks - following up on this one. Are you planning to address this? |
@MrPowers Let's go through this details over our sync. |
@souvik-databricks - anything else we need to push this over the finish line? |
Added the function to get all of the possible composite key candidates for a delta table