-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: change StudyLocusId
hashing method to md5 (and change StudyLocusId
to string type)
#783
Conversation
…n will have removed null ids
StudyLocusId
hashing method to md5 (and change StudyLocusId to string type)StudyLocusId
hashing method to md5 (and change StudyLocusId
to string type)
Given you were done so fast, I would recommend to update the logic to be more abstract allowing generalisation of the identifier generation (we will need this in other datasets, eg l2g). If you notice, you don't really need the arguments here: def assign_study_locus_id(
study_id_col: Column,
variant_id_col: Column,
finemapping_col: Column = None,
) -> Column: Because you have this list: columns = [study_id_col, variant_id_col, finemapping_col] This can be a simple array of column names, which I believe should be a class class attribute for StudyLocus dataset. So the class itself would define what columns needs to be hashed for the identifier and in which order. Also, I think the actual hashing logic: hashable_columns = [f.when(column.cast("string").isNull(), f.lit("None"))
.otherwise(column.cast("string"))
for column in columns]
return f.md5(f.concat(*hashable_columns)) should be in the @staticmethod
def assign_study_locus_id( ) -> Column:
self._generate_identifier(self.uniqueness_defining_columns).alias("studyLocusId") Where:
I have a tendency to over abstract everything, so it would be great to have a second opinion on this from @d0choa . |
@@ -1109,7 +1109,7 @@ def from_source( | |||
""" | |||
return StudyLocusGWASCatalog( | |||
_df=gwas_associations.withColumn( | |||
"studyLocusId", f.monotonically_increasing_id().cast(LongType()) | |||
"studyLocusId", f.monotonically_increasing_id().cast(StringType()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a dealbreaker, and has no impact whatsoever: this column is not a "real" studyLocusId
: this column is temporarily generated to identify original rows of the GWAS Catalog association dataset before explosion. But it is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lot of changes, all looks great, let's hope nothing breaks. 🤞🏻
✨ Context
StudyLocusId is used as an identifier and does not need to be numerical. Changing it to string will make it easier on the backend side. Hashing strategy is changed to md5, which returns strings.
🛠 What does this PR implement
StudyLocusId
is changed to string type in the schema and at relevant locations (mostly in tests).The hashing strategy for generating the
StudyLocusId
is changed to md5.A test (
test_assign_study_locus_id__null_variant_id
) was removed as validation steps elsewhere should have dropped null variant id cases before theassign_study_locus_id
function.🙈 Missing
🚦 Before submitting
dev
branch?make test
)?poetry run pre-commit run --all-files
)?