feat: change `StudyLocusId` hashing method to md5 (and change `StudyLocusId` to string type) #783

vivienho · 2024-09-24T15:02:28Z

✨ Context

StudyLocusId is used as an identifier and does not need to be numerical. Changing it to string will make it easier on the backend side. Hashing strategy is changed to md5, which returns strings.

🛠 What does this PR implement

StudyLocusId is changed to string type in the schema and at relevant locations (mostly in tests).

The hashing strategy for generating the StudyLocusId is changed to md5.

A test (test_assign_study_locus_id__null_variant_id) was removed as validation steps elsewhere should have dropped null variant id cases before the assign_study_locus_id function.

🙈 Missing

🚦 Before submitting

Do these changes cover one single feature (one change at a time)?
Did you read the contributor guideline?
Did you make sure to update the documentation with your changes?
Did you make sure there is no commented out code in this PR?
Did you follow conventional commits standards in PR title and commit messages?
Did you make sure the branch is up-to-date with the dev branch?
Did you write any new necessary tests?
Did you make sure the changes pass local tests (make test)?
Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

…n will have removed null ids

…o vh-3448

…ain)

DSuveges · 2024-09-25T08:34:34Z

Given you were done so fast, I would recommend to update the logic to be more abstract allowing generalisation of the identifier generation (we will need this in other datasets, eg l2g). If you notice, you don't really need the arguments here:

    def assign_study_locus_id(
        study_id_col: Column,
        variant_id_col: Column,
        finemapping_col: Column = None,
    ) -> Column:

Because you have this list:

columns = [study_id_col, variant_id_col, finemapping_col]

This can be a simple array of column names, which I believe should be a class class attribute for StudyLocus dataset. So the class itself would define what columns needs to be hashed for the identifier and in which order. Also, I think the actual hashing logic:

 hashable_columns = [f.when(column.cast("string").isNull(), f.lit("None"))
                                 .otherwise(column.cast("string"))
                                 for column in columns]
        return f.md5(f.concat(*hashable_columns))

should be in the Dataset class. And the assign_study_locus_id method should wraps that function:

    @staticmethod
    def assign_study_locus_id( ) -> Column:
        self._generate_identifier(self.uniqueness_defining_columns).alias("studyLocusId")

Where:

_generate_identifier is the hashing function in Dataset class, can be called from any dataset.
uniqueness_defining_columns is a class attribute defined for the given dataset.
This method returns the alias of the column, which is also dataset specific.

I have a tendency to over abstract everything, so it would be great to have a second opinion on this from @d0choa .

…lass

DSuveges · 2024-09-30T13:13:32Z

src/gentropy/datasource/gwas_catalog/associations.py

@@ -1109,7 +1109,7 @@ def from_source(
        """
        return StudyLocusGWASCatalog(
            _df=gwas_associations.withColumn(
-                "studyLocusId", f.monotonically_increasing_id().cast(LongType())
+                "studyLocusId", f.monotonically_increasing_id().cast(StringType())


This is not a dealbreaker, and has no impact whatsoever: this column is not a "real" studyLocusId: this column is temporarily generated to identify original rows of the GWAS Catalog association dataset before explosion. But it is fine.

DSuveges

Lot of changes, all looks great, let's hope nothing breaks. 🤞🏻

vivienho added 5 commits September 24, 2024 12:10

feat: change studyLocusId to string in schema

4389315

feat: change studyLocusId of example data to string in tests

7dc153a

feat: change hashing method to md5

7e62efd

test: remove test_assign_study_locus_id__null_variant_id as validatio…

dd354b4

…n will have removed null ids

fix: change studyLocusId to string in remaining files

4c7e146

vivienho changed the title ~~feat: change StudyLocusId hashing method to md5 (and change StudyLocusId to string type)~~ feat: change StudyLocusId hashing method to md5 (and change StudyLocusId to string type) Sep 24, 2024

chore: resolve merge conflicts

58ca683

github-actions bot added size-M Dataset Feature Datasource labels Sep 24, 2024

fix: ensure inputs to assign_study_locus_id are columns and not strings

1dec962

github-actions bot added Method Step labels Sep 24, 2024

This was linked to issues Sep 24, 2024

Change hashing strategy for StudyLocusId generation in StudyLocus object opentargets/issues#3448

Closed

Convert StudyLocusId to String opentargets/issues#3535

Closed

vivienho added 7 commits September 24, 2024 16:53

fix: change studyLocusId to string in remaining files

bcae23d

Merge branch 'dev' into vh-3448

d3c122d

chore: update assign_study_locus_id docstring with updated output

8057a55

Merge branch 'vh-3448' of https://github.com/opentargets/gentropy int…

78ab0c0

…o vh-3448

chore: update assign_study_locus_id docstring with updated output (ag…

d8ab719

…ain)

Merge branch 'dev' into vh-3448

29ef81a

fix: change studyLocusId to string in recently merged files

e873353

vivienho marked this pull request as ready for review September 24, 2024 22:11

vivienho requested a review from DSuveges September 25, 2024 08:11

Merge branch 'dev' into vh-3448

ce125f9

vivienho and others added 3 commits September 26, 2024 12:34

feat: move hashing logic to generate_identifier function in Dataset c…

f1b0817

…lass

Merge branch 'dev' into vh-3448

c441b79

Merge branch 'dev' into vh-3448

caea96e

DSuveges mentioned this pull request Sep 27, 2024

feat: testing dataset #794

Closed

Merge branch 'dev' into vh-3448

bd0ed41

DSuveges reviewed Sep 30, 2024

View reviewed changes

DSuveges approved these changes Sep 30, 2024

View reviewed changes

DSuveges merged commit 5c58e58 into dev Sep 30, 2024
5 checks passed

DSuveges deleted the vh-3448 branch September 30, 2024 14:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: change `StudyLocusId` hashing method to md5 (and change `StudyLocusId` to string type) #783

feat: change `StudyLocusId` hashing method to md5 (and change `StudyLocusId` to string type) #783

vivienho commented Sep 24, 2024 •

edited

Loading

DSuveges commented Sep 25, 2024 •

edited

Loading

DSuveges Sep 30, 2024

DSuveges left a comment

feat: change StudyLocusId hashing method to md5 (and change StudyLocusId to string type) #783

feat: change StudyLocusId hashing method to md5 (and change StudyLocusId to string type) #783

Conversation

vivienho commented Sep 24, 2024 • edited Loading

✨ Context

🛠 What does this PR implement

🙈 Missing

🚦 Before submitting

DSuveges commented Sep 25, 2024 • edited Loading

DSuveges Sep 30, 2024

Choose a reason for hiding this comment

DSuveges left a comment

Choose a reason for hiding this comment

feat: change `StudyLocusId` hashing method to md5 (and change `StudyLocusId` to string type) #783

feat: change `StudyLocusId` hashing method to md5 (and change `StudyLocusId` to string type) #783

vivienho commented Sep 24, 2024 •

edited

Loading

DSuveges commented Sep 25, 2024 •

edited

Loading