feat: testing dataset #794

DSuveges · 2024-09-27T15:39:08Z

!! Don't merge, just an experiment

Branching out from #783 In this experiment I was trying to do two things:

Automate ID generation.
Resolve initialising datasets without providing the schema.
The id column is also abstracted, can be referred in joins without typing the column name.

It seems we need to have a dataset specific post_init function that does some magic. Eg.

from gentropy.dataset.test import TestClass


spark = SparkSession.builder.getOrCreate()


dataset = spark.createDataFrame([('a', 1, 1.0, 'b'),], ['a', 'b', 'c', 'd'])

print('Dataset before:')
dataset.show()


print('Dataset after:')
TestClass(dataset).df.show(truncate=False)


print('Dataset with id before:')
dataset = spark.createDataFrame([('a', 1, 2.5, 'd', 'asdf')], ['a', 'b', 'c', 'd', 'testId'])
dataset.show()

print('Dataset with id after:')
TestClass(dataset).df.show(truncate=False)

Outputs:

+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
|  a|  1|1.0|  b|
+---+---+---+---+

Dataset after:
+---+---+---+---+--------------------------------+
|a  |b  |c  |d  |testId                          |
+---+---+---+---+--------------------------------+
|a  |1  |1.0|b  |75ed355ca18d8f03adfe5160dde9bff1|
+---+---+---+---+--------------------------------+

Dataset with id before:
+---+---+---+---+------+
|  a|  b|  c|  d|testId|
+---+---+---+---+------+
|  a|  1|2.5|  d|  asdf|
+---+---+---+---+------+

Dataset with id after:
+---+---+---+---+------+
|a  |b  |c  |d  |testId|
+---+---+---+---+------+
|a  |1  |2.5|d  |asdf  |
+---+---+---+---+------+

There are some problem though:

mypy is not happy about not providing the _unique_fields and_id_column fields in the parent class. I see the point, however things becoming uglier if these are optional. (as I saw you need to define these values in the post init as well)
Also, the id generation relies on the presence of all unique fields, which is not the case of the result of the some operation on study loci (eg. after window based clumping of summary statistics, we don't have finemapping method. It can be circumvented by making all these columns mandatory or having separate study loci and credible set dataclasses.

What do you think @ireneisdoomed , @project-defiant, @vivienho, @d0choa ?

…n will have removed null ids

…o vh-3448

…ain)

…lass

project-defiant · 2024-09-27T16:02:30Z

src/gentropy/dataset/test.py

+        # Only calculating the identifier if it is not present in the dataframe already:
+        if "testId" not in self._df.columns:
+            self._df = self._df.withColumn(
+                self._id_column, self._generate_identifier(self._unique_fields)


We can call the _generate_identifier in the Dataset.post_init with the fields that are inherited from the child class. Then the only things to consider are:

if dataset requires the index field

the index field name

the fields that build index field
Defining the above in the child and will result in useage of them, but returning to the parent class post_init to run the method itself to generate the index.

But we don't need identifier in every dataset. We do, however want a list of columns defining uniqueness though.

project-defiant · 2024-09-27T16:47:18Z

@DSuveges Here is the example how this could be implemented

from dataclasses import dataclass, field
import hashlib

@dataclass
class A:
    create_idx: bool = False
    idx_field_name: str = ""
    fields_defining_idx: list =  field(default_factory=list)

    def __post_init__(self):
        if self.create_idx:
            self.build_hash()

    def build_hash(self):
        fields_str = ''.join(self.fields_defining_idx)
        self.idx = hashlib.md5(fields_str.encode()).hexdigest()


@dataclass
class B(A):
    create_idx: bool = True
    idx_field_name: str = "idx"
    fields_defining_idx: list = field(default_factory=lambda: ["a", "b"])

print(A())
print(B())
print(B().idx)
print(A().idx)

yields

A(create_idx=False, idx_field_name='', fields_defining_idx=[])
B(create_idx=True, idx_field_name='idx', fields_defining_idx=['a', 'b'])
187ef4436122d1cc2f40dc2b92f0eba0
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[2], [line 28](vscode-notebook-cell:?execution_count=2&line=28)
     [26](vscode-notebook-cell:?execution_count=2&line=26) print(B())
     [27](vscode-notebook-cell:?execution_count=2&line=27) print(B().idx)
---> [28](vscode-notebook-cell:?execution_count=2&line=28) print(A().idx)

AttributeError: 'A' object has no attribute 'idx'

DSuveges · 2024-10-03T11:56:57Z

Ideas discussed with @d0choa , there's a more sensible implementation dropping dataclasses, which in this case makes the initialisation quite a pain.

vivienho and others added 19 commits September 24, 2024 12:10

feat: change studyLocusId to string in schema

4389315

feat: change studyLocusId of example data to string in tests

7dc153a

feat: change hashing method to md5

7e62efd

test: remove test_assign_study_locus_id__null_variant_id as validatio…

dd354b4

…n will have removed null ids

fix: change studyLocusId to string in remaining files

4c7e146

chore: resolve merge conflicts

58ca683

fix: ensure inputs to assign_study_locus_id are columns and not strings

1dec962

fix: change studyLocusId to string in remaining files

bcae23d

Merge branch 'dev' into vh-3448

d3c122d

chore: update assign_study_locus_id docstring with updated output

8057a55

Merge branch 'vh-3448' of https://github.com/opentargets/gentropy int…

78ab0c0

…o vh-3448

chore: update assign_study_locus_id docstring with updated output (ag…

d8ab719

…ain)

Merge branch 'dev' into vh-3448

29ef81a

fix: change studyLocusId to string in recently merged files

e873353

Merge branch 'dev' into vh-3448

ce125f9

feat: move hashing logic to generate_identifier function in Dataset c…

f1b0817

…lass

Merge branch 'dev' into vh-3448

c441b79

Merge branch 'dev' into vh-3448

caea96e

feat: dataset experiment

8c5ba40

github-actions bot added size-S Dataset Feature labels Sep 27, 2024

project-defiant reviewed Sep 27, 2024

View reviewed changes

Base automatically changed from vh-3448 to dev September 30, 2024 14:47

DSuveges closed this Oct 3, 2024

DSuveges deleted the ds_test_id branch October 10, 2024 22:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: testing dataset #794

feat: testing dataset #794

DSuveges commented Sep 27, 2024 •

edited

Loading

project-defiant Sep 27, 2024

DSuveges Sep 27, 2024

project-defiant commented Sep 27, 2024

DSuveges commented Oct 3, 2024

feat: testing dataset #794

feat: testing dataset #794

Conversation

DSuveges commented Sep 27, 2024 • edited Loading

!! Don't merge, just an experiment

project-defiant Sep 27, 2024

Choose a reason for hiding this comment

DSuveges Sep 27, 2024

Choose a reason for hiding this comment

project-defiant commented Sep 27, 2024

DSuveges commented Oct 3, 2024

DSuveges commented Sep 27, 2024 •

edited

Loading