Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: testing dataset #794

Closed
wants to merge 19 commits into from
Closed

feat: testing dataset #794

wants to merge 19 commits into from

Conversation

DSuveges
Copy link
Contributor

@DSuveges DSuveges commented Sep 27, 2024

!! Don't merge, just an experiment

Branching out from #783 In this experiment I was trying to do two things:

  1. Automate ID generation.
  2. Resolve initialising datasets without providing the schema.
  3. The id column is also abstracted, can be referred in joins without typing the column name.

It seems we need to have a dataset specific post_init function that does some magic. Eg.

from gentropy.dataset.test import TestClass


spark = SparkSession.builder.getOrCreate()


dataset = spark.createDataFrame([('a', 1, 1.0, 'b'),], ['a', 'b', 'c', 'd'])

print('Dataset before:')
dataset.show()


print('Dataset after:')
TestClass(dataset).df.show(truncate=False)


print('Dataset with id before:')
dataset = spark.createDataFrame([('a', 1, 2.5, 'd', 'asdf')], ['a', 'b', 'c', 'd', 'testId'])
dataset.show()

print('Dataset with id after:')
TestClass(dataset).df.show(truncate=False)

Outputs:

+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
|  a|  1|1.0|  b|
+---+---+---+---+

Dataset after:
+---+---+---+---+--------------------------------+
|a  |b  |c  |d  |testId                          |
+---+---+---+---+--------------------------------+
|a  |1  |1.0|b  |75ed355ca18d8f03adfe5160dde9bff1|
+---+---+---+---+--------------------------------+

Dataset with id before:
+---+---+---+---+------+
|  a|  b|  c|  d|testId|
+---+---+---+---+------+
|  a|  1|2.5|  d|  asdf|
+---+---+---+---+------+

Dataset with id after:
+---+---+---+---+------+
|a  |b  |c  |d  |testId|
+---+---+---+---+------+
|a  |1  |2.5|d  |asdf  |
+---+---+---+---+------+

There are some problem though:

  1. mypy is not happy about not providing the _unique_fields and_id_column fields in the parent class. I see the point, however things becoming uglier if these are optional. (as I saw you need to define these values in the post init as well)
  2. Also, the id generation relies on the presence of all unique fields, which is not the case of the result of the some operation on study loci (eg. after window based clumping of summary statistics, we don't have finemapping method. It can be circumvented by making all these columns mandatory or having separate study loci and credible set dataclasses.

What do you think @ireneisdoomed , @project-defiant, @vivienho, @d0choa ?

# Only calculating the identifier if it is not present in the dataframe already:
if "testId" not in self._df.columns:
self._df = self._df.withColumn(
self._id_column, self._generate_identifier(self._unique_fields)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can call the _generate_identifier in the Dataset.post_init with the fields that are inherited from the child class. Then the only things to consider are:

  1. if dataset requires the index field
  2. the index field name
  3. the fields that build index field
    Defining the above in the child and will result in useage of them, but returning to the parent class post_init to run the method itself to generate the index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we don't need identifier in every dataset. We do, however want a list of columns defining uniqueness though.

@project-defiant
Copy link
Contributor

@DSuveges Here is the example how this could be implemented

from dataclasses import dataclass, field
import hashlib

@dataclass
class A:
    create_idx: bool = False
    idx_field_name: str = ""
    fields_defining_idx: list =  field(default_factory=list)

    def __post_init__(self):
        if self.create_idx:
            self.build_hash()

    def build_hash(self):
        fields_str = ''.join(self.fields_defining_idx)
        self.idx = hashlib.md5(fields_str.encode()).hexdigest()


@dataclass
class B(A):
    create_idx: bool = True
    idx_field_name: str = "idx"
    fields_defining_idx: list = field(default_factory=lambda: ["a", "b"])

print(A())
print(B())
print(B().idx)
print(A().idx)

yields

A(create_idx=False, idx_field_name='', fields_defining_idx=[])
B(create_idx=True, idx_field_name='idx', fields_defining_idx=['a', 'b'])
187ef4436122d1cc2f40dc2b92f0eba0
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[2], [line 28](vscode-notebook-cell:?execution_count=2&line=28)
     [26](vscode-notebook-cell:?execution_count=2&line=26) print(B())
     [27](vscode-notebook-cell:?execution_count=2&line=27) print(B().idx)
---> [28](vscode-notebook-cell:?execution_count=2&line=28) print(A().idx)

AttributeError: 'A' object has no attribute 'idx'

Base automatically changed from vh-3448 to dev September 30, 2024 14:47
@DSuveges
Copy link
Contributor Author

DSuveges commented Oct 3, 2024

Ideas discussed with @d0choa , there's a more sensible implementation dropping dataclasses, which in this case makes the initialisation quite a pain.

@DSuveges DSuveges closed this Oct 3, 2024
@DSuveges DSuveges deleted the ds_test_id branch October 10, 2024 22:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants