-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: testing dataset #794
Conversation
…n will have removed null ids
# Only calculating the identifier if it is not present in the dataframe already: | ||
if "testId" not in self._df.columns: | ||
self._df = self._df.withColumn( | ||
self._id_column, self._generate_identifier(self._unique_fields) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can call the _generate_identifier
in the Dataset.post_init with the fields that are inherited from the child class. Then the only things to consider are:
- if dataset requires the index field
- the index field name
- the fields that build index field
Defining the above in the child and will result in useage of them, but returning to the parent class post_init to run the method itself to generate the index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we don't need identifier in every dataset. We do, however want a list of columns defining uniqueness though.
@DSuveges Here is the example how this could be implemented from dataclasses import dataclass, field
import hashlib
@dataclass
class A:
create_idx: bool = False
idx_field_name: str = ""
fields_defining_idx: list = field(default_factory=list)
def __post_init__(self):
if self.create_idx:
self.build_hash()
def build_hash(self):
fields_str = ''.join(self.fields_defining_idx)
self.idx = hashlib.md5(fields_str.encode()).hexdigest()
@dataclass
class B(A):
create_idx: bool = True
idx_field_name: str = "idx"
fields_defining_idx: list = field(default_factory=lambda: ["a", "b"])
print(A())
print(B())
print(B().idx)
print(A().idx) yields
|
Ideas discussed with @d0choa , there's a more sensible implementation dropping dataclasses, which in this case makes the initialisation quite a pain. |
!! Don't merge, just an experiment
Branching out from #783 In this experiment I was trying to do two things:
It seems we need to have a dataset specific post_init function that does some magic. Eg.
Outputs:
There are some problem though:
_unique_fields
and_id_column
fields in the parent class. I see the point, however things becoming uglier if these are optional. (as I saw you need to define these values in the post init as well)What do you think @ireneisdoomed , @project-defiant, @vivienho, @d0choa ?