Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter based on additional metadata fields #948

Merged
merged 9 commits into from
Dec 19, 2024
Merged

Filter based on additional metadata fields #948

merged 9 commits into from
Dec 19, 2024

Conversation

misko
Copy link
Collaborator

@misko misko commented Dec 18, 2024

Extend metadata with additional fields present in the NPZ metadata file. Instead of just taking 'natoms' take everything that is present.
Allow user to specify a filter on the dataset in the format of

Take all indices that have absolute value of 'some_quantity' less then or equal to 5

subset_to:
  - op: abs_le
     metadata_key: some_quantity
     rhv: 5

Take all indices that a value of 'some_quantity' in [x,y,z]

subset_to:
  - op: in
     metadata_key: some_quantity
     rhv: [x,y,z]

@misko misko requested review from rayg1234 and wood-b December 18, 2024 23:31
@misko misko changed the title Add info from ase Additional metadata field from ASE , filter based on additional fields Dec 18, 2024
@misko misko added enhancement New feature or request patch Patch version release labels Dec 19, 2024
@@ -34,8 +33,11 @@
T_co = TypeVar("T_co", covariant=True)


class DatasetMetadata(NamedTuple):
natoms: ArrayLike | None = None
class DatasetMetadata:
Copy link
Collaborator

@rayg1234 rayg1234 Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why even have this class if we can just dump random things into it? might as well just make it a dict? not sure if this will break alot of things lol

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its better to keep it as a NamedTuple or a dataclass and fields that we decide are needed. This is a much more disciplined interface, so that we are not just dumping random things.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this offline and agreed to follow up after this PR and figure out a better way to include Metadata inside of dataset object and make it fast enough to filter on.

@zulissimeta
Copy link
Collaborator

Is there a way to keep this info self-contained within the dataset files themselves? Every time we have another file that needs to be kept consistent.

This functionality is pretty close to what the ASE datasets already allow (filtering by natoms, or other fields in the object).

Copy link

codecov bot commented Dec 19, 2024

Codecov Report

Attention: Patch coverage is 81.25000% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/fairchem/core/datasets/base_dataset.py 81.25% 3 Missing ⚠️
Files with missing lines Coverage Δ
src/fairchem/core/datasets/base_dataset.py 90.90% <81.25%> (-2.32%) ⬇️

wood-b
wood-b previously approved these changes Dec 19, 2024
Copy link
Collaborator

@wood-b wood-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works for me

@misko misko changed the title Additional metadata field from ASE , filter based on additional fields Filter based on additional metadata fields Dec 19, 2024
@misko misko enabled auto-merge December 19, 2024 19:29
@misko misko added this pull request to the merge queue Dec 19, 2024
Merged via the queue into main with commit 3a9732a Dec 19, 2024
9 checks passed
@misko misko deleted the add_info_from_ase branch December 19, 2024 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request patch Patch version release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants