Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential Duplicate Data in OMat24 Dataset #942

Open
mjrs33 opened this issue Dec 12, 2024 · 5 comments
Open

Potential Duplicate Data in OMat24 Dataset #942

mjrs33 opened this issue Dec 12, 2024 · 5 comments
Assignees

Comments

@mjrs33
Copy link

mjrs33 commented Dec 12, 2024

What would you like to report?

I suspect that the OMat24 dataset may contain duplicate data. Specifically, while working with the validation dataset rattled-300-subsampled, I found identical Atoms entries using the following code:

import numpy as np
from fairchem.core.datasets import AseDBDataset

dataset = AseDBDataset(config={"src": "rattled-300-subsampled/val.aselmdb"}) 
sids = [dataset.get_atoms(i).info["sid"] for i in range(len(dataset))]

unique_ids, counts = np.unique(sids, return_counts=True)

target = unique_ids[counts > 1][0]
for i, sid in enumerate(sids):
    if sid == target:
        print(i, sid)

print("n_duplicates:", (counts > 1).sum())

output:

31084 agm001012784_ABC_150_spg44_0_0_rattled-300-subsampled_62rpeg
34616 agm001012784_ABC_150_spg44_0_0_rattled-300-subsampled_62rpeg
n_duplicates: 1750

The script identifies duplicate Atoms objects based on their sid values. After running the code, I confirmed that identical entries exist in the dataset.

Could you please verify and address this issue?

Thank you!

@lbluque
Copy link
Collaborator

lbluque commented Dec 12, 2024

Thanks for reporting this @mjrs33! I will double check this on my end!

@lbluque
Copy link
Collaborator

lbluque commented Dec 12, 2024

@mjrs33 I can confirm that I am noticing this issue with the validation set available on Hugging Face (the training set does have this issue). I will fix the validation set and upload a clean version as soon as I can!

@mjrs33
Copy link
Author

mjrs33 commented Dec 13, 2024

Thank you for confirming and addressing this issue so quickly! I really appreciate your effort to fix the dataset. Please let me know once the clean version is available. Let me know if there’s anything I can do to assist.

@lbluque lbluque self-assigned this Dec 16, 2024
@lbluque
Copy link
Collaborator

lbluque commented Dec 20, 2024

Hi @mjrs33,

I have confirmed that indeed there were duplicates in the 1M validation set on HF. I uploaded the correct versions without duplicates. If you get a chance to re-download the latest files from HF and make sure that you no longer see any duplicates that would be much appreciated!

@mjrs33
Copy link
Author

mjrs33 commented Dec 23, 2024

Thank you for the quick fix! I’ve re-downloaded the latest files from HF and confirmed that the duplicates issue has been resolved. Everything looks great now. I really appreciate your prompt response and effort to address this.

Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants