-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential Duplicate Data in OMat24 Dataset #942
Comments
Thanks for reporting this @mjrs33! I will double check this on my end! |
@mjrs33 I can confirm that I am noticing this issue with the validation set available on Hugging Face (the training set does have this issue). I will fix the validation set and upload a clean version as soon as I can! |
Thank you for confirming and addressing this issue so quickly! I really appreciate your effort to fix the dataset. Please let me know once the clean version is available. Let me know if there’s anything I can do to assist. |
Hi @mjrs33, I have confirmed that indeed there were duplicates in the 1M validation set on HF. I uploaded the correct versions without duplicates. If you get a chance to re-download the latest files from HF and make sure that you no longer see any duplicates that would be much appreciated! |
Thank you for the quick fix! I’ve re-downloaded the latest files from HF and confirmed that the duplicates issue has been resolved. Everything looks great now. I really appreciate your prompt response and effort to address this. Thanks again! |
What would you like to report?
I suspect that the OMat24 dataset may contain duplicate data. Specifically, while working with the validation dataset rattled-300-subsampled, I found identical Atoms entries using the following code:
output:
The script identifies duplicate Atoms objects based on their sid values. After running the code, I confirmed that identical entries exist in the dataset.
Could you please verify and address this issue?
Thank you!
The text was updated successfully, but these errors were encountered: