Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue for preprocessing of OMAT24 dataset #945

Open
Seunghyo-Noh opened this issue Dec 17, 2024 · 6 comments
Open

Issue for preprocessing of OMAT24 dataset #945

Seunghyo-Noh opened this issue Dec 17, 2024 · 6 comments
Assignees

Comments

@Seunghyo-Noh
Copy link

Seunghyo-Noh commented Dec 17, 2024

What would you like to report?

  1. I would like to inquire about the preprocessing of the training database (OMAT24).
    The paper states the following:

'Next, we reduced the size of the dataset by removing all structures with energies > 0 eV, forces norm > 50 eV/Å, and stress > 80 GPa.'"

I am curious whether the 'forces norm' refers to the L1 norm or the L2 norm. Based on the context of the paper, it seems to be the L2 norm, but I would like to confirm this as it is not clear. Similarly, for the 'stress' case (3x3 symmetric matrix), I would like to know if the > 80 GPa refers to the norm or the maximum value.

  1. In OMAT24 paper, sAlex and MPTrj datasets are further trained for fine-tuning. While sAlex is available for download through Hugging Face, the MPTrj dataset is not publicly available. Is there a way to download it? (The OMAT24 paper states that some different settings are applied to the existing MPTrj dataset, so I'm inquiring for data consistency.)"
@lbluque
Copy link
Collaborator

lbluque commented Dec 17, 2024

Hi @Seunghyo-Noh 👋

  1. For the forces norm we used the L2 norm. For the stress value we used the maximum absolute value in the 3x3 matrix.
  2. Have a look at this comment to download the MPTrj dataset or you can dowload it from it's original source here.

@lbluque lbluque self-assigned this Dec 17, 2024
@Seunghyo-Noh
Copy link
Author

Thank you for your comment.
I undertsood the answer of first question.

In the case of answer of 2nd question, Can I use MPTrj files without re-calculation via VASP?
I worried that the difference of input setting between MPTrj and OMAT24 dataset
(OMAT24 paper stated DFT calculations generally followed Material Project default settings with some "important exceptions")

@lbluque
Copy link
Collaborator

lbluque commented Dec 20, 2024

Both MPTrj and Alexandria (including our downsampled version sAlex) are fully compatible between them.

That is correct, the OMat24 calculations use have some important differences to those in MPTrj (for example some of the pseudopotentials that were updated by Materials Project after the snapshot for MPTrj was taken).

What are you trying to obtain from the different datasets?

@Seunghyo-Noh
Copy link
Author

Both MPTrj and Alexandria (including our downsampled version sAlex) are fully compatible between them.

That is correct, the OMat24 calculations use have some important differences to those in MPTrj (for example some of the pseudopotentials that were updated by Materials Project after the snapshot for MPTrj was taken).

What are you trying to obtain from the different datasets?

Thank you for your comment.

Fairchem has demonstrated very successful predictions in energy, force, and stress, but it seems to have relatively high computational costs even in the s-model. While there may be some trade-offs in performance, I aim to develop a lightweight model by utilizing various datasets. If possible, I would like to include a version of the MPTrj dataset that is compatible with OMat24 in the training set, which is why I reached out to inquire about this.

@lbluque
Copy link
Collaborator

lbluque commented Dec 20, 2024

In that case I recommend following a similar approach as listed in the OMat24 manuscript.

  1. Pretrain with the OMat24 dataset
  2. Finetune with MPTrj and/or Alexandria

If you want to train everything in a single step in a fully compatible manner, you will need to recalculate DFT or adapt your model architecture to handle different DFT settings, i.e. see this work or similar work on multi-fidelity models.

@Seunghyo-Noh
Copy link
Author

@lbluque Thank you for providing the good paper link. I will refer to it and proceed accordingly! Your comments so far have been very helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants