Issue for preprocessing of OMAT24 dataset #945

Seunghyo-Noh · 2024-12-17T01:39:17Z

What would you like to report?

I would like to inquire about the preprocessing of the training database (OMAT24).
The paper states the following:

'Next, we reduced the size of the dataset by removing all structures with energies > 0 eV, forces norm > 50 eV/Å, and stress > 80 GPa.'"

I am curious whether the 'forces norm' refers to the L1 norm or the L2 norm. Based on the context of the paper, it seems to be the L2 norm, but I would like to confirm this as it is not clear. Similarly, for the 'stress' case (3x3 symmetric matrix), I would like to know if the > 80 GPa refers to the norm or the maximum value.

In OMAT24 paper, sAlex and MPTrj datasets are further trained for fine-tuning. While sAlex is available for download through Hugging Face, the MPTrj dataset is not publicly available. Is there a way to download it? (The OMAT24 paper states that some different settings are applied to the existing MPTrj dataset, so I'm inquiring for data consistency.)"

lbluque · 2024-12-17T23:17:46Z

Hi @Seunghyo-Noh 👋

For the forces norm we used the L2 norm. For the stress value we used the maximum absolute value in the 3x3 matrix.
Have a look at this comment to download the MPTrj dataset or you can dowload it from it's original source here.

Seunghyo-Noh · 2024-12-18T01:13:04Z

Thank you for your comment.
I undertsood the answer of first question.

In the case of answer of 2nd question, Can I use MPTrj files without re-calculation via VASP?
I worried that the difference of input setting between MPTrj and OMAT24 dataset
(OMAT24 paper stated DFT calculations generally followed Material Project default settings with some "important exceptions")

lbluque · 2024-12-20T00:52:18Z

Both MPTrj and Alexandria (including our downsampled version sAlex) are fully compatible between them.

That is correct, the OMat24 calculations use have some important differences to those in MPTrj (for example some of the pseudopotentials that were updated by Materials Project after the snapshot for MPTrj was taken).

What are you trying to obtain from the different datasets?

Seunghyo-Noh · 2024-12-20T02:16:59Z

Both MPTrj and Alexandria (including our downsampled version sAlex) are fully compatible between them.

That is correct, the OMat24 calculations use have some important differences to those in MPTrj (for example some of the pseudopotentials that were updated by Materials Project after the snapshot for MPTrj was taken).

What are you trying to obtain from the different datasets?

Thank you for your comment.

Fairchem has demonstrated very successful predictions in energy, force, and stress, but it seems to have relatively high computational costs even in the s-model. While there may be some trade-offs in performance, I aim to develop a lightweight model by utilizing various datasets. If possible, I would like to include a version of the MPTrj dataset that is compatible with OMat24 in the training set, which is why I reached out to inquire about this.

lbluque · 2024-12-20T18:49:58Z

In that case I recommend following a similar approach as listed in the OMat24 manuscript.

Pretrain with the OMat24 dataset
Finetune with MPTrj and/or Alexandria

If you want to train everything in a single step in a fully compatible manner, you will need to recalculate DFT or adapt your model architecture to handle different DFT settings, i.e. see this work or similar work on multi-fidelity models.

Seunghyo-Noh · 2024-12-23T04:48:46Z

@lbluque Thank you for providing the good paper link. I will refer to it and proceed accordingly! Your comments so far have been very helpful.

lbluque self-assigned this Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue for preprocessing of OMAT24 dataset #945

Issue for preprocessing of OMAT24 dataset #945

Seunghyo-Noh commented Dec 17, 2024 •

edited

Loading

lbluque commented Dec 17, 2024

Seunghyo-Noh commented Dec 18, 2024

lbluque commented Dec 20, 2024

Seunghyo-Noh commented Dec 20, 2024

lbluque commented Dec 20, 2024

Seunghyo-Noh commented Dec 23, 2024

Issue for preprocessing of OMAT24 dataset #945

Issue for preprocessing of OMAT24 dataset #945

Comments

Seunghyo-Noh commented Dec 17, 2024 • edited Loading

What would you like to report?

lbluque commented Dec 17, 2024

Seunghyo-Noh commented Dec 18, 2024

lbluque commented Dec 20, 2024

Seunghyo-Noh commented Dec 20, 2024

lbluque commented Dec 20, 2024

Seunghyo-Noh commented Dec 23, 2024

Seunghyo-Noh commented Dec 17, 2024 •

edited

Loading