How to train the mptraj dataset using the fair chem framework？ #939

lavenderwfy · 2024-12-11T07:16:58Z

Hi guys,

I want to train a model using the MPtraj dataset but I'm having trouble fitting this dataset into the framework. Can you guide me on how to proceed?

Regarding the dataset split for training and validation: Is the split random, or is there an official recommended method for splitting? What is the ratio between training and validation sets?
Regarding the MPTraj dataset format: The MPTraj dataset is in JSON format. To adapt it to the fair-chem framework, it should be converted to ASE or LMDB format, could you provide related conversion code?

Thank you very much.

CompRhys · 2024-12-12T14:48:46Z

https://github.com/janosh/matbench-discovery/blob/main/data/mp/eda_mp_trj.py - this eda code in MBD repo will download and give you a list of MPtrj as atoms objects. There is no standard validation split for the data set afaik.

wood-b · 2024-12-13T02:30:17Z

Hi @lavenderwfy, thanks for your question we are hoping to add more examples in the near future of how to write aselmdbs. For now, I'll include some pseudocode that will hopefully be useful. It sounds like you could use the code @CompRhys mentioned to generate the list of atoms objects.

from fairchem.core.datasets import LMDBDatabase

# convert JSON into a list of ase atoms objects
atoms_list = get_atoms_list_from_json(json_file)
# write atoms to the lmdb
output_file = "your_database.lmdb"
with LMDBDatabase(output_file) as db:
    for atoms in atoms_list:
        db.write(atoms, data=atoms.info)

lmdbs written in this way can be used for training/validation/testing in our repo e.g. you would replace this line in the config with the path to the file/folder of the train lmdbs you write. If you want to sanity check your lmdb you can easily read it using the code below.

from fairchem.core.datasets import AseDBDataset

dataset = AseDBDataset({"src": "path_to_your_database.lmdb"}) # path can also point to a folder with multiple lmdb files
dataset.get_atoms(0) # returns the first atoms object in the database

lbluque mentioned this issue Dec 17, 2024

Issue for preprocessing of OMAT24 dataset #945

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to train the mptraj dataset using the fair chem framework？ #939

How to train the mptraj dataset using the fair chem framework？ #939

lavenderwfy commented Dec 11, 2024 •

edited

Loading

CompRhys commented Dec 12, 2024

wood-b commented Dec 13, 2024

How to train the mptraj dataset using the fair chem framework？ #939

How to train the mptraj dataset using the fair chem framework？ #939

Comments

lavenderwfy commented Dec 11, 2024 • edited Loading

CompRhys commented Dec 12, 2024

wood-b commented Dec 13, 2024

lavenderwfy commented Dec 11, 2024 •

edited

Loading