Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to train the mptraj dataset using the fair chem framework? #939

Open
lavenderwfy opened this issue Dec 11, 2024 · 2 comments
Open

Comments

@lavenderwfy
Copy link

lavenderwfy commented Dec 11, 2024

Hi guys,

I want to train a model using the MPtraj dataset but I'm having trouble fitting this dataset into the framework. Can you guide me on how to proceed?

  1. Regarding the dataset split for training and validation: Is the split random, or is there an official recommended method for splitting? What is the ratio between training and validation sets?

  2. Regarding the MPTraj dataset format: The MPTraj dataset is in JSON format. To adapt it to the fair-chem framework, it should be converted to ASE or LMDB format, could you provide related conversion code?

Thank you very much.

@CompRhys
Copy link

https://github.com/janosh/matbench-discovery/blob/main/data/mp/eda_mp_trj.py - this eda code in MBD repo will download and give you a list of MPtrj as atoms objects. There is no standard validation split for the data set afaik.

@wood-b
Copy link
Collaborator

wood-b commented Dec 13, 2024

Hi @lavenderwfy, thanks for your question we are hoping to add more examples in the near future of how to write aselmdbs. For now, I'll include some pseudocode that will hopefully be useful. It sounds like you could use the code @CompRhys mentioned to generate the list of atoms objects.

from fairchem.core.datasets import LMDBDatabase

# convert JSON into a list of ase atoms objects
atoms_list = get_atoms_list_from_json(json_file)
# write atoms to the lmdb
output_file = "your_database.lmdb"
with LMDBDatabase(output_file) as db:
    for atoms in atoms_list:
        db.write(atoms, data=atoms.info)

lmdbs written in this way can be used for training/validation/testing in our repo e.g. you would replace this line in the config with the path to the file/folder of the train lmdbs you write. If you want to sanity check your lmdb you can easily read it using the code below.

from fairchem.core.datasets import AseDBDataset

dataset = AseDBDataset({"src": "path_to_your_database.lmdb"}) # path can also point to a folder with multiple lmdb files
dataset.get_atoms(0) # returns the first atoms object in the database

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants