Ase dataset updates #622

lbluque · 2024-01-26T21:40:27Z

This PR includes updates in #630

Updates and fixes in AtomsToGraphs, AseAtomsDataset objects, and LMDBDatabase that make it easier to load and use Atoms with arbitrary properties as ASE style DBs as datasets with the unified OCP trainer.

AtomsToGraphs

Add r_stress as config option to include stress in data
Add r_data_keys as config option to pass a list of target properties saved in Atoms.info to be included in data (see example below)
Edit unit tests for 1 and 2

AseAtomsDataset

Add key_mapping and transforms attribute according to changes in Unified OCP Trainer #520
Allow using .db, .lmdb, .aselmbd extensions for datasets
Allow passing a list of paths for src in config

LMDBDatabase

minor bug fixes for handling updating and deleting rows
cleanup unit-tests

OCPTrainer

cast sid/fid data attributes to list using list() to allow string ids

Here's a minimal example for using:

from pathlib import Path
from ocpmodels.datasets import AseDBDataset

root_dir = Path("/root/path/to/databases")

# paths with ase dbs (ie LMDBDatabase, or dbs from ase: https://wiki.fysik.dtu.dk/ase/ase/db/db.html)
path1 = root_dir / "train_dir_1"
path2 = root_dir / "train_dir_2"
path3 = root_dir / "train_dir_3"

# To load the dataset for S2EFS we set energy, force, stress and magmoms as an example of some other property
# note the magmoms (or any additional property) should be saved in the 'data' attribute of each AtomsRow in the db

dataset = AseDBDataset(
    config={
        "src": [str(path1), str(path2), str(path3)],
        "a2g_args": {
            "r_energy": True,
            "r_forces": True,
            "r_stress": True,
            "r_data_keys": ["magmoms"]
        },
        "key_mapping": {"magmoms": "magnetic_moments"}
    }
)

# note the atoms objects 

# now they should all be part of data objects
print(dataset[0].energy, dataset[0].forces.shape, dataset[0].stress.shape, dataset[0].magnetic_moments.shape)

# you can also directly inspect the corresponding `Atoms` objects
atoms = dataset.get_atoms(0)

# the additional data is saved in the atoms info
print(atoms.info.keys())  # this includes "magmoms"

ocpmodels/datasets/ase_datasets.py

ocpmodels/preprocessing/atoms_to_graphs.py

tests/datasets/test_ase_datasets.py

mshuaibii

Looks good! Minor changes.

codecov · 2024-01-29T20:18:09Z

Codecov Report

Attention: Patch coverage is 79.11392% with 33 lines in your changes are missing coverage. Please review.

Project coverage is 57.23%. Comparing base (fa39a8f) to head (bdbba48).

Files	Patch %	Lines
ocpmodels/datasets/ase_datasets.py	82.81%	11 Missing ⚠️
ocpmodels/datasets/lmdb_database.py	82.97%	8 Missing ⚠️
ocpmodels/trainers/ocp_trainer.py	14.28%	6 Missing ⚠️
ocpmodels/preprocessing/atoms_to_graphs.py	82.35%	3 Missing ⚠️
ocpmodels/common/utils.py	33.33%	2 Missing ⚠️
ocpmodels/datasets/_utils.py	91.66%	1 Missing ⚠️
ocpmodels/datasets/lmdb_dataset.py	66.66%	1 Missing ⚠️
ocpmodels/datasets/oc22_lmdb_dataset.py	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #622      +/-   ##
==========================================
+ Coverage   56.98%   57.23%   +0.25%     
==========================================
  Files         108      109       +1     
  Lines       10262    10287      +25     
==========================================
+ Hits         5848     5888      +40     
+ Misses       4414     4399      -15

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ocpmodels/datasets/ase_datasets.py

ocpmodels/preprocessing/atoms_to_graphs.py

ocpmodels/trainers/ocp_trainer.py

mshuaibii

Minor comments

… w use_train_settings

ocpmodels/trainers/base_trainer.py

ocpmodels/trainers/ocp_trainer.py

* minor cleanup of lmbddatabase * ase dataset compat for unified trainer and cleanup * typo in docstring * key_mapping docstring * add stress to atoms_to_graphs.py and test * allow adding target properties in atoms.info * test using generic tensor property in ase_datasets * minor docstring/comments * handle stress in voigt notation in metadata guesser * handle scalar generic values in a2g * clean up ase dataset unit tests * allow .aselmdb extensions * fix minor bugs in lmdb database and update tests * make connect_db staticmethod * remove redundant methods and make some private * allow a list of paths in AseDBdataset * remove sprinkled print statement * remove deprecated transform kwarg * fix doctring typo * rename keys function * fix missing comma in tests * set default r_edges in a2g in AseDatasets to false * simple unit-test for good measure * call _get_row directly * [wip] allow string sids * raise a helpful error if AseAtomsAdaptor not available * remove db extension in filepaths * set logger to info level when trying to read non db files, remove print * set logging.debug to avoid saturating logs * Update documentation for dataset config changes This PR is intended to address #629 * Update atoms_to_graphs.py * Update test_ase_datasets.py * Update test_ase_datasets.py * Update test_atoms_to_graphs.py * Update test_atoms_to_graphs.py * case for explicit a2g_args None values * Update update_config() * Update utils.py * Update utils.py * Update ocp_trainer.py More helpful warning for debug mode * Update ocp_trainer.py * Update ocp_trainer.py * Update TRAIN.md * fix concatenating predictions * check if keys exist in atoms.info * Update test_ase_datasets.py * use list() to cast all batch.sid/fid * correctly stack predictions * raise error on empty datasets * raise ValueError instead of exception * code cleanup * rename get_atoms object -> get_atoms for brevity * revert to raise keyerror when data_keys are missing * cast tensors to list using tolist and vstack relaxation pos * remove r_energy, r_forces, r_stress and r_data_keys from test_dataset w use_train_settings * fix test_dataset key * fix test_dataset key! * revert to not setting a2g_args dataset keys * fix debug predict logic * support numpy 1.26 * fix numpy version * revert write_pos * no list casting on batch lists * pretty logging --------- Co-authored-by: Ethan Sunshine <[email protected]> Co-authored-by: Muhammed Shuaibi <[email protected]>

lbluque added 17 commits January 17, 2024 15:52

minor cleanup of lmbddatabase

826598f

ase dataset compat for unified trainer and cleanup

324a645

typo in docstring

6bb3b81

key_mapping docstring

b4614c4

add stress to atoms_to_graphs.py and test

d736b00

allow adding target properties in atoms.info

0a17008

test using generic tensor property in ase_datasets

3a7f810

minor docstring/comments

f47a0b8

handle stress in voigt notation in metadata guesser

c2a789e

handle scalar generic values in a2g

47f4578

clean up ase dataset unit tests

48dc7d0

allow .aselmdb extensions

8549411

fix minor bugs in lmdb database and update tests

3371cae

make connect_db staticmethod

a0a2b2e

remove redundant methods and make some private

237f000

allow a list of paths in AseDBdataset

cae0765

remove sprinkled print statement

dd0b5fc

lbluque requested review from zulissimeta and mshuaibii January 26, 2024 21:40