Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ase dataset updates #622

Merged
merged 71 commits into from
Apr 1, 2024
Merged

Ase dataset updates #622

merged 71 commits into from
Apr 1, 2024

Conversation

lbluque
Copy link
Collaborator

@lbluque lbluque commented Jan 26, 2024

This PR includes updates in #630

Updates and fixes in AtomsToGraphs, AseAtomsDataset objects, and LMDBDatabase that make it easier to load and use Atoms with arbitrary properties as ASE style DBs as datasets with the unified OCP trainer.

AtomsToGraphs

  1. Add r_stress as config option to include stress in data
  2. Add r_data_keys as config option to pass a list of target properties saved in Atoms.info to be included in data (see example below)
  3. Edit unit tests for 1 and 2

AseAtomsDataset

  1. Add key_mapping and transforms attribute according to changes in Unified OCP Trainer #520
  2. Allow using .db, .lmdb, .aselmbd extensions for datasets
  3. Allow passing a list of paths for src in config

LMDBDatabase

  1. minor bug fixes for handling updating and deleting rows
  2. cleanup unit-tests

OCPTrainer

  1. cast sid/fid data attributes to list using list() to allow string ids

Here's a minimal example for using:

from pathlib import Path
from ocpmodels.datasets import AseDBDataset

root_dir = Path("/root/path/to/databases")

# paths with ase dbs (ie LMDBDatabase, or dbs from ase: https://wiki.fysik.dtu.dk/ase/ase/db/db.html)
path1 = root_dir / "train_dir_1"
path2 = root_dir / "train_dir_2"
path3 = root_dir / "train_dir_3"

# To load the dataset for S2EFS we set energy, force, stress and magmoms as an example of some other property
# note the magmoms (or any additional property) should be saved in the 'data' attribute of each AtomsRow in the db

dataset = AseDBDataset(
    config={
        "src": [str(path1), str(path2), str(path3)],
        "a2g_args": {
            "r_energy": True,
            "r_forces": True,
            "r_stress": True,
            "r_data_keys": ["magmoms"]
        },
        "key_mapping": {"magmoms": "magnetic_moments"}
    }
)

# note the atoms objects 

# now they should all be part of data objects
print(dataset[0].energy, dataset[0].forces.shape, dataset[0].stress.shape, dataset[0].magnetic_moments.shape)

# you can also directly inspect the corresponding `Atoms` objects
atoms = dataset.get_atoms(0)

# the additional data is saved in the atoms info
print(atoms.info.keys())  # this includes "magmoms"

Copy link
Collaborator

@mshuaibii mshuaibii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Minor changes.

Copy link

codecov bot commented Jan 29, 2024

Codecov Report

Attention: Patch coverage is 79.11392% with 33 lines in your changes are missing coverage. Please review.

Project coverage is 57.23%. Comparing base (fa39a8f) to head (bdbba48).

Files Patch % Lines
ocpmodels/datasets/ase_datasets.py 82.81% 11 Missing ⚠️
ocpmodels/datasets/lmdb_database.py 82.97% 8 Missing ⚠️
ocpmodels/trainers/ocp_trainer.py 14.28% 6 Missing ⚠️
ocpmodels/preprocessing/atoms_to_graphs.py 82.35% 3 Missing ⚠️
ocpmodels/common/utils.py 33.33% 2 Missing ⚠️
ocpmodels/datasets/_utils.py 91.66% 1 Missing ⚠️
ocpmodels/datasets/lmdb_dataset.py 66.66% 1 Missing ⚠️
ocpmodels/datasets/oc22_lmdb_dataset.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #622      +/-   ##
==========================================
+ Coverage   56.98%   57.23%   +0.25%     
==========================================
  Files         108      109       +1     
  Lines       10262    10287      +25     
==========================================
+ Hits         5848     5888      +40     
+ Misses       4414     4399      -15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@mshuaibii mshuaibii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments

@mshuaibii mshuaibii added this pull request to the merge queue Apr 1, 2024
Merged via the queue into main with commit f6e46b1 Apr 1, 2024
5 checks passed
@lbluque lbluque deleted the ase_data_updates branch April 5, 2024 19:57
levineds pushed a commit that referenced this pull request Jul 11, 2024
* minor cleanup of lmbddatabase

* ase dataset compat for unified trainer and cleanup

* typo in docstring

* key_mapping docstring

* add stress to atoms_to_graphs.py and test

* allow adding target properties in atoms.info

* test using generic tensor property in ase_datasets

* minor docstring/comments

* handle stress in voigt notation in metadata guesser

* handle scalar generic values in a2g

* clean up ase dataset unit tests

* allow .aselmdb extensions

* fix minor bugs in lmdb database and update tests

* make connect_db staticmethod

* remove redundant methods and make some private

* allow a list of paths in AseDBdataset

* remove sprinkled print statement

* remove deprecated transform kwarg

* fix doctring typo

* rename keys function

* fix missing comma in tests

* set default r_edges in a2g in AseDatasets to false

* simple unit-test for good measure

* call _get_row directly

* [wip] allow string sids

* raise a helpful error if AseAtomsAdaptor not available

* remove db extension in filepaths

* set logger to info level when trying to read non db files, remove print

* set logging.debug to avoid saturating logs

* Update documentation for dataset config changes

This PR is intended to address #629

* Update atoms_to_graphs.py

* Update test_ase_datasets.py

* Update test_ase_datasets.py

* Update test_atoms_to_graphs.py

* Update test_atoms_to_graphs.py

* case for explicit a2g_args None values

* Update update_config()

* Update utils.py

* Update utils.py

* Update ocp_trainer.py

More helpful warning for debug mode

* Update ocp_trainer.py

* Update ocp_trainer.py

* Update TRAIN.md

* fix concatenating predictions

* check if keys exist in atoms.info

* Update test_ase_datasets.py

* use list() to cast all batch.sid/fid

* correctly stack predictions

* raise error on empty datasets

* raise ValueError instead of exception

* code cleanup

* rename get_atoms object -> get_atoms for brevity

* revert to raise keyerror when data_keys are missing

* cast tensors to list using tolist and vstack relaxation pos

* remove r_energy, r_forces, r_stress and r_data_keys from test_dataset w use_train_settings

* fix test_dataset key

* fix test_dataset key!

* revert to not setting a2g_args dataset keys

* fix debug predict logic

* support numpy 1.26

* fix numpy version

* revert write_pos

* no list casting on batch lists

* pretty logging

---------

Co-authored-by: Ethan Sunshine <[email protected]>
Co-authored-by: Muhammed Shuaibi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants