Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble loading pretrained Equiformer V2 models using latest fairchem-core 1.3.0 on Linux #936

Open
samueldyoung29ctr opened this issue Dec 9, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@samueldyoung29ctr
Copy link

Python version

Python 3.12.8

fairchem-core version

1.3.0

pytorch version

2.4.0

cuda version

False None

Operating system version

RHEL 8.8 (kernel 4.18.0-477.10.1.el8_8.x86_64)

Minimal example

from fairchem.core.models.model_registry import model_name_to_local_file

checkpoint_path = model_name_to_local_file(
    "EquiformerV2-31M-S2EF-OC20-All+MD", local_cache="/tmp/fairchem_checkpoints/"
)

from fairchem.core.common.relaxation.ase_utils import OCPCalculator

# Load the pre-trained checkpoint!
calc = OCPCalculator(checkpoint_path=checkpoint_path, cpu=True)
slab.set_calculator(calc)

Current behavior

I get the following traceback:

/conda_dir/envs/fairchem-ocp-cpu/lib/python3.12/site-packages/fairchem/core/common/relaxation/ase_utils.py:190: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint = torch.load(checkpoint_path, map_location=torch.device("cpu"))
WARNING:root:Detected old config, converting to new format. Consider updating to avoid potential incompatibilities.
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[90], line 2
      1 # Load the pre-trained checkpoint!
----> 2 calc = OCPCalculator(checkpoint_path=checkpoint_path, cpu=True)
      3 slab.set_calculator(calc)

File /conda_dir/envs/fairchem-ocp-cpu/lib/python3.12/site-packages/fairchem/core/common/relaxation/ase_utils.py:212, in OCPCalculator.__init__(self, config_yml, checkpoint_path, model_name, local_cache, trainer, cpu, seed)
    209 self.config["checkpoint"] = str(checkpoint_path)
    210 del config["dataset"]["src"]
--> 212 self.trainer = registry.get_trainer_class(config["trainer"])(
    213     task=config.get("task", {}),
    214     model=config["model"],
    215     dataset=[config["dataset"]],
    216     outputs=config["outputs"],
    217     loss_functions=config["loss_functions"],
    218     evaluation_metrics=config["evaluation_metrics"],
    219     optimizer=config["optim"],
    220     identifier="",
    221     slurm=config.get("slurm", {}),
    222     local_rank=config.get("local_rank", 0),
    223     is_debug=config.get("is_debug", True),
    224     cpu=cpu,
    225     amp=config.get("amp", False),
    226     inference_only=True,
    227 )
    229 if checkpoint_path is not None:
    230     self.load_checkpoint(checkpoint_path=checkpoint_path, checkpoint=checkpoint)

File /conda_dir/envs/fairchem-ocp-cpu/lib/python3.12/site-packages/fairchem/core/common/registry.py:302, in Registry.get_trainer_class(cls, name)
    300 @classmethod
    301 def get_trainer_class(cls, name: str):
--> 302     return cls.get_class(name, "trainer_name_mapping")

File /conda_dir/envs/fairchem-ocp-cpu/lib/python3.12/site-packages/fairchem/core/common/registry.py:273, in Registry.get_class(cls, name, mapping_name)
    271 # mapping be class path of type `{module_name}.{class_name}` (e.g., `fairchem.core.trainers.ForcesTrainer`)
    272 if name.count(".") < 1:
--> 273     raise cls.__import_error(name, mapping_name)
    275 try:
    276     return _get_absolute_mapping(name)

RuntimeError: Failed to find the trainer 'equiformerv2_forces'. You may either use a trainer from the registry (one of 'base', 'forces', 'energy' or 'ocp') or provide the full import path to the trainer (e.g., 'fairchem.core.trainers.ocp_trainer.OCPTrainer').

Expected Behavior

I expected EquiformerV2 models to be importable into the OCPCalculator and usable for inference (e.g., during BFGS optimization in ASE).

I am doing this on a compute cluster in which I don't have control over the CUDA or glibc versions. The cluster has only CUDA 11.6 available, and the required version of PyTorch (2.4.0) isn't available for CUDA 11.6, so I installed the CPU version of the Conda environment. The cluster's glibc version is only 2.28, and I was running into import errors mentioning the system libm.so that it needed at least glibc 2.29, so I installed OpenLibm 0.8.1 from the conda-forge channel and symlinked $CONDA_PREFIX/lib/libm.so to $CONDA_PREFIX/lib/libopenlibm.so.4.0. This resolved the libm.so/glibc 2.29 errors, but now I am running into the above-mentioned errors when trying to instantiate an OCPCalculator with one of the publicly available EquiformerV2 checkpoint files. I get the same error when trying to load any of the publicly available EquiformerV2 names:

['EquiformerV2-83M-S2EF-OC20-2M',
 'EquiformerV2-31M-S2EF-OC20-All+MD',
 'EquiformerV2-153M-S2EF-OC20-All+MD',
 'EquiformerV2-lE4-lF100-S2EFS-OC22',
 'EquiformerV2-S2EF-ODAC',
 'EquiformerV2-Large-S2EF-ODAC',
 'EquiformerV2-IS2RE-ODAC']

Specifying trainer="equiformerv2_forces" in the OCPCalculator constructor leads to the same error above. Specifying trainer="forces" leads to this error:

RuntimeError: Failed to find the trainer 'equiformerv2_forces'. You may either use a trainer from the registry (one of 'base', 'forces', 'energy' or 'ocp') or provide the full import path to the trainer (e.g., 'fairchem.core.trainers.ocp_trainer.OCPTrainer').

If I specify trainer="equiformerv2_forces", model_name="equiformerv2", I get:

RuntimeError: model_name and checkpoint_path were both specified, please use only one at a time

I can verify in the local copy of fairchem/core/models/pretrained_models.yml that all the EquiformerV2* names appear. This is the latest 1.3.0 release from last week. Not sure what I'm doing wrong.

Relevant files to reproduce this bug

No response

@samueldyoung29ctr samueldyoung29ctr added the bug Something isn't working label Dec 9, 2024
@DMPoolM
Copy link

DMPoolM commented Dec 10, 2024

Have you downloaded fairchem-data-oc? I'd ever saw this error but I remembered to fix it through running pip install fairchem-data-oc

@samueldyoung29ctr
Copy link
Author

I did add fairchem-data-oc to the Conda environment using Pip. I'm now again seeing the previous errors about pyg-lib not finding a libm.so compiled with glibc 2.29, and I think the outdated glibc might be responsible for the problem. I recreated the same Conda environment on NERSC Perlmutter login nodes (SLES 15, glibc 2.31), and EquiformerV2 models seem to work okay without needing fairdata-chem-oc. Possibly on my primary cluster the presence of the old libm.so/glibc, which prevents pyg-lib and torch_sparse from loading, ultimately makes the newer EquiformerV2 models unavailable in the model registry.

@DMPoolM
Copy link

DMPoolM commented Dec 11, 2024

Thank you. I also encountered it before on my primary cluster, which had only glibc 2.27 and can't be updated. However, I deleted pyg-lib and got fair-chem working well. Until now, I haven't see problems. I thought it can be a tricky method? Is it OK? @misko

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants