Training the model on a single CPU or GPU #944

csu-wjc · 2024-12-13T09:13:23Z

What would you like to report?

When I run the model on a single cpu using code: 'ython main.py --mode train --config-yml configs/oc22/is2re/painn/painn.yml ', it gives the following error :

(WARNING): Could not find dataset metadata.npz files in '[WindowsPath('D:/Miniconda/envs/fair-chem/Lib/site-packages/fairchem/data/oc22/is2re-total/train')]'
(WARNING): Disabled BalancedBatchSampler because num_replicas=1.
[rank0]: Traceback (most recent call last):
[rank0]: File "D:\Desktop\fairchem\main.py", line 10, in
[rank0]: main()
[rank0]: File "D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core_cli.py", line 135, in main
[rank0]: runner_wrapper(config)
[rank0]: File "D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core_cli.py", line 58, in runner_wrapper
[rank0]: Runner()(config)
[rank0]: File "D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core_cli.py", line 37, in call
[rank0]: with new_trainer_context(config=config) as ctx:
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "D:\Miniconda\envs\fair-chem\Lib\contextlib.py", line 137, in enter
[rank0]: return next(self.gen)
[rank0]: ^^^^^^^^^^^^^^
[rank0]: File "D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core\common\utils.py", line 1102, in new_trainer_context
[rank0]: trainer = trainer_cls(**trainer_config)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core\trainers\ocp_trainer.py", line 109, in init
[rank0]: super().init(
[rank0]: File "D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core\trainers\base_trainer.py", line 220, in init
[rank0]: self.load(inference_only)
[rank0]: File "D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core\trainers\base_trainer.py", line 246, in load
[rank0]: self.load_datasets()
[rank0]: File "D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core\trainers\base_trainer.py", line 365, in load_datasets
[rank0]: self.train_sampler = self.get_sampler(
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core\trainers\base_trainer.py", line 313, in get_sampler
[rank0]: return BalancedBatchSampler(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core\common\data_parallel.py", line 171, in init
[rank0]: raise error
[rank0]: File "D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core\common\data_parallel.py", line 168, in init
[rank0]: dataset = _ensure_supported(dataset)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core\common\data_parallel.py", line 113, in _ensure_supported
[rank0]: raise UnsupportedDatasetError(
[rank0]: fairchem.core.datasets.base_dataset.UnsupportedDatasetError: BalancedBatchSampler requires a dataset that has a metadata attributed with number of atoms.

lbluque · 2024-12-13T17:52:25Z

Hi @csu-wjc,

Since you are using a single GPU and do not need the load balancing, you should remove the optim.load_balancing entry from your config, i.e. remove this line altogether, and then training can run without the metadata.npz file.

csu-wjc · 2024-12-14T05:12:58Z

Hello @lbluque .
Thank you for your generous response. I am confident that I have removed 'load_malancing: atoms’， But when I run 'python main. py -- mode train -- config yml configs/oc22/is2re/gemnet-dT/gemnet-dT. yml', the following result is generated:
'wandb: You chose "Don't visualize my results"
wandb: WARNING resume will be ignored since W&B syncing is set to offline. Starting a new run with run id 2024-12-14-12-50-08.
wandb: Tracking run with wandb version 0.18.7
wandb: W&B syncing is set to offline in this directory.
wandb: Run wandb online or set WANDB_MODE=online to enable cloud syncing.
2024-12-14 12:50:46 (INFO): Loading model: gemnet_t
2024-12-14 12:50:49 (INFO): Loaded GemNetT with 23176469 parameters.
2024-12-14 12:50:49 (INFO): Loading dataset: oc22_lmdb
2024-12-14 12:50:49 (WARNING): Could not find dataset metadata.npz files in '[WindowsPath('D:/Miniconda/envs/fair-chem/Lib/site-packages/fairchem/data/oc22/is2re-total/train')]'
2024-12-14 12:50:49 (WARNING): Disabled BalancedBatchSampler because num_replicas=1.
2024-12-14 12:50:49 (WARNING): Failed to get data sizes, falling back to uniform partitioning. BalancedBatchSampler requires a dataset that has a metadata attributed with number of atoms.
2024-12-14 12:50:49 (INFO): rank: 0: Sampler created...
2024-12-14 12:50:49 (INFO): Created BalancedBatchSampler with sampler=<fairchem.core.common.data_parallel.StatefulDistributedSampler object at 0x000001E11893EC90>, batch_size=8, drop_last=False'.

Also, I would like to know how to run the model on a single CPU. The result of the 'python - u main. py -- mode train -- config yml configs/oc22/is2re/gemnet-dT/gemnet-dT.yml -- CPU ' instruction is the same as above. Thank you again for your reply.

lbluque · 2024-12-16T23:49:17Z

@csu-wjc are you getting an error when running this? If so can you paste the traceback or more information to understand the issue?

csu-wjc · 2024-12-17T08:48:54Z

Hello @lbluque .
Thank you very much for your reply, when I run the command the result is shown below:

(fair-chem) PS D:\Desktop\fairchem> python main.py --mode train --config-yml configs/oc22/is2re/gemnet-dT/gemnet-dT.yml --cpu
W1217 16:32:02.571000 4820 site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
2024-12-17 16:32:02 (INFO): Running in local mode without elastic launch (single gpu only)
2024-12-17 16:32:02 (INFO): Setting env PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
2024-12-17 16:32:02 (INFO): Project root: D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem
D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core\models\scn\spherical_harmonics.py:23: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
_Jd = torch.load(os.path.join(os.path.dirname(file), "Jd.pt"))
D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core\models\equiformer_v2\wigner.py:10: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
_Jd = torch.load(os.path.join(os.path.dirname(file), "Jd.pt"))
D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core\models\escn\so3.py:23: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
_Jd = torch.load(os.path.join(os.path.dirname(file), "Jd.pt"))
2024-12-17 16:32:06 (INFO): amp: false
cmd:
checkpoint_dir: D:\Desktop\fairchem\checkpoints\2024-12-17-16-32-00
commit: null
identifier: ''
logs_dir: D:\Desktop\fairchem\logs\wandb\2024-12-17-16-32-00
print_every: 10
results_dir: D:\Desktop\fairchem\results\2024-12-17-16-32-00
seed: 0
timestamp_id: 2024-12-17-16-32-00
version: 1.3.0
dataset:
format: oc22_lmdb
key_mapping:
y_relaxed: energy
src: D:/Miniconda/envs/fair-chem/Lib/site-packages/fairchem/data/oc22/is2re-total/train
evaluation_metrics:
metrics:
energy:
- mae
- mse
- energy_within_threshold
primary_metric: energy_mae
gp_gpus: null
gpus: 0
logger: wandb
loss_functions:

energy:
coefficient: 1
fn: mae
model:
activation: silu
cbf:
name: spherical_harmonics
cutoff: 12.0
emb_size_atom: 256
emb_size_bil_trip: 64
emb_size_cbf: 16
emb_size_edge: 512
emb_size_rbf: 64
emb_size_trip: 64
envelope:
exponent: 5
name: polynomial
extensive: true
max_neighbors: 50
name: gemnet_t
num_after_skip: 2
num_atom: 3
num_before_skip: 1
num_blocks: 5
num_concat: 1
num_radial: 64
num_spherical: 7
otf_graph: true
output_init: HeOrthogonal
rbf:
name: gaussian
regress_forces: false
scale_file: configs/oc22/scaling_factors/gemnet-dT_c12.json
optim:
batch_size: 8
clip_grad_norm: 10
ema_decay: 0.999
eval_batch_size: 8
factor: 0.8
lr_initial: 0.0001
max_epochs: 100
mode: min
num_workers: 0
optimizer: AdamW
optimizer_params:
amsgrad: true
patience: 3
scheduler: ReduceLROnPlateau
outputs:
energy:
level: system
shape: 1
relax_dataset: {}
slurm: {}
task: {}
test_dataset: {}
trainer: ocp
val_dataset:
src: D:/Miniconda/envs/fair-chem/Lib/site-packages/fairchem/data/oc22/is2re-total/val_id

wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: WARNING resume will be ignored since W&B syncing is set to offline. Starting a new run with run id 2024-12-17-16-32-00.
wandb: Tracking run with wandb version 0.18.7
wandb: W&B syncing is set to offline in this directory.
wandb: Run wandb online or set WANDB_MODE=online to enable cloud syncing.
2024-12-17 16:32:13 (INFO): Loading model: gemnet_t
2024-12-17 16:32:16 (INFO): Loaded GemNetT with 23176469 parameters.
2024-12-17 16:32:16 (INFO): Loading dataset: oc22_lmdb
2024-12-17 16:32:16 (WARNING): Could not find dataset metadata.npz files in '[WindowsPath('D:/Miniconda/envs/fair-chem/Lib/site-packages/fairchem/data/oc22/is2re-total/train')]'
2024-12-17 16:32:16 (WARNING): Disabled BalancedBatchSampler because num_replicas=1.
2024-12-17 16:32:16 (WARNING): Failed to get data sizes, falling back to uniform partitioning. BalancedBatchSampler requires a dataset that has a metadata attributed with number of atoms.
2024-12-17 16:32:16 (INFO): rank: 0: Sampler created...
2024-12-17 16:32:16 (INFO): Created BalancedBatchSampler with sampler=<fairchem.core.common.data_parallel.StatefulDistributedSampler object at 0x000001F6628E7950>, batch_size=8, drop_last=False

These are all the feedback on how the code is running, the problem is that I can't see how it's training or if it's training at all.
I look forward to your reply, best wishes to you.

lbluque · 2024-12-17T23:20:53Z

Training errors should be printed to the console when training,

Can you try running it with the --debug flag?

You can also set logging to WandB to see training and validation curves by setting the following in the config file (you will need to set up and login to WandB):

logger:
  name: wandb

csu-wjc · 2024-12-18T11:50:27Z

Hello @lbluque .
Thank you for your generous reply, when I run the command '--debug' the result is shown below:

(fair-chem) PS D:\Desktop\fairchem> python main.py --mode train --config-yml configs/oc22/is2re/gemnet-dT/gemnet-dT.yml --cpu --debug
W1218 19:42:09.659000 2932 site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
2024-12-18 19:42:09 (INFO): Running in local mode without elastic launch (single gpu only)
2024-12-18 19:42:09 (INFO): Setting env PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
2024-12-18 19:42:09 (INFO): Project root: D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem
D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core\models\escn\so3.py:23: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future
release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
_Jd = torch.load(os.path.join(os.path.dirname(file), "Jd.pt"))
2024-12-18 19:42:12 (INFO): amp: false
cmd:
checkpoint_dir: D:\Desktop\fairchem\checkpoints\2024-12-18-19-41-52
commit: null
identifier: ''
logs_dir: D:\Desktop\fairchem\logs\wandb\2024-12-18-19-41-52
print_every: 10
results_dir: D:\Desktop\fairchem\results\2024-12-18-19-41-52
seed: 0
timestamp_id: 2024-12-18-19-41-52
version: 1.3.0
dataset:
format: oc22_lmdb
key_mapping:
y_relaxed: energy
src: D:/Miniconda/envs/fair-chem/Lib/site-packages/fairchem/data/oc22/is2re-total/train
evaluation_metrics:
metrics:
energy:
- mae
- mse
- energy_within_threshold
primary_metric: energy_mae
gp_gpus: null
gpus: 0
logger: wandb
loss_functions:

energy:
coefficient: 1
fn: mae
model:
activation: silu
cbf:
name: spherical_harmonics
cutoff: 12.0
emb_size_atom: 256
emb_size_bil_trip: 64
emb_size_cbf: 16
emb_size_edge: 512
emb_size_rbf: 64
emb_size_trip: 64
envelope:
exponent: 5
name: polynomial
extensive: true
max_neighbors: 50
name: gemnet_t
num_after_skip: 2
num_atom: 3
num_before_skip: 1
num_blocks: 5
num_concat: 1
num_radial: 64
num_spherical: 7
otf_graph: true
output_init: HeOrthogonal
rbf:
name: gaussian
regress_forces: false
scale_file: configs/oc22/scaling_factors/gemnet-dT_c12.json
optim:
batch_size: 8
clip_grad_norm: 10
ema_decay: 0.999
eval_batch_size: 8
factor: 0.8
lr_initial: 0.0001
max_epochs: 100
mode: min
num_workers: 0
optimizer: AdamW
optimizer_params:
amsgrad: true
patience: 3
scheduler: ReduceLROnPlateau
outputs:
energy:
level: system
shape: 1
relax_dataset: {}
slurm: {}
task: {}
test_dataset: {}
trainer: ocp
val_dataset:
src: D:/Miniconda/envs/fair-chem/Lib/site-packages/fairchem/data/oc22/is2re-total/val_id

2024-12-18 19:42:12 (INFO): Loading model: gemnet_t
2024-12-18 19:42:13 (INFO): Loaded GemNetT with 23176469 parameters.
2024-12-18 19:42:13 (INFO): Loading dataset: oc22_lmdb
2024-12-18 19:42:13 (WARNING): Could not find dataset metadata.npz files in '[WindowsPath('D:/Miniconda/envs/fair-chem/Lib/site-packages/fairchem/data/oc22/is2re-total/train')]'
2024-12-18 19:42:13 (WARNING): Disabled BalancedBatchSampler because num_replicas=1.
2024-12-18 19:42:13 (WARNING): Failed to get data sizes, falling back to uniform partitioning. BalancedBatchSampler requires a dataset that has a metadata attributed with number of atoms.
2024-12-18 19:42:13 (INFO): rank: 0: Sampler created...
2024-12-18 19:42:13 (INFO): Created BalancedBatchSampler with sampler=<fairchem.core.common.data_parallel.StatefulDistributedSampler object at 0x0000021922D50F20>, batch_size=8, drop_last=False

Also, may I ask if there is any way for me to see the training curve while training on a single CPU (without logging into Wanbd).
I look forward to your reply, best wishes to you.

lbluque assigned lbluque and rayg1234 Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training the model on a single CPU or GPU #944

Training the model on a single CPU or GPU #944

csu-wjc commented Dec 13, 2024

lbluque commented Dec 13, 2024 •

edited

Loading

csu-wjc commented Dec 14, 2024

lbluque commented Dec 16, 2024

csu-wjc commented Dec 17, 2024

lbluque commented Dec 17, 2024

csu-wjc commented Dec 18, 2024

Training the model on a single CPU or GPU #944

Training the model on a single CPU or GPU #944

Comments

csu-wjc commented Dec 13, 2024

What would you like to report?

lbluque commented Dec 13, 2024 • edited Loading

csu-wjc commented Dec 14, 2024

lbluque commented Dec 16, 2024

csu-wjc commented Dec 17, 2024

lbluque commented Dec 17, 2024

csu-wjc commented Dec 18, 2024

lbluque commented Dec 13, 2024 •

edited

Loading