Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training the model on a single CPU or GPU #944

Open
csu-wjc opened this issue Dec 13, 2024 · 6 comments
Open

Training the model on a single CPU or GPU #944

csu-wjc opened this issue Dec 13, 2024 · 6 comments
Assignees

Comments

@csu-wjc
Copy link

csu-wjc commented Dec 13, 2024

What would you like to report?

When I run the model on a single cpu using code: 'ython main.py --mode train --config-yml configs/oc22/is2re/painn/painn.yml ', it gives the following error :

(WARNING): Could not find dataset metadata.npz files in '[WindowsPath('D:/Miniconda/envs/fair-chem/Lib/site-packages/fairchem/data/oc22/is2re-total/train')]'
(WARNING): Disabled BalancedBatchSampler because num_replicas=1.
[rank0]: Traceback (most recent call last):
[rank0]: File "D:\Desktop\fairchem\main.py", line 10, in
[rank0]: main()
[rank0]: File "D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core_cli.py", line 135, in main
[rank0]: runner_wrapper(config)
[rank0]: File "D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core_cli.py", line 58, in runner_wrapper
[rank0]: Runner()(config)
[rank0]: File "D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core_cli.py", line 37, in call
[rank0]: with new_trainer_context(config=config) as ctx:
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "D:\Miniconda\envs\fair-chem\Lib\contextlib.py", line 137, in enter
[rank0]: return next(self.gen)
[rank0]: ^^^^^^^^^^^^^^
[rank0]: File "D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core\common\utils.py", line 1102, in new_trainer_context
[rank0]: trainer = trainer_cls(**trainer_config)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core\trainers\ocp_trainer.py", line 109, in init
[rank0]: super().init(
[rank0]: File "D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core\trainers\base_trainer.py", line 220, in init
[rank0]: self.load(inference_only)
[rank0]: File "D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core\trainers\base_trainer.py", line 246, in load
[rank0]: self.load_datasets()
[rank0]: File "D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core\trainers\base_trainer.py", line 365, in load_datasets
[rank0]: self.train_sampler = self.get_sampler(
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core\trainers\base_trainer.py", line 313, in get_sampler
[rank0]: return BalancedBatchSampler(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core\common\data_parallel.py", line 171, in init
[rank0]: raise error
[rank0]: File "D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core\common\data_parallel.py", line 168, in init
[rank0]: dataset = _ensure_supported(dataset)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core\common\data_parallel.py", line 113, in _ensure_supported
[rank0]: raise UnsupportedDatasetError(
[rank0]: fairchem.core.datasets.base_dataset.UnsupportedDatasetError: BalancedBatchSampler requires a dataset that has a metadata attributed with number of atoms.

@lbluque
Copy link
Collaborator

lbluque commented Dec 13, 2024

Hi @csu-wjc,

Since you are using a single GPU and do not need the load balancing, you should remove the optim.load_balancing entry from your config, i.e. remove this line altogether, and then training can run without the metadata.npz file.

@csu-wjc
Copy link
Author

csu-wjc commented Dec 14, 2024

Hello @lbluque .
Thank you for your generous response. I am confident that I have removed 'load_malancing: atoms’, But when I run 'python main. py -- mode train -- config yml configs/oc22/is2re/gemnet-dT/gemnet-dT. yml', the following result is generated:
'wandb: You chose "Don't visualize my results"
wandb: WARNING resume will be ignored since W&B syncing is set to offline. Starting a new run with run id 2024-12-14-12-50-08.
wandb: Tracking run with wandb version 0.18.7
wandb: W&B syncing is set to offline in this directory.
wandb: Run wandb online or set WANDB_MODE=online to enable cloud syncing.
2024-12-14 12:50:46 (INFO): Loading model: gemnet_t
2024-12-14 12:50:49 (INFO): Loaded GemNetT with 23176469 parameters.
2024-12-14 12:50:49 (INFO): Loading dataset: oc22_lmdb
2024-12-14 12:50:49 (WARNING): Could not find dataset metadata.npz files in '[WindowsPath('D:/Miniconda/envs/fair-chem/Lib/site-packages/fairchem/data/oc22/is2re-total/train')]'
2024-12-14 12:50:49 (WARNING): Disabled BalancedBatchSampler because num_replicas=1.
2024-12-14 12:50:49 (WARNING): Failed to get data sizes, falling back to uniform partitioning. BalancedBatchSampler requires a dataset that has a metadata attributed with number of atoms.
2024-12-14 12:50:49 (INFO): rank: 0: Sampler created...
2024-12-14 12:50:49 (INFO): Created BalancedBatchSampler with sampler=<fairchem.core.common.data_parallel.StatefulDistributedSampler object at 0x000001E11893EC90>, batch_size=8, drop_last=False'.

Also, I would like to know how to run the model on a single CPU. The result of the 'python - u main. py -- mode train -- config yml configs/oc22/is2re/gemnet-dT/gemnet-dT.yml -- CPU ' instruction is the same as above. Thank you again for your reply.

@lbluque
Copy link
Collaborator

lbluque commented Dec 16, 2024

@csu-wjc are you getting an error when running this? If so can you paste the traceback or more information to understand the issue?

@csu-wjc
Copy link
Author

csu-wjc commented Dec 17, 2024

Hello @lbluque .
Thank you very much for your reply, when I run the command the result is shown below:

(fair-chem) PS D:\Desktop\fairchem> python main.py --mode train --config-yml configs/oc22/is2re/gemnet-dT/gemnet-dT.yml --cpu
W1217 16:32:02.571000 4820 site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
2024-12-17 16:32:02 (INFO): Running in local mode without elastic launch (single gpu only)
2024-12-17 16:32:02 (INFO): Setting env PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
2024-12-17 16:32:02 (INFO): Project root: D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem
D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core\models\scn\spherical_harmonics.py:23: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
_Jd = torch.load(os.path.join(os.path.dirname(file), "Jd.pt"))
D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core\models\equiformer_v2\wigner.py:10: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
_Jd = torch.load(os.path.join(os.path.dirname(file), "Jd.pt"))
D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core\models\escn\so3.py:23: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
_Jd = torch.load(os.path.join(os.path.dirname(file), "Jd.pt"))
2024-12-17 16:32:06 (INFO): amp: false
cmd:
checkpoint_dir: D:\Desktop\fairchem\checkpoints\2024-12-17-16-32-00
commit: null
identifier: ''
logs_dir: D:\Desktop\fairchem\logs\wandb\2024-12-17-16-32-00
print_every: 10
results_dir: D:\Desktop\fairchem\results\2024-12-17-16-32-00
seed: 0
timestamp_id: 2024-12-17-16-32-00
version: 1.3.0
dataset:
format: oc22_lmdb
key_mapping:
y_relaxed: energy
src: D:/Miniconda/envs/fair-chem/Lib/site-packages/fairchem/data/oc22/is2re-total/train
evaluation_metrics:
metrics:
energy:
- mae
- mse
- energy_within_threshold
primary_metric: energy_mae
gp_gpus: null
gpus: 0
logger: wandb
loss_functions:

  • energy:
    coefficient: 1
    fn: mae
    model:
    activation: silu
    cbf:
    name: spherical_harmonics
    cutoff: 12.0
    emb_size_atom: 256
    emb_size_bil_trip: 64
    emb_size_cbf: 16
    emb_size_edge: 512
    emb_size_rbf: 64
    emb_size_trip: 64
    envelope:
    exponent: 5
    name: polynomial
    extensive: true
    max_neighbors: 50
    name: gemnet_t
    num_after_skip: 2
    num_atom: 3
    num_before_skip: 1
    num_blocks: 5
    num_concat: 1
    num_radial: 64
    num_spherical: 7
    otf_graph: true
    output_init: HeOrthogonal
    rbf:
    name: gaussian
    regress_forces: false
    scale_file: configs/oc22/scaling_factors/gemnet-dT_c12.json
    optim:
    batch_size: 8
    clip_grad_norm: 10
    ema_decay: 0.999
    eval_batch_size: 8
    factor: 0.8
    lr_initial: 0.0001
    max_epochs: 100
    mode: min
    num_workers: 0
    optimizer: AdamW
    optimizer_params:
    amsgrad: true
    patience: 3
    scheduler: ReduceLROnPlateau
    outputs:
    energy:
    level: system
    shape: 1
    relax_dataset: {}
    slurm: {}
    task: {}
    test_dataset: {}
    trainer: ocp
    val_dataset:
    src: D:/Miniconda/envs/fair-chem/Lib/site-packages/fairchem/data/oc22/is2re-total/val_id

wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: WARNING resume will be ignored since W&B syncing is set to offline. Starting a new run with run id 2024-12-17-16-32-00.
wandb: Tracking run with wandb version 0.18.7
wandb: W&B syncing is set to offline in this directory.
wandb: Run wandb online or set WANDB_MODE=online to enable cloud syncing.
2024-12-17 16:32:13 (INFO): Loading model: gemnet_t
2024-12-17 16:32:16 (INFO): Loaded GemNetT with 23176469 parameters.
2024-12-17 16:32:16 (INFO): Loading dataset: oc22_lmdb
2024-12-17 16:32:16 (WARNING): Could not find dataset metadata.npz files in '[WindowsPath('D:/Miniconda/envs/fair-chem/Lib/site-packages/fairchem/data/oc22/is2re-total/train')]'
2024-12-17 16:32:16 (WARNING): Disabled BalancedBatchSampler because num_replicas=1.
2024-12-17 16:32:16 (WARNING): Failed to get data sizes, falling back to uniform partitioning. BalancedBatchSampler requires a dataset that has a metadata attributed with number of atoms.
2024-12-17 16:32:16 (INFO): rank: 0: Sampler created...
2024-12-17 16:32:16 (INFO): Created BalancedBatchSampler with sampler=<fairchem.core.common.data_parallel.StatefulDistributedSampler object at 0x000001F6628E7950>, batch_size=8, drop_last=False

These are all the feedback on how the code is running, the problem is that I can't see how it's training or if it's training at all.
I look forward to your reply, best wishes to you.

@lbluque
Copy link
Collaborator

lbluque commented Dec 17, 2024

Training errors should be printed to the console when training,

Can you try running it with the --debug flag?

You can also set logging to WandB to see training and validation curves by setting the following in the config file (you will need to set up and login to WandB):

logger:
  name: wandb

@csu-wjc
Copy link
Author

csu-wjc commented Dec 18, 2024

Hello @lbluque .
Thank you for your generous reply, when I run the command '--debug' the result is shown below:

(fair-chem) PS D:\Desktop\fairchem> python main.py --mode train --config-yml configs/oc22/is2re/gemnet-dT/gemnet-dT.yml --cpu --debug
W1218 19:42:09.659000 2932 site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
2024-12-18 19:42:09 (INFO): Running in local mode without elastic launch (single gpu only)
2024-12-18 19:42:09 (INFO): Setting env PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
2024-12-18 19:42:09 (INFO): Project root: D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem
D:\Miniconda\envs\fair-chem\Lib\site-packages\fairchem\core\models\escn\so3.py:23: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future
release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
_Jd = torch.load(os.path.join(os.path.dirname(file), "Jd.pt"))
2024-12-18 19:42:12 (INFO): amp: false
cmd:
checkpoint_dir: D:\Desktop\fairchem\checkpoints\2024-12-18-19-41-52
commit: null
identifier: ''
logs_dir: D:\Desktop\fairchem\logs\wandb\2024-12-18-19-41-52
print_every: 10
results_dir: D:\Desktop\fairchem\results\2024-12-18-19-41-52
seed: 0
timestamp_id: 2024-12-18-19-41-52
version: 1.3.0
dataset:
format: oc22_lmdb
key_mapping:
y_relaxed: energy
src: D:/Miniconda/envs/fair-chem/Lib/site-packages/fairchem/data/oc22/is2re-total/train
evaluation_metrics:
metrics:
energy:
- mae
- mse
- energy_within_threshold
primary_metric: energy_mae
gp_gpus: null
gpus: 0
logger: wandb
loss_functions:

  • energy:
    coefficient: 1
    fn: mae
    model:
    activation: silu
    cbf:
    name: spherical_harmonics
    cutoff: 12.0
    emb_size_atom: 256
    emb_size_bil_trip: 64
    emb_size_cbf: 16
    emb_size_edge: 512
    emb_size_rbf: 64
    emb_size_trip: 64
    envelope:
    exponent: 5
    name: polynomial
    extensive: true
    max_neighbors: 50
    name: gemnet_t
    num_after_skip: 2
    num_atom: 3
    num_before_skip: 1
    num_blocks: 5
    num_concat: 1
    num_radial: 64
    num_spherical: 7
    otf_graph: true
    output_init: HeOrthogonal
    rbf:
    name: gaussian
    regress_forces: false
    scale_file: configs/oc22/scaling_factors/gemnet-dT_c12.json
    optim:
    batch_size: 8
    clip_grad_norm: 10
    ema_decay: 0.999
    eval_batch_size: 8
    factor: 0.8
    lr_initial: 0.0001
    max_epochs: 100
    mode: min
    num_workers: 0
    optimizer: AdamW
    optimizer_params:
    amsgrad: true
    patience: 3
    scheduler: ReduceLROnPlateau
    outputs:
    energy:
    level: system
    shape: 1
    relax_dataset: {}
    slurm: {}
    task: {}
    test_dataset: {}
    trainer: ocp
    val_dataset:
    src: D:/Miniconda/envs/fair-chem/Lib/site-packages/fairchem/data/oc22/is2re-total/val_id

2024-12-18 19:42:12 (INFO): Loading model: gemnet_t
2024-12-18 19:42:13 (INFO): Loaded GemNetT with 23176469 parameters.
2024-12-18 19:42:13 (INFO): Loading dataset: oc22_lmdb
2024-12-18 19:42:13 (WARNING): Could not find dataset metadata.npz files in '[WindowsPath('D:/Miniconda/envs/fair-chem/Lib/site-packages/fairchem/data/oc22/is2re-total/train')]'
2024-12-18 19:42:13 (WARNING): Disabled BalancedBatchSampler because num_replicas=1.
2024-12-18 19:42:13 (WARNING): Failed to get data sizes, falling back to uniform partitioning. BalancedBatchSampler requires a dataset that has a metadata attributed with number of atoms.
2024-12-18 19:42:13 (INFO): rank: 0: Sampler created...
2024-12-18 19:42:13 (INFO): Created BalancedBatchSampler with sampler=<fairchem.core.common.data_parallel.StatefulDistributedSampler object at 0x0000021922D50F20>, batch_size=8, drop_last=False

Also, may I ask if there is any way for me to see the training curve while training on a single CPU (without logging into Wanbd).
I look forward to your reply, best wishes to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants