You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Traceback (most recent call last):
File "/home/nanotron/run_train.py", line 233, in <module>
trainer = DistributedTrainer(config_file)
File "/home/nanotron/src/nanotron/trainer.py", line 147, in __init__
self.parallel_context = ParallelContext(
File "/home/nanotron/src/nanotron/parallel/context.py", line 33, in __init__
raise ValueError(
ValueError: ('The number of process requires to run all replicas (8)', 'must be equal to the world size (4).')
Traceback (most recent call last):
File "/home/nanotron/run_train.py", line 233, in <module>
Traceback (most recent call last):
File "/home/nanotron/run_train.py", line 233, in <module>
trainer = DistributedTrainer(config_file)
File "/home/nanotron/src/nanotron/trainer.py", line 147, in __init__
trainer = DistributedTrainer(config_file)
File "/home/nanotron/src/nanotron/trainer.py", line 147, in __init__
self.parallel_context = ParallelContext(
File "/home/nanotron/src/nanotron/parallel/context.py", line 33, in __init__
self.parallel_context = ParallelContext(
File "/home/nanotron/src/nanotron/parallel/context.py", line 33, in __init__
raise ValueError(
ValueError: ('The number of process requires to run all replicas (8)', 'must be equal to the world size (4).')
raise ValueError(
ValueError: ('The number of process requires to run all replicas (8)', 'must be equal to the world size (4).')
Traceback (most recent call last):
File "/home/nanotron/run_train.py", line 233, in <module>
trainer = DistributedTrainer(config_file)
File "/home/nanotron/src/nanotron/trainer.py", line 147, in __init__
self.parallel_context = ParallelContext(
File "/home/nanotron/src/nanotron/parallel/context.py", line 33, in __init__
raise ValueError(
ValueError: ('The number of process requires to run all replicas (8)', 'must be equal to the world size (4).')
E1127 09:16:53.081000 23331279684096 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 366) of binary: /usr/local/bin/python3.10
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_train.py FAILED
The text was updated successfully, but these errors were encountered:
Change dp, pp, and tp according to your needs in the config_tiny_llama.yaml file under parallelism. I think dp needs to be equal to the number of GPUs.
Currently I was testing the working of the script on my single GPU so I was able to run the training with dp=1, pp=1, and tp=1 as well.
Change dp, pp, and tp according to your needs in the config_tiny_llama.yaml file under parallelism. I think dp needs to be equal to the number of GPUs. Currently I was testing the working of the script on my single GPU so I was able to run the training with dp=1, pp=1, and tp=1 as well.
run:
CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=4 run_train.py --config-file examples/config_tiny_llama.yaml
how to run with nproc_per_node=4/2/1 ?
errors:
The text was updated successfully, but these errors were encountered: