ValueError: ('The number of process requires to run all replicas (8)', 'must be equal to the world size (4).') #250

sankexin · 2024-11-27T09:24:36Z

run:
CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=4 run_train.py --config-file examples/config_tiny_llama.yaml

how to run with nproc_per_node=4/2/1 ?

errors:

Traceback (most recent call last):
  File "/home/nanotron/run_train.py", line 233, in <module>
    trainer = DistributedTrainer(config_file)
  File "/home/nanotron/src/nanotron/trainer.py", line 147, in __init__
    self.parallel_context = ParallelContext(
  File "/home/nanotron/src/nanotron/parallel/context.py", line 33, in __init__
    raise ValueError(
ValueError: ('The number of process requires to run all replicas (8)', 'must be equal to the world size (4).')
Traceback (most recent call last):
  File "/home/nanotron/run_train.py", line 233, in <module>
Traceback (most recent call last):
  File "/home/nanotron/run_train.py", line 233, in <module>
    trainer = DistributedTrainer(config_file)
  File "/home/nanotron/src/nanotron/trainer.py", line 147, in __init__
    trainer = DistributedTrainer(config_file)
  File "/home/nanotron/src/nanotron/trainer.py", line 147, in __init__
    self.parallel_context = ParallelContext(
  File "/home/nanotron/src/nanotron/parallel/context.py", line 33, in __init__
    self.parallel_context = ParallelContext(
      File "/home/nanotron/src/nanotron/parallel/context.py", line 33, in __init__
raise ValueError(
ValueError: ('The number of process requires to run all replicas (8)', 'must be equal to the world size (4).')
    raise ValueError(
ValueError: ('The number of process requires to run all replicas (8)', 'must be equal to the world size (4).')
Traceback (most recent call last):
  File "/home/nanotron/run_train.py", line 233, in <module>
    trainer = DistributedTrainer(config_file)
  File "/home/nanotron/src/nanotron/trainer.py", line 147, in __init__
    self.parallel_context = ParallelContext(
  File "/home/nanotron/src/nanotron/parallel/context.py", line 33, in __init__
    raise ValueError(
ValueError: ('The number of process requires to run all replicas (8)', 'must be equal to the world size (4).')
E1127 09:16:53.081000 23331279684096 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 366) of binary: /usr/local/bin/python3.10
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_train.py FAILED

The text was updated successfully, but these errors were encountered:

hz-nm · 2024-11-27T12:46:36Z

Change dp, pp, and tp according to your needs in the config_tiny_llama.yaml file under parallelism. I think dp needs to be equal to the number of GPUs.
Currently I was testing the working of the script on my single GPU so I was able to run the training with dp=1, pp=1, and tp=1 as well.

sankexin · 2024-11-28T12:31:13Z

Change dp, pp, and tp according to your needs in the config_tiny_llama.yaml file under parallelism. I think dp needs to be equal to the number of GPUs. Currently I was testing the working of the script on my single GPU so I was able to run the training with dp=1, pp=1, and tp=1 as well.

thank you! you are all right.

when --nproc_per_node=4:
dp=1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: ('The number of process requires to run all replicas (8)', 'must be equal to the world size (4).') #250

ValueError: ('The number of process requires to run all replicas (8)', 'must be equal to the world size (4).') #250

sankexin commented Nov 27, 2024 •

edited

Loading

hz-nm commented Nov 27, 2024

sankexin commented Nov 28, 2024

ValueError: ('The number of process requires to run all replicas (8)', 'must be equal to the world size (4).') #250

ValueError: ('The number of process requires to run all replicas (8)', 'must be equal to the world size (4).') #250

Comments

sankexin commented Nov 27, 2024 • edited Loading

hz-nm commented Nov 27, 2024

sankexin commented Nov 28, 2024

sankexin commented Nov 27, 2024 •

edited

Loading