RuntimeError: NCCL Error 2: unhandled system error #114

jovijovi · 2024-12-11T08:47:43Z

When using Parallel Inference on Multiple GPUs by xDiT, the following error is encountered.
Did I miss something?

Host OS: Ubuntu 22.04
GPU: 8 * NVIDIA L20 (48GB)
MEMORY: 1024 GB
Docker: Docker version 27.4.0, build bde2b89
NVIDIA packages on the Host:

ii  libnvidia-container-tools              1.17.3-1                                amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64             1.17.3-1                                amd64        NVIDIA container runtime library
ii  nvidia-container-runtime               3.14.0-1                                all          NVIDIA Container Toolkit meta-package
ii  nvidia-container-toolkit               1.17.3-1                                amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base          1.17.3-1                                amd64        NVIDIA Container Toolkit Base

Docker /etc/docker/daemon.json:

{
  "runtimes": {
      "nvidia": {
          "path": "nvidia-container-runtime",
          "runtimeArgs": []
      }
  }
}

Docker Image: hufeifeibear/hunyuanvideo:latest (9e45c4f03d71)
Docker Compose file test.yaml:

services:
  hunyuan-video-1:
    container_name: hunyuan_video_1
    image:  thufeifeibear/hunyuanvideo:latest
    restart: always
    volumes:
      - /data/hunyuan/models:/data/hunyuan/models

    entrypoint: ["/bin/bash"]
    command: ["-c", "tail -f /dev/null"]

    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: [ '0', '1', '2', '3', '4', '5', '6', '7' ]
              capabilities: [ gpu ]

Run a Docker container:

docker-compose -f test.yaml up -d

Run the following command inside the Docker container:

torchrun --nproc_per_node=2 sample_video.py --video-size 640 360 --video-length 49 --infer-steps 50 --prompt 'A cat walks on the grass, realistic style.' --flow-reverse --seed 42 --ulysses-degree 2 --ring-degree 1 --save-path ./results

ERROR MESSAGE

(myenv) root@fb5f3d12661e:/data/hunyuan/models/HunyuanVideo# ./run.sh
+ torchrun --nproc_per_node=2 sample_video.py --video-size 640 360 --video-length 49 --infer-steps 50 --prompt 'A cat walks on the grass, realistic style.' --flow-reverse --seed 42 --ulysses-degree 2 --ring-degree 1 --save-path ./results
W1211 08:25:27.222000 139773546530624 torch/distributed/run.py:779] 
W1211 08:25:27.222000 139773546530624 torch/distributed/run.py:779] *****************************************
W1211 08:25:27.222000 139773546530624 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1211 08:25:27.222000 139773546530624 torch/distributed/run.py:779] *****************************************
Namespace(model='HYVideo-T/2-cfgdistill', latent_channels=16, precision='bf16', rope_theta=256, vae='884-16c-hy', vae_precision='fp16', vae_tiling=True, text_encoder='llm', text_encoder_precision='fp16', text_states_dim=4096, text_len=256, tokenizer='llm', prompt_template='dit-llm-encode', prompt_template_video='dit-llm-encode-video', hidden_state_skip_layer=2, apply_final_norm=False, text_encoder_2='clipL', text_encoder_precision_2='fp16', text_states_dim_2=768, tokenizer_2='clipL', text_len_2=77, denoise_type='flow', flow_shift=7.0, flow_reverse=True, flow_solver='euler', use_linear_quadratic_schedule=False, linear_schedule_end=25, model_base='ckpts', dit_weight='ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt', model_resolution='540p', load_key='module', use_cpu_offload=False, batch_size=1, infer_steps=50, disable_autocast=False, save_path='./results', save_path_suffix='', name_suffix='', num_videos=1, video_size=[640, 360], video_length=49, prompt='A cat walks on the grass, realistic style.', seed_type='auto', seed=42, neg_prompt=None, cfg_scale=1.0, embedded_cfg_scale=6.0, reproduce=False, ulysses_degree=2, ring_degree=1)
Namespace(model='HYVideo-T/2-cfgdistill', latent_channels=16, precision='bf16', rope_theta=256, vae='884-16c-hy', vae_precision='fp16', vae_tiling=True, text_encoder='llm', text_encoder_precision='fp16', text_states_dim=4096, text_len=256, tokenizer='llm', prompt_template='dit-llm-encode', prompt_template_video='dit-llm-encode-video', hidden_state_skip_layer=2, apply_final_norm=False, text_encoder_2='clipL', text_encoder_precision_2='fp16', text_states_dim_2=768, tokenizer_2='clipL', text_len_2=77, denoise_type='flow', flow_shift=7.0, flow_reverse=True, flow_solver='euler', use_linear_quadratic_schedule=False, linear_schedule_end=25, model_base='ckpts', dit_weight='ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt', model_resolution='540p', load_key='module', use_cpu_offload=False, batch_size=1, infer_steps=50, disable_autocast=False, save_path='./results', save_path_suffix='', name_suffix='', num_videos=1, video_size=[640, 360], video_length=49, prompt='A cat walks on the grass, realistic style.', seed_type='auto', seed=42, neg_prompt=None, cfg_scale=1.0, embedded_cfg_scale=6.0, reproduce=False, ulysses_degree=2, ring_degree=1)
2024-12-11 08:25:29.215 | INFO     | hyvideo.inference:from_pretrained:153 - Got text-to-video model root path: ckpts
2024-12-11 08:25:29.215 | INFO     | hyvideo.inference:from_pretrained:153 - Got text-to-video model root path: ckpts
DEBUG 12-11 08:25:29 [parallel_state.py:179] world_size=2 rank=1 local_rank=-1 distributed_init_method=env:// backend=nccl
DEBUG 12-11 08:25:29 [parallel_state.py:179] world_size=2 rank=0 local_rank=-1 distributed_init_method=env:// backend=nccl
2024-12-11 08:25:29.223 | INFO     | hyvideo.inference:from_pretrained:188 - Building model...
2024-12-11 08:25:29.223 | INFO     | hyvideo.inference:from_pretrained:188 - Building model...
2024-12-11 08:25:29.442 | INFO     | hyvideo.inference:load_state_dict:337 - Loading torch model ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt...
/data/hunyuan/models/HunyuanVideo/hyvideo/inference.py:338: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  state_dict = torch.load(model_path, map_location=lambda storage, loc: storage)
2024-12-11 08:25:29.458 | INFO     | hyvideo.inference:load_state_dict:337 - Loading torch model ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt...
/data/hunyuan/models/HunyuanVideo/hyvideo/inference.py:338: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  state_dict = torch.load(model_path, map_location=lambda storage, loc: storage)
2024-12-11 08:25:42.487 | INFO     | hyvideo.vae:load_vae:29 - Loading 3D VAE model (884-16c-hy) from: ./ckpts/hunyuan-video-t2v-720p/vae
2024-12-11 08:25:42.803 | INFO     | hyvideo.vae:load_vae:29 - Loading 3D VAE model (884-16c-hy) from: ./ckpts/hunyuan-video-t2v-720p/vae
/data/hunyuan/models/HunyuanVideo/hyvideo/vae/__init__.py:39: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  ckpt = torch.load(vae_ckpt, map_location=vae.device)
/data/hunyuan/models/HunyuanVideo/hyvideo/vae/__init__.py:39: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  ckpt = torch.load(vae_ckpt, map_location=vae.device)
2024-12-11 08:25:44.264 | INFO     | hyvideo.vae:load_vae:55 - VAE to dtype: torch.float16
2024-12-11 08:25:44.558 | INFO     | hyvideo.text_encoder:load_text_encoder:28 - Loading text encoder model (llm) from: ./ckpts/text_encoder
2024-12-11 08:25:44.596 | INFO     | hyvideo.vae:load_vae:55 - VAE to dtype: torch.float16
2024-12-11 08:25:44.705 | INFO     | hyvideo.text_encoder:load_text_encoder:28 - Loading text encoder model (llm) from: ./ckpts/text_encoder
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:09<00:00,  2.44s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:09<00:00,  2.40s/it]
2024-12-11 08:26:00.917 | INFO     | hyvideo.text_encoder:load_text_encoder:50 - Text encoder to dtype: torch.float16
2024-12-11 08:26:00.926 | INFO     | hyvideo.text_encoder:load_text_encoder:50 - Text encoder to dtype: torch.float16
2024-12-11 08:26:03.925 | INFO     | hyvideo.text_encoder:load_tokenizer:64 - Loading tokenizer (llm) from: ./ckpts/text_encoder
2024-12-11 08:26:03.977 | INFO     | hyvideo.text_encoder:load_tokenizer:64 - Loading tokenizer (llm) from: ./ckpts/text_encoder
2024-12-11 08:26:04.277 | INFO     | hyvideo.text_encoder:load_text_encoder:28 - Loading text encoder model (clipL) from: ./ckpts/text_encoder_2
2024-12-11 08:26:04.317 | INFO     | hyvideo.text_encoder:load_text_encoder:28 - Loading text encoder model (clipL) from: ./ckpts/text_encoder_2
2024-12-11 08:26:04.384 | INFO     | hyvideo.text_encoder:load_text_encoder:50 - Text encoder to dtype: torch.float16
2024-12-11 08:26:04.419 | INFO     | hyvideo.text_encoder:load_tokenizer:64 - Loading tokenizer (clipL) from: ./ckpts/text_encoder_2
2024-12-11 08:26:04.424 | INFO     | hyvideo.text_encoder:load_text_encoder:50 - Text encoder to dtype: torch.float16
2024-12-11 08:26:04.458 | INFO     | hyvideo.text_encoder:load_tokenizer:64 - Loading tokenizer (clipL) from: ./ckpts/text_encoder_2
2024-12-11 08:26:04.484 | INFO     | hyvideo.inference:predict:581 - Input (height, width, video_length) = (640, 360, 49)
2024-12-11 08:26:04.491 | DEBUG    | hyvideo.inference:predict:641 - 
                        height: 640
                         width: 368
                  video_length: 49
                        prompt: ['A cat walks on the grass, realistic style.']
                    neg_prompt: ['Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion']
                          seed: 42
                   infer_steps: 50
         num_videos_per_prompt: 1
                guidance_scale: 1.0
                      n_tokens: 11960
                    flow_shift: 7.0
       embedded_guidance_scale: 6.0
2024-12-11 08:26:04.523 | INFO     | hyvideo.inference:predict:581 - Input (height, width, video_length) = (640, 360, 49)
2024-12-11 08:26:04.531 | DEBUG    | hyvideo.inference:predict:641 - 
                        height: 640
                         width: 368
                  video_length: 49
                        prompt: ['A cat walks on the grass, realistic style.']
                    neg_prompt: ['Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion']
                          seed: 42
                   infer_steps: 50
         num_videos_per_prompt: 1
                guidance_scale: 1.0
                      n_tokens: 11960
                    flow_shift: 7.0
       embedded_guidance_scale: 6.0
  0%|                                                                                                                                                                                                                                                                                                                                                                 | 0/50 [00:00<?, ?it/s]
  0%|                                                                                                                                                                                                                                                                                                                                                                 | 0/50 [00:00<?, ?it/s]
[rank1]: Traceback (most recent call last):
[rank1]:   File "/data/hunyuan/models/HunyuanVideo/sample_video.py", line 58, in <module>
[rank1]:     main()
[rank1]:   File "/data/hunyuan/models/HunyuanVideo/sample_video.py", line 32, in main
[rank1]:     outputs = hunyuan_video_sampler.predict(
[rank1]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/data/hunyuan/models/HunyuanVideo/hyvideo/inference.py", line 647, in predict
[rank1]:     samples = self.pipeline(
[rank1]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/data/hunyuan/models/HunyuanVideo/hyvideo/diffusion/pipelines/pipeline_hunyuan_video.py", line 991, in __call__
[rank1]:     noise_pred = self.transformer(  # For an input image (129, 192, 336) (1, 256, 256)
[rank1]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/data/hunyuan/models/HunyuanVideo/hyvideo/inference.py", line 84, in new_forward
[rank1]:     output = original_forward(
[rank1]:   File "/data/hunyuan/models/HunyuanVideo/hyvideo/modules/models.py", line 667, in forward
[rank1]:     img, txt = block(*double_block_args)
[rank1]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/data/hunyuan/models/HunyuanVideo/hyvideo/modules/models.py", line 215, in forward
[rank1]:     attn = parallel_attention(
[rank1]:   File "/data/hunyuan/models/HunyuanVideo/hyvideo/modules/attenion.py", line 169, in parallel_attention
[rank1]:     attn1 = hybrid_seq_parallel_attn(
[rank1]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 600, in _fn
[rank1]:     return fn(*args, **kwargs)
[rank1]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/xfuser/core/long_ctx_attention/hybrid/attn_layer.py", line 146, in forward
[rank1]:     query_layer = SeqAllToAll4D.apply(
[rank1]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/autograd/function.py", line 574, in apply
[rank1]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank1]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/yunchang/comm/all_to_all.py", line 122, in forward
[rank1]:     return all_to_all_4D(input, scatter_idx, gather_idx, group=group, use_sync=use_sync)
[rank1]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/yunchang/comm/all_to_all.py", line 56, in all_to_all_4D
[rank1]:     dist.all_to_all_single(output, input_t, group=group)
[rank1]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3770, in all_to_all_single
[rank1]:     work = group.alltoall_base(
[rank1]: RuntimeError: NCCL Error 2: unhandled system error (run with NCCL_DEBUG=INFO for details)
[rank0]: Traceback (most recent call last):
[rank0]:   File "/data/hunyuan/models/HunyuanVideo/sample_video.py", line 58, in <module>
[rank0]:     main()
[rank0]:   File "/data/hunyuan/models/HunyuanVideo/sample_video.py", line 32, in main
[rank0]:     outputs = hunyuan_video_sampler.predict(
[rank0]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/data/hunyuan/models/HunyuanVideo/hyvideo/inference.py", line 647, in predict
[rank0]:     samples = self.pipeline(
[rank0]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/data/hunyuan/models/HunyuanVideo/hyvideo/diffusion/pipelines/pipeline_hunyuan_video.py", line 991, in __call__
[rank0]:     noise_pred = self.transformer(  # For an input image (129, 192, 336) (1, 256, 256)
[rank0]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/data/hunyuan/models/HunyuanVideo/hyvideo/inference.py", line 84, in new_forward
[rank0]:     output = original_forward(
[rank0]:   File "/data/hunyuan/models/HunyuanVideo/hyvideo/modules/models.py", line 667, in forward
[rank0]:     img, txt = block(*double_block_args)
[rank0]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/data/hunyuan/models/HunyuanVideo/hyvideo/modules/models.py", line 215, in forward
[rank0]:     attn = parallel_attention(
[rank0]:   File "/data/hunyuan/models/HunyuanVideo/hyvideo/modules/attenion.py", line 169, in parallel_attention
[rank0]:     attn1 = hybrid_seq_parallel_attn(
[rank0]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 600, in _fn
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/xfuser/core/long_ctx_attention/hybrid/attn_layer.py", line 146, in forward
[rank0]:     query_layer = SeqAllToAll4D.apply(
[rank0]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/autograd/function.py", line 574, in apply
[rank0]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank0]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/yunchang/comm/all_to_all.py", line 122, in forward
[rank0]:     return all_to_all_4D(input, scatter_idx, gather_idx, group=group, use_sync=use_sync)
[rank0]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/yunchang/comm/all_to_all.py", line 56, in all_to_all_4D
[rank0]:     dist.all_to_all_single(output, input_t, group=group)
[rank0]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3770, in all_to_all_single
[rank0]:     work = group.alltoall_base(
[rank0]: RuntimeError: NCCL Error 2: unhandled system error (run with NCCL_DEBUG=INFO for details)
W1211 08:26:06.780000 139773546530624 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 6832 closing signal SIGTERM
E1211 08:26:06.844000 139773546530624 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 6833) of binary: /opt/conda/envs/myenv/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/myenv/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
  File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
sample_video.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-12-11_08:26:06
  host      : fb5f3d12661e
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 6833)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
(myenv) root@fb5f3d12661e:/data/hunyuan/models/HunyuanVideo#

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: NCCL Error 2: unhandled system error #114

RuntimeError: NCCL Error 2: unhandled system error #114

jovijovi commented Dec 11, 2024

RuntimeError: NCCL Error 2: unhandled system error #114

RuntimeError: NCCL Error 2: unhandled system error #114

Comments

jovijovi commented Dec 11, 2024