multi-node pp hang when enable gradient accumulation #209

yuuxiaooqingg · 2024-07-24T03:38:42Z

I tested llama3 continue training with multi-machine tp4 pp2 dp2. If I enabled grad accum operation, the training would hang. The experimental environment is: 16H800 torch 2.1.2+cu121.

checkpoints:
checkpoint_interval: 200
checkpoints_path: exps/llama3_ct/ckpts
checkpoints_path_is_shared_file_system: true
resume_checkpoint_path: pretrained_models/Meta-Llama-3-8B
save_initial_state: false
no_load_optim: true

data_stages:

name: Stable Training Stage
start_training_step: 1
data:
dataset:
dataset_overwrite_cache: false
dataset_processing_num_proc_per_process: 4
hf_dataset_config_name: null
hf_dataset_or_datasets: data/cosmopedia_8193/train
domain_names:
- auto_math_text
- khanacademy
num_loading_workers: 1
seed: 42
general:
benchmark_csv_path: null
consumed_train_samples: 256000
ignore_sanity_checks: true
project: llama_ct
run: train_cosmopedia
seed: 42
step: 500
logging:
iteration_step_info_interval: 1
log_level: info
log_level_replica: info
model:
ddp_bucket_cap_mb: 120
dtype: bfloat16
init_method:
std: 0.025
make_vocab_size_divisible_by: 1
model_config:
bos_token_id: 128000
eos_token_id: 128001
hidden_act: silu
hidden_size: 4096
initializer_range: 0.02
intermediate_size: 14336
is_llama_config: true
max_position_embeddings: 8192
num_attention_heads: 32
num_hidden_layers: 32
num_key_value_heads: 8
pad_token_id: null
pretraining_tp: 1
rms_norm_eps: 1.0e-05
rope_scaling: null
tie_word_embeddings: false
use_cache: true
vocab_size: 128256
optimizer:
accumulate_grad_in_fp32: true
optimizer_factory:
adam_beta1: 0.9
adam_beta2: 0.95
adam_eps: 1.0e-08
torch_adam_is_fused: true
clip_grad: 1.0
learning_rate_scheduler:
learning_rate: 6.0e-05
lr_decay_steps: 400
lr_decay_style: cosine
lr_warmup_steps: 100
lr_warmup_style: linear
min_decay_lr: 6.0e-06
weight_decay: 0.01
zero_stage: 0
parallelism:
dp: 2
pp: 2
pp_engine: 1f1b
tp: 4
tp_linear_async_communication: true
tp_mode: REDUCE_SCATTER
profiler: null
tokenizer:
tokenizer_max_length: null
tokenizer_name_or_path: meta-llama/Meta-Llama-3-8B
tokenizer_revision: null
tokens:
batch_accumulation_per_replica: 2
limit_test_batches: 0
limit_val_batches: 0
micro_batch_size: 1
sequence_length: 8192
train_steps: 500
val_check_interval: -1

yuuxiaooqingg · 2024-07-25T03:43:25Z

@xrsrke Do you know what might be causing this?

xrsrke · 2024-08-02T12:02:53Z

@yuuxiaooqingg Hello. At which step did it hang?

Pclanglais · 2024-08-24T00:25:47Z

Same issue here. It hangs just before starting. Seems like gradient communication is not going well…

(tested on 48x4 h100s)

Pclanglais · 2024-08-30T20:08:36Z

Not sure if it applies in this case but I've found a fix: disable zero stage entirely. Zero stage 1 works somewhat with replicas in a smalll number of nodes setting (not 2 nor 3). Stops working as well with large distributed training sets.

yuuxiaooqingg changed the title ~~multi-machine pp hang~~ multi-node pp hang when enable gradient accumulation Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-node pp hang when enable gradient accumulation #209

multi-node pp hang when enable gradient accumulation #209

yuuxiaooqingg commented Jul 24, 2024

yuuxiaooqingg commented Jul 25, 2024

xrsrke commented Aug 2, 2024

Pclanglais commented Aug 24, 2024 •

edited

Loading

Pclanglais commented Aug 30, 2024

multi-node pp hang when enable gradient accumulation #209

multi-node pp hang when enable gradient accumulation #209

Comments

yuuxiaooqingg commented Jul 24, 2024

yuuxiaooqingg commented Jul 25, 2024

xrsrke commented Aug 2, 2024

Pclanglais commented Aug 24, 2024 • edited Loading

Pclanglais commented Aug 30, 2024

Pclanglais commented Aug 24, 2024 •

edited

Loading