Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-node pp hang when enable gradient accumulation #209

Open
yuuxiaooqingg opened this issue Jul 24, 2024 · 4 comments
Open

multi-node pp hang when enable gradient accumulation #209

yuuxiaooqingg opened this issue Jul 24, 2024 · 4 comments

Comments

@yuuxiaooqingg
Copy link

I tested llama3 continue training with multi-machine tp4 pp2 dp2. If I enabled grad accum operation, the training would hang. The experimental environment is: 16H800 torch 2.1.2+cu121.

checkpoints:
checkpoint_interval: 200
checkpoints_path: exps/llama3_ct/ckpts
checkpoints_path_is_shared_file_system: true
resume_checkpoint_path: pretrained_models/Meta-Llama-3-8B
save_initial_state: false
no_load_optim: true

data_stages:

  • name: Stable Training Stage
    start_training_step: 1
    data:
    dataset:
    dataset_overwrite_cache: false
    dataset_processing_num_proc_per_process: 4
    hf_dataset_config_name: null
    hf_dataset_or_datasets: data/cosmopedia_8193/train
    domain_names:
    - auto_math_text
    - khanacademy
    num_loading_workers: 1
    seed: 42
    general:
    benchmark_csv_path: null
    consumed_train_samples: 256000
    ignore_sanity_checks: true
    project: llama_ct
    run: train_cosmopedia
    seed: 42
    step: 500
    logging:
    iteration_step_info_interval: 1
    log_level: info
    log_level_replica: info
    model:
    ddp_bucket_cap_mb: 120
    dtype: bfloat16
    init_method:
    std: 0.025
    make_vocab_size_divisible_by: 1
    model_config:
    bos_token_id: 128000
    eos_token_id: 128001
    hidden_act: silu
    hidden_size: 4096
    initializer_range: 0.02
    intermediate_size: 14336
    is_llama_config: true
    max_position_embeddings: 8192
    num_attention_heads: 32
    num_hidden_layers: 32
    num_key_value_heads: 8
    pad_token_id: null
    pretraining_tp: 1
    rms_norm_eps: 1.0e-05
    rope_scaling: null
    tie_word_embeddings: false
    use_cache: true
    vocab_size: 128256
    optimizer:
    accumulate_grad_in_fp32: true
    optimizer_factory:
    adam_beta1: 0.9
    adam_beta2: 0.95
    adam_eps: 1.0e-08
    torch_adam_is_fused: true
    clip_grad: 1.0
    learning_rate_scheduler:
    learning_rate: 6.0e-05
    lr_decay_steps: 400
    lr_decay_style: cosine
    lr_warmup_steps: 100
    lr_warmup_style: linear
    min_decay_lr: 6.0e-06
    weight_decay: 0.01
    zero_stage: 0
    parallelism:
    dp: 2
    pp: 2
    pp_engine: 1f1b
    tp: 4
    tp_linear_async_communication: true
    tp_mode: REDUCE_SCATTER
    profiler: null
    tokenizer:
    tokenizer_max_length: null
    tokenizer_name_or_path: meta-llama/Meta-Llama-3-8B
    tokenizer_revision: null
    tokens:
    batch_accumulation_per_replica: 2
    limit_test_batches: 0
    limit_val_batches: 0
    micro_batch_size: 1
    sequence_length: 8192
    train_steps: 500
    val_check_interval: -1
@yuuxiaooqingg yuuxiaooqingg changed the title multi-machine pp hang multi-node pp hang when enable gradient accumulation Jul 24, 2024
@yuuxiaooqingg
Copy link
Author

@xrsrke Do you know what might be causing this?

@xrsrke
Copy link
Member

xrsrke commented Aug 2, 2024

@yuuxiaooqingg Hello. At which step did it hang?

@Pclanglais
Copy link

Pclanglais commented Aug 24, 2024

Same issue here. It hangs just before starting. Seems like gradient communication is not going well…

(tested on 48x4 h100s)

@Pclanglais
Copy link

Not sure if it applies in this case but I've found a fix: disable zero stage entirely. Zero stage 1 works somewhat with replicas in a smalll number of nodes setting (not 2 nor 3). Stops working as well with large distributed training sets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants