You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tested llama3 continue training with multi-machine tp4 pp2 dp2. If I enabled grad accum operation, the training would hang. The experimental environment is: 16H800 torch 2.1.2+cu121.
Not sure if it applies in this case but I've found a fix: disable zero stage entirely. Zero stage 1 works somewhat with replicas in a smalll number of nodes setting (not 2 nor 3). Stops working as well with large distributed training sets.
I tested llama3 continue training with multi-machine tp4 pp2 dp2. If I enabled grad accum operation, the training would hang. The experimental environment is: 16H800 torch 2.1.2+cu121.
checkpoints:
checkpoint_interval: 200
checkpoints_path: exps/llama3_ct/ckpts
checkpoints_path_is_shared_file_system: true
resume_checkpoint_path: pretrained_models/Meta-Llama-3-8B
save_initial_state: false
no_load_optim: true
data_stages:
start_training_step: 1
data:
dataset:
dataset_overwrite_cache: false
dataset_processing_num_proc_per_process: 4
hf_dataset_config_name: null
hf_dataset_or_datasets: data/cosmopedia_8193/train
domain_names:
- auto_math_text
- khanacademy
num_loading_workers: 1
seed: 42
general:
benchmark_csv_path: null
consumed_train_samples: 256000
ignore_sanity_checks: true
project: llama_ct
run: train_cosmopedia
seed: 42
step: 500
logging:
iteration_step_info_interval: 1
log_level: info
log_level_replica: info
model:
ddp_bucket_cap_mb: 120
dtype: bfloat16
init_method:
std: 0.025
make_vocab_size_divisible_by: 1
model_config:
bos_token_id: 128000
eos_token_id: 128001
hidden_act: silu
hidden_size: 4096
initializer_range: 0.02
intermediate_size: 14336
is_llama_config: true
max_position_embeddings: 8192
num_attention_heads: 32
num_hidden_layers: 32
num_key_value_heads: 8
pad_token_id: null
pretraining_tp: 1
rms_norm_eps: 1.0e-05
rope_scaling: null
tie_word_embeddings: false
use_cache: true
vocab_size: 128256
optimizer:
accumulate_grad_in_fp32: true
optimizer_factory:
adam_beta1: 0.9
adam_beta2: 0.95
adam_eps: 1.0e-08
torch_adam_is_fused: true
clip_grad: 1.0
learning_rate_scheduler:
learning_rate: 6.0e-05
lr_decay_steps: 400
lr_decay_style: cosine
lr_warmup_steps: 100
lr_warmup_style: linear
min_decay_lr: 6.0e-06
weight_decay: 0.01
zero_stage: 0
parallelism:
dp: 2
pp: 2
pp_engine: 1f1b
tp: 4
tp_linear_async_communication: true
tp_mode: REDUCE_SCATTER
profiler: null
tokenizer:
tokenizer_max_length: null
tokenizer_name_or_path: meta-llama/Meta-Llama-3-8B
tokenizer_revision: null
tokens:
batch_accumulation_per_replica: 2
limit_test_batches: 0
limit_val_batches: 0
micro_batch_size: 1
sequence_length: 8192
train_steps: 500
val_check_interval: -1
The text was updated successfully, but these errors were encountered: