-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
out of memory for continuing pretraining llama3-8B #161
Comments
Hi, Thanks for your question. Here is my solution:
You can also try tp=8, dp=1. Then should set theta = 500000.0 for RotaryEmbedding and interleaved = False. Please let me know if you have any other questions |
Thanks for the reply. |
If my understanding is correct, ZERO1 is not pure DP, that's why you got the error. Some links: |
Sorry for the confusion. In the doc, I found the "activation checkpointing" with a decorator "@checkpoint_method(attr_name="do_checkpoint")". I applied it on LLaMAModel and I got the following error: I have the following question:
|
Hello, 1,2 Yes, @checkpoint_method is the same as gradient checkpointing. Thanks a lot for your questions. |
I am trying to use the framework to continue pretraining llama3-8B. I have converted the HF checkpoint into nanotron format and the generated tokens seem reasonable.
I use the following setting to train the model but I got the OOO with 8 GPUs. I tried HF accelerate with deepspeed zero-1 and flash-attention and it worked well previously:
model:
ddp_bucket_cap_mb: 25
dtype: bfloat16
init_method:
std: 0.025
make_vocab_size_divisible_by: 1
model_config:
bos_token_id: 128000
eos_token_id: 128001
hidden_act: silu
hidden_size: 4096
initializer_range: 0.02
intermediate_size: 14336
is_llama_config: true
max_position_embeddings: 8192
num_attention_heads: 32
num_hidden_layers: 32
num_key_value_heads: 8
pad_token_id: null
pretraining_tp: 1
rms_norm_eps: 1.0e-05
rope_scaling: null
tie_word_embeddings: false
use_cache: true
vocab_size: 128256
optimizer:
accumulate_grad_in_fp32: true
clip_grad: 1.0
learning_rate_scheduler:
learning_rate: 8.0e-07
lr_decay_starting_step: null
lr_decay_steps: 100
lr_decay_style: cosine
lr_warmup_steps: 100
lr_warmup_style: linear
min_decay_lr: 1.0e-05
optimizer_factory:
adam_beta1: 0.9
adam_beta2: 0.999
adam_eps: 1.0e-08
name: adamW
torch_adam_is_fused: true
weight_decay: 0.1
zero_stage: 1
parallelism:
dp: 8
expert_parallel_size: 1
pp: 1
pp_engine: 1f1b
tp: 1
tp_linear_async_communication: true
tp_mode: REDUCE_SCATTER
profiler: null
tokenizer:
tokenizer_max_length: null
tokenizer_name_or_path: meta-llama/Meta-Llama-3-8B
tokenizer_revision: null
tokens:
batch_accumulation_per_replica: 1
limit_test_batches: 0
limit_val_batches: 0
micro_batch_size: 1
sequence_length: 8192
train_steps: 200
val_check_interval: -1
The text was updated successfully, but these errors were encountered: