while using megatron distributed flash-checkpoint to recovery, error ocurs when load_checkpoint #1233

deepcoldfish · 2024-08-13T01:25:01Z

Env: 16GPUs + llama2 pretrain+ megatron-lm
strategy: TP 8 + PP 1 + DP 2
case: when killing a training proceess to retrigger fault-tollerence with megatron-distributed flash-checkpoint，the dp 1 group load_checkpoint failed with the following log,

WARNING: on rank 11 found iteration 15 in the metadata while max iteration across the ranks is 4160813071, replacing it with max iteration.WARNING: on rank 10 found iteration 15 in the metadata while max iteration across the ranks is 4160813071, replacing it with max iteration.WARNING: on rank 14 found iteration 15 in the metadata while max iteration across the ranks is 4160813071, replacing it with max iteration.

The reason is that dp 1 group load checkpoint from storage for no model in memory and uses allreduce when read_metadata, meanwhile dp 0 group only load from memory.

The text was updated successfully, but these errors were encountered:

BalaBalaYi · 2024-10-12T02:43:21Z

Can u provide more information? The more detailed, the better.
e.g.
Detail of killing. (failed cp step?, load cp step after failover?)

deepcoldfish · 2024-11-12T11:03:03Z

Can u provide more information? The more detailed, the better. e.g. Detail of killing. (failed cp step?, load cp step after failover?)

When training after save checkpoint to memory or storage, kill a training process (in node 1) to retrigger the restart of training cluster.

After restart, all node will recovery from memory.

When dp rank !=0, model_state_dict is empty , and will go to here and read_metadata here. Nodes with dp_rank = 0, have model_state_dict in memory , and will not go to this branch.

read_metadata will trigger global sync among all node group, and will cause step failing out.

BalaBalaYi · 2024-11-18T11:19:00Z

ur code version please(commit id)?

BalaBalaYi added this to the Backlog milestone Nov 18, 2024

BalaBalaYi added investigating todo issue or pr with 'todo' will ignore expiration labels Nov 18, 2024

BalaBalaYi removed the todo issue or pr with 'todo' will ignore expiration label Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

while using megatron distributed flash-checkpoint to recovery, error ocurs when load_checkpoint #1233

while using megatron distributed flash-checkpoint to recovery, error ocurs when load_checkpoint #1233

deepcoldfish commented Aug 13, 2024

BalaBalaYi commented Oct 12, 2024

deepcoldfish commented Nov 12, 2024 •

edited

Loading

BalaBalaYi commented Nov 18, 2024

while using megatron distributed flash-checkpoint to recovery, error ocurs when load_checkpoint #1233

while using megatron distributed flash-checkpoint to recovery, error ocurs when load_checkpoint #1233

Comments

deepcoldfish commented Aug 13, 2024

BalaBalaYi commented Oct 12, 2024

deepcoldfish commented Nov 12, 2024 • edited Loading

BalaBalaYi commented Nov 18, 2024

deepcoldfish commented Nov 12, 2024 •

edited

Loading