Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to load_checkpoint using origin Megatron while used dlrover save_checkpoint ? #1398

Open
LDH007 opened this issue Dec 19, 2024 · 1 comment
Labels
question Further information is requested

Comments

@LDH007
Copy link

LDH007 commented Dec 19, 2024

Use DLRover's save_checkpoint function, result format like
iter_0000030/mp_rank_00_000/
iter_0000030/mp_rank_00_001/
iter_0000030/mp_rank_01_000/
...

But when I load_checkpoint using Megatron-LM's load_checkpoint It raises
[Errno 2] No such file or directory: 'iter_0000030/mp_rank_00/model_optim_rng.pt'
could not load the checkpoint

How to Solve This ?

@LDH007 LDH007 added the question Further information is requested label Dec 19, 2024
@workingloong
Copy link
Collaborator

You use dlrover's load_checkpoint and Megatron-LM's save_checkpoint to transform the checkpoint format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants