-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client.connect(path) error when saving checkpoint #1337
Comments
In addition, there are always warnings like this during the saving process. How can I eliminate them?
|
What's ur version? Maybe u can try master branch with this fixed: #1261 |
Thank you for your reply. I used |
If u r using 0.3.8, seems not the same issue i just provided.
|
1.Not using 'dlover run', using 'torchrun', only using dlrover when saving checkpoint. |
So u r expecting to create a sub process(saver) to do the checkpoint issue during training? Need more logging info of ur context. |
Probably the same issue of #1361 |
When using dlrover to save checkpoints, the following error will always occur:
The code used is as follows:
How to solve this problem? I really hope to receive a reply.
The text was updated successfully, but these errors were encountered: