Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client.connect(path) error when saving checkpoint #1337

Open
atomrun39 opened this issue Nov 15, 2024 · 7 comments
Open

client.connect(path) error when saving checkpoint #1337

atomrun39 opened this issue Nov 15, 2024 · 7 comments

Comments

@atomrun39
Copy link

atomrun39 commented Nov 15, 2024

When using dlrover to save checkpoints, the following error will always occur:

[2024-11-15 12:30:37,876] [INFO] [engine.py:131:start_saver_process] Start a process to asynchronously save checkpoint.
[2024-11-15 12:30:37,879] [INFO] [engine.py:299:_notify_agent_to_create_saver] Notify agent to create a checkpoint saver using: {'module_path': 'dlrover.python.elastic_agent.torch.ckpt_saver', 'class_name': 'DeepSpeedCheckpointSaver', 'kwargs': {'checkpoint_dir': '/work/share/chenyd/finetune/ChatGLM2-6B/model/checkpoints_out/ALL/original/chatglm2-6b/checkpoint-15', 'storage_meta': ClassMeta(module_path='dlrover.python.common.storage', class_name='PosixDiskStorage', kwargs={}), 'local_shard_num': 8, 'global_shard_num': 16, 'save_timeout': 600}}.
[2024-11-15 12:30:37,879] [WARNING] [multi_process.py:91:_create_socket_client] Unexpected error when creating socket client by path: /tmp/ckpt_sock/1857279191730585602/sharedqueue_factory.sock, error: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/usr/local/python3.9.10/lib/python3.9/site-packages/dlrover/python/common/multi_process.py", line 89, in _create_socket_client
    client.connect(path)
FileNotFoundError: [Errno 2] No such file or directory
[2024-11-15 12:30:37,895] [INFO] [ckpt_saver.py:451:_factory] Start the checkpoint saver factory.
/usr/local/python3.9.10/lib/python3.9/site-packages/dlrover/python/common/multi_process.py:48: ResourceWarning: unclosed <socket.socket fd=116, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>
  time.sleep(1)
ResourceWarning: Enable tracemalloc to get the object allocation traceback

The code used is as follows:

           checkpointer = DeepSpeedCheckpointer(model, output_dir)
            result = checkpointer.save_checkpoint(
            output_dir,
            tag=self.state.global_step,
            storage_type=StorageType.DISK
            )

How to solve this problem? I really hope to receive a reply.

@atomrun39
Copy link
Author

In addition, there are always warnings like this during the saving process. How can I eliminate them?

/usr/local/python3.9.10/lib/python3.9/site-packages/dlrover/python/common/multi_process.py:271: ResourceWarning: unclosed <socket.socket fd=127, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0, laddr=/tmp/ckpt_sock/1857345181448912897/sharedlock_shm_lock_1.sock>
  connection, _ = self._server.accept()
ResourceWarning: Enable tracemalloc to get the object allocation traceback

@BalaBalaYi
Copy link
Collaborator

What's ur version? Maybe u can try master branch with this fixed: #1261

@atomrun39
Copy link
Author

What's ur version? Maybe u can try master branch with this fixed: #1261

Thank you for your reply. I used pip install dlrover and the installed version is 0.3.8. Is there any difference between it and the master branch.

@BalaBalaYi
Copy link
Collaborator

What's ur version? Maybe u can try master branch with this fixed: #1261

Thank you for your reply. I used pip install dlrover and the installed version is 0.3.8. Is there any difference between it and the master branch.

If u r using 0.3.8, seems not the same issue i just provided.

  1. Are u using 'dlrover-run' to init ur training?
  2. What is the env 'ROLE_NAME' value(process env) in ur case?

@atomrun39
Copy link
Author

What's ur version? Maybe u can try master branch with this fixed: #1261

Thank you for your reply. I used pip install dlrover and the installed version is 0.3.8. Is there any difference between it and the master branch.

If u r using 0.3.8, seems not the same issue i just provided.

  1. Are u using 'dlrover-run' to init ur training?
  2. What is the env 'ROLE_NAME' value(process env) in ur case?

1.Not using 'dlover run', using 'torchrun', only using dlrover when saving checkpoint.
2.The value of env'ROLE-NAME' has not been set. What should it be set to?

@BalaBalaYi
Copy link
Collaborator

BalaBalaYi commented Nov 27, 2024

What's ur version? Maybe u can try master branch with this fixed: #1261

Thank you for your reply. I used pip install dlrover and the installed version is 0.3.8. Is there any difference between it and the master branch.

If u r using 0.3.8, seems not the same issue i just provided.

  1. Are u using 'dlrover-run' to init ur training?
  2. What is the env 'ROLE_NAME' value(process env) in ur case?

1.Not using 'dlover run', using 'torchrun', only using dlrover when saving checkpoint. 2.The value of env'ROLE-NAME' has not been set. What should it be set to?

So u r expecting to create a sub process(saver) to do the checkpoint issue during training?

Need more logging info of ur context.

@BalaBalaYi
Copy link
Collaborator

Probably the same issue of #1361

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants