Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed Issue for torchrun command for train_cifar10_ddp.py #149

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Xiaoming-Zhao
Copy link

What does this PR do?

This PR added a --standalone command line argument to torchrun, without with I cannot make the script train_cifar10_ddp.py run locally. This also follows the official document.

Before submitting

  • Did you make sure title is self-explanatory and the description concisely explains the PR?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you test your PR locally with pytest command?
  • Did you run pre-commit hooks with pre-commit run -a command?

Did you have fun?

Make sure you had fun coding 🙃

@ImahnShekhzadeh
Copy link
Contributor

Thanks for the PR! I'm actually very curious about the error message you got without the --standalone flag?

Also, did you install your environment recently? I had tested train_cifar10_ddp.py with PyTorch 2.0 or 2.1 a few months ago, so if you have a newer version with which you are running into a problem, I'd be happy to give a newer version a try to see whether I can reproduce the error.

@Xiaoming-Zhao
Copy link
Author

I used pytorch 2.4.0.

There were no errors but the process just hung forever.

It could be possible due to my server's setup. Happy to close this PR if this is the case.

@ImahnShekhzadeh
Copy link
Contributor

ImahnShekhzadeh commented Nov 18, 2024

Hmm, difficult to say... So locally, you have two GPUs with which you tried?
I myself had tried it on runpod.io with two GPUs, and there - after specifying the correct master address and port - the script worked without the --standalone flag. In my experience, when a DDP process "hung forever", it was usually because of the wrong master port/adress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants