Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarks: model benchmarks - change torch.distributed.launch to torchrun #556

Merged
merged 4 commits into from
Aug 8, 2023

Conversation

pnunna93
Copy link
Contributor

This PR has following changes

  • torch.distributed.launch changed to torchrun. torch.distributed.launch is deprecated in latest Pytorch and is recommended to move to torchrun - https://pytorch.org/docs/stable/elastic/run.html
  • Changes to AMD GPU detection logic. The AMD GPU detection logic throws warning when containers have only renderD in /dev/dri, this change would resolve those warnings

@pnunna93 pnunna93 requested a review from a team as a code owner July 24, 2023 22:24
@pnunna93
Copy link
Contributor Author

@microsoft-github-policy-service agree company="AMD"

Copy link
Member

@abuccts abuccts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR, pls also replace all python3 -m torch.distributed.launch --use_env with torchrun in tests/ to pass unit tests

@abuccts
Copy link
Member

abuccts commented Jul 25, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@codecov
Copy link

codecov bot commented Jul 25, 2023

Codecov Report

Merging #556 (248600a) into main (e1df877) will decrease coverage by 0.64%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##             main     #556      +/-   ##
==========================================
- Coverage   86.96%   86.32%   -0.64%     
==========================================
  Files          93       93              
  Lines        6268     6268              
==========================================
- Hits         5451     5411      -40     
- Misses        817      857      +40     
Flag Coverage Δ
cpu-python3.6-unit-test 71.86% <0.00%> (ø)
cpu-python3.7-unit-test 71.86% <0.00%> (ø)
cpu-python3.8-unit-test 72.34% <0.00%> (ø)
cuda-unit-test 84.91% <0.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
superbench/common/devices/gpu.py 82.60% <0.00%> (ø)
superbench/runner/runner.py 82.46% <ø> (ø)

... and 1 file with indirect coverage changes

@cp5555
Copy link
Contributor

cp5555 commented Jul 25, 2023

Thanks for your PR. We will test it soon to check whether it will impact the final performance or not. If not, we will merge it.

@cp5555 cp5555 self-requested a review July 25, 2023 23:11
@cp5555 cp5555 mentioned this pull request Jul 27, 2023
30 tasks
Copy link
Contributor

@yukirora yukirora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the functionality,
works for both single-node and multi-node on MI200
Regarding the performance,
not observe much difference on single A100 node

Model Precision Previous Throughput New Throughput
bert/pytorch-bert-base fp32 380.13 379.61
bert/pytorch-bert-base fp16 614.74 614.43
bert/pytorch-bert-large fp32 130.85 130.77
bert/pytorch-bert-large fp16 224.03 223.17
densenet/pytorch-densenet169 fp32 268.70 264.50
densenet/pytorch-densenet169 fp16 274.66 266.47
densenet/pytorch-densenet201 fp32 219.88 219.15
densenet/pytorch-densenet201 fp16 219.61 218.44
gpt/pytorch-gpt2-small fp32 179.65 179.04
gpt/pytorch-gpt2-small fp16 188.58 189.43
gpt/pytorch-gpt2-large fp32 35.37 35.48
gpt/pytorch-gpt2-large fp16 59.36 59.25
lstm/pytorch-lstm fp32 4975.33 5026.24
lstm/pytorch-lstm fp16 7895.35 7981.03
resnet/pytorch-resnet50 fp32 945.86 945.61
resnet/pytorch-resnet50 fp16 1273.37 1317.63
resnet/pytorch-resnet101 fp32 607.28 611.11
resnet/pytorch-resnet101 fp16 887.07 913.76
resnet/pytorch-resnet152 fp32 436.23 435.34
resnet/pytorch-resnet152 fp16 652.38 660.70
vgg/pytorch-vgg11 fp32 760.03 757.03
vgg/pytorch-vgg11 fp16 1130.59 1139.74
vgg/pytorch-vgg13 fp32 554.00 552.60
vgg/pytorch-vgg13 fp16 858.53 885.61
vgg/pytorch-vgg16 fp32 482.54 481.02
vgg/pytorch-vgg16 fp16 777.03 785.29
vgg/pytorch-vgg19 fp32 422.60 422.29
vgg/pytorch-vgg19 fp16 693.83 696.07

@yukirora yukirora changed the title Change torch.distributed.launch to torchrun Benchmarks: model benchmarks - change torch.distributed.launch to torchrun Aug 4, 2023
@yukirora
Copy link
Contributor

yukirora commented Aug 4, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@yukirora yukirora merged commit 67f2aa7 into microsoft:main Aug 8, 2023
22 of 23 checks passed
@pnunna93 pnunna93 deleted the torchrun branch August 8, 2023 15:37
@yukirora yukirora mentioned this pull request Dec 6, 2023
29 tasks
@cp5555 cp5555 added benchmarks SuperBench Benchmarks model-benchmarks Model Benchmark Test for SuperBench Benchmarks labels Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarks SuperBench Benchmarks model-benchmarks Model Benchmark Test for SuperBench Benchmarks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants