Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Q] Resolving CUDA OOM Errors When Running run_tests.sh on GPUs #857

Open
apivovarov opened this issue Nov 25, 2024 · 0 comments
Open

[Q] Resolving CUDA OOM Errors When Running run_tests.sh on GPUs #857

apivovarov opened this issue Nov 25, 2024 · 0 comments

Comments

@apivovarov
Copy link

Hello,

I’m running the run_tests.sh script found in the root of the project on a system with 8x A100 GPUs (40GB each).
Internally, the script runs two pytest commands simultaneously:

  • One for fast tests.
  • One for all tests.

I noticed that the script uses pytest-xdist with the following options:

  • -n auto: Starts the number of workers based on CPU cores.
  • --dist worksteal: Distributes tests evenly across workers and allows idle workers to "steal" remaining tests.

However, when I execute the script as-is, I frequently encounter CUDA Out-Of-Memory (OOM) errors. My understanding is that running multiple pytest processes in parallel might be causing resource contention across the GPUs.

Questions:

  1. What is the recommended number of workers (-n) when running tests on a GPU instance?
    Should it be based on the number of GPUs, memory per GPU, or another factor?
  2. Should the two pytest commands in run_tests.sh be executed sequentially?
    They currently run simultaneously using background execution (&). Would running them sequentially help mitigate CUDA OOM issues?

Any guidance on optimizing the script for a multi-GPU setup would be greatly appreciated!

Thank you!

@apivovarov apivovarov changed the title [Q] How to properly test this project on GPU? [Q] Resolving CUDA OOM Errors When Running run_tests.sh on GPUs Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant