You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’m running the run_tests.sh script found in the root of the project on a system with 8x A100 GPUs (40GB each).
Internally, the script runs two pytest commands simultaneously:
One for fast tests.
One for all tests.
I noticed that the script uses pytest-xdist with the following options:
-n auto: Starts the number of workers based on CPU cores.
--dist worksteal: Distributes tests evenly across workers and allows idle workers to "steal" remaining tests.
However, when I execute the script as-is, I frequently encounter CUDA Out-Of-Memory (OOM) errors. My understanding is that running multiple pytest processes in parallel might be causing resource contention across the GPUs.
Questions:
What is the recommended number of workers (-n) when running tests on a GPU instance?
Should it be based on the number of GPUs, memory per GPU, or another factor?
Should the two pytest commands in run_tests.sh be executed sequentially?
They currently run simultaneously using background execution (&). Would running them sequentially help mitigate CUDA OOM issues?
Any guidance on optimizing the script for a multi-GPU setup would be greatly appreciated!
Thank you!
The text was updated successfully, but these errors were encountered:
apivovarov
changed the title
[Q] How to properly test this project on GPU?
[Q] Resolving CUDA OOM Errors When Running run_tests.sh on GPUs
Nov 25, 2024
Hello,
I’m running the
run_tests.sh
script found in the root of the project on a system with 8x A100 GPUs (40GB each).Internally, the script runs two pytest commands simultaneously:
I noticed that the script uses pytest-xdist with the following options:
-n auto
: Starts the number of workers based on CPU cores.--dist worksteal
: Distributes tests evenly across workers and allows idle workers to "steal" remaining tests.However, when I execute the script as-is, I frequently encounter CUDA Out-Of-Memory (OOM) errors. My understanding is that running multiple pytest processes in parallel might be causing resource contention across the GPUs.
Questions:
Should it be based on the number of GPUs, memory per GPU, or another factor?
They currently run simultaneously using background execution (&). Would running them sequentially help mitigate CUDA OOM issues?
Any guidance on optimizing the script for a multi-GPU setup would be greatly appreciated!
Thank you!
The text was updated successfully, but these errors were encountered: