[Q] Resolving CUDA OOM Errors When Running run_tests.sh on GPUs #857

apivovarov · 2024-11-25T23:22:05Z

Hello,

I’m running the run_tests.sh script found in the root of the project on a system with 8x A100 GPUs (40GB each).
Internally, the script runs two pytest commands simultaneously:

One for fast tests.
One for all tests.

I noticed that the script uses pytest-xdist with the following options:

-n auto: Starts the number of workers based on CPU cores.
--dist worksteal: Distributes tests evenly across workers and allows idle workers to "steal" remaining tests.

However, when I execute the script as-is, I frequently encounter CUDA Out-Of-Memory (OOM) errors. My understanding is that running multiple pytest processes in parallel might be causing resource contention across the GPUs.

Questions:

What is the recommended number of workers (-n) when running tests on a GPU instance?
Should it be based on the number of GPUs, memory per GPU, or another factor?
Should the two pytest commands in run_tests.sh be executed sequentially?
They currently run simultaneously using background execution (&). Would running them sequentially help mitigate CUDA OOM issues?

Any guidance on optimizing the script for a multi-GPU setup would be greatly appreciated!

Thank you!

The text was updated successfully, but these errors were encountered:

apivovarov changed the title ~~[Q] How to properly test this project on GPU?~~ [Q] Resolving CUDA OOM Errors When Running run_tests.sh on GPUs Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Q] Resolving CUDA OOM Errors When Running run_tests.sh on GPUs #857

[Q] Resolving CUDA OOM Errors When Running run_tests.sh on GPUs #857

apivovarov commented Nov 25, 2024

[Q] Resolving CUDA OOM Errors When Running run_tests.sh on GPUs #857

[Q] Resolving CUDA OOM Errors When Running run_tests.sh on GPUs #857

Comments

apivovarov commented Nov 25, 2024