Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flux : slower and slower computations of a neural network - a VRAM problem ? #654

Open
Vintarel opened this issue Jul 24, 2024 · 3 comments

Comments

@Vintarel
Copy link

This small code:

input = rand(Float32, 6, 6, 2, 1) |> gpu  
output = nn(input) |> cpu  

is slower and slower to compute on my computer. Namely, the full code:

using Flux, AMDGPU

nn = Chain(
    Conv((3,3), 2=>128, relu, pad=0),
    Conv((3,3), 128=>128, relu, pad=1),
    Conv((1,1), 128=>1, relu),
    Flux.flatten,
    Dense(16 => 256, relu),
    Dense(256=>1, sigmoid)
) |> gpu

for i in 1:20
  @time for j in 1:2000
      input = rand(Float32, 6, 6, 2, 1) |> gpu
      output = nn(input) |> cpu
  end
end

prints the following computation times:

  9.989851 seconds (13.81 M allocations: 876.953 MiB, 2.33% gc time, 80.64% compilation time)
  1.678728 seconds (880.00 k allocations: 36.392 MiB)
  3.435393 seconds (1.03 M allocations: 40.158 MiB, 0.55% gc time)
  3.878582 seconds (1.04 M allocations: 40.348 MiB, 0.48% gc time)
  4.033464 seconds (1.04 M allocations: 40.417 MiB, 0.58% gc time)
  4.567139 seconds (880.00 k allocations: 36.392 MiB)
  7.569016 seconds (1.04 M allocations: 40.452 MiB, 0.26% gc time)
  6.075572 seconds (1.04 M allocations: 40.448 MiB, 0.30% gc time)
  6.280088 seconds (1.04 M allocations: 40.448 MiB, 0.30% gc time)
  7.004467 seconds (1.04 M allocations: 40.448 MiB, 0.27% gc time)
  6.084493 seconds (880.01 k allocations: 36.392 MiB)
  8.216904 seconds (1.04 M allocations: 40.449 MiB, 0.23% gc time)
  9.743433 seconds (1.04 M allocations: 40.449 MiB, 0.20% gc time)
 10.631787 seconds (1.04 M allocations: 40.449 MiB, 0.18% gc time)
  9.975057 seconds (880.01 k allocations: 36.392 MiB)
 11.235186 seconds (1.04 M allocations: 40.449 MiB, 0.17% gc time)
 20.558719 seconds (1.04 M allocations: 40.449 MiB, 0.09% gc time)
 25.954910 seconds (1.04 M allocations: 40.449 MiB, 0.07% gc time)
 19.961299 seconds (880.01 k allocations: 36.392 MiB)
 17.356931 seconds (1.04 M allocations: 40.449 MiB, 0.11% gc time)

so it goes from 3sec to more than 20sec, for the same part of code ! I checked the VRAM of my GPU card (with cat /sys/class/drm/card1/device/mem_info_vram_used) and is strictly increasing during the computation of the code above. Maybe this is the source of the problem ? But I'm unable to empty it.

I tried many trivial things such as finalize(input), but I was not able to solve the problem. The cpu version of it works well. Please help !

The GPU card is AMD Radeon RX 6700 XT, I am on Manjaro, kernel 6.9.9-1.

@pxl-th
Copy link
Member

pxl-th commented Jul 24, 2024

Without profiling, I suspect this is because the VRAM is not freed in time (because GC does not know about GPU memory space).

This creates memory pressure, and when it is too high, we manually trigger GC.
But that is slow. And I think it just continues to build up.

This is a known problem and sadly there are no bulletproof solution for it at the moment.

One possible thing to try is to create in your project directory a LocalPreferences.toml file with following:

[AMDGPU]
soft_memory_limit = "80 %"
hard_memory_limit = "80 %"

@Vintarel
Copy link
Author

Thanks for the answer !

I've added this, changed nothing ; I then went down to "1%" and it changed nothing, then I lowered again to "3 MiB" where it started going even slower, because apparently GC is triggered everytime.

Now I see that the problem has been discussed many times... e.g. https://github.com/JuliaGPU/CUDA.jl/issues/137
So i guess I have no choice at the moment but to find a way to use large batches !

@pxl-th
Copy link
Member

pxl-th commented Jul 25, 2024

These memory limit parameters only control how soon the GC is triggered manually under-the-hood, so it won't help you avoiding GC calls.
And here you can see that those GC calls are very wasteful.

Probably the best solution for this is to introduce Reference-Counting as a garbage collection mechanism to Julia (along with current GC mechanism), but that is not trivial to do (although some work has been done in that direction).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants