-
-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Device-to-host copies with GPU code #1513
Comments
Does it work if you return the result into a 0D array? e.g. something like: sum!(ROCArray{Float32}(undef, ()), 1f0 .- x) ? If so, it would be pretty easy to define some generic function like: sum0d(X) = sum!(similar(X,()), X) |
@pxl-th does that work? |
The problem is not that there is But you can't eliminate it as easily with chain rules for broadcasting. To eliminate synchronization here, you'd have to either:
|
Interesting discussion. Some partial thoughts:
|
@jpsamaroo has also suggested |
Got |
@maleadt is there a reason that |
There's no other way to implement a reduction to a scalar; the value needs to be available, so we need to wait for the GPU to finish computing the value, aka. synchronize the device as @pxl-th mentioned. |
@pxl-th, do you think |
Testing this hypothesis here: using CUDA, BenchmarkTools
function f()
x = CUDA.rand(1024,1024, 100)
a1 = rand(Float32)
a2 = CUDA.rand(Float32, 1,1)
@btime CUDA.@sync $a1 * $x # 30.536 μs (70 allocations: 1.70 KiB)
@btime CUDA.@sync $a1 .* $x # 30.779 μs (70 allocations: 1.70 KiB)
@btime CUDA.@sync $a2 .* $x # 33.149 μs (105 allocations: 2.25 KiB)
nothing
end
f() Looks like theres no speedup to be gained there. |
Yeah, because you're synchronizing. The whole point of GPUNumber (or whatever it will be named) is that pending values can be used as inputs without having to synchronize the GPU. It also only matters when acquiring such an unmaterialized value from one GPU kernel and feeding it to another; when feeding simple random numbers to a kernel it does not matter whether you use the raw value or a 1-element array, in fact it will only slow things down because the value can't get loaded from the parameter address space anymore. |
Currently any code that reduces GPU arrays to a single scalar value (like
sum
) does device-to-host copy of a single element at the end to return::Number
instead of::GPUArray
.But each such transfer causes GPU synchronization, which kills the performance as we are not able to feed GPU fast enough.
Below is an MWE for
Zygote.unbroadcast
, but in a more complex code, like with Flux models I have up to 40 such transfers in a single forward+backward pass.One way to fix this (which is what Python GPU frameworks do) is to return
::GPUArray
instead of::Number
when we reduce to a single value. But this breaks things like projection and other stuff, although I'd be in favor of it.Another is to go through every use case and update it to return a
::GPUArray
withsum(x; dims=1:ndims(x))
or similar things...For bigger Flux.jl models, such synchronizations may take up to a second, and all of them are under-the-hood, so the user cannot easily mitigate them.
The text was updated successfully, but these errors were encountered: