Splat gpu launch #539

wsmoses · 2024-05-31T13:21:05Z

Attempt fix for #535

I think this should be cost free (in fact ideally slightly improve alias analysis/etc)

maleadt · 2024-06-04T08:35:34Z

I don't get why this is needed. We construct Broadcasted objects in other places, so why is this one problematic?

wsmoses · 2024-06-08T16:18:20Z

The problem here is that all of the arguments to the kernel are passed together [as the broadcasted object]. As a result it's impossible to mark some arguments as differentiable and others as non-differentiable [leading to errors if differentiating x .+ y, where x is constant and y is diferentiable].

Preserving the separate arguments remedies this

maleadt · 2024-06-10T14:19:46Z

The problem here is that all of the arguments to the kernel are passed together [as the broadcasted object].

Right, and we do that for all broadcasts. So why is this only a problem for map!, as you said in #535:

The implementation of map! (https://github.com/JuliaGPU/GPUArrays.jl/blob/ec9fe5b6f7522902e444c95a0c9248a4bc55d602/src/host/broadcast.jl#L120C46-L120C59) creates a broadcasted object which captures all of the arguments to map!. This is then passed to the kernel.

FWIW, If this kind of pattern is an issue for Enzyme, you'll run into it a lot, because passing arguments through structures is how closures work in Julia.

wsmoses · 2024-06-10T14:26:46Z

It may not be just a problem for map, but map is for sure at a critical juncture point.

In particular, Enzyme often doesn't care if things are put into closures, if it gets optimized out (and potentially separated/etc).

One significant issue here, however, is that the custom rule for the kernel call forces us to see the entire closure at the Julia level [and thus we can't separate the args out even if we wanted to]. Since this creates the kernel which is called, I'm not sure of a way to resolve the issue without this.

It also doesn't matter if all the variables are differentiable [but in the context of a broadcast it is much more common for one to say x .+ y where one isn't differentiable].

Doing something like this also might have other downstream benefits as well. For example arrays being passed separately may permit alias information to be propagated, whereas indirection via a closure would likely drop alias info.

wsmoses · 2024-06-18T00:28:43Z

bump @maleadt for thoughts?

maleadt · 2024-06-18T11:45:29Z

CI failures are related, so I figured you were still working on this.

Enzyme often doesn't care if things are put into closures

I don't get your subsequent reasoning. Enzyme obviously seems to care here, and I don't understand what the difference is between the broadcast kernel, and the following pattern we use all over CUDA.jl:

function outer(x::CuArray)
    function kernel()
        # do something with x, captured through the kernel closure
        return
    end
    @cuda kernel()
end

The kernel closure, capturing x, is pretty similar to a Broadcasted object capturing arguments and being passed to a functoin.

wsmoses · 2024-06-18T12:02:25Z

Oh yeah sorry I wanted to finish discussing before continuing work on to make sure the strategy was fine with you.

Ah I was assuming the closure was within the cuda kernel. However your case doesn’t present an issue.

The problem Enzyme has is creating a data structure with one variable being a differentiable cuarray and a second variable being non differentiable cuarray — and that data structure being passed into a custom rule (specifically the custom rule for cuda kernel launch).

your case above is fine since the closure only has one cuarray and thus can’t contain both a differentiable and non differentiable cuarray. It’ll be only one or the other

maleadt · 2024-06-18T12:08:33Z

your case above is fine since the closure only has one cuarray and thus can’t contain both a differentiable and non differentiable cuarray.

Yeah that was just for simplicity of the example, we often capture more than one input. So while I'm not particularly opposed to this workaround, even though I do dislike it, you'll be running into other situations like this. I would assume that the situation will also arise when, e.g., passing tuples or CuArrays?

wsmoses · 2024-06-18T19:15:28Z

partially it depends on context. It's really common or one to differentiate active x .+ inactve y [e.g. one is constant other is not], but doinga tuple kernel where these are user-speciied different activities feels sufficiently uncommon that I'm willing to figure that out when we get there

wsmoses · 2024-06-18T20:04:30Z

@maleadt mind giving permission to run the pr?

vchuravy

I double checked the style handling and indeed before 1.10 we need to do the ugly typeof/typeof dance.

This reverts commit 8c5d550.

maleadt · 2024-06-28T14:24:16Z

Had to revert:

julia> using ForwardDiff, CUDA

julia> x = CUDA.ones(4,4)
4×4 CuArray{Float32, 2, CUDA.DeviceMemory}:
 1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0

julia> y = ForwardDiff.Dual.(x, true)
ERROR: GPU compilation of MethodInstance for (::GPUArrays.var"#35#37")(::CUDA.CuKernelContext, ::CuDeviceMatrix{…}, ::Int64, ::CUDA.CuArrayStyle{…}, ::Type{…}, ::Tuple{…}, ::Base.Broadcast.Extruded{…}, ::Bool) failed
KernelError: passing and using non-bitstype argument

Argument 6 to your kernel function is of type Type{ForwardDiff.Dual}, which is not isbits

wsmoses force-pushed the splatlaunch branch from 853cbbd to 89814a8 Compare May 31, 2024 13:37

wsmoses force-pushed the splatlaunch branch from 89814a8 to 1fb6c32 Compare June 18, 2024 19:27

maleadt requested a review from vchuravy June 20, 2024 10:55

wsmoses added 3 commits June 22, 2024 12:35

Splat gpu launch

f0543de

old julia versions

213ef98

fix

e3124ce

vchuravy force-pushed the splatlaunch branch from f1a5652 to e3124ce Compare June 22, 2024 16:35

vchuravy approved these changes Jun 22, 2024

View reviewed changes

Style fixes.

fa48043

maleadt force-pushed the splatlaunch branch from 6ed3a62 to fa48043 Compare June 25, 2024 10:53

maleadt merged commit 8c5d550 into JuliaGPU:master Jun 25, 2024
13 of 14 checks passed

leios mentioned this pull request Jun 27, 2024

KernelAbstractions version of GPUArrays #525

Closed

9 tasks

maleadt added a commit that referenced this pull request Jun 28, 2024

Revert "Reconstruct Broadcasted in kernel to help Enzyme.jl (#539)"

40fa8c0

This reverts commit 8c5d550.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splat gpu launch #539

Splat gpu launch #539

wsmoses commented May 31, 2024

maleadt commented Jun 4, 2024

wsmoses commented Jun 8, 2024

maleadt commented Jun 10, 2024

wsmoses commented Jun 10, 2024

wsmoses commented Jun 18, 2024

maleadt commented Jun 18, 2024

wsmoses commented Jun 18, 2024

maleadt commented Jun 18, 2024

wsmoses commented Jun 18, 2024

wsmoses commented Jun 18, 2024

vchuravy left a comment

maleadt commented Jun 28, 2024 •

edited

Loading

Splat gpu launch #539

Splat gpu launch #539

Conversation

wsmoses commented May 31, 2024

maleadt commented Jun 4, 2024

wsmoses commented Jun 8, 2024

maleadt commented Jun 10, 2024

wsmoses commented Jun 10, 2024

wsmoses commented Jun 18, 2024

maleadt commented Jun 18, 2024

wsmoses commented Jun 18, 2024

maleadt commented Jun 18, 2024

wsmoses commented Jun 18, 2024

wsmoses commented Jun 18, 2024

vchuravy left a comment

Choose a reason for hiding this comment

maleadt commented Jun 28, 2024 • edited Loading

maleadt commented Jun 28, 2024 •

edited

Loading