Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splat gpu launch #539

Merged
merged 4 commits into from
Jun 25, 2024
Merged

Splat gpu launch #539

merged 4 commits into from
Jun 25, 2024

Conversation

wsmoses
Copy link
Contributor

@wsmoses wsmoses commented May 31, 2024

Attempt fix for #535

I think this should be cost free (in fact ideally slightly improve alias analysis/etc)

@maleadt @vchuravy

@maleadt
Copy link
Member

maleadt commented Jun 4, 2024

I don't get why this is needed. We construct Broadcasted objects in other places, so why is this one problematic?

@wsmoses
Copy link
Contributor Author

wsmoses commented Jun 8, 2024

The problem here is that all of the arguments to the kernel are passed together [as the broadcasted object]. As a result it's impossible to mark some arguments as differentiable and others as non-differentiable [leading to errors if differentiating x .+ y, where x is constant and y is diferentiable].

Preserving the separate arguments remedies this

@maleadt
Copy link
Member

maleadt commented Jun 10, 2024

The problem here is that all of the arguments to the kernel are passed together [as the broadcasted object].

Right, and we do that for all broadcasts. So why is this only a problem for map!, as you said in #535:

The implementation of map! (https://github.com/JuliaGPU/GPUArrays.jl/blob/ec9fe5b6f7522902e444c95a0c9248a4bc55d602/src/host/broadcast.jl#L120C46-L120C59) creates a broadcasted object which captures all of the arguments to map!. This is then passed to the kernel.

FWIW, If this kind of pattern is an issue for Enzyme, you'll run into it a lot, because passing arguments through structures is how closures work in Julia.

@wsmoses
Copy link
Contributor Author

wsmoses commented Jun 10, 2024

It may not be just a problem for map, but map is for sure at a critical juncture point.

In particular, Enzyme often doesn't care if things are put into closures, if it gets optimized out (and potentially separated/etc).

One significant issue here, however, is that the custom rule for the kernel call forces us to see the entire closure at the Julia level [and thus we can't separate the args out even if we wanted to]. Since this creates the kernel which is called, I'm not sure of a way to resolve the issue without this.

It also doesn't matter if all the variables are differentiable [but in the context of a broadcast it is much more common for one to say x .+ y where one isn't differentiable].

Doing something like this also might have other downstream benefits as well. For example arrays being passed separately may permit alias information to be propagated, whereas indirection via a closure would likely drop alias info.

@wsmoses
Copy link
Contributor Author

wsmoses commented Jun 18, 2024

bump @maleadt for thoughts?

@maleadt
Copy link
Member

maleadt commented Jun 18, 2024

CI failures are related, so I figured you were still working on this.

Enzyme often doesn't care if things are put into closures

I don't get your subsequent reasoning. Enzyme obviously seems to care here, and I don't understand what the difference is between the broadcast kernel, and the following pattern we use all over CUDA.jl:

function outer(x::CuArray)
    function kernel()
        # do something with x, captured through the kernel closure
        return
    end
    @cuda kernel()
end

The kernel closure, capturing x, is pretty similar to a Broadcasted object capturing arguments and being passed to a functoin.

@wsmoses
Copy link
Contributor Author

wsmoses commented Jun 18, 2024

Oh yeah sorry I wanted to finish discussing before continuing work on to make sure the strategy was fine with you.

Ah I was assuming the closure was within the cuda kernel. However your case doesn’t present an issue.

The problem Enzyme has is creating a data structure with one variable being a differentiable cuarray and a second variable being non differentiable cuarray — and that data structure being passed into a custom rule (specifically the custom rule for cuda kernel launch).

your case above is fine since the closure only has one cuarray and thus can’t contain both a differentiable and non differentiable cuarray. It’ll be only one or the other

@maleadt
Copy link
Member

maleadt commented Jun 18, 2024

your case above is fine since the closure only has one cuarray and thus can’t contain both a differentiable and non differentiable cuarray.

Yeah that was just for simplicity of the example, we often capture more than one input. So while I'm not particularly opposed to this workaround, even though I do dislike it, you'll be running into other situations like this. I would assume that the situation will also arise when, e.g., passing tuples or CuArrays?

@wsmoses
Copy link
Contributor Author

wsmoses commented Jun 18, 2024

partially it depends on context. It's really common or one to differentiate active x .+ inactve y [e.g. one is constant other is not], but doinga tuple kernel where these are user-speciied different activities feels sufficiently uncommon that I'm willing to figure that out when we get there

@wsmoses
Copy link
Contributor Author

wsmoses commented Jun 18, 2024

@maleadt mind giving permission to run the pr?

@maleadt maleadt requested a review from vchuravy June 20, 2024 10:55
Copy link
Member

@vchuravy vchuravy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked the style handling and indeed before 1.10 we need to do the ugly typeof/typeof dance.

@maleadt maleadt merged commit 8c5d550 into JuliaGPU:master Jun 25, 2024
13 of 14 checks passed
maleadt added a commit that referenced this pull request Jun 28, 2024
@maleadt
Copy link
Member

maleadt commented Jun 28, 2024

Had to revert:

julia> using ForwardDiff, CUDA

julia> x = CUDA.ones(4,4)
4×4 CuArray{Float32, 2, CUDA.DeviceMemory}:
 1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0

julia> y = ForwardDiff.Dual.(x, true)
ERROR: GPU compilation of MethodInstance for (::GPUArrays.var"#35#37")(::CUDA.CuKernelContext, ::CuDeviceMatrix{…}, ::Int64, ::CUDA.CuArrayStyle{…}, ::Type{…}, ::Tuple{…}, ::Base.Broadcast.Extruded{…}, ::Bool) failed
KernelError: passing and using non-bitstype argument

Argument 6 to your kernel function is of type Type{ForwardDiff.Dual}, which is not isbits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants