KernelAbstractions version of GPUArrays #525

leios · 2024-03-29T12:32:06Z

Right, tests are passing locally, but still a bunch of small things to do:

This PR supersedes #451 and should be ready for review next week. Just giving everyone a sneak peek now.

Also seems to fix #530

Project.toml

leios · 2024-03-30T14:35:06Z

This PR breaks the interface to CUDA, so the buildkite tests will fail unless I point to a specific PR (working on that now). How do I do that with buildkite?

Also: if this is merged, we might want to create a new release

maleadt · 2024-04-05T06:20:43Z

Maybe temporarily change the Pkg invocation during CI to pick up the accompanying back-end PRs:

GPUArrays.jl/.buildkite/pipeline.yml

Line 13 in 4623226

Pkg.develop(; name="CUDA")

leios · 2024-05-22T14:09:08Z

Ok, can't quite figure out the oneAPI test failures. It is breaking here: https://github.com/leios/GPUArrays.jl/blob/yoyoyo_rebase_time/test/testsuite/statistics.jl#L66 (and at a similar line above).

Here is the compare function:

function compare(f, AT::Type{<:AbstractGPUArray}, xs...; kwargs...)
    # copy on the CPU, adapt on the GPU, but keep Ref's
    cpu_in = map(x -> isa(x, Base.RefValue) ? x[] : deepcopy(x), xs)
    gpu_in = map(x -> isa(x, Base.RefValue) ? x[] : adapt(ArrayAdaptor{AT}(), x)
, xs)

    cpu_out = f(cpu_in...)
    gpu_out = f(gpu_in...)

    test_result(cpu_out, gpu_out; kwargs...)
end

and here are the relevant test_result functions:

function test_result(a::AbstractArray{T}, b::AbstractArray{T};
                     kwargs...) where {T<:NTuple{N,<:Number} where {N}}
    ET = eltype(T)
    ≈(reinterpret(ET, collect(a)), reinterpret(ET, collect(b)); kwargs...)
end
function test_result(as::NTuple{N,Any}, bs::NTuple{N,Any}; kwargs...) where {N}
    all(zip(as, bs)) do (a, b)
        test_result(a, b; kwargs...)
    end
end

So I took the bodies from these and just pasted them into the test:

        @testset "cov" begin
            s = 100
            @test compare(cov, AT, rand(ET, s))
            @test compare(cov, AT, rand(ET, s, 2))

            # copy of `compare` contents
            f = A->cov(A; dims=2)
            rand_array = rand(ET, s, 2)
            cpu_in = map(x -> isa(x, Base.RefValue) ? x[] : deepcopy(x), rand_array)
            gpu_in = map(x -> isa(x, Base.RefValue) ? x[] : adapt(ArrayAdaptor{AT}(), x), rand_array)
            cpu_out = f(cpu_in)
            gpu_out = f(gpu_in)
            
            # test_result prints `true` here
            println("cov:", '\t', ET, '\t', AT, '\t', cpu_out == gpu_out, '\t',
                    test_result(cpu_out, gpu_out))
            @test compare(A->cov(A; dims=2), AT, rand(ET, s, 2))
            if ET <: Real
                @test compare(cov, AT, rand(ET(1):ET(100), s))
            end

This returns true.

So now I don't understand why the compare(...) function isn't passing, but the same lines are when just put into the function. Also interesting note: the tests all pass when the dummy array size is smaller (like 10).

Also: I am just testing this by throwing it against CI because I couldn't find an Intel GPU...

leios · 2024-06-06T09:29:55Z

oneAPI passes the tests now. I don't know why. The best I can figure is that we are somehow no longer triggering JuliaGPU/oneAPI.jl#442 (which seems to be hardware dependent)

leios · 2024-06-11T12:49:58Z

Quick note for performance regressions. I ran all the tests on this branch (blue) and plotted them here against master (orange / red). In general, they are the same speed. I also ran the cases where master was faster separately and found that these tests are still generally the same speed.

I can look into this in more detail by automating the process, but I think it might be better to come up with a specific test case where KA is almost certainly slower than the current GPUArrays DSL. It would be a good idea to list all the reasons why KA could be slow in an issue or something so we can tackle them.

src/host/broadcast.jl

leios · 2024-06-27T18:12:01Z

Ok, it seems like this branch works and is ready for review. There is some overall cleanup left, but I'll do that after.

maleadt · 2024-07-22T07:29:26Z

Oh, I didn't realize this was ready for review. We should get this merged!

About the CUDA.jl branch, leios/CUDA.jl@b085472, what is the reason this requires a separate CuArrayBackend?

leios · 2024-07-22T08:31:30Z

You are right. All the XArrayBackends can be removed. Let me mess around and test locally.

The main thing that stalled the PR is that I couldn't figure out the CI on the CUDA side and got swamped with other things.

maleadt · 2024-07-22T08:33:17Z

Now that we're past JuliaCon I should have the time to help out, so feel free to just list issues here.

In parallel, I'll be looking at packaging POCL so that we can hopefully move forwards on an improved CPU back-end for KA.jl too.

leios · 2024-07-22T08:45:36Z

I was literally just about to create an issue in KA about that. I'll go ahead and rebase everything up for this (these) PRs

leios · 2024-07-22T11:09:52Z

Just rebased up (also had to revert the enzyme stuff). All tests pass locally on AMDGPU. Could we rerun the CI to make sure the errors are consistent on each backend's master / main?

leios · 2024-07-22T14:30:45Z

So the main problem is with launch_heuristic and launch_configuration as well as KernelAbstractions.launch_config. These are kinda conflicting and I'm not actually sure how to write the launch_heuristic function in CUDA/gpuarrays.jl.

This one doesn't work because we need to read in an ndrange when doing the config.

@inline function GPUArrays.launch_heuristic(::CUDABackend, f::F, args::Vararg{Any,N};
                                            elements::Int, elements_per_thread::Int) where {F,N}

    obj = f(CUDABackend())
    ndrange, workgroupsize, iterspace, dynamic = KA.launch_config(obj, nothing,
                                                                  nothing)

    # this might not be the final context, since we may tune the workgroupsize
    ctx = KA.mkcontext(obj, ndrange, iterspace)
    kernel = @cuda launch=false obj.f(ctx, args...)

    # launching many large blocks) lowers performance, as observed with broadcast, so cap
    # the block size if we don't have a grid-stride kernel (which would keep the grid small)
    if elements_per_thread > 1
        launch_configuration(kernel.fun)
    else
        launch_configuration(kernel.fun; max_threads=256)
    end
end

The tests passed earlier because we weren't calling the right launch_heuristic from CUDA.

maleadt · 2024-07-25T07:48:44Z

Something is really wrong with Metal.jl's map! on this PR:

julia> map!(-, Metal.zeros(Float32, 1), Metal.ones(Float32, 1))
1-element MtlVector{Float32, Private}:
 -1.0

julia> map!(-, Metal.zeros(Float32, 2), Metal.ones(Float32, 2))
2-element MtlVector{Float32, Private}:
 0.0
 0.0

julia> Metal.zeros(Float32, 2) .= .-(Metal.ones(Float32, 2))
2-element MtlVector{Float32, Private}:
 -1.0
 -1.0

It doesn't seem launch configuration related because I can see 2 threads being launched here, as expected.

leios · 2024-07-25T07:57:41Z

Map goes through launch_heuristic on Metal's side. I might have really messed up the logic somehow? AMDGPU doesn't have launch_heuristic and instead falls back to the default one defined here (in GPUArrays). It seems to be fine:

julia> map!(-, AMDGPU.zeros(Float32, 1), AMDGPU.ones(Float32, 1))
1-element ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
 -1.0

julia> map!(-, AMDGPU.zeros(Float32, 2), AMDGPU.ones(Float32, 2))
2-element ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
 -1.0
 -1.0

julia> AMDGPU.zeros(Float32, 2) .= .-(AMDGPU.ones(Float32, 2))
2-element ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
 -1.0
 -1.0

I might be tired, but those look right to me. I just pushed a commit to Metal that removes launch_heuristic. Could you try running the tests again with that branch? JuliaGPU/Metal.jl#328

maleadt · 2024-07-25T08:52:30Z

I just pushed a commit to Metal that removes launch_heuristic. Could you try running the tests again with that branch? JuliaGPU/Metal.jl#328

That doesn't help, as the launch configuration seems correct: config = (threads = 2, blocks = 1, elements_per_thread = 1), so the launch heuristic isn't to blame (and to be sure, I made it fall back to the GPUArrays definition, which didn't change a thing).

maleadt · 2024-07-25T09:09:15Z

This looks like a miscompilation in Metal.jl. I'll investigate. For a workaround: removing the for i in 1:nelem seems to help, so maybe we should get rid of the grid-stride processing here already. elements_per_thread is going to be always 1 anyway, and it would simplify the "scary" indexing in the kernel ((J-1)*nelem + j). It would probably be best to add grid-stride indexing support to KA.jl before revisiting this in the future.

maleadt · 2024-07-25T09:12:35Z

Actually, found another workaround: add @inbounds to the CartesianIndices constructor.

leios · 2024-07-25T10:21:12Z

Just to be clear, there are 2 options:

@inbounds J_c = CartesianIndices(axes(bc))[(J-1)*nelem + j]
Remove the launch_heuristic / elements_per_thread approach.

You are in favor of 2 with the hopes of tackling it in KA down the road?

maleadt · 2024-07-25T12:36:05Z

Yeah. As I mentioned on Slack, I haven't noticed much performance improvements from grid stride loops here, and seeing how it uglifies the indexing I'd prefer we just get rid of it right now until KA.jl properly supports them.

In the case we need to do manual CartesianIndex construction though, it should be annotated @inbounds, not to avoid the Metal.jl compiler bug but to avoid the unnecessary trap branch.

leios · 2024-07-26T11:04:04Z

Running out of time to keep debugging this today, so I'll just write everything down.

After removing the launch_heuristic and elements_per_thread approach, I am getting errors with map for certain array sizes and also in minimum!, maximum! and extrema! for different array sizes. The issue with broadcast is weird:

BoundsError: attempt to access Base.Broadcast.Broadcasted{JLArrays.JLArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(+), Tuple{Base.Broadcast.Extruded{JLArrays.JLDeviceArray{ComplexF64, 2}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}} at index [1]

So it's failing to access a Broadcasted of size (2,2) at index 1. Note that the same lines in the REPL do work:

map!(+, AMDGPU.zeros(Float32, 10,10), AMDGPU.ones(Float32, 10,1))

So the errors only appear when running ] test for broadcasting.

The other issue is with minimum!, maximum! and extrema! for different array sizes. Again, I am struggling to exactly pinpoint this issue, but it happens when sz = (10, 10, 10) and red = (1, 10, 10). It doesn't break on other array sizes. To be clear sz = (10, 10, 10) and red = (1, 1, 1) seems to work.

Long story short. There's something wrong with map. I am sure it's just a stupid typo that I can't quite pick out right now:

function Base.map!(f, dest::AnyGPUArray, xs::AbstractArray...)
    # custom broadcast, ignoring the container size mismatches
    # (avoids the reshape + view that our mapreduce impl has to do)
    indices = LinearIndices.((dest, xs...))
    common_length = minimum(length.(indices))
    common_length==0 && return

    bc = Broadcast.instantiate(Broadcast.broadcasted(f, xs...))
    if bc isa Broadcast.Broadcasted
        bc = Broadcast.preprocess(dest, bc)
    end

    # grid-stride kernel
    @kernel function map_kernel(dest, bc)
        j = @index(Global, Linear)
        @inbounds dest[j] = bc[j]
    end

    kernel = map_kernel(get_backend(dest))
    config = KernelAbstractions.launch_config(kernel, common_length, nothing)
    kernel(dest, bc; ndrange = config[1], workgroupsize = config[2])

    if eltype(dest) <: BrokenBroadcast
        throw(ArgumentError("Map operation resulting in $(eltype(eltype(dest))) is not GPU compatible"))
    end

    return dest
end

maleadt · 2024-08-29T09:22:03Z

FYI, although I haven't had the time to look into this further, the Metal miscompilation has been fixed.

leios · 2024-08-29T11:57:38Z

Oh, great! Tbh, I had to put this on hold for August because our daycare is closed this month and I'm juggling childcare duties. I should be able to pick it back up in September

This reverts commit 0c7e26b.

leios · 2024-09-16T13:35:04Z

I couldn't quite get 00c8dd4 to work, so I reverted it to check to see if Metal would build.

It seems like now CUDA and AMDGPU (locally) both pass, but I'm not sure what's going on with Metal and oneAPI

maleadt · 2024-09-19T12:55:06Z

The launch_heuristic/launch_configuration is not something that can stay, though. I have some time tomorrow and next week, so I can have a look if you want.

maleadt · 2024-09-20T16:51:25Z

#559 is looking good (all CI green), so let's close this branch on favor of it. Huge thanks to @leios for doing the grunt work!

vchuravy · 2024-09-20T20:13:12Z

Congratulations @leios !

leios mentioned this pull request Mar 29, 2024

Attempt to fix up KA transition #517

Closed

4 tasks

vchuravy reviewed Mar 29, 2024

View reviewed changes

Project.toml Outdated Show resolved Hide resolved

leios force-pushed the yoyoyo_rebase_time branch from d05e5cd to 0800eee Compare March 30, 2024 14:33

lcw mentioned this pull request Apr 3, 2024

Kernel using StaticArray compiles in julia v1.9.4 but not in v1.10.2 JuliaGPU/CUDA.jl#2313

Closed

leios mentioned this pull request Apr 4, 2024

new GPUArrays interface for KA transition JuliaGPU/CUDA.jl#2315

Draft

1 task

This was referenced Apr 10, 2024

adding necessary changes for KA transition for gpuarrays JuliaGPU/Metal.jl#328

Closed

adding necessary changes for KA transition for gpuarrays JuliaGPU/oneAPI.jl#414

Closed

adding necessary changes for KA transition with gpuarrays JuliaGPU/AMDGPU.jl#619

Closed

leios mentioned this pull request Jun 5, 2024

Multiplication of StridedMaybeAdjOrTransMat broken for certain matrix sizes JuliaGPU/oneAPI.jl#442

Open

leios force-pushed the yoyoyo_rebase_time branch 2 times, most recently from cc5662a to 1fd096b Compare June 6, 2024 09:09

leios force-pushed the yoyoyo_rebase_time branch 2 times, most recently from dcd7468 to 38f4302 Compare June 27, 2024 12:37

leios commented Jun 27, 2024

View reviewed changes

src/host/broadcast.jl Outdated Show resolved Hide resolved

leios force-pushed the yoyoyo_rebase_time branch from 38f4302 to c370523 Compare June 27, 2024 16:31

leios marked this pull request as ready for review June 27, 2024 18:11

maleadt mentioned this pull request Jul 22, 2024

Equivalent of cuSparse - start with sparse matvec JuliaGPU/Metal.jl#208

Open

leios force-pushed the yoyoyo_rebase_time branch from c370523 to c1f6283 Compare July 22, 2024 09:19

leios force-pushed the yoyoyo_rebase_time branch from c1f6283 to 93e212b Compare July 22, 2024 12:16

leios force-pushed the yoyoyo_rebase_time branch 2 times, most recently from 7936cf7 to 3560049 Compare July 23, 2024 09:05

maleadt mentioned this pull request Jul 25, 2024

Control flow-related miscompilation: JuliaGPU/Metal.jl#401

Closed

cncastillo mentioned this pull request Jul 31, 2024

Add GPU support to MRIReco.jl MagneticResonanceImaging/MRIReco.jl#182

Merged

leios added 3 commits September 16, 2024 13:49

Transition GPUArrays to KernelAbstractions

d87ca55

remocing heuristic

00c8dd4

Revert "remocing heuristic"

52db290

This reverts commit 0c7e26b.

leios force-pushed the yoyoyo_rebase_time branch from 6aae0a8 to 56fbd8f Compare September 16, 2024 12:15

recommenting some stuff

c2bd9f4

leios force-pushed the yoyoyo_rebase_time branch from c7b99a6 to c2bd9f4 Compare September 16, 2024 12:56

maleadt mentioned this pull request Sep 19, 2024

Switch to KernelAbstractions.jl #559

Merged

maleadt closed this Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KernelAbstractions version of GPUArrays #525

KernelAbstractions version of GPUArrays #525

leios commented Mar 29, 2024 •

edited by maleadt

Loading

leios commented Mar 30, 2024

maleadt commented Apr 5, 2024

leios commented May 22, 2024 •

edited

Loading

leios commented Jun 6, 2024

leios commented Jun 11, 2024

leios commented Jun 27, 2024

maleadt commented Jul 22, 2024

leios commented Jul 22, 2024

maleadt commented Jul 22, 2024

leios commented Jul 22, 2024

leios commented Jul 22, 2024 •

edited

Loading

leios commented Jul 22, 2024

maleadt commented Jul 25, 2024 •

edited

Loading

leios commented Jul 25, 2024 •

edited

Loading

maleadt commented Jul 25, 2024 •

edited

Loading

maleadt commented Jul 25, 2024

maleadt commented Jul 25, 2024

leios commented Jul 25, 2024

maleadt commented Jul 25, 2024

leios commented Jul 26, 2024 •

edited

Loading

maleadt commented Aug 29, 2024

leios commented Aug 29, 2024

leios commented Sep 16, 2024

maleadt commented Sep 19, 2024

maleadt commented Sep 20, 2024 •

edited

Loading

vchuravy commented Sep 20, 2024

KernelAbstractions version of GPUArrays #525

KernelAbstractions version of GPUArrays #525

Conversation

leios commented Mar 29, 2024 • edited by maleadt Loading

leios commented Mar 30, 2024

maleadt commented Apr 5, 2024

leios commented May 22, 2024 • edited Loading

leios commented Jun 6, 2024

leios commented Jun 11, 2024

leios commented Jun 27, 2024

maleadt commented Jul 22, 2024

leios commented Jul 22, 2024

maleadt commented Jul 22, 2024

leios commented Jul 22, 2024

leios commented Jul 22, 2024 • edited Loading

leios commented Jul 22, 2024

maleadt commented Jul 25, 2024 • edited Loading

leios commented Jul 25, 2024 • edited Loading

maleadt commented Jul 25, 2024 • edited Loading

maleadt commented Jul 25, 2024

maleadt commented Jul 25, 2024

leios commented Jul 25, 2024

maleadt commented Jul 25, 2024

leios commented Jul 26, 2024 • edited Loading

maleadt commented Aug 29, 2024

leios commented Aug 29, 2024

leios commented Sep 16, 2024

maleadt commented Sep 19, 2024

maleadt commented Sep 20, 2024 • edited Loading

vchuravy commented Sep 20, 2024

leios commented Mar 29, 2024 •

edited by maleadt

Loading

leios commented May 22, 2024 •

edited

Loading

leios commented Jul 22, 2024 •

edited

Loading

maleadt commented Jul 25, 2024 •

edited

Loading

leios commented Jul 25, 2024 •

edited

Loading

maleadt commented Jul 25, 2024 •

edited

Loading

leios commented Jul 26, 2024 •

edited

Loading

maleadt commented Sep 20, 2024 •

edited

Loading