Using one single array of pointers for multiGPU AMDGPU computation #663

pedrovalerolara · 2024-08-15T17:29:33Z

Hi folks!

I am working on the multiGPU support of JACC: https://github.com/JuliaORNL/JACC.jl/
For that, I would need to be able to use a single array of pointers that can store pointers to different GPUs.

I opened another issue a few days ago: #662
Although that helped me understand the problem better, I still cannot run the test code below.
I can run that code on CUDA (I put the CUDA code too, just in case it is useful).

@pxl-th mentioned the CU_MEMHOSTALLOC_PORTABLE CUDA flag. Can we use that in AMDGPU?

Here are the codes:
AMDGPU

function multi_scal(N, dev_id,alpha,x)
  i = (workgroupIdx().x - 1) * workgroupDim().x + workitemIdx().x
  if i <= N
    @inbounds x[dev_id][i]*=alpha
  end
  return nothing
end

x = ones(600)
alpha = 2.0
ndev = length(AMDGPU.devices())
ret = Vector{Any}(undef, 2)

AMDGPU.device!(AMDGPU.device(1))
s_array = length(x)
s_arrays = ceil(Int, s_array/ndev)
array_ret = Vector{Any}(undef, ndev)
pointer_ret = Vector{AMDGPU.Device.ROCDeviceVector{Float64,AMDGPU.Device.AS.Global}}(undef, ndev)

for i in 1:ndev
  AMDGPU.device!(AMDGPU.device(i))
  array_ret[i] = AMDGPU.ROCArray(x[((i-1)*s_arrays)+1:i*s_arrays])
  pointer_ret[i] = AMDGPU.rocconvert(array_ret[i])
end

AMDGPU.device!(AMDGPU.device(1))
amdgpu_pointer_ret = ROCArray(pointer_ret)
ret[1] = amdgpu_pointer_ret
ret[2] = array_ret

numThreads = 256
threads = min(s_arrays, numThreads)
blocks = ceil(Int, s_arrays / threads)


# This works
AMDGPU.device!(AMDGPU.device(1))
@roc groupsize=threads gridsize=blocks multi_scal(s_arrays, 1, alpha, ret[1])


# This does not work
AMDGPU.device!(AMDGPU.device(2))
@roc groupsize=threads gridsize=blocks multi_scal(s_arrays, 2, alpha, ret[1])

CUDA

function multi_scal(N, dev_id,alpha,x)
  i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
  if i <= N
    @inbounds x[dev_id][i]*=alpha
  end
  return nothing
end

x = ones(600)
alpha = 2.0
ndev = length(devices())
ret = Vector{Any}(undef, 2)

device!(0)
s_array = length(x)
s_arrays = ceil(Int, s_array/ndev)
array_ret = Vector{Any}(undef, ndev)
pointer_ret = Vector{CuDeviceVector{Float64,CUDA.AS.Global}}(undef, ndev)

for i in 1:ndev
  device!(i-1)
  array_ret[i] = CuArray(x[((i-1)*s_arrays)+1:i*s_arrays])
  pointer_ret[i] = cudaconvert(array_ret[i])
end

device!(0)
cuda_pointer_ret = CuArray(pointer_ret)
ret[1] = cuda_pointer_ret
ret[2] = array_ret

numThreads = 256
threads = min(s_arrays, numThreads)
blocks = ceil(Int, s_arrays / threads)


# This works
device!(0)
@cuda threads=threads blocks=blocks multi_scal(s_arrays, 1, alpha, ret[1])


# This works too
device!(1)
@cuda threads=threads blocks=blocks multi_scal(s_arrays, 2, alpha, ret[1])

The text was updated successfully, but these errors were encountered:

pxl-th · 2024-08-15T17:43:12Z

@maleadt how does CUDA make array (cuda_pointer_ret = CuArray(pointer_ret)) accessible on multiple GPUs in this case?

pxl-th · 2024-08-16T08:36:11Z

@pxl-th mentioned the CU_MEMHOSTALLOC_PORTABLE CUDA flag. Can we use that in AMDGPU?

You can, but at the moment it is not pretty:

bytesize = prod(dims) * sizeof(T)
buf = AMDGPU.Runtime.Mem.HostBuffer(bytesize, AMDGPU.HIP.hipHostAllocPortable)
amdgpu_pointer_ret = ROCArray{T, N}(AMDGPU.DataRef(AMDGPU.pool_free, AMDGPU.Managed(buf)), dims)

# Copy from CPU array.
copyto!(amdgpu_pointer_ret, pointer_ret)

But this is different from CUDA. What CUDA devices do you use? Maybe they have unified memory?

pedrovalerolara · 2024-08-16T15:48:20Z

Thank you @pxl-th !!
Regarding NVIDIA systems, I am using two different systems, both with two GPUs, one with A100s and the other with H100s.
I am not doing anything special for unified memory. No idea is CUDA.jl is doing something in that regard.

So sorry @pxl-th , but I do not understand well your code. Can you use the variable names that I used in my code to help me understand where I have to make the modifications?

I think that I must be using an old version of AMDGPU because I cannot find AMDGPU.pool_free and AMDGPU.Managed.
The version of AMDGPU that I am using is 0.8. May I need to use a more modern version?

pxl-th · 2024-08-16T16:14:50Z

Yes, you should use AMDGPU 1.0, it has important multi-GPU fixes.

Here's the code, I don't have access to multi-gpu system at the moment, but at least on 1 GPU it works:

using AMDGPU

"""
Create a ROCArray that is accessible from different GPUs (a.k.a. portable).
"""
function get_portable_rocarray(x::Array{T, N}) where {T, N}
    dims = size(x)
    bytesize = sizeof(T) * prod(dims)
    buf = AMDGPU.Mem.HostBuffer(bytesize, AMDGPU.HIP.hipHostAllocPortable)
    ROCArray{T, N}(AMDGPU.GPUArrays.DataRef(AMDGPU.pool_free, AMDGPU.Managed(buf)), dims)
end

function main()
    ndev = 2
    pointer_ret = Vector{AMDGPU.Device.ROCDeviceVector{Float64,AMDGPU.Device.AS.Global}}(undef, ndev)

    # Fill `pointer_ret` with pointers here.

    amdgpu_pointer_ret = get_portable_rocarray(pointer_ret)
    @show amdgpu_pointer_ret
    return
end

maleadt · 2024-08-17T15:29:26Z

how does CUDA make array (cuda_pointer_ret = CuArray(pointer_ret)) accessible on multiple GPUs in this case?

Assuming the buffer type used here is device memory (which is the default), CUDA.jl enables P2P access between devices when doing the conversion of CUDA.Managed (a struct wrapping buffers, keeping track of the owning device and stream that last accessed the memory) to a pointer: https://github.com/JuliaGPU/CUDA.jl/blob/69043ee42f4c6e08a12662da4d0537b721eeee84/src/memory.jl#L530-L573

Note that this isn't guaranteed to always work; the devices need to be compatible, or P2P isn't supported. In that case the user is responsible for staging through the CPU (by explicit copyto!), or by using unified or host memory which is available on all devices automatically.

pedrovalerolara · 2024-08-19T17:00:01Z

Thanks @pxl-th and @maleadt for your comments!!!
I am using the code showed below that incorporates the comments from @pxl-th on one node of Frontier (8 x AMD GPUs per node).
The code works well now. Thank you!!
However, when I run the kernels I see this:

julia> @roc groupsize=threads gridsize=blocks multi_scal(s_arrays, 3, alpha, ret[1])
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
AMDGPU.Runtime.HIPKernel{typeof(multi_scal), Tuple{Int64, Int64, Float64, AMDGPU.Device.ROCDeviceVector{AMDGPU.Device.ROCDeviceVector{Float64, 1}, 1}}}(multi_scal, AMDGPU.HIP.HIPFunction(Ptr{Nothing} @0x0000000007d7d6a0, AMDGPU.HIP.HIPModule(Ptr{Nothing} @0x00000000082566b0), Symbol[]))

Do you know why?
I am using a local Julia installation (1.10.4) and the 1.0 version of AMDGPU.

function get_portable_rocarray(x::Array{T, N}) where {T, N}
    dims = size(x)
    bytesize = sizeof(T) * prod(dims)
    buf = AMDGPU.Mem.HostBuffer(bytesize, AMDGPU.HIP.hipHostAllocPortable)
    ROCArray{T, N}(AMDGPU.GPUArrays.DataRef(AMDGPU.pool_free, AMDGPU.Managed(buf)), dims)
end


function multi_scal(N, dev_id,alpha,x)
  i = (workgroupIdx().x - 1) * workgroupDim().x + workitemIdx().x
  if i <= N
    @inbounds x[dev_id][i]*=alpha
  end
  return nothing
end

x = ones(800)
alpha = 2.0
ndev = length(AMDGPU.devices())
ret = Vector{Any}(undef, 2)

AMDGPU.device!(AMDGPU.device(1))
s_array = length(x)
s_arrays = ceil(Int, s_array/ndev)
array_ret = Vector{Any}(undef, ndev)
pointer_ret = Vector{AMDGPU.Device.ROCDeviceVector{Float64,AMDGPU.Device.AS.Global}}(undef, ndev)

numThreads = 256
threads = min(s_arrays, numThreads)
blocks = ceil(Int, s_arrays / threads)


for i in 1:ndev
  AMDGPU.device!(AMDGPU.device(i))
  array_ret[i] = AMDGPU.ROCArray(x[((i-1)*s_arrays)+1:i*s_arrays])
  pointer_ret[i] = AMDGPU.rocconvert(array_ret[i])
end

AMDGPU.device!(AMDGPU.device(1))

amdgpu_pointer_ret = get_portable_rocarray(pointer_ret)
copyto!(amdgpu_pointer_ret, pointer_ret)

ret[1] = amdgpu_pointer_ret
ret[2] = array_ret


AMDGPU.device!(AMDGPU.device(1))
@roc groupsize=threads gridsize=blocks multi_scal(s_arrays, 1, alpha, ret[1])


AMDGPU.device!(AMDGPU.device(2))
@roc groupsize=threads gridsize=blocks multi_scal(s_arrays, 2, alpha, ret[1])

pxl-th · 2024-08-19T17:25:01Z

Ah... That's a bug in AMDGPU.jl with setting features of the compilation target. I'll fix it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using one single array of pointers for multiGPU AMDGPU computation #663

Using one single array of pointers for multiGPU AMDGPU computation #663

pedrovalerolara commented Aug 15, 2024 •

edited

Loading

pxl-th commented Aug 15, 2024

pxl-th commented Aug 16, 2024

pedrovalerolara commented Aug 16, 2024

pxl-th commented Aug 16, 2024

maleadt commented Aug 17, 2024 •

edited

Loading

pedrovalerolara commented Aug 19, 2024

pxl-th commented Aug 19, 2024

Using one single array of pointers for multiGPU AMDGPU computation #663

Using one single array of pointers for multiGPU AMDGPU computation #663

Comments

pedrovalerolara commented Aug 15, 2024 • edited Loading

pxl-th commented Aug 15, 2024

pxl-th commented Aug 16, 2024

pedrovalerolara commented Aug 16, 2024

pxl-th commented Aug 16, 2024

maleadt commented Aug 17, 2024 • edited Loading

pedrovalerolara commented Aug 19, 2024

pxl-th commented Aug 19, 2024

pedrovalerolara commented Aug 15, 2024 •

edited

Loading

maleadt commented Aug 17, 2024 •

edited

Loading