Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using one single array of pointers for multiGPU AMDGPU computation #663

Open
pedrovalerolara opened this issue Aug 15, 2024 · 7 comments
Open

Comments

@pedrovalerolara
Copy link

pedrovalerolara commented Aug 15, 2024

Hi folks!

I am working on the multiGPU support of JACC: https://github.com/JuliaORNL/JACC.jl/
For that, I would need to be able to use a single array of pointers that can store pointers to different GPUs.

I opened another issue a few days ago: #662
Although that helped me understand the problem better, I still cannot run the test code below.
I can run that code on CUDA (I put the CUDA code too, just in case it is useful).

@pxl-th mentioned the CU_MEMHOSTALLOC_PORTABLE CUDA flag. Can we use that in AMDGPU?

Here are the codes:
AMDGPU

function multi_scal(N, dev_id,alpha,x)
  i = (workgroupIdx().x - 1) * workgroupDim().x + workitemIdx().x
  if i <= N
    @inbounds x[dev_id][i]*=alpha
  end
  return nothing
end

x = ones(600)
alpha = 2.0
ndev = length(AMDGPU.devices())
ret = Vector{Any}(undef, 2)

AMDGPU.device!(AMDGPU.device(1))
s_array = length(x)
s_arrays = ceil(Int, s_array/ndev)
array_ret = Vector{Any}(undef, ndev)
pointer_ret = Vector{AMDGPU.Device.ROCDeviceVector{Float64,AMDGPU.Device.AS.Global}}(undef, ndev)

for i in 1:ndev
  AMDGPU.device!(AMDGPU.device(i))
  array_ret[i] = AMDGPU.ROCArray(x[((i-1)*s_arrays)+1:i*s_arrays])
  pointer_ret[i] = AMDGPU.rocconvert(array_ret[i])
end

AMDGPU.device!(AMDGPU.device(1))
amdgpu_pointer_ret = ROCArray(pointer_ret)
ret[1] = amdgpu_pointer_ret
ret[2] = array_ret

numThreads = 256
threads = min(s_arrays, numThreads)
blocks = ceil(Int, s_arrays / threads)


# This works
AMDGPU.device!(AMDGPU.device(1))
@roc groupsize=threads gridsize=blocks multi_scal(s_arrays, 1, alpha, ret[1])


# This does not work
AMDGPU.device!(AMDGPU.device(2))
@roc groupsize=threads gridsize=blocks multi_scal(s_arrays, 2, alpha, ret[1])

CUDA

function multi_scal(N, dev_id,alpha,x)
  i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
  if i <= N
    @inbounds x[dev_id][i]*=alpha
  end
  return nothing
end

x = ones(600)
alpha = 2.0
ndev = length(devices())
ret = Vector{Any}(undef, 2)

device!(0)
s_array = length(x)
s_arrays = ceil(Int, s_array/ndev)
array_ret = Vector{Any}(undef, ndev)
pointer_ret = Vector{CuDeviceVector{Float64,CUDA.AS.Global}}(undef, ndev)

for i in 1:ndev
  device!(i-1)
  array_ret[i] = CuArray(x[((i-1)*s_arrays)+1:i*s_arrays])
  pointer_ret[i] = cudaconvert(array_ret[i])
end

device!(0)
cuda_pointer_ret = CuArray(pointer_ret)
ret[1] = cuda_pointer_ret
ret[2] = array_ret

numThreads = 256
threads = min(s_arrays, numThreads)
blocks = ceil(Int, s_arrays / threads)


# This works
device!(0)
@cuda threads=threads blocks=blocks multi_scal(s_arrays, 1, alpha, ret[1])


# This works too
device!(1)
@cuda threads=threads blocks=blocks multi_scal(s_arrays, 2, alpha, ret[1])
@pxl-th
Copy link
Member

pxl-th commented Aug 15, 2024

@maleadt how does CUDA make array (cuda_pointer_ret = CuArray(pointer_ret)) accessible on multiple GPUs in this case?

@pxl-th
Copy link
Member

pxl-th commented Aug 16, 2024

@pxl-th mentioned the CU_MEMHOSTALLOC_PORTABLE CUDA flag. Can we use that in AMDGPU?

You can, but at the moment it is not pretty:

bytesize = prod(dims) * sizeof(T)
buf = AMDGPU.Runtime.Mem.HostBuffer(bytesize, AMDGPU.HIP.hipHostAllocPortable)
amdgpu_pointer_ret = ROCArray{T, N}(AMDGPU.DataRef(AMDGPU.pool_free, AMDGPU.Managed(buf)), dims)

# Copy from CPU array.
copyto!(amdgpu_pointer_ret, pointer_ret)

But this is different from CUDA. What CUDA devices do you use? Maybe they have unified memory?

@pedrovalerolara
Copy link
Author

Thank you @pxl-th !!
Regarding NVIDIA systems, I am using two different systems, both with two GPUs, one with A100s and the other with H100s.
I am not doing anything special for unified memory. No idea is CUDA.jl is doing something in that regard.

So sorry @pxl-th , but I do not understand well your code. Can you use the variable names that I used in my code to help me understand where I have to make the modifications?

I think that I must be using an old version of AMDGPU because I cannot find AMDGPU.pool_free and AMDGPU.Managed.
The version of AMDGPU that I am using is 0.8. May I need to use a more modern version?

@pxl-th
Copy link
Member

pxl-th commented Aug 16, 2024

Yes, you should use AMDGPU 1.0, it has important multi-GPU fixes.

Here's the code, I don't have access to multi-gpu system at the moment, but at least on 1 GPU it works:

using AMDGPU

"""
Create a ROCArray that is accessible from different GPUs (a.k.a. portable).
"""
function get_portable_rocarray(x::Array{T, N}) where {T, N}
    dims = size(x)
    bytesize = sizeof(T) * prod(dims)
    buf = AMDGPU.Mem.HostBuffer(bytesize, AMDGPU.HIP.hipHostAllocPortable)
    ROCArray{T, N}(AMDGPU.GPUArrays.DataRef(AMDGPU.pool_free, AMDGPU.Managed(buf)), dims)
end

function main()
    ndev = 2
    pointer_ret = Vector{AMDGPU.Device.ROCDeviceVector{Float64,AMDGPU.Device.AS.Global}}(undef, ndev)

    # Fill `pointer_ret` with pointers here.

    amdgpu_pointer_ret = get_portable_rocarray(pointer_ret)
    @show amdgpu_pointer_ret
    return
end

@maleadt
Copy link
Member

maleadt commented Aug 17, 2024

how does CUDA make array (cuda_pointer_ret = CuArray(pointer_ret)) accessible on multiple GPUs in this case?

Assuming the buffer type used here is device memory (which is the default), CUDA.jl enables P2P access between devices when doing the conversion of CUDA.Managed (a struct wrapping buffers, keeping track of the owning device and stream that last accessed the memory) to a pointer: https://github.com/JuliaGPU/CUDA.jl/blob/69043ee42f4c6e08a12662da4d0537b721eeee84/src/memory.jl#L530-L573

Note that this isn't guaranteed to always work; the devices need to be compatible, or P2P isn't supported. In that case the user is responsible for staging through the CPU (by explicit copyto!), or by using unified or host memory which is available on all devices automatically.

@pedrovalerolara
Copy link
Author

Thanks @pxl-th and @maleadt for your comments!!!
I am using the code showed below that incorporates the comments from @pxl-th on one node of Frontier (8 x AMD GPUs per node).
The code works well now. Thank you!!
However, when I run the kernels I see this:

julia> @roc groupsize=threads gridsize=blocks multi_scal(s_arrays, 3, alpha, ret[1])
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
'+sramecc-wavefrontsize32' is not a recognized feature for this target (ignoring feature)
AMDGPU.Runtime.HIPKernel{typeof(multi_scal), Tuple{Int64, Int64, Float64, AMDGPU.Device.ROCDeviceVector{AMDGPU.Device.ROCDeviceVector{Float64, 1}, 1}}}(multi_scal, AMDGPU.HIP.HIPFunction(Ptr{Nothing} @0x0000000007d7d6a0, AMDGPU.HIP.HIPModule(Ptr{Nothing} @0x00000000082566b0), Symbol[]))

Do you know why?
I am using a local Julia installation (1.10.4) and the 1.0 version of AMDGPU.

function get_portable_rocarray(x::Array{T, N}) where {T, N}
    dims = size(x)
    bytesize = sizeof(T) * prod(dims)
    buf = AMDGPU.Mem.HostBuffer(bytesize, AMDGPU.HIP.hipHostAllocPortable)
    ROCArray{T, N}(AMDGPU.GPUArrays.DataRef(AMDGPU.pool_free, AMDGPU.Managed(buf)), dims)
end


function multi_scal(N, dev_id,alpha,x)
  i = (workgroupIdx().x - 1) * workgroupDim().x + workitemIdx().x
  if i <= N
    @inbounds x[dev_id][i]*=alpha
  end
  return nothing
end

x = ones(800)
alpha = 2.0
ndev = length(AMDGPU.devices())
ret = Vector{Any}(undef, 2)

AMDGPU.device!(AMDGPU.device(1))
s_array = length(x)
s_arrays = ceil(Int, s_array/ndev)
array_ret = Vector{Any}(undef, ndev)
pointer_ret = Vector{AMDGPU.Device.ROCDeviceVector{Float64,AMDGPU.Device.AS.Global}}(undef, ndev)

numThreads = 256
threads = min(s_arrays, numThreads)
blocks = ceil(Int, s_arrays / threads)


for i in 1:ndev
  AMDGPU.device!(AMDGPU.device(i))
  array_ret[i] = AMDGPU.ROCArray(x[((i-1)*s_arrays)+1:i*s_arrays])
  pointer_ret[i] = AMDGPU.rocconvert(array_ret[i])
end

AMDGPU.device!(AMDGPU.device(1))

amdgpu_pointer_ret = get_portable_rocarray(pointer_ret)
copyto!(amdgpu_pointer_ret, pointer_ret)

ret[1] = amdgpu_pointer_ret
ret[2] = array_ret


AMDGPU.device!(AMDGPU.device(1))
@roc groupsize=threads gridsize=blocks multi_scal(s_arrays, 1, alpha, ret[1])


AMDGPU.device!(AMDGPU.device(2))
@roc groupsize=threads gridsize=blocks multi_scal(s_arrays, 2, alpha, ret[1])

@pxl-th
Copy link
Member

pxl-th commented Aug 19, 2024

Ah... That's a bug in AMDGPU.jl with setting features of the compilation target. I'll fix it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants