Inconsistent results across platforms and datatypes on CPU backend #468

DhruvDh · 2024-03-08T14:21:47Z

I am trying to write a simple reduction kernel, which seems to only work for certain datatypes.
The datatypes it works for is inconsistent across platforms.

Here is the kernel, which computes the sum of squares for a vector of 6000 ones -

using KernelAbstractions
using Atomix: @atomic

@kernel function self_dot_kernel!(result, @Const(vec))
  BLOCK_SIZE = @uniform @groupsize()[1]
  I = @index(Global, Linear)
  li = @index(Local, Linear)

  intermediateresults = @localmem eltype(vec) (BLOCK_SIZE)

  if I <= length(vec)
    intermediateresults[li] = vec[I] * vec[I]
  else
    intermediateresults[li] = 0
  end
  @synchronize()

  for i in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
    if li + i <= BLOCK_SIZE
      intermediateresults[li] += intermediateresults[li+i]
    end
    @synchronize()
  end


  @synchronize()
  if li == 1
    @atomic result[1] += intermediateresults[1]
  end
end

backend = CPU()
types = [Float16, Float32, Float64, Int16, Int32, Int64, UInt16, UInt32, UInt64]
kernel! = self_dot_kernel!(backend, 32)
platform = Sys.MACHINE

for T in types
  X = KernelAbstractions.ones(backend, T, 6000)
  println("\n=== Type: $T ($platform) ==\n")

  for i in 1:6
    dotresult = KernelAbstractions.zeros(backend, T, 1)
    kernel!(dotresult, X, ndrange=length(X))
    KernelAbstractions.synchronize(backend)

    if i % 3 == 0
      println("  $(dotresult[1])")
    else
      print("  $(dotresult[1]),")
    end
  end
end

And here are some outputs, ran using julia 1.10.2 -

=== Type: Float16 (arm64-apple-darwin23.2.0) ==

  6.0e3,  6.0e3,  5.996e3
  6.0e3,  7.424e3,  6.0e3

=== Type: Float32 (arm64-apple-darwin23.2.0) ==

  6000.0,  6000.0,  6000.0
  6000.0,  6000.0,  6000.0

=== Type: Float64 (arm64-apple-darwin23.2.0) ==

  6000.0,  6000.0,  6000.0
  6000.0,  6000.0,  6000.0

=== Type: Int16 (arm64-apple-darwin23.2.0) ==

  6001,  6004,  6003
  6000,  6003,  6001

=== Type: Int32 (arm64-apple-darwin23.2.0) ==

  476212432,  49993592,  6001
  6004,  6004,  49993592

=== Type: Int64 (arm64-apple-darwin23.2.0) ==

  6000,  6004,  6000
  4547881696,  4546395904,  6000

=== Type: UInt16 (arm64-apple-darwin23.2.0) ==

  6003,  6003,  6003
  6003,  6003,  6003

=== Type: UInt32 (arm64-apple-darwin23.2.0) ==

  254459536,  230300864,  468297712
  6000,  6000,  50026560

=== Type: UInt64 (arm64-apple-darwin23.2.0) ==

  6038,  6001,  4551972480
  4554654336,  6001,  6000

Float32, and Float64 always work on apple-aarch64, making me think that there's nothing wrong with the algorithm. And on windows-x64 -

=== Type: Float16 (x86_64-w64-mingw32) ==

  6.0e3,  6.0e3,  6.0e3
  6.0e3,  6.0e3,  6.0e3

=== Type: Float32 (x86_64-w64-mingw32) ==

  6000.0,  6000.0,  6000.0
  -2.7825533e35,  6000.0,  6000.0

=== Type: Float64 (x86_64-w64-mingw32) ==

  6000.0,  6000.0,  6000.0
  6000.0,  6000.0,  6000.0

=== Type: Int16 (x86_64-w64-mingw32) ==

  6004,  6000,  6001
  6002,  6000,  6003

=== Type: Int32 (x86_64-w64-mingw32) ==

  -556641600,  -150726792,  -150726792
  -108881600,  -91191952,  6041

=== Type: Int64 (x86_64-w64-mingw32) ==

  42949678970,  1846738422272,  1846692115360
  1846740674704,  6000,  1846740275056

=== Type: UInt16 (x86_64-w64-mingw32) ==

  6002,  6027,  6003
  6004,  6003,  6004

=== Type: UInt32 (x86_64-w64-mingw32) ==

  6011,  4144240504,  4165309344
  4170999664,  569204,  6000

=== Type: UInt64 (x86_64-w64-mingw32) ==

  140724418903328,  1846746033440,  1846275974656
  6000,  6009,  1846685210488

On windows-x64, only Float64 consistently works. What is going on?

The text was updated successfully, but these errors were encountered:

DhruvDh · 2024-03-08T14:42:38Z

I tried to run this on AMDGPU and Metal, on both backends I get something very similar to the following -

RROR: LoadError: InvalidIRError: compiling MethodInstance for gpu_self_dot_kernel!(::KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(32,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, ::AMDGPU.Device.ROCDeviceVector{Float16, 1}, ::AMDGPU.Device.ROCDeviceVector{Float16, 1}) resulted in invalid LLVM IR
Reason: unsupported call through a literal pointer (call to ijl_alloc_array_1d)
Stacktrace:
 [1] Array
   @ ./boot.jl:477
 [2] vect
   @ ./array.jl:163
 [3] macro expansion
   @ /media/dhruvdh/External/Dropbox/documents/repositories/Llama2.jl/test.jl:19
 [4] gpu_self_dot_kernel!
   @ ~/.julia/packages/KernelAbstractions/zPAn3/src/macros.jl:95
 [5] gpu_self_dot_kernel!
   @ ./none:0
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.GCNCompilerTarget, AMDGPU.Compiler.HIPCompilerParams}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/C55Eh/src/validation.jl:147
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/C55Eh/src/driver.jl:445 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/C55Eh/src/driver.jl:444 [inlined]
  [5] emit_llvm(job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, only_entry::Bool, validate::Bool)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/C55Eh/src/utils.jl:92
  [6] emit_llvm
    @ ~/.julia/packages/GPUCompiler/C55Eh/src/utils.jl:86 [inlined]
  [7] codegen(output::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/C55Eh/src/driver.jl:134
  [8] codegen
    @ ~/.julia/packages/GPUCompiler/C55Eh/src/driver.jl:115 [inlined]
  [9] compile(target::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/C55Eh/src/driver.jl:111
 [10] compile
    @ ~/.julia/packages/GPUCompiler/C55Eh/src/driver.jl:103 [inlined]
 [11] #40
    @ ~/.julia/packages/AMDGPU/gtxsf/src/compiler/codegen.jl:172 [inlined]
 [12] JuliaContext(f::AMDGPU.Compiler.var"#40#41"{GPUCompiler.CompilerJob{GPUCompiler.GCNCompilerTarget, AMDGPU.Compiler.HIPCompilerParams}}; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/C55Eh/src/driver.jl:52
 [13] JuliaContext(f::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/C55Eh/src/driver.jl:42
 [14] hipcompile(job::GPUCompiler.CompilerJob)
    @ AMDGPU.Compiler ~/.julia/packages/AMDGPU/gtxsf/src/compiler/codegen.jl:171
 [15] actual_compilation(cache::Dict{Any, AMDGPU.HIP.HIPFunction}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.GCNCompilerTarget, AMDGPU.Compiler.HIPCompilerParams}, compiler::typeof(AMDGPU.Compiler.hipcompile), linker::typeof(AMDGPU.Compiler.hiplink))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/C55Eh/src/execution.jl:125
 [16] cached_compilation(cache::Dict{Any, AMDGPU.HIP.HIPFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.GCNCompilerTarget, AMDGPU.Compiler.HIPCompilerParams}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/C55Eh/src/execution.jl:103
 [17] macro expansion
    @ ~/.julia/packages/AMDGPU/gtxsf/src/compiler/codegen.jl:139 [inlined]
 [18] macro expansion
    @ ./lock.jl:267 [inlined]
 [19] hipfunction(f::typeof(gpu_self_dot_kernel!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(32,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, AMDGPU.Device.ROCDeviceVector{Float16, 1}, AMDGPU.Device.ROCDeviceVector{Float16, 1}}}; kwargs::@Kwargs{})
    @ AMDGPU.Compiler ~/.julia/packages/AMDGPU/gtxsf/src/compiler/codegen.jl:133
 [20] hipfunction(f::typeof(gpu_self_dot_kernel!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(32,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, AMDGPU.Device.ROCDeviceVector{Float16, 1}, AMDGPU.Device.ROCDeviceVector{Float16, 1}}})
    @ AMDGPU.Compiler ~/.julia/packages/AMDGPU/gtxsf/src/compiler/codegen.jl:132
 [21] macro expansion
    @ ~/.julia/packages/AMDGPU/gtxsf/src/highlevel.jl:172 [inlined]
 [22] (::KernelAbstractions.Kernel{ROCBackend, KernelAbstractions.NDIteration.StaticSize{(32,)}, KernelAbstractions.NDIteration.DynamicSize, typeof(gpu_self_dot_kernel!)})(::ROCArray{Float16, 1, AMDGPU.Runtime.Mem.HIPBuffer}, ::Vararg{ROCArray{Float16, 1, AMDGPU.Runtime.Mem.HIPBuffer}}; ndrange::Int64, workgroupsize::Nothing)
    @ AMDGPU.ROCKernels ~/.julia/packages/AMDGPU/gtxsf/src/ROCKernels.jl:86
 [23] top-level scope
    @ /media/dhruvdh/External/Dropbox/documents/repositories/Llama2.jl/test.jl:44
in expression starting at /media/dhruvdh/External/Dropbox/documents/repositories/Llama2.jl/test.jl:38

The only change I made was using AMDGPU and changed backend = ROCBackend() or MetalBackend

vchuravy · 2024-03-08T15:28:00Z

  for i in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
    if li + i <= BLOCK_SIZE
      intermediateresults[li] += intermediateresults[li+i]
    end
    @synchronize()
  end

This is #330 I think. On the CPU this currently won't work.

vchuravy · 2024-03-08T15:28:54Z

(call to ijl_alloc_array_1d)

for i in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]

Use a tuple () instead

DhruvDh · 2024-03-08T21:02:00Z

Thanks! That works but I am still getting inconsistent results on GPU. I'll close this issue in favor of #330 .

vchuravy · 2024-03-08T22:49:04Z

Hm on the GPU you shouldn't get inconsistent results, on the CPU yes.

DhruvDh · 2024-03-08T23:09:14Z

So, on AMDGPU, using the following snippet, here is what I get -

using Pkg;
Pkg.add("AMDGPU");
Pkg.add("KernelAbstractions");
Pkg.add("Atomix");

using AMDGPU
using KernelAbstractions
using Atomix: @atomic

@kernel function self_dot_kernel!(result, @Const(vec))
  BLOCK_SIZE = @groupsize()[1]

  I = @index(Global, Linear)
  li = @index(Local, Linear)

  intermediateresults = @localmem eltype(vec) BLOCK_SIZE

  if I <= length(vec)
    intermediateresults[li] = vec[I] * vec[I]
  else
    intermediateresults[li] = 0
  end

  @synchronize()
  s = 1
  while (li + s) <= BLOCK_SIZE
    intermediateresults[li] += intermediateresults[li+s]
    @synchronize()
    s *= 2
  end
  @synchronize()
  if li == 1
    @atomic result[1] += intermediateresults[1]
  end
end

backend = ROCBackend()

for T in [Float32, Float64, Int32, Int64]
  println("===== Testing $T =====")
  for dims in [100, 24354, 12323, 13454, 32512, 314, 235, 51234, 5312]
    x = KernelAbstractions.ones(backend, T, dims)
    dotresult = KernelAbstractions.zeros(backend, T, 1)

    for i in 1:6
      kernel! = self_dot_kernel!(backend, 32)

      kernel!(dotresult, x; ndrange=length(x))
      KernelAbstractions.synchronize(backend)
      println(dotresult)
      dotresult .= 0
    end
  end
end

output -

===== Testing Float32 =====
Float32[100.0]
Float32[98.20209]
Float32[100.00625]
Float32[98.097916]
Float32[99.902084]
Float32[100.0]
Float32[24354.0]
Float32[24452.0]
Float32[24452.0]
Float32[24452.0]
Float32[24452.0]
Float32[24452.0]
Float32[12420.0]
Float32[12420.0]
Float32[12420.0]
Float32[12420.0]
Float32[12420.0]
Float32[12420.0]
Float32[13488.0]
Float32[13488.0]
Float32[13488.0]
Float32[13488.0]
Float32[13488.0]
Float32[13488.0]
Float32[32512.0]
Float32[32512.0]
Float32[32512.0]
Float32[32512.0]
Float32[32512.0]
Float32[32512.0]
Float32[324.0]
Float32[324.0]
Float32[324.0]
Float32[324.0]
Float32[324.0]
Float32[324.0]
Float32[292.0]
Float32[292.0]
Float32[292.0]
Float32[292.0]
Float32[292.0]
Float32[292.0]
Float32[51232.902]
Float32[51236.902]
Float32[51332.0]
Float32[51332.0]
Float32[51332.0]
Float32[51332.0]
Float32[5312.0]
Float32[5312.0]
Float32[5312.0]
Float32[5312.0]
Float32[5312.0]
Float32[5312.0]
===== Testing Float64 =====
[4.1943056984375e8]
[100.0]
[99.9991455078125]
[7.909956653047252e22]
[100.0]
[100.00803324091248]
[24353.999145507812]
[24452.0]
[24452.0]
[24452.0]
[24452.0]
[24452.0]
[12420.0]
[12420.0]
[12420.0]
[12420.0]
[12420.0]
[12420.0]
[13488.0]
[13488.0]
[13488.0]
[13488.0]
[13488.0]
[13488.0]
[32512.0]
[32512.0]
[32512.0]
[32512.0]
[32512.0]
[32512.0]
[324.0]
[324.0]
[324.0]
[324.0]
[324.0]
[324.0]
[292.0]
[292.0]
[292.0]
[292.0]
[292.0]
[292.0]
[51234.0]
[51234.0]
[51332.0]
[51332.0]
[51332.0]
[51332.0]
[5312.0]
[5312.0]
[5312.0]
[5312.0]
[5312.0]
[5312.0]
===== Testing Int32 =====
Int32[100]
Int32[100]
Int32[100]
Int32[100]
Int32[100]
Int32[-18419886]
Int32[24354]
Int32[24452]
Int32[24452]
Int32[24452]
Int32[24452]
Int32[24452]
Int32[12420]
Int32[12420]
Int32[12420]
Int32[12420]
Int32[12420]
Int32[12420]
Int32[13488]
Int32[13488]
Int32[13488]
Int32[13488]
Int32[13488]
Int32[13488]
Int32[32512]
Int32[32512]
Int32[32512]
Int32[32512]
Int32[32512]
Int32[32512]
Int32[324]
Int32[324]
Int32[324]
Int32[324]
Int32[324]
Int32[324]
Int32[292]
Int32[292]
Int32[292]
Int32[292]
Int32[292]
Int32[292]
Int32[51234]
Int32[2128985074]
Int32[51332]
Int32[51332]
Int32[51332]
Int32[51332]
Int32[5312]
Int32[5312]
Int32[5312]
Int32[5312]
Int32[5312]
Int32[5312]
===== Testing Int64 =====
[163208757388]
[4625196980518256780]
[4949667255626498188]
[4948752461952188556]
[163208757388]
[4625196980518256780]
[4811252051793174374]
[24452]
[24452]
[24452]
[24452]
[24452]
[12420]
[12420]
[12420]
[12420]
[12420]
[12420]
[13488]
[13488]
[13488]
[13488]
[13488]
[13488]
[32512]
[32512]
[32512]
[32512]
[32512]
[32512]
[324]
[324]
[324]
[324]
[324]
[324]
[292]
[292]
[292]
[292]
[292]
[292]
[279172925542]
[4625197096482424934]
[51332]
[51332]
[51332]
[51332]
[5312]
[5312]
[5312]
[5312]
[5312]
[5312]

On Metal, I get -

===== Testing Float32 =====
Float32[5.4712238f17]
Float32[2.7071126f17]
Float32[100.0]
Float32[100.0]
Float32[100.0]
Float32[100.0]
Float32[24354.0]
Float32[24354.0]
Float32[24354.0]
Float32[24428.0]
Float32[24444.0]
Float32[24420.0]
Float32[12323.0]
Float32[12323.0]
Float32[12323.0]
Float32[12323.0]
Float32[12398.0]
Float32[12400.0]
Float32[-Inf]
Float32[13454.0]
Float32[13454.0]
Float32[-Inf]
Float32[-Inf]
Float32[13488.0]
Float32[32512.0]
Float32[32512.0]
Float32[32512.0]
Float32[32512.0]
Float32[32512.0]
Float32[32512.0]
Float32[314.0]
Float32[-Inf]
Float32[-Inf]
Float32[-Inf]
Float32[-Inf]
Float32[314.0]
Float32[235.0]
Float32[235.0]
Float32[235.0]
Float32[235.0]
Float32[235.0]
Float32[-Inf]
Float32[51234.0]
Float32[51234.0]
Float32[-Inf]
Float32[51234.016]
Float32[-Inf]
Float32[-Inf]
Float32[5312.0]
Float32[5312.0]
Float32[5312.0]
Float32[5312.0]
Float32[5312.0]
Float32[5312.0]
===== Testing Int32 =====
Int32[-1677848871]
Int32[-1979950382]
Int32[1387148965]
Int32[-1337915697]
Int32[100]
Int32[100]
Int32[24354]
Int32[24428]
Int32[24424]
Int32[24424]
Int32[24432]
Int32[24424]
Int32[12400]
Int32[12418]
Int32[12400]
Int32[12323]
Int32[12323]
Int32[-1324293837]
Int32[13488]
Int32[-29856570]
Int32[-29856570]
Int32[13488]
Int32[13488]
Int32[13488]
Int32[32512]
Int32[32512]
Int32[32512]
Int32[32512]
Int32[32512]
Int32[32512]
Int32[1342722370]
Int32[-29869710]
Int32[-29869710]
Int32[-29869710]
Int32[-29869710]
Int32[-29869710]
Int32[-44804801]
Int32[235]
Int32[235]
Int32[235]
Int32[235]
Int32[-44804801]
Int32[-59688814]
Int32[51312]
Int32[51332]
Int32[-59688814]
Int32[51324]
Int32[51332]
Int32[5312]
Int32[5312]
Int32[5312]
Int32[5312]
Int32[5312]
Int32[5312]

vchuravy · 2024-03-09T14:51:06Z

So the only thing that seems fishy to me is.

while (li + s) <= BLOCK_SIZE
    intermediateresults[li] += intermediateresults[li+s]
    @synchronize()
    s *= 2
  end

Do you not have a read-write race here?

E.g. li=n+1 may or may not have updated it's value when li=n accesses the value?

vchuravy · 2024-03-09T14:56:37Z

You are trying to do something similar to https://github.com/JuliaGPU/CUDA.jl/blob/cb14a637e0b7b7be9ae01005ea9bdcf79b320189/src/mapreduce.jl#L62 IIUC?

My long term goal is to provide primitives for these kinds of block reduces #234

So the difference I see that in your first iteration all Workitems in the block are active and are accessing their neighborhood data (and being racy about it).

In the CUDA code half the Workitems are active in the first iteration

DhruvDh · 2024-03-09T16:12:12Z

yes, thanks! There is overlap in how I was indexing which does not occur with the CUDA way. I can confirm it works -

  d = 1
  while d < BLOCK_SIZE
    @synchronize()
    if 2 * d * (li - 1) + li <= BLOCK_SIZE
      intermediateresults[li] += intermediateresults[2*d*(li-1)+li]
    end
    d *= 2
  end

DhruvDh · 2024-03-09T16:50:48Z

Some things still don't make sense to me -

This works -

  d = 1
  while d < BLOCK_SIZE
    @synchronize()
    if 2 * d * (li - 1) + li <= BLOCK_SIZE
      intermediateresults[li] += intermediateresults[2*d*(li-1)+li]
    end
    d *= 2
  end

This outputs only zeros -

  d = 1
  while d < BLOCK_SIZE
    @synchronize()
    lhsIndex = 2 * d * (li - 1) # or @private lhsIndex = 2 * d * (li - 1)
    if lhsIndex + li <= BLOCK_SIZE
      intermediateresults[lhsIndex] += intermediateresults[lhsIndex+li]
    end
    d *= 2
  end

Assigning variables outside the loop is fine but inside the loop is troublesome.

vchuravy · 2024-03-09T18:19:48Z

Are you back to testing on the CPU? I should add a pass that errors when synchronize is used inside control flow, KA current design simply can't support that.

DhruvDh · 2024-03-09T18:53:17Z

Sorry, I forgot to mention. My previous message was based on the Metal backend.
I tried it again on the AMD backend, and the full snippet follows, where it crashes.

# using Pkg;
# Pkg.add("AMDGPU");
# Pkg.add("KernelAbstractions");
# Pkg.add("Atomix");

using AMDGPU
using KernelAbstractions
using Atomix: @atomic

@kernel function self_dot_kernel!(result, @Const(vec))
  BLOCK_SIZE = @groupsize()[1]

  I = @index(Global, Linear)
  li = @index(Local, Linear)

  intermediateresults = @localmem eltype(vec) BLOCK_SIZE

  if I <= length(vec)
    intermediateresults[li] = vec[I] * vec[I]
  else
    intermediateresults[li] = 0
  end

  d = 1
  while d < BLOCK_SIZE
    @synchronize()
    lhsIndex = 2 * d * (li - 1) # or @private lhsIndex = 2 * d * (li - 1)
    if lhsIndex + li <= BLOCK_SIZE
      intermediateresults[lhsIndex] += intermediateresults[lhsIndex+li]
    end
    d *= 2
  end

  @synchronize()
  if li == 1
    @atomic result[1] += intermediateresults[1]
  end
end

backend = ROCBackend()

for T in [Float32, Float64, Int32, Int64]
  println("===== Testing $T =====")
  for dims in [100, 24354, 12323, 13454, 32512, 314, 235, 51234, 5312]
    x = KernelAbstractions.ones(backend, T, dims)
    dotresult = KernelAbstractions.zeros(backend, T, 1)

    for i in 1:6
      kernel! = self_dot_kernel!(backend, 32)

      kernel!(dotresult, x; ndrange=length(x))
      KernelAbstractions.synchronize(backend)
      println(dotresult)
      dotresult .= 0
    end
  end
end

===== Testing Float32 =====
ERROR: LoadError: GPU Kernel Exception
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] throw_if_exception(dev::HIPDevice)
   @ AMDGPU ~/.julia/packages/AMDGPU/gtxsf/src/exception_handler.jl:123
 [3] synchronize(stm::HIPStream; blocking::Bool, stop_hostcalls::Bool)
   @ AMDGPU ~/.julia/packages/AMDGPU/gtxsf/src/highlevel.jl:53
 [4] synchronize (repeats 2 times)
   @ ~/.julia/packages/AMDGPU/gtxsf/src/highlevel.jl:49 [inlined]
 [5] synchronize
   @ ~/.julia/packages/AMDGPU/gtxsf/src/ROCKernels.jl:29 [inlined]
 [6] top-level scope
   @ /media/dhruvdh/External/Dropbox/documents/repositories/Llama2.jl/amdgpu-test.jl:52
in expression starting at /media/dhruvdh/External/Dropbox/documents/repositories/Llama2.jl/amdgpu-test.jl:42

On Metal, the same kernel outputs only 0.0s

vchuravy · 2024-03-09T19:57:46Z

Are you not missing a +1 in lhsIndex = 2 * d * (li - 1)

For li=1 lhsindex will be 0 which is OOB

DhruvDh · 2024-03-09T20:17:48Z

No, actually, it was supposed to be intermediateresults[li] += intermediateresults[lhsIndex+li]. Sorry! and thanks for all your responses!

A pass that errors on when synchronize is used inside control flow would be great!

DhruvDh closed this as completed Mar 8, 2024

vchuravy reopened this Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent results across platforms and datatypes on CPU backend #468

Inconsistent results across platforms and datatypes on CPU backend #468

DhruvDh commented Mar 8, 2024

DhruvDh commented Mar 8, 2024 •

edited

Loading

vchuravy commented Mar 8, 2024

vchuravy commented Mar 8, 2024

DhruvDh commented Mar 8, 2024

vchuravy commented Mar 8, 2024

DhruvDh commented Mar 8, 2024

vchuravy commented Mar 9, 2024

vchuravy commented Mar 9, 2024 •

edited

Loading

DhruvDh commented Mar 9, 2024

DhruvDh commented Mar 9, 2024

vchuravy commented Mar 9, 2024

DhruvDh commented Mar 9, 2024

vchuravy commented Mar 9, 2024

DhruvDh commented Mar 9, 2024

Inconsistent results across platforms and datatypes on CPU backend #468

Inconsistent results across platforms and datatypes on CPU backend #468

Comments

DhruvDh commented Mar 8, 2024

DhruvDh commented Mar 8, 2024 • edited Loading

vchuravy commented Mar 8, 2024

vchuravy commented Mar 8, 2024

DhruvDh commented Mar 8, 2024

vchuravy commented Mar 8, 2024

DhruvDh commented Mar 8, 2024

vchuravy commented Mar 9, 2024

vchuravy commented Mar 9, 2024 • edited Loading

DhruvDh commented Mar 9, 2024

DhruvDh commented Mar 9, 2024

vchuravy commented Mar 9, 2024

DhruvDh commented Mar 9, 2024

vchuravy commented Mar 9, 2024

DhruvDh commented Mar 9, 2024

DhruvDh commented Mar 8, 2024 •

edited

Loading

vchuravy commented Mar 9, 2024 •

edited

Loading