Debug New Motion #472

pvillacorta · 2024-08-30T11:18:10Z

This PR is intended to debug the error of ArbitraryAction motion when testing in oneAPI.
I don't want to fill #442 with (more) unnecessary commits, so I'll upload them here and delete the PR when I find the error.

maleadt · 2024-09-04T10:03:36Z

The CI jobs here have killed our oneAPI CI server multiple times, each time consuming over 20GB of shared memory, so I'm temporarily disabling the JuliaGPU Buildkite integration for KomaMRI.jl.

cncastillo · 2024-09-04T12:34:09Z

Hi @maleadt I just saw this. I'm very sorry about what happened. I have tried to tell our contributors to be mindful with the BuildKite CI, but by the amount of commits this is technically a DDoS atack.

@pvillacorta please stop this immediately. You need to change how you approach pushing commits, and if you're testing something, you need to be a lot more strategic on what to try, as it is running on computers that the Julia team has kindly provided us (and Julia also uses).

I will set up the BuilKite CI so it only runs if I approve it to avoid this happening again in the future.

My sincere apologies.

maleadt · 2024-09-04T13:25:56Z

Thank you. The concrete issue was that this PR somehow create a huge amount of shared memory outside of the Julia process -- possible because of an excessive amount of printing happening from within kernels -- bypassing our normal OOM-prevention heuristics. In any case, we don't have the CI resources to support CI-driven development, so I would recommend debugging the issue locally first.

Let me know when I can re-enable the pipeline, or feel free to do it yourself on the Buildkite Web UI if you have the permissions to do so.

cncastillo · 2024-09-04T18:15:48Z

I think I am able to re-enable it myself. Thank you!

Talking with @pvillacorta, it seems that debugging this locally would be hard, as it appears that we encountered an issue with oneAPI.jl. This seems to be a problem with max/min like the one we had on Metal.jl JuliaGPU/Metal.jl#389 (https://github.com/JuliaGPU/GPUCompiler.jl/pull/602/files#diff-52d57b551d62b4dc18df904b53dfbbf149404dd5f9e748b0f417364f5d32966f). On oneAPI.jl the symptoms are different, instead of being wrong all the time like for Metal.jl, the operation on oneAPI.jl sometimes gives incorrect results (approx 2/10 times).

function _unit_time(t::AbstractArray{T}, ts::TimeRange{T}) where {T<:Real}
    if ts.t_start == ts.t_end
        return (t .>= ts.t_start) .* oneunit(T)
    else
        # Metal.jl: Separating the min/max statements fixes the problem for now
        # until https://github.com/JuliaGPU/GPUCompiler.jl/pull/602/files#diff-52d57b551d62b4dc18df904b53dfbbf149404dd5f9e748b0f417364f5d32966f
        tmp = max.((t .- ts.t_start) ./ (ts.t_end - ts.t_start), zero(T))
        # oneAPI.jl: The problem was fixed by adding an additional instruction here
        # _ = sum(t)
        # Could this be a GPUCompiler.jl problem?
        return min.(tmp, oneunit(T))
    end
end

Pablo hasn't completely identified/isolated the problem, as he tells me that it is a little bit random when it happens, so he hasn't filed an issue.

@maleadt Do you think using ifelse instead of min/max could be better? Would you happen to have any recommendations for what to do?

Thank you very much for all your help!

pvillacorta · 2024-09-04T18:28:19Z

@maleadt, I want to apologize for the inconvenience I've caused with my recent commits. It was never my intention to overload the server or create issues in the CI environment. I’m working on debugging this issue as best as I can, but I’ve encountered some challenges, as @cncastillo mentioned.

I understand the importance of being more mindful of shared resources and I’m committed to changing my approach to ensure that I don’t cause further problems. I would appreciate any advice or guidance on how I could handle this situation more effectively. Once again, I’m very sorry for what happened and I'd like to collaborate on resolving this in the best possible way.

Thank you for your understanding and help.

maleadt · 2024-09-05T08:42:54Z

debugging this locally would be hard, as it appears that we encountered an issue with oneAPI.jl.

Is the issue that you don't have the hardware? Many Intel CPUs come with an oneAPI-compatible IGP.

On oneAPI.jl the symptoms are different, instead of being wrong all the time like for Metal.jl, the operation on oneAPI.jl sometimes gives incorrect results (approx 2/10 times).

The function you quote there doesn't look like a kernel issue, so it's probably different. If this ends up dispatching to MKL, this may be JuliaGPU/oneAPI.jl#445; try setting the environment variable mentioned in that issue.

Do you think using ifelse instead of min/max could be better?

If this is a kernel function (which it doesn't look like), using ifelse reduces branching which may help.

pvillacorta · 2024-09-05T08:49:50Z

From c71fee8, we can notice how tests fail even using KernelAbstractions.synchronize after calculating the time array:

const AdjointOneArray{T, N, M} = LinearAlgebra.Adjoint{T, oneArray{T, N, M}} where {T, N, M}
## Extend KomaMRIBase.unit_time (until bug with oneAPI is solved)
function KomaMRIBase.unit_time(t::AdjointOneArray{T, N, M}, ts::KomaMRIBase.TimeRange{T}) where {T, N, M}
    if ts.t_start == ts.t_end
        return (t .>= ts.t_start) .* oneunit(T)
    else
        tmp = max.((t .- ts.t_start) ./ (ts.t_end - ts.t_start), zero(T))
        t = min.(tmp, oneunit(T))
        KA.synchronize(KA.get_backend(t))
        return t
    end
end

What I thought would solve the problem every time, was to use a somewhat heavier operation that would make use of the calculated variable t, such as a sum function (see 97048f0). The tests now pass 8/8 times, but it may be a coincidence:

## Extend KomaMRIBase.unit_time (until bug with oneAPI is solved)
function KomaMRIBase.unit_time(t::AdjointOneArray{T, N, M}, ts::KomaMRIBase.TimeRange{T}) where {T<:Real, N, M}
    if ts.t_start == ts.t_end
        return (t .>= ts.t_start) .* oneunit(T)
    else
        tmp = max.((t .- ts.t_start) ./ (ts.t_end - ts.t_start), zero(T))
        t = min.(tmp, oneunit(T))
        _ = sum(t) # Dummy sum
    end
end

As I thought that this sum could be a good temporary solution (until I could find the source of the problem), I included it in branch #442 (commit 51c71cc), but the test failed again for oneAPI.

So I started to think that this function unit_time was not the source of the problem, and I tried to isolate and reproduce the problem (commits 3e5b845 in advance) including also Interpolations.jl, to see if it was something related with this package. I did not get it, by the way, as all tests were passing now.

pvillacorta · 2024-09-05T08:52:33Z

Just saw this message:

Is the issue that you don't have the hardware? Many Intel CPUs come with an oneAPI-compatible IGP.

I will try on my laptop, as it has an embedded Intel GPU.

The function you quote there doesn't look like a kernel issue, so it's probably different. If this ends up dispatching to MKL, this may be JuliaGPU/oneAPI.jl#445; try setting the environment variable mentioned in that issue.

I will try it and let you know.

Thank you!

pvillacorta added 30 commits July 11, 2024 13:04

Initial commit

2b491ff

Delete Static

b948d49

Merge branch 'master' into new-motion

8f2ead2

Apply suggested change by @rkierulf

7ad003e

Merge branch 'master' into new-motion

447cbf7

Solve bug

2804413

Enable all testsitems

596af23

Suggested changes

fe9c8c2

Solve bug

79596ad

Solve bug

ea91613

Export AbstractMotionList

e10660f

Remove periodic motions from docs

211337f

Update contracting_ring.phantom example

c78838e

Recover tmp variable for now

c603fd0

Solve docs bug

a21e3d7

Update docs to pass CI

76895c2

Docstrings to pass CI

056c615

Test get_spin_coords in core

4901df7

Continue testing

130a7f1

Correct bug

b4ac68e

Add SimpleMotion test

5b59214

Solve bug

0aa8540

Try again...

36f96be

Again...

02c539d

Check if only fails with oneAPI

25360e2

Try now

0d8a541

Aux comments

59d6ecc

Try with simple motion

0d2c134

comments

2cbfa29

Recover to initial state

77b9530

pvillacorta added 19 commits September 3, 2024 16:02

_ = copy(t)

4120a19

_ = sum(t)

97048f0

Try as it is in the main branch, as it is not working now

3e5b845

Try sum before calling _unit_time

2a76c45

println

8b7797e

println where it was at the beggining

c40444a

Println inside displacement_y!

39becd2

println inside resample

5a410a1

Printing also interpolation result, not sure about what is the problem

f77429c

More iterations, print bigger @view

c5e099d

@view 1:3

616f7ab

not print time

7ce69f7

Not print uy

23502d4

Try to reproduce the error

b823513

print only uy again

9de93c9

Correct bug in minimum function

14f48b1

Print time

1aa52ab

Print u and t inside resample

16c13ec

Timeout 15 minutes, 12 iterations

fad91b3

Timeout 30 minutes

170d3ab

pvillacorta closed this Sep 11, 2024

pvillacorta deleted the new-motion-debug branch September 11, 2024 11:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debug New Motion #472

Debug New Motion #472

pvillacorta commented Aug 30, 2024

maleadt commented Sep 4, 2024 •

edited

Loading

cncastillo commented Sep 4, 2024 •

edited

Loading

maleadt commented Sep 4, 2024 •

edited

Loading

cncastillo commented Sep 4, 2024 •

edited

Loading

pvillacorta commented Sep 4, 2024 •

edited

Loading

maleadt commented Sep 5, 2024

pvillacorta commented Sep 5, 2024

pvillacorta commented Sep 5, 2024 •

edited

Loading

Debug New Motion #472

Debug New Motion #472

Conversation

pvillacorta commented Aug 30, 2024

maleadt commented Sep 4, 2024 • edited Loading

cncastillo commented Sep 4, 2024 • edited Loading

maleadt commented Sep 4, 2024 • edited Loading

cncastillo commented Sep 4, 2024 • edited Loading

pvillacorta commented Sep 4, 2024 • edited Loading

maleadt commented Sep 5, 2024

pvillacorta commented Sep 5, 2024

pvillacorta commented Sep 5, 2024 • edited Loading

maleadt commented Sep 4, 2024 •

edited

Loading

cncastillo commented Sep 4, 2024 •

edited

Loading

maleadt commented Sep 4, 2024 •

edited

Loading

cncastillo commented Sep 4, 2024 •

edited

Loading

pvillacorta commented Sep 4, 2024 •

edited

Loading

pvillacorta commented Sep 5, 2024 •

edited

Loading