Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug New Motion #472

Closed
wants to merge 187 commits into from
Closed

Debug New Motion #472

wants to merge 187 commits into from

Conversation

pvillacorta
Copy link
Collaborator

This PR is intended to debug the error of ArbitraryAction motion when testing in oneAPI.
I don't want to fill #442 with (more) unnecessary commits, so I'll upload them here and delete the PR when I find the error.

@maleadt
Copy link

maleadt commented Sep 4, 2024

The CI jobs here have killed our oneAPI CI server multiple times, each time consuming over 20GB of shared memory, so I'm temporarily disabling the JuliaGPU Buildkite integration for KomaMRI.jl.

@cncastillo
Copy link
Member

cncastillo commented Sep 4, 2024

Hi @maleadt I just saw this. I'm very sorry about what happened. I have tried to tell our contributors to be mindful with the BuildKite CI, but by the amount of commits this is technically a DDoS atack.

@pvillacorta please stop this immediately. You need to change how you approach pushing commits, and if you're testing something, you need to be a lot more strategic on what to try, as it is running on computers that the Julia team has kindly provided us (and Julia also uses).

I will set up the BuilKite CI so it only runs if I approve it to avoid this happening again in the future.

My sincere apologies.

@maleadt
Copy link

maleadt commented Sep 4, 2024

Thank you. The concrete issue was that this PR somehow create a huge amount of shared memory outside of the Julia process -- possible because of an excessive amount of printing happening from within kernels -- bypassing our normal OOM-prevention heuristics. In any case, we don't have the CI resources to support CI-driven development, so I would recommend debugging the issue locally first.

Let me know when I can re-enable the pipeline, or feel free to do it yourself on the Buildkite Web UI if you have the permissions to do so.

@cncastillo
Copy link
Member

cncastillo commented Sep 4, 2024

I think I am able to re-enable it myself. Thank you!

Talking with @pvillacorta, it seems that debugging this locally would be hard, as it appears that we encountered an issue with oneAPI.jl. This seems to be a problem with max/min like the one we had on Metal.jl JuliaGPU/Metal.jl#389 (https://github.com/JuliaGPU/GPUCompiler.jl/pull/602/files#diff-52d57b551d62b4dc18df904b53dfbbf149404dd5f9e748b0f417364f5d32966f). On oneAPI.jl the symptoms are different, instead of being wrong all the time like for Metal.jl, the operation on oneAPI.jl sometimes gives incorrect results (approx 2/10 times).

function _unit_time(t::AbstractArray{T}, ts::TimeRange{T}) where {T<:Real}
    if ts.t_start == ts.t_end
        return (t .>= ts.t_start) .* oneunit(T)
    else
        # Metal.jl: Separating the min/max statements fixes the problem for now
        # until https://github.com/JuliaGPU/GPUCompiler.jl/pull/602/files#diff-52d57b551d62b4dc18df904b53dfbbf149404dd5f9e748b0f417364f5d32966f
        tmp = max.((t .- ts.t_start) ./ (ts.t_end - ts.t_start), zero(T))
        # oneAPI.jl: The problem was fixed by adding an additional instruction here
        # _ = sum(t)
        # Could this be a GPUCompiler.jl problem?
        return min.(tmp, oneunit(T))
    end
end

Pablo hasn't completely identified/isolated the problem, as he tells me that it is a little bit random when it happens, so he hasn't filed an issue.

@maleadt Do you think using ifelse instead of min/max could be better? Would you happen to have any recommendations for what to do?

Thank you very much for all your help!

@pvillacorta
Copy link
Collaborator Author

pvillacorta commented Sep 4, 2024

@maleadt, I want to apologize for the inconvenience I've caused with my recent commits. It was never my intention to overload the server or create issues in the CI environment. I’m working on debugging this issue as best as I can, but I’ve encountered some challenges, as @cncastillo mentioned.

I understand the importance of being more mindful of shared resources and I’m committed to changing my approach to ensure that I don’t cause further problems. I would appreciate any advice or guidance on how I could handle this situation more effectively. Once again, I’m very sorry for what happened and I'd like to collaborate on resolving this in the best possible way.

Thank you for your understanding and help.

@maleadt
Copy link

maleadt commented Sep 5, 2024

debugging this locally would be hard, as it appears that we encountered an issue with oneAPI.jl.

Is the issue that you don't have the hardware? Many Intel CPUs come with an oneAPI-compatible IGP.

On oneAPI.jl the symptoms are different, instead of being wrong all the time like for Metal.jl, the operation on oneAPI.jl sometimes gives incorrect results (approx 2/10 times).

The function you quote there doesn't look like a kernel issue, so it's probably different. If this ends up dispatching to MKL, this may be JuliaGPU/oneAPI.jl#445; try setting the environment variable mentioned in that issue.

Do you think using ifelse instead of min/max could be better?

If this is a kernel function (which it doesn't look like), using ifelse reduces branching which may help.

@pvillacorta
Copy link
Collaborator Author

From c71fee8, we can notice how tests fail even using KernelAbstractions.synchronize after calculating the time array:

const AdjointOneArray{T, N, M} = LinearAlgebra.Adjoint{T, oneArray{T, N, M}} where {T, N, M}
## Extend KomaMRIBase.unit_time (until bug with oneAPI is solved)
function KomaMRIBase.unit_time(t::AdjointOneArray{T, N, M}, ts::KomaMRIBase.TimeRange{T}) where {T, N, M}
    if ts.t_start == ts.t_end
        return (t .>= ts.t_start) .* oneunit(T)
    else
        tmp = max.((t .- ts.t_start) ./ (ts.t_end - ts.t_start), zero(T))
        t = min.(tmp, oneunit(T))
        KA.synchronize(KA.get_backend(t))
        return t
    end
end

What I thought would solve the problem every time, was to use a somewhat heavier operation that would make use of the calculated variable t, such as a sum function (see 97048f0). The tests now pass 8/8 times, but it may be a coincidence:

## Extend KomaMRIBase.unit_time (until bug with oneAPI is solved)
function KomaMRIBase.unit_time(t::AdjointOneArray{T, N, M}, ts::KomaMRIBase.TimeRange{T}) where {T<:Real, N, M}
    if ts.t_start == ts.t_end
        return (t .>= ts.t_start) .* oneunit(T)
    else
        tmp = max.((t .- ts.t_start) ./ (ts.t_end - ts.t_start), zero(T))
        t = min.(tmp, oneunit(T))
        _ = sum(t) # Dummy sum
    end
end

As I thought that this sum could be a good temporary solution (until I could find the source of the problem), I included it in branch #442 (commit 51c71cc), but the test failed again for oneAPI.

So I started to think that this function unit_time was not the source of the problem, and I tried to isolate and reproduce the problem (commits 3e5b845 in advance) including also Interpolations.jl, to see if it was something related with this package. I did not get it, by the way, as all tests were passing now.

@pvillacorta
Copy link
Collaborator Author

pvillacorta commented Sep 5, 2024

Just saw this message:

Is the issue that you don't have the hardware? Many Intel CPUs come with an oneAPI-compatible IGP.

I will try on my laptop, as it has an embedded Intel GPU.

The function you quote there doesn't look like a kernel issue, so it's probably different. If this ends up dispatching to MKL, this may be JuliaGPU/oneAPI.jl#445; try setting the environment variable mentioned in that issue.

I will try it and let you know.

Thank you!

@pvillacorta pvillacorta deleted the new-motion-debug branch September 11, 2024 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants