-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Debug New Motion #472
Debug New Motion #472
Conversation
The CI jobs here have killed our oneAPI CI server multiple times, each time consuming over 20GB of shared memory, so I'm temporarily disabling the JuliaGPU Buildkite integration for KomaMRI.jl. |
Hi @maleadt I just saw this. I'm very sorry about what happened. I have tried to tell our contributors to be mindful with the BuildKite CI, but by the amount of commits this is technically a DDoS atack. @pvillacorta please stop this immediately. You need to change how you approach pushing commits, and if you're testing something, you need to be a lot more strategic on what to try, as it is running on computers that the Julia team has kindly provided us (and Julia also uses). I will set up the BuilKite CI so it only runs if I approve it to avoid this happening again in the future. My sincere apologies. |
Thank you. The concrete issue was that this PR somehow create a huge amount of shared memory outside of the Julia process -- possible because of an excessive amount of printing happening from within kernels -- bypassing our normal OOM-prevention heuristics. In any case, we don't have the CI resources to support CI-driven development, so I would recommend debugging the issue locally first. Let me know when I can re-enable the pipeline, or feel free to do it yourself on the Buildkite Web UI if you have the permissions to do so. |
I think I am able to re-enable it myself. Thank you! Talking with @pvillacorta, it seems that debugging this locally would be hard, as it appears that we encountered an issue with oneAPI.jl. This seems to be a problem with function _unit_time(t::AbstractArray{T}, ts::TimeRange{T}) where {T<:Real}
if ts.t_start == ts.t_end
return (t .>= ts.t_start) .* oneunit(T)
else
# Metal.jl: Separating the min/max statements fixes the problem for now
# until https://github.com/JuliaGPU/GPUCompiler.jl/pull/602/files#diff-52d57b551d62b4dc18df904b53dfbbf149404dd5f9e748b0f417364f5d32966f
tmp = max.((t .- ts.t_start) ./ (ts.t_end - ts.t_start), zero(T))
# oneAPI.jl: The problem was fixed by adding an additional instruction here
# _ = sum(t)
# Could this be a GPUCompiler.jl problem?
return min.(tmp, oneunit(T))
end
end Pablo hasn't completely identified/isolated the problem, as he tells me that it is a little bit random when it happens, so he hasn't filed an issue. @maleadt Do you think using Thank you very much for all your help! |
@maleadt, I want to apologize for the inconvenience I've caused with my recent commits. It was never my intention to overload the server or create issues in the CI environment. I’m working on debugging this issue as best as I can, but I’ve encountered some challenges, as @cncastillo mentioned. I understand the importance of being more mindful of shared resources and I’m committed to changing my approach to ensure that I don’t cause further problems. I would appreciate any advice or guidance on how I could handle this situation more effectively. Once again, I’m very sorry for what happened and I'd like to collaborate on resolving this in the best possible way. Thank you for your understanding and help. |
Is the issue that you don't have the hardware? Many Intel CPUs come with an oneAPI-compatible IGP.
The function you quote there doesn't look like a kernel issue, so it's probably different. If this ends up dispatching to MKL, this may be JuliaGPU/oneAPI.jl#445; try setting the environment variable mentioned in that issue.
If this is a kernel function (which it doesn't look like), using |
From c71fee8, we can notice how tests fail even using const AdjointOneArray{T, N, M} = LinearAlgebra.Adjoint{T, oneArray{T, N, M}} where {T, N, M}
## Extend KomaMRIBase.unit_time (until bug with oneAPI is solved)
function KomaMRIBase.unit_time(t::AdjointOneArray{T, N, M}, ts::KomaMRIBase.TimeRange{T}) where {T, N, M}
if ts.t_start == ts.t_end
return (t .>= ts.t_start) .* oneunit(T)
else
tmp = max.((t .- ts.t_start) ./ (ts.t_end - ts.t_start), zero(T))
t = min.(tmp, oneunit(T))
KA.synchronize(KA.get_backend(t))
return t
end
end What I thought would solve the problem every time, was to use a somewhat heavier operation that would make use of the calculated variable t, such as a ## Extend KomaMRIBase.unit_time (until bug with oneAPI is solved)
function KomaMRIBase.unit_time(t::AdjointOneArray{T, N, M}, ts::KomaMRIBase.TimeRange{T}) where {T<:Real, N, M}
if ts.t_start == ts.t_end
return (t .>= ts.t_start) .* oneunit(T)
else
tmp = max.((t .- ts.t_start) ./ (ts.t_end - ts.t_start), zero(T))
t = min.(tmp, oneunit(T))
_ = sum(t) # Dummy sum
end
end As I thought that this sum could be a good temporary solution (until I could find the source of the problem), I included it in branch #442 (commit 51c71cc), but the test failed again for oneAPI. So I started to think that this function |
Just saw this message:
I will try on my laptop, as it has an embedded Intel GPU.
I will try it and let you know. Thank you! |
This PR is intended to debug the error of ArbitraryAction motion when testing in oneAPI.
I don't want to fill #442 with (more) unnecessary commits, so I'll upload them here and delete the PR when I find the error.