Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Randomly occuring segmentation fault #29

Open
RomeoV opened this issue Feb 8, 2023 · 2 comments
Open

Randomly occuring segmentation fault #29

RomeoV opened this issue Feb 8, 2023 · 2 comments

Comments

@RomeoV
Copy link

RomeoV commented Feb 8, 2023

Hello, thanks for the great work!
I'm currently trying to wrap a pytorch model into a Flux based training setup.
The training seems to go fine for a few epochs, however seemingly at random, a segmentation fault occurs (see below).
I don't have a great MWE right now (I'll try to make one still), but perhaps we can already make some conclusions based on the stacktrace, which here happened after about seven epochs:

[56770] signal (11.1): Segmentation fault
in expression starting at /home/romeo/Documents/Stanford/google_ood/DisentanglingVAE.jl/scripts/vae_CUB.jl:213
PyErr_Occurred at /usr/lib/libpython3.10.so.1.0 (unknown line)
pyerr_occurred at /home/romeo/.julia/packages/PyCall/twYvK/src/exception.jl:69 [inlined]
pyerr_check at /home/romeo/.julia/packages/PyCall/twYvK/src/exception.jl:75 [inlined]
############# LOOK HERE vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
share at /home/romeo/.julia/packages/DLPack/SUhao/src/pycall.jl:109
#13 at /home/romeo/.julia/packages/PyCallChainRules/YR5iR/src/pytorch.jl:59
#########################^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
unknown function (ip: 0x7ff3e1725d52)
map at ./tuple.jl:292
unknown function (ip: 0x7ff3e1723e23)
_jl_invoke at /cache/build/default-amdci4-7/julialang/julia-release-1-dot-9/src/gf.c:2681 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-7/julialang/julia-release-1-dot-9/src/gf.c:2863
#rrule#12 at /home/romeo/.julia/packages/PyCallChainRules/YR5iR/src/pytorch.jl:59
rrule at /home/romeo/.julia/packages/PyCallChainRules/YR5iR/src/pytorch.jl:56 [inlined]
rrule at /home/romeo/.julia/packages/ChainRulesCore/a4mIA/src/rules.jl:134 [inlined]
chain_rrule at /home/romeo/.julia/packages/Zygote/xGkZ5/src/compiler/chainrules.jl:218 [inlined]
macro expansion at /home/romeo/.julia/packages/Zygote/xGkZ5/src/compiler/interface2.jl:0 [inlined]
_pullback at /home/romeo/.julia/packages/Zygote/xGkZ5/src/compiler/interface2.jl:9
unknown function (ip: 0x7ff3e1723a4d)
_jl_invoke at /cache/build/default-amdci4-7/julialang/julia-release-1-dot-9/src/gf.c:2681 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-7/julialang/julia-release-1-dot-9/src/gf.c:2863
_pullback at /home/romeo/Documents/Stanford/google_ood/DisentanglingVAE.jl/scripts/vae_CUB.jl:166 [inlined]

Here are the referenced code snippets in the stacktrace:

function ChainRulesCore.rrule(wrap::TorchModuleWrapper, args...; kwargs...)
T = typeof(first(wrap.params))
params = wrap.params
pyparams = Tuple(map(x -> DLPack.share(x, PyObject, pyfrom_dlpack).requires_grad_(true), params))
pyargs = fmap(x -> DLPack.share(x, PyObject, pyfrom_dlpack).requires_grad_(true), args)
torch_primal, torch_vjpfun = functorch.vjp(py"buffer_implicit"(wrap.torch_stateless_module, wrap.buffers), pyparams, pyargs...; kwargs...)
project = ProjectTo(args)
function TorchModuleWrapper_pullback(Δ)

and
https://github.com/pabloferz/DLPack.jl/blob/61f48ee6b5e4f56d9b8525fa6ef9b613242160b8/src/pycall.jl#L98-L116

@RomeoV
Copy link
Author

RomeoV commented Feb 8, 2023

Here is a github gist which reproduces the error:
https://gist.github.com/RomeoV/ca397a6b883c1cf567f2503d135084d8

The setup is generally inspired by the VAE tutorial in the FastAI doc.

@rejuvyesh
Copy link
Owner

Given dlpack and garbage collection is involved, could very well be related to #24 (the interaction with Julia GC and Python GC). What versions of pytorch/functorch are you using? Would it be possible to check your interaction with PyNNTraining implementation: https://github.com/lorenzoh/PyNNTraining.jl/blob/e02bf899ce7228090a60286b8373fb87bfa5b6b1/src/topytorch.jl#L34

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants