-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory error solved by emptying CUDA cache #37
Comments
Hi @theophilec ! Thanks for the bug report! In trying to reproduce the OOM I set up an environment similar to yours (main difference is GPU model and CUDA 11.1 instead of 11.0), but cannot seem to trigger the out-of-memory. If possible could you try to run the same code with a few more debugging statements added in. Something like this would be helpful: import torch
from falkon.kernels import LinearKernel
from falkon import Falkon, FalkonOptions
n = 50000
d = 51000
l = 10
X = torch.randn((n, d))
y = torch.nn.functional.one_hot(torch.randint(0, l, (n,))).float()
opt = FalkonOptions(debug=True)
sigma = 1
penalties = [1e-4, 1e-5]
for i in range(2):
print(f"Fitting {i}")
kernel = LinearKernel(sigma=sigma, opt=opt)
model = Falkon(
kernel=kernel,
penalty=penalties[i],
M=40000,
maxiter=10,
seed=0,
options=opt,
)
model.fit(X, y)
predictions = model.predict(X)
print(f"Memory used after fit {i}")
for device in range(torch.cuda.device_count()):
print(torch.cuda.memory_summary(device))
# torch.cuda.empty_cache()
# If the line above is commented, FALKON induces a CUDA Out of Memory error. |
Sounds good @Giodiro! Here is the result of the execution:
|
PS: if I empty the cache then show the summaries again, the ~ 12 GB of
|
Okay, so it seems that in reality things are getting freed correctly (cue the fact that allocated memory is 0 at the end of fit). I think the error message gives a clue in that the free memory on the GPU is:
So it could be possible that the large allocation performed on the first call to There is a PyTorch issue which is relevant: pytorch/pytorch#35901 and a related pull-request pytorch/pytorch#44742 but I'm not sure whether it will be merged in the next release or not! To debug this further and make sure that the interpretation I'm giving is correct I checked the non-releasable memory just before the allocation which fails falkon/falkon/mmv_ops/fmm_cuda.py Line 144 in 4bae112
fit .I'm also trying to play a bit with the data sizes to see if I can manage to cause a crash on my GPU as well, but haven't managed yet. Not sure if you have any ideas on how to reduce fragmentation, but I think the cleanest / easiest course of action until the PyTorch oversize blocks patch lands is to empty the PyTorch allocator after each iteration (which is what you've been doing!). I'm sorry this is a bit of a disappointing fix! |
This makes sense, thanks. Indeed the PyTorch issue seems relevant. Do you know why wasn't I observing non-releasable memory in my No problem in any case: the current state of affairs works for me for now. Hopefully, they will merge the PR. :) |
Hi FALKON team!
While using Falkon, I stumbled on what looks like a memory bug in the library.
Code to reproduce
Expected behavior
Fit different models, no errors.
Actual behavior
In the above code,
torch.cuda.empty_cache()
line eliminates the issue.d = 51000
tod = 30000
eliminates the issue.Environment
pytorch 1.9
pip
Let me know if I can provide any further information or assistance in fixing the issue! Thanks!
The text was updated successfully, but these errors were encountered: