cuda runtime error (304) : OS call failed or operation not supported on this OS #31

nehap25 · 2021-04-30T00:16:52Z

When running the following example with Falkon, I run into a cuda runtime error.

Example:
`from sklearn import datasets, model_selection
import numpy as np
import torch
import falkon
from falkon.models import Falkon
from falkon.kernels import GaussianKernel
from falkon.options import FalkonOptions

Xtrain = np.random.randn(80000, 1536)
Xtest = np.random.randn(10000, 1536)

Ytrain = np.random.randn(80000, 20)
Ytest = np.random.randn(10000, 20)

Xtrain = torch.from_numpy(Xtrain)
Xtest = torch.from_numpy(Xtest)
Ytrain = torch.from_numpy(Ytrain)
Ytest = torch.from_numpy(Ytest)

print("X TRAIN SHAPE: ", Xtrain.shape, Ytrain.shape, "TEsT SHAPES: ", Xtest.shape, Ytest.shape)

kernel = GaussianKernel(sigma=5)
flk = Falkon(kernel=kernel, penalty=1e-5, M=Xtrain.shape[0])

flk.fit(Xtrain, Ytrain)`

Error:
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=304 : OS call failed or operation not supported on this OS Traceback (most recent call last): File "falkon_test.py", line 26, in <module> flk.fit(Xtrain, Ytrain) File "/home/nehap/anaconda3/envs/falkon/lib/python3.7/site-packages/falkon/models/falkon.py", line 197, in fit ny_points = ny_points.pin_memory() RuntimeError: cuda runtime error (304) : OS call failed or operation not supported on this OS at /opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/THC/THCCachingHostAllocator.cpp:278

Here is my .yml file:
name: falkon
channels:

conda-forge
pytorch
anaconda
defaults
dependencies:
_libgcc_mutex=0.1=main
blas=1.0=mkl
bzip2=1.0.8=h7b6447c_0
ca-certificates=2020.10.14=0
certifi=2020.6.20=py37_0
cmake=3.18.2=ha30ef3c_0
cudatoolkit=10.1.243=h6bb024c_0
expat=2.2.10=he6710b0_2
ffmpeg=4.3=hf484d3e_0
freetype=2.10.4=h5ab3b9f_0
gmp=6.2.1=h2531618_2
gnutls=3.6.15=he1e5248_0
intel-openmp=2020.2=254
joblib=1.0.1=pyhd8ed1ab_0
jpeg=9b=h024ee3a_2
krb5=1.18.2=h173b8e3_0
lame=3.100=h7b6447c_0
lcms2=2.11=h396b838_0
ld_impl_linux-64=2.33.1=h53a641e_7
libblas=3.9.0=1_h6e990d7_netlib
libcblas=3.9.0=3_h893e4fe_netlib
libcurl=7.71.1=h20c2e04_1
libedit=3.1.20191231=h14c3975_1
libffi=3.3=he6710b0_2
libgcc-ng=9.1.0=hdf63c60_0
libgfortran-ng=7.5.0=h14aa051_19
libgfortran4=7.5.0=h14aa051_19
libiconv=1.15=h63c8f33_5
libidn2=2.3.0=h27cfd23_0
liblapack=3.9.0=3_h893e4fe_netlib
libpng=1.6.37=hbc83047_0
libssh2=1.9.0=h1ba5d50_1
libstdcxx-ng=9.1.0=hdf63c60_0
libtasn1=4.16.0=h27cfd23_0
libtiff=4.1.0=h2733197_1
libunistring=0.9.10=h27cfd23_0
libuv=1.40.0=h7b6447c_0
lz4-c=1.9.2=heb0550a_3
mkl=2020.2=256
mkl-service=2.3.0=py37he8ac12f_0
mkl_fft=1.2.0=py37h23d657b_0
mkl_random=1.1.1=py37h0573a6f_0
ncurses=6.2=he6710b0_1
nettle=3.7.2=hbbd107a_1
ninja=1.10.2=py37hff7bd54_0
numpy=1.19.2=py37h54aff64_0
numpy-base=1.19.2=py37hfa32c7d_0
olefile=0.46=py_0
openh264=2.1.0=hd408876_0
openssl=1.1.1h=h7b6447c_0
pillow=8.0.1=py37he98fc37_0
pip=20.3.3=py37h06a4308_0
python=3.7.9=h7579374_0
python_abi=3.7=1_cp37m
pytorch=1.8.1=py3.7_cuda10.1_cudnn7.6.3_0
readline=8.0=h7b6447c_0
rhash=1.4.0=h1ba5d50_0
scikit-learn=0.23.2=py37hddcf8d6_3
scipy=1.5.3=py37h8911b10_0
setuptools=51.0.0=py37h06a4308_2
six=1.15.0=py37h06a4308_0
sqlite=3.33.0=h62c20be_0
threadpoolctl=2.1.0=pyh5ca1d4c_0
tk=8.6.10=hbc83047_0
torchaudio=0.8.1=py37
torchvision=0.9.1=py37_cu101
typing_extensions=3.7.4.3=py_0
wheel=0.36.2=pyhd3eb1b0_0
xz=5.2.5=h7b6447c_0
zlib=1.2.11=h7b6447c_3
zstd=1.4.5=h9ceee32_0
pip:
- falkon==0.6.3
- psutil==5.8.0
- pykeops==1.4.2
  prefix: /home/nehap/anaconda3/envs/falkon

I'm currently using a 1 TITAN RTX GPU  with 24 GB memory and my CPU has 128 GB memory. The example works if we reduce the number of dimensions from 1536 to 20, but with larger datasets it seems to be running into this issue. We would appreciate any help with this issue - thank you!

The text was updated successfully, but these errors were encountered:

Giodiro · 2021-05-03T08:08:45Z

Hi!
This seems to be a problem with not having enough pinnable memory, I'm not an expert on how exactly the OS determines the amount of pinnable memory but from what I observed I think this is related to the amount of free RAM on your machine.
What OS are you on, and how much free RAM do you have when running the example?

I see a couple of other issues in your script though:

The number of centers (M) should be much lower than the number of points. The scalability of Falkon is cubic with the number of centers, so it makes sense to set M to be low, and gradually increase it until you see performance plateau.
If you generate your data with numpy it will be in float64 precision, and float64 precision data will be processed very slowly by your GPU. An easy fix to things going slow is to reduce the precision of your data (e.g. call Xtrain = torch.from_numpy(Xtrain).to(dtype=torch.float32))

nehap25 · 2021-05-04T00:51:04Z

Thank you so much for your response! For the same example, after changing the precision to float32 I was only able to use <= 400 centers, as anything more was resulting in the error I was getting above. I then tried setting pin_memory to False in falkon/preconditioner/flk_preconditioner.py, falkon/models/falkon.py, and falkon/mmv_ops/fmm_cuda.py, and that seemed to help quite a bit, as I was able to use 50K centers for the same example without running into that error. Here is also the output of 'free' while the example was running:

total    used    free   shared buff/cache  available
Mem:   131944080  33383436  65672288   954712  32888356  96054764
Swap:    999420   998744     676

I was wondering if you had any other suggestions on how to deal with this issue.

Giodiro · 2021-05-05T05:32:48Z

Hi again, and sorry for the slow replies.

I cannot explain the fact that changing precision changes behaviour so drastically.
May I ask what operating system you are using?

Short term the fix you applied -- disabling memory-pinning -- is fine! Just repeat the process of setting pin_memory to False in other places if you encounter the error again.

Long term, if it seems like certain hadware?software? configurations don't support pinning more than a little bit of RAM, I can wrap the calls in a try/catch so that the whole thing doesn't crash and just falls back to unpinned memory.

Thanks for the bug report :)

p.s. while using float32 might not be beneficial for the pinning issue, you should find that it improves the running time of Falkon by quite a bit (once you get past the pinning problem).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda runtime error (304) : OS call failed or operation not supported on this OS #31

cuda runtime error (304) : OS call failed or operation not supported on this OS #31

nehap25 commented Apr 30, 2021

Giodiro commented May 3, 2021

nehap25 commented May 4, 2021

Giodiro commented May 5, 2021

cuda runtime error (304) : OS call failed or operation not supported on this OS #31

cuda runtime error (304) : OS call failed or operation not supported on this OS #31

Comments

nehap25 commented Apr 30, 2021

Giodiro commented May 3, 2021

nehap25 commented May 4, 2021

Giodiro commented May 5, 2021