Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running with 3 or more GPUs #53

Open
ahabedsoltan opened this issue Nov 7, 2022 · 1 comment
Open

Running with 3 or more GPUs #53

ahabedsoltan opened this issue Nov 7, 2022 · 1 comment

Comments

@ahabedsoltan
Copy link

Hi,
I tried to run FALKON with 3 GPUs but I got the following error:

`Traceback (most recent call last):
File "/home/"user"/.conda/envs/flk4/lib/python3.10/site-packages/falkon/utils/threading.py", line 15, in run
self.ret = self._target(*self._args, **self._kwargs)
File "/home/"user"//.conda/envs/flk4/lib/python3.10/site-packages/falkon/mmv_ops/fmmv.py", line 138, in mmv_run_starter
return mmv_run_thread(X1, X2, v, out, kernel, blk_n, blk_m, mem_needed, dev, tid=proc_idx)
File "/home/"user"//.conda/envs/flk4/lib/python3.10/site-packages/falkon/mmv_ops/fmmv.py", line 251, in mmv_run_thread
flat_gpu = torch.empty(size=(mem_needed,), dtype=m1.dtype, device=dev)
RuntimeError: CUDA out of memory. Tried to allocate 21.00 GiB (GPU 0; 31.75 GiB total capacity; 5.57 GiB already allocated; 20.88 GiB free; 9.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/"user"/.conda/envs/flk4/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/"user"/.conda/envs/flk4/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/"user"/research/knotty/run/main.py", line 38, in
alpha, acc_valid_ep3,nystrom_samples,knots_x,acc_ep2_test= run(**args,wandb_run=wandb_run)
File "/home/"user"/research/knotty/run/run.py", line 225, in run
Falkon_loss, accu_falkon = falkon_run(dataset, kernel_fn, options, p=num_knots, epochs=20,
File "/home/"user"/research/knotty/run/run.py", line 34, in falkon_run
flk.fit(x_train, y_train)
File "/home/"user"/.conda/envs/flk4/lib/python3.10/site-packages/falkon/models/falkon.py", line 264, in fit
beta = optim.solve(
File "/home/"user"/.conda/envs/flk4/lib/python3.10/site-packages/falkon/optim/conjgrad.py", line 310, in solve
B = self.kernel.mmv(M, X, y_over_n, opt=self.params)
File "/home/"user"/.conda/envs/flk4/lib/python3.10/site-packages/falkon/kernels/kernel.py", line 266, in mmv
return mmv_impl(X1, X2, v, self, out, params)
File "/home/"user"/.conda/envs/flk4/lib/python3.10/site-packages/falkon/mmv_ops/fmmv.py", line 734, in fmmv
return KernelMmvFnFull.apply(kernel, opt, out, X1, X2, v, *kernel.diff_params.values())
File "/home/"user"/.conda/envs/flk4/lib/python3.10/site-packages/falkon/mmv_ops/fmmv.py", line 695, in forward
KernelMmvFnFull.run_cpu_gpu(X1, X2, v, out, kernel, opt, False)
File "/home/"user"/.conda/envs/flk4/lib/python3.10/site-packages/falkon/mmv_ops/fmmv.py", line 641, in run_cpu_gpu
outputs = _start_wait_processes(mmv_run_starter, args)
File "/home/"user"/conda/envs/flk4/lib/python3.10/site-packages/falkon/mmv_ops/utils.py", line 59, in _start_wait_processes
outputs.append(p.join())
File "/home/"user"/.conda/envs/flk4/lib/python3.10/site-packages/falkon/utils/threading.py", line 22, in join
raise RuntimeError('Exception in thread %s' % (self.name)) from self.exc
RuntimeError: Exception in thread GPU-0
`
It works fine with 1,2 GPUs. I was wondering if using 3 or more GPUs can further make FALKON faster?

Thank you for your help.

@Giodiro
Copy link
Contributor

Giodiro commented Dec 27, 2022

Hi @ahabedsoltan! Unfortunately the code which splits a dataset into blocks can produce weird behaviors, and sometimes this depends on factors such as the number of GPUs.

I just introduced an option to change the behavior of the heuristic for splitting the data into blocks: memory_slack. By default it is set to 0.9 which means that the split size is calculated considering 90% of available GPU RAM.
You try reducing it to e.g. 0.7 and the out of memory errors should go away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants