Running with 3 or more GPUs #53

ahabedsoltan · 2022-11-07T03:04:24Z

Hi,
I tried to run FALKON with 3 GPUs but I got the following error:

`Traceback (most recent call last):
File "/home/"user"/.conda/envs/flk4/lib/python3.10/site-packages/falkon/utils/threading.py", line 15, in run
self.ret = self._target(*self._args, **self._kwargs)
File "/home/"user"//.conda/envs/flk4/lib/python3.10/site-packages/falkon/mmv_ops/fmmv.py", line 138, in mmv_run_starter
return mmv_run_thread(X1, X2, v, out, kernel, blk_n, blk_m, mem_needed, dev, tid=proc_idx)
File "/home/"user"//.conda/envs/flk4/lib/python3.10/site-packages/falkon/mmv_ops/fmmv.py", line 251, in mmv_run_thread
flat_gpu = torch.empty(size=(mem_needed,), dtype=m1.dtype, device=dev)
RuntimeError: CUDA out of memory. Tried to allocate 21.00 GiB (GPU 0; 31.75 GiB total capacity; 5.57 GiB already allocated; 20.88 GiB free; 9.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/"user"/.conda/envs/flk4/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/"user"/.conda/envs/flk4/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/"user"/research/knotty/run/main.py", line 38, in
alpha, acc_valid_ep3,nystrom_samples,knots_x,acc_ep2_test= run(**args,wandb_run=wandb_run)
File "/home/"user"/research/knotty/run/run.py", line 225, in run
Falkon_loss, accu_falkon = falkon_run(dataset, kernel_fn, options, p=num_knots, epochs=20,
File "/home/"user"/research/knotty/run/run.py", line 34, in falkon_run
flk.fit(x_train, y_train)
File "/home/"user"/.conda/envs/flk4/lib/python3.10/site-packages/falkon/models/falkon.py", line 264, in fit
beta = optim.solve(
File "/home/"user"/.conda/envs/flk4/lib/python3.10/site-packages/falkon/optim/conjgrad.py", line 310, in solve
B = self.kernel.mmv(M, X, y_over_n, opt=self.params)
File "/home/"user"/.conda/envs/flk4/lib/python3.10/site-packages/falkon/kernels/kernel.py", line 266, in mmv
return mmv_impl(X1, X2, v, self, out, params)
File "/home/"user"/.conda/envs/flk4/lib/python3.10/site-packages/falkon/mmv_ops/fmmv.py", line 734, in fmmv
return KernelMmvFnFull.apply(kernel, opt, out, X1, X2, v, *kernel.diff_params.values())
File "/home/"user"/.conda/envs/flk4/lib/python3.10/site-packages/falkon/mmv_ops/fmmv.py", line 695, in forward
KernelMmvFnFull.run_cpu_gpu(X1, X2, v, out, kernel, opt, False)
File "/home/"user"/.conda/envs/flk4/lib/python3.10/site-packages/falkon/mmv_ops/fmmv.py", line 641, in run_cpu_gpu
outputs = _start_wait_processes(mmv_run_starter, args)
File "/home/"user"/conda/envs/flk4/lib/python3.10/site-packages/falkon/mmv_ops/utils.py", line 59, in _start_wait_processes
outputs.append(p.join())
File "/home/"user"/.conda/envs/flk4/lib/python3.10/site-packages/falkon/utils/threading.py", line 22, in join
raise RuntimeError('Exception in thread %s' % (self.name)) from self.exc
RuntimeError: Exception in thread GPU-0
`
It works fine with 1,2 GPUs. I was wondering if using 3 or more GPUs can further make FALKON faster?

Thank you for your help.

Giodiro · 2022-12-27T08:35:04Z

Hi @ahabedsoltan! Unfortunately the code which splits a dataset into blocks can produce weird behaviors, and sometimes this depends on factors such as the number of GPUs.

I just introduced an option to change the behavior of the heuristic for splitting the data into blocks: memory_slack. By default it is set to 0.9 which means that the split size is calculated considering 90% of available GPU RAM.
You try reducing it to e.g. 0.7 and the out of memory errors should go away.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running with 3 or more GPUs #53

Running with 3 or more GPUs #53

ahabedsoltan commented Nov 7, 2022

Giodiro commented Dec 27, 2022

Running with 3 or more GPUs #53

Running with 3 or more GPUs #53

Comments

ahabedsoltan commented Nov 7, 2022

Giodiro commented Dec 27, 2022