Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault #68

Open
ahabedsoltan opened this issue Dec 24, 2023 · 6 comments
Open

Segmentation fault #68

ahabedsoltan opened this issue Dec 24, 2023 · 6 comments

Comments

@ahabedsoltan
Copy link

Hello,

I've been using FALKON, and it functions well on a GPU with a small number of centers. However, when I increase the number of centers to around 64,000, I encounter a "Segmentation fault" error, causing the program to terminate.

Here's the sequence of events during the run:

falkon starts...
MainProcess.MainThread::[Calcuating Preconditioner of size 128000]
Preconditioner will run on 1 GPUs
--MainProcess.MainThread::[Kernel]
--MainProcess.MainThread::[Kernel] complete in 80.324s
--MainProcess.MainThread::[Cholesky 1]
Using parallel POTRF
--MainProcess.MainThread::[Cholesky 1] complete in 47.152s
--MainProcess.MainThread::[Copy triangular]
--MainProcess.MainThread::[Copy triangular] complete in 17.284s
--MainProcess.MainThread::[LAUUM(CUDA)]
--MainProcess.MainThread::[LAUUM(CUDA)] complete in 56.486s
--MainProcess.MainThread::[Cholesky 2]
Segmentation fault

I'm curious about why this issue arises with a large number of centers. Previously, I've successfully used FALKON with up to 256,000 centers. It seems that with the current updated version, there are issues at this scale. Your assistance in resolving this matter would be greatly appreciated.

@parthe
Copy link

parthe commented Dec 28, 2023

We have tried setting the following options, yet the seg-fault persists.
never_store_kernel=True
chol_force_kernel=True
no_single_kernel=False

@parthe
Copy link

parthe commented Dec 29, 2023

Here is a minimal working code that reproduces the error that was raised by @ahabedsoltan

import falkon, torch

n, N, M, d, bw = 200_000, 1000, 64_000, 1, 1.

accufun = lambda yt, yh: 100 * (yt.argmax(dim=1) == yh.argmax(dim=1)).sum() / yh.shape[0]

options = falkon.FalkonOptions(debug=True,
    never_store_kernel=True,
    chol_force_ooc=True,
    no_single_kernel=False)
kernel_fn_flk = falkon.kernels.LaplacianKernel(sigma=bw, opt=options)
model = falkon.Falkon(kernel=kernel_fn_flk, penalty=1e-6, M=M, options=options,
                      error_every=1, error_fn=accufun, maxiter=1)

X = torch.randn(n, d)
Y = torch.randn(n, d)
x = torch.randn(N, d)
y = torch.randn(N, d)
model.fit(X, Y, Xts=x, Yts=y)

@Giodiro
Copy link
Contributor

Giodiro commented Jan 1, 2024

Hi! I think it was a bug in a small helper function, it should be fixed on master! Are you comfortable trying it out like this or do you prefer if I release a new version?

@ahabedsoltan
Copy link
Author

Thank you. Could you please create a pre-built wheel for it? Each time I try to install it using the command 'pip install git+https://github.com/falkonml/falkon.git', the installation fails.

@parthe
Copy link

parthe commented Jan 2, 2024

Reinstalling falkon as follows solved the issue. @Giodiro Thanks for the quick bug-fix!

pip uninstall falkon
pip install --no-build-isolation git+https://github.com/FalkonML/falkon.git

@ahabedsoltan
Copy link
Author

Thank you it resolved the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants