Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lxcfs: handle NULL in lxcfs_read (segfault at 0, code=killed, status=11/SEGV) #635

Open
81981266 opened this issue Apr 24, 2024 · 10 comments · May be fixed by #640
Open

lxcfs: handle NULL in lxcfs_read (segfault at 0, code=killed, status=11/SEGV) #635

81981266 opened this issue Apr 24, 2024 · 10 comments · May be fixed by #640
Assignees
Labels
Incomplete Waiting on more information from reporter

Comments

@81981266
Copy link

81981266 commented Apr 24, 2024

os: ubuntu 5.15.0-52
lxcfs version: 4.0.11

lxcfs is killed by 11/SEGV signal, the syslog is as below:

Apr 18 22:54:43 10-169-58-129 kernel: [10829437.241075] lxcfs[197879]: segfault at 0 ip 00007f101184cf81 sp 00007f0fe9ffa790 error 6 in libc-2.31.so[7f10117e3000+178000]
Apr 18 22:54:43 10-169-58-129 systemd[1]: lxcfs.service: Main process exited, code=killed, status=11/SEGV
Apr 18 22:54:43 10-169-58-129 systemd[1]: var-lib-lxcfs.mount: Succeeded.
Apr 18 22:54:43 10-169-58-129 systemd[1]: lxcfs.service: Failed with result 'signal'.
Apr 18 22:54:43 10-169-58-129 systemd[1]: lxcfs.service: Consumed 2d 41min 37.091s CPU time.
Apr 18 22:54:43 10-169-58-129 systemd[1]: lxcfs.service: Scheduled restart job, restart counter is at 1.
Apr 18 22:54:43 10-169-58-129 systemd[1]: lxcfs.service: Consumed 2d 41min 37.091s CPU time.

the core dump explained by gbd from /var/crash folder is as below:
SeaTalk_IMG_20240424_171648

The code of lxcfs.c:778 is here: https://github.com/lxc/lxcfs/blob/lxcfs-4.0.11/src/lxcfs.c#L778

A similar issue about 'NULL path in lxcfs_releasedir/lxcfs_release' fix: #577

@81981266
Copy link
Author

@deleriux @mihalicyn @brauner
Could you help take a look? Thank you in advance.

@mihalicyn mihalicyn self-assigned this Apr 25, 2024
@anooprac
Copy link

anooprac commented May 1, 2024

Hello, my partner and I are UT Austin students and would like to work on this problem for a class project. Could we know more about how this issue can be recreated so we can try debugging it?

@DevonSchwartz
Copy link

DevonSchwartz commented May 1, 2024

I am working with anooprac
I think that the solution could be to use the macros created for lxcfs_release() instead of the strcmp method.

@81981266
Copy link
Author

81981266 commented May 2, 2024

Hi @anooprac , it only happened on 2 nodes within a cluster with thousands of machines. I still cannot reproduce it from my side manually.

@mihalicyn
Copy link
Member

mihalicyn commented May 3, 2024

Hi @81981266

thanks for your report!

This is a very interesting issues, because as I can see from callstack proc_read was called from do_sys_read which should never happen. And we obviously don't have such a calls in the LXCFS code.

My theory is that it can be a very tricky bug in dynamic symbol resolution (the dlsym function). It can be racy and return a wrong pointer to the function in some circumstances.

Another good question is that even if proc_read was called instead of sys_read for sysfs file then how this code path reached proc_loadavg_read. We do have fi->type checks everywhere.

Can you provide me with your crash dump file and your LXCFS binary so I can go through the crash-dump and analyze it?

@mihalicyn mihalicyn linked a pull request May 3, 2024 that will close this issue
@81981266
Copy link
Author

81981266 commented May 3, 2024

Hello @mihalicyn , thanks for your comment.
I'm sorry to say that the core dump file was rotated/deleted when I tried to copy it just now. Then I attached the lxcfs binary. I also attached my lxcfs code repo because we did some revamp based on v4.0.11 to meet some internal requirements. Hope this can help you well. Thank you very much.

OS VERSION="20.04.6 LTS (Focal Fossa)"

lxcfs binary.zip
lxcfs_v4.0.11_revamp.zip

@mihalicyn
Copy link
Member

I'm sorry to say that the core dump file was rotated/deleted when I tried to copy it just now

It is sad news.

Then I attached the lxcfs binary. I also attached my lxcfs code repo because we did some revamp based on v4.0.11 to meet some internal requirements. Hope this can help you well. Thank you very much.

In general I can't see any issues with your version of the code.

Let's then wait for the next crash reproduction and crash-dump file. Also, I would strongly recommend updating to the recent LXCFS version from 4.0.11.

I mark this issue as "incomplete" as we don't have enough information to debug this right now.

@mihalicyn mihalicyn added the Incomplete Waiting on more information from reporter label May 3, 2024
@DevonSchwartz
Copy link

Is it still useful to update the repo to use the macros instead of strcmp for the paths?

@mihalicyn
Copy link
Member

Is it still useful to update the repo to use the macros instead of strcmp for the paths?

I think it is. It also makes sense to do this for all fuse callbacks (except open/opendir of course), not only the "read" one. But let's start from read.

@DevonSchwartz
Copy link

Is it still useful to update the repo to use the macros instead of strcmp for the paths?

I think it is. It also makes sense to do this for all fuse callbacks (except open/opendir of course), not only the "read" one. But let's start from read.

Sounds good. I'll add the fixes for the rest of the FUSE callbacks once the lxcfs_read() changes pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Incomplete Waiting on more information from reporter
Development

Successfully merging a pull request may close this issue.

4 participants