Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust device permissions #61

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

scaronni
Copy link
Contributor

@scaronni scaronni commented Dec 5, 2024

In releation to #57 and #52.

This pull request makes the following changes:

  • Drop device linking with custom permissions hack for proprietary modules in kmp-post.sh and kmp-trigger.sh.
  • Adjust udev rule to work for all NVIDIA modules, not just nvidia.
  • Drop custom user/group for device files:
    • For desktops we are using the uaccess ACL on device files.
    • For compute on SSH probably some other mechanism should apply instead of the video group (a compute cluster node does not have any "video").
  • By dropping the nvidia options NVreg_DeviceFileUID and NVreg_DeviceFileGID, the proprietary modules get the uaccess ACL correctly. I don't know the exact reason, but I guess it's a race condition when the device files get created.

@scaronni
Copy link
Contributor Author

scaronni commented Dec 5, 2024

Note: for our modules for the time being we will stick without the 660 permission and the uaccess ACL, I need to discuss this internally as some tools expect the default access level.

@sndirsch
Copy link
Collaborator

sndirsch commented Dec 5, 2024

Wow! You must have a hard time to figure this out yesterday! I've tested open and proprietary driver. Both work fine with that configuration. I noticed that open driver doesn't add any ACLs for

/dev/nvidia-modeset
/dev/nvidia0
/dev/nvidiactl

but had set

# file: dev/nvidia-caps
# owner: root
# group: root
user::rwx
group::r-x
other::r-x

whereas for the proprietary driver ACLs for these devices are added + these settings of /dev/nvidia-cap

# file: dev/nvidia-caps
# owner: root
# group: root
user::rwx
group::r-x
other::r-x

# file: dev/nvidia-modeset
# owner: root
# group: root
user::rw-
user:tux:rw-
group::rw-
mask::rw-
other::---

# file: dev/nvidia-uvm
# owner: root
# group: root
user::rw-
group::rw-
other::rw-

# file: dev/nvidia-uvm-tools
# owner: root
# group: root
user::rw-
group::rw-
other::rw-

# file: dev/nvidia0
# owner: root
# group: root
user::rw-
user:tux:rw-
group::rw-
mask::rw-
other::---

# file: dev/nvidiactl
# owner: root
# group: root
user::rw-
user:tux:rw-
group::rw-
mask::rw-
other::---

Output for open driver:

# file: dev/nvidia-caps
# owner: root
# group: root
user::rwx
group::r-x
other::r-x

# file: dev/nvidia-modeset
# owner: root
# group: root
user::rw-
group::rw-
other::---

# file: dev/nvidia0
# owner: root
# group: root
user::rw-
group::rw-
other::---

# file: dev/nvidiactl
# owner: root
# group: root
user::rw-
group::rw-
other::---

@sndirsch
Copy link
Collaborator

sndirsch commented Dec 5, 2024

@e4t Will this be an issue that adding users to the video group no longer works to get access to NVIDIA GPUs. I guess you then now need to use setfacl instead. Question is how to make this a permanent setting.

@scaronni
Copy link
Contributor Author

scaronni commented Dec 5, 2024

Wow! You must have a hard time to figure this out yesterday!

Eh, did it this morning, no time yesterday, but where I did test is a system with a GTX 1600, so no open drivers to test, it supports only the proprietary modules.

I've tested open and proprietary driver. Both work fine with that configuration. I noticed that open driver doesn't add any ACLs for
[...]
whereas for the proprietary driver ACLs for these devices are added + these settings of /dev/nvidia-cap
[...]

I'm not sure I understand from your comment if it works or not...

I guess this is due to the fact that the logic is, if you are trying to access an nvidia character file, create them with nvidia-modprobe and set the appropriate ACL because of the uaccess tag,

The other devices without the ACL, it's probably because the user/process logged in on the console was not trying to access them but it's just a result of the nvidia-modprobe action.

Proprietary modules are not staying forever, maybe we can clean that up when they disappear.

Regarding the video group, if a user needs to run someting via SSH, you can always add a new file in /etc/modprobe.d/ to change the group to whatever you want on that specific node. This removal of the video group is just a suggestion, btw.

@sndirsch
Copy link
Collaborator

sndirsch commented Dec 5, 2024

Wow! You must have a hard time to figure this out yesterday!

Eh, did it this morning, no time yesterday, but where I did test is a system with a GTX 1600, so no open drivers to test, it supports only the proprietary modules.

Sure, still pre-Turing hardware.

I've tested open and proprietary driver. Both work fine with that configuration. I noticed that open driver doesn't add any ACLs for
[...]
whereas for the proprietary driver ACLs for these devices are added + these settings of /dev/nvidia-cap
[...]

I'm not sure I understand from your comment if it works or not...

Don't worry. Your solution works with open AND proprietary driver.

I guess this is due to the fact that the logic is, if you are trying to access an nvidia character file, create them with nvidia-modprobe and set the appropriate ACL because of the uaccess tag,

The other devices without the ACL, it's probably because the user/process logged in on the console was not trying to access them but it's just a result of the nvidia-modprobe action.

Well a session was running with this user. Same scenario with open and proprietary driver, everything working fine, but different getfacl output.

Proprietary modules are not staying forever, maybe we can clean that up when they disappear.

Yes, but could be in year 2027/2028. ;-)

Regarding the video group, if a user needs to run someting via SSH, you can always add a new file in /etc/modprobe.d/ to change the group to whatever you want on that specific node.

That's indeed true.

This removal of the video group is just a suggestion, btw.

Of course, but it's an easy solution for open and proprietary driver. Behaviour is just different than before. We may need to document this somehow/somewhere.

@e4t
Copy link

e4t commented Dec 5, 2024

  * For compute on SSH probably some other mechanism should apply instead of the `video` group (a compute cluster node does not have any "video").

Do you know of any? I'm not aware of such a thing. It could be that Slurm supports ACLs now - I've read in the release notes that they have improved handling of GPUs - but I don't know.
The problem is that for setting up the ACL a service needs to run as root - the display manager does, so it's able to do take the steps necessary. There are two scenarios, however, that need to be addressed: remote sessions (ssh for instance) and batch jobs.
I agree that using the video group is a bit ugly and doesn't play well with unprivileged podman containers. I'm just a bit concerned that if we change it now it will break for existing users. It would be bad if an update suddenly prevented access to GPUs.
I can try to investigate why it doesn't work when using NVreg_DeviceFileUID and NVreg_DeviceFileGID but this is not going to happen this year any more: today is my last working day for the year.

@scaronni
Copy link
Contributor Author

scaronni commented Dec 5, 2024

I can try to investigate why it doesn't work when using NVreg_DeviceFileUID and NVreg_DeviceFileGID but this is not going to happen this year any more: today is my last working day for the year.

Woah, long holidays, congrats! 🥇

@sndirsch
Copy link
Collaborator

sndirsch commented Dec 6, 2024

Just had a try by adding a second modprobe.d snippet just for the open driver to get this video group feature back.

$ cat /etc/modprobe.d/51-nvidia.conf 
options nvidia NVreg_DeviceFileUID=0
options nvidia NVreg_DeviceFileGID=481

But this breaks again the uaccess hack.

KERNEL=="nvidia*", RUN+="/usr/bin/nvidia-modprobe", TAG+="uaccess"

Users NOT in the video group can still log in graphically, but switcheroo -g 1 glxinfo -B crashes and nvidia-smi gives you Failed to initialize NVML: Insufficient Permissions. Users in video group can log in graphically and run switcheroo and nvidia-smi without any issues. Seems you can't mix this. Not even with open driver.

@sndirsch
Copy link
Collaborator

sndirsch commented Dec 9, 2024

@scaronni-nvidia Any new ideas about this? Otherwise I would say we keep it as is for now and dig into this again once @e4t is back from vacation in January.

@scaronni-nvidia
Copy link
Contributor

I would say leave the merge request open and we'll update as soon as we have something to show. Anyway it needs the approval of Egbert?

@scaronni-nvidia
Copy link
Contributor

I might have time tomorrow to look at it.

@sndirsch
Copy link
Collaborator

sndirsch commented Dec 9, 2024

I would say we don't need Egbert's input if we can keep the possibility to just add users to the video group.

@scaronni
Copy link
Contributor Author

scaronni commented Dec 9, 2024

Ok will try.

@sndirsch
Copy link
Collaborator

@scaronni-nvidia Any outcomings already? I'll have the vacation from Tue Dec 17 2024 until Mon Jan 6 2025

@scaronni
Copy link
Contributor Author

Had no time yet, sorry. Maybe in the next days. I'll try to sort it out before the end of the week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants