Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU does not show up as OpenCL device when logged in over SSH, unless you login locally #701

Open
ProjectPhysX opened this issue Jan 29, 2024 · 13 comments

Comments

@ProjectPhysX
Copy link

On a fresh Ubuntu Server 23.04 installation (kernel 6.5), after installing NEO and rebooting, when accessing the machine remotely over SSH, the GPU (Arc A770) does not show up as OpenCL device. Only when I locally login at the PC, the GPU immediately shows up as OpenCL device both locally and in the remote terminal.

@JablonskiMateusz
Copy link
Contributor

Hi @ProjectPhysX
Could you run command strace -o strace.log clinfo and share produced strace.log file?

@ProjectPhysX
Copy link
Author

Hi @JablonskiMateusz,

here is strace-before-local-login.log, and visible devices are:

| Device ID    0 | NVIDIA TITAN Xp                                            |
| Device ID    1 | 13th Gen Intel(R) Core(TM) i7-13700K                       |
| Device ID    2 | Intel(R) FPGA Emulation Device                             |

After logging in locally on the PC, here is strace-after-local-login.log, and visible devices are:

| Device ID    0 | Intel(R) Arc(TM) A770 Graphics                             |
| Device ID    1 | Intel(R) UHD Graphics 770                                  |
| Device ID    2 | NVIDIA TITAN Xp                                            |
| Device ID    3 | 13th Gen Intel(R) Core(TM) i7-13700K                       |
| Device ID    4 | Intel(R) FPGA Emulation Device                             |

Kind regards,
Moritz

@JablonskiMateusz
Copy link
Contributor

@ProjectPhysX from logs it looks like in the first log you don't have permission to gpu file:

openat(AT_FDCWD, "/dev/dri/by-path/pci-0000:00:02.0-render", O_RDWR|O_CLOEXEC) = -1 EACCES (Permission denied)

Please ensure that user you are using is a member of group render

@ProjectPhysX
Copy link
Author

Hi @JablonskiMateusz,

thanks a lot for the help! An additional sudo usermod -a -G render $(whoami) fixes the issue.
Please make the installation fix the file permissions or automatically put the user in the render group, and/or include this line in the intallation instructions.

Kind regards,
Moritz

@bashbaug
Copy link
Contributor

@JablonskiMateusz, out of curiosity why does logging in locally "fix" this issue?

@JablonskiMateusz
Copy link
Contributor

@ProjectPhysX

In our readme we have following line:

To allow NEO access to GPU device make sure user has permissions to files /dev/dri/renderD*.

btw.

out of curiosity why does logging in locally "fix" this issue?

@ProjectPhysX when you logged locally, was it the same user as when you logged over ssh?

@ProjectPhysX
Copy link
Author

ProjectPhysX commented Jan 31, 2024

@JablonskiMateusz yes, same user. The local login alone triggers the GPU to become visible as OpenCL device.
Why can't the installation set the user access rights? Miss this detail and devices won't show up without any error, that's not user-friendly.

@eero-t
Copy link

eero-t commented Feb 16, 2024

thanks a lot for the help! An additional sudo usermod -a -G render $(whoami) fixes the issue. Please make the installation fix the file permissions or automatically put the user in the render group,

It's (definitely) not the driver (package) responsibility to do things like that.

and/or include this line in the intallation instructions.

Yes, that's a good idea. In which all documents you think this should be mentioned?

@JablonskiMateusz yes, same user. The local login alone triggers the GPU to become visible as OpenCL device.

As to what happens when you do graphical login locally... Your GUI session manager grants authenticated user (temporary) access to the display device. Otherwise user's GUI would not work that well (as it would fall back to CPU rendering, or even fail).

@ProjectPhysX
Copy link
Author

Yes, that's a good idea. In which all documents you think this should be mentioned?

Here in the Readme and in the "Installation procedure" in release notes would be good. Thanks!

@eero-t
Copy link

eero-t commented Feb 16, 2024

An additional sudo usermod -a -G render $(whoami) fixes the issue.

Older (e.g. Ubuntu) distro versions do not have render group => it's better to use Intel device group ID directly.

In case host has also non-Intel DRM devices (with different group IDs), Intel GPU device file names can be gotten with following:
grep -l 0x8086 /sys/class/drm/renderD*/device/vendor | cut -d/ -f 5

And group ID for the first one with:
stat --format %g /dev/dri/$(grep -l 0x8086 /sys/class/drm/renderD*/device/vendor | cut -d/ -f 5 | head -1)

Yes, that's a good idea. In which all documents you think this should be mentioned?

Here in the Readme and in the "Installation procedure" in release notes would be good. Thanks!

Thanks! @JablonskiMateusz ?

@sumseq
Copy link

sumseq commented Sep 4, 2024

I am having a similar issue issue after upgrading from Rocky 9.2 to Rocky 9.4.
I see my Arc 750 in "lspci" but not in clinfo and I cannot run codes on it.
My username is part of the "render" group and I have the Redhat 9.3 driver installed along with OneAPI HPC toolkit 2024.2.
Any ideas?

@eero-t
Copy link

eero-t commented Sep 4, 2024

@sumseq I'm not familiar with Rocky, but maybe your kernel and user-space driver do not match anymore after the update? See #710.

@sumseq
Copy link

sumseq commented Sep 4, 2024

@sumseq I'm not familiar with Rocky, but maybe your kernel and user-space driver do not match anymore after the update? See #710.

Thanks for the reference! The environment variables they say to set in that post make it work!
For reference:

export NEOReadDebugKeys=1
export OverrideGpuAddressSpace=48

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants