-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include /usr/bin/nvidia-smi for nvidia-kmod extension #385
Comments
The Talos Nvidia driver extensions installs However, you will then find that the device plug-in will not find a core CUDA library as part of its driver detection process. This is because of the aforementioned custom install path for other driver components. Furthermore, Talos applies a patch to the container toolkit to change the ldcache path (which the toolkit uses to find libraries), because Talos needs to maintain separate glibc and musl LD caches and thus stores them in custom locations. You will need to patch the device plug-in, build and publish a custom image, and use that image to get past that issue. Something like this: diff --git a/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/ldcache/ldcache.go b/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/ldcache/ldcache.go
index 2f6de2fe..35f62f45 100644
--- a/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/ldcache/ldcache.go
+++ b/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/ldcache/ldcache.go
@@ -33,7 +33,7 @@ import (
"github.com/NVIDIA/nvidia-container-toolkit/internal/lookup/symlinks"
)
-const ldcachePath = "/etc/ld.so.cache"
+const ldcachePath = "/usr/local/glibc/etc/ld.so.cache"
const (
magicString1 = "ld.so-1.7.0"
diff --git a/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/lookup/library.go b/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/lookup/library.go
index 7f5cf7c8..85fd1db9 100644
--- a/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/lookup/library.go
+++ b/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/lookup/library.go
@@ -36,6 +36,7 @@ func NewLibraryLocator(opts ...Option) Locator {
// If search paths are already specified, we return a locator for the specified search paths.
if len(b.searchPaths) > 0 {
+ b.logger.Infof("Returning symlink locator with paths: %v", b.searchPaths)
return NewSymlinkLocator(
WithLogger(b.logger),
WithSearchPaths(b.searchPaths...),
@@ -56,6 +57,7 @@ func NewLibraryLocator(opts ...Option) Locator {
"/lib/aarch64-linux-gnu",
"/lib/x86_64-linux-gnu/nvidia/current",
"/lib/aarch64-linux-gnu/nvidia/current",
+ "/usr/local/lib",
}...),
)
// We construct a symlink locator for expected library locations. With the previously mentioned upcoming support for driver container images in the GPU operator, Talos may want to consider reworking their Nvidia extensions to deliver all the components as container image. That should hopefully provide a more supported and long-term stable solution. |
Hi @jfroy, one issue here is that Talos requires signed drivers, and the singing key is ephemeral to each build process, hence why each release of Talos has a specific corresponding release of each system extension. |
Yeah I like that Talos provides a chain of trust. You would need a per-release driver container just like you have a per-release extension. I work at Nvidia, but I only speak for myself here. It would be inappropriate to engage beyond the occasional comment and bug fix PR on GitHub. I will however reach out to the folks working on our container technologies. |
That would be greatly appreciated, and thank you for reaching out in the first instance. |
When attempting to run the NVIDIA gpu-operator it fails to fully initialize. From what I can tell it is because the
nvidia-validator
tries to run thenvidia-smi
binary from the host in/usr/bin/
I installed the operator via helm with the following values.yaml
This should skip installing drivers and changing containerd config (already included with the extensions), but it apparently doesn't skip checking them.
The chart was installed with
I tried manually touching the files that the validator creates and it still attempts to execute the nvidia-smi command
more information in the repo
https://github.com/NVIDIA/gpu-operator/tree/master
and installation docs
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#operator-install-guide
The text was updated successfully, but these errors were encountered: