ImagePullBackOff caused by redundant information from the operator #647

uhthomas · 2023-12-27T18:40:36Z

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): Talos v1.6.1
Kernel Version: 6.1.69
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): 1.29.0 - Talos
GPU Operator Version: 23.9.1

2. Issue or feature description

The operator tries to pull invalid images as it includes redundant information like the kernel and os?

❯ k describe po nvidia-driver-daemonset-6.1.69-talos-talosv1.6.1-xgcqd
Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  56s               default-scheduler  Successfully assigned nvidia-gpu-operator/nvidia-driver-daemonset-6.1.69-talos-talosv1.6.1-xgcqd to rhode
  Normal   Pulled     18s               kubelet            Container image "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.5" already present on machine
  Normal   Created    18s               kubelet            Created container k8s-driver-manager
  Normal   Started    18s               kubelet            Started container k8s-driver-manager
  Normal   BackOff    15s               kubelet            Back-off pulling image "nvcr.io/nvidia/driver:535.129.03-6.1.69-talos-talosv1.6.1"
  Warning  Failed     15s               kubelet            Error: ImagePullBackOff
  Normal   Pulling    4s (x2 over 17s)  kubelet            Pulling image "nvcr.io/nvidia/driver:535.129.03-6.1.69-talos-talosv1.6.1"
  Warning  Failed     2s (x2 over 16s)  kubelet            Failed to pull image "nvcr.io/nvidia/driver:535.129.03-6.1.69-talos-talosv1.6.1": rpc error: code = NotFound desc = failed to pull and unpack image "nvcr.io/nvidia/driver:535.129.03-6.1.69-talos-talosv1.6.1": failed to resolve reference "nvcr.io/nvidia/driver:535.129.03-6.1.69-talos-talosv1.6.1": nvcr.io/nvidia/driver:535.129.03-6.1.69-talos-talosv1.6.1: not found
  Warning  Failed     2s (x2 over 16s)  kubelet            Error: ErrImagePull

❯ k get po
NAME                                                     READY   STATUS             RESTARTS      AGE
gpu-feature-discovery-pgc7c                              0/1     Init:0/1           0             2m47s
nvidia-container-toolkit-daemonset-lw22k                 0/1     Init:0/1           0             2m47s
nvidia-dcgm-exporter-qg6j7                               0/1     Init:0/1           0             2m47s
nvidia-device-plugin-daemonset-m8z55                     0/1     Init:0/1           0             2m47s
nvidia-driver-daemonset-6.1.69-talos-talosv1.6.1-xgcqd   0/1     ImagePullBackOff   0             3m25s
nvidia-gpu-operator-79c7dc6d5-8dhhx                      1/1     Running            7 (13m ago)   2d19h
nvidia-operator-validator-xnbhr                          0/1     Init:0/4           0             2m47s

3. Steps to reproduce the issue

Deploy the GPU operator with the default configuration on a Talos Kubernetes cluster.

4. Information to attach (optional if deemed irrelevant)

kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

The text was updated successfully, but these errors were encountered:

uhthomas · 2023-12-27T18:50:48Z

This happens bot h with and without the usePrecompiled option. The above is with, and this is without:

❯ k describe po nvidia-driver-daemonset-pmmz5
Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  22s               default-scheduler  Successfully assigned nvidia-gpu-operator/nvidia-driver-daemonset-pmmz5 to rhode
  Normal   Pulled     21s               kubelet            Container image "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.5" already present on machine
  Normal   Created    21s               kubelet            Created container k8s-driver-manager
  Normal   Started    21s               kubelet            Started container k8s-driver-manager
  Normal   Pulling    8s (x2 over 19s)  kubelet            Pulling image "nvcr.io/nvidia/driver:535.129.03-talosv1.6.1"
  Warning  Failed     6s (x2 over 18s)  kubelet            Failed to pull image "nvcr.io/nvidia/driver:535.129.03-talosv1.6.1": rpc error: code = NotFound desc = failed to pull and unpack image "nvcr.io/nvidia/driver:535.129.03-talosv1.6.1": failed to resolve reference "nvcr.io/nvidia/driver:535.129.03-talosv1.6.1": nvcr.io/nvidia/driver:535.129.03-talosv1.6.1: not found
  Warning  Failed     6s (x2 over 18s)  kubelet            Error: ErrImagePull

uhthomas · 2023-12-27T19:00:12Z

I tried to set the DRIVER_IMAGE environment variable on the operator pod to nvcr.io/nvidia/driver:535.129.03-ubuntu22.04 but it didn't help.

https://catalog.ngc.nvidia.com/orgs/nvidia/containers/driver/tags

cdesiniotis · 2024-01-25T21:57:08Z

Hi @uhthomas, Talos is not a supported distribution. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/platform-support.html#supported-operating-systems-and-kubernetes-platforms

cdesiniotis · 2024-01-25T22:15:09Z

I am not familiar with Talos at all, but if you wanted to force the operator to pull the driver image for one of our supported distros, like ubuntu22.04, then you would need use the image digest when configuring driver.version in clusterpolicy. For example, driver.version=sha256:3981d34191e355a8c96a926f4b00254dba41f89def7ed2c853e681a72e3f14eb if you wanted to use the 535.129.03-ubuntu22.04 tag.

Note, it is likely that the ubuntu22.04 image will fail to install the driver successfully on a different distribution. One way to proceed is to install the NVIDIA drivers following Talos's official guide: https://www.talos.dev/v1.6/talos-guides/configuration/nvidia-gpu/ and then install the GPU Operator with driver.enabled=false to bring up the rest of the software components.

jfroy · 2024-11-20T18:09:42Z

I've been working on making GPU Operator and related components work out of the box on Talos. There is some work left to do. See #1007, NVIDIA/nvidia-container-toolkit#700.

For Talos, once some of the issues in NVIDIA components have been resolved, siderolabs/extensions#476 will provide a host driver installation compatible with the GPU Operator. I've also talked with SideroLabs on supporting driver containers, but that would require some changes in Talos. For security, they remove SYS_MODULE from the container runtime and thus all containers. That would either need to be removed, or a new API would have to be added to their PID 1 (machined) to require a module to be loaded.

Additionally, if you use secure boot, then no pre-built driver container will work because SideroLabs throws away the kernel module signing key after they build the kernel and kernel module packages. I've also talked with them about that and there just isn't a clear solution that works for every customer and use case. As with many things in the Linux world, if you're serious about security, your best option is likely to some of Talos from source (kernel and kernel modules) and manage secure boot and kernel keys with your own key infrastructure. In such a scenario, either a host driver extension as linked above or driver containers will work if you retain the necessary private keys to sign the kernel modules.

As for the redundant information, all the operator is doing is concatenating node labels:

feature.node.kubernetes.io/kernel-version.full: 6.11.3-talos
feature.node.kubernetes.io/system-os_release.ID: talos
feature.node.kubernetes.io/system-os_release.VERSION_ID: v1.8.3

On Talos, it just happens that both the kernel version and the OS release ID have "talos" in it.

@elezar

cdesiniotis added the platform label Jan 25, 2024

ArangoGutierrez removed the platform label Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ImagePullBackOff caused by redundant information from the operator #647

ImagePullBackOff caused by redundant information from the operator #647

uhthomas commented Dec 27, 2023

uhthomas commented Dec 27, 2023

uhthomas commented Dec 27, 2023

cdesiniotis commented Jan 25, 2024

cdesiniotis commented Jan 25, 2024

jfroy commented Nov 20, 2024 •

edited

Loading

ImagePullBackOff caused by redundant information from the operator #647

ImagePullBackOff caused by redundant information from the operator #647

Comments

uhthomas commented Dec 27, 2023

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to attach (optional if deemed irrelevant)

uhthomas commented Dec 27, 2023

uhthomas commented Dec 27, 2023

cdesiniotis commented Jan 25, 2024

cdesiniotis commented Jan 25, 2024

jfroy commented Nov 20, 2024 • edited Loading

jfroy commented Nov 20, 2024 •

edited

Loading