You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've managed to setup a Talos cluster with both amd64 and arm64 worker nodes. I have no issues running amd64 GPU jobs using the nonfree / production nvidia driver extension. There have been some sharp edges, but all-in-all I've had a pretty clean experience along the way, even though my Kubernetes knowledge is limited. Thank you!
The arm64 node is a Honeycomb LX2k board that is based around the LX2160s SOM and this requires a patch to the open-gpu-kernel-modules to function. This patch appears not to have made it into their master branch, and so don't think it is present in either the LTS or the production variant of the the Talos published extensions. A related issue is given here showing the OSS modules working with this patch. Before switching to Talos I was running containerized GPU images on this platform in Ubuntu 22.04 on an Ampere card without issues.
I checked out this repo thinking I might be able to apply a patch to the driver build script, but on closer inspection it appears like this repo actully stitches together prebuilt and signed artifacts from container registry ghcr.io/siderolabs/nvidia-open-gpu-kernel-modules-*. Would it be possible to nudge me in the right direction to patch, build and sign my own OSS modules to produce an updated Talos extension, or is there an official process whereby Sidero Labs can supply a prebuilt image with the patch applied to get this platform supported by the drivers?
The text was updated successfully, but these errors were encountered:
asymingt
changed the title
How can I patch and build the open-gpu-kernel-modules extension to support eh arm64 LX2160 platform?
How can I patch and build the open-gpu-kernel-modules extension to support the arm64 LX2160 platform?
Nov 23, 2024
asymingt
changed the title
How can I patch and build the open-gpu-kernel-modules extension to support the arm64 LX2160 platform?
How can I patch and build the open-gpu-kernel-modules extension to support the arm64 LX2160a platform?
Nov 23, 2024
I've managed to setup a Talos cluster with both amd64 and arm64 worker nodes. I have no issues running amd64 GPU jobs using the nonfree / production nvidia driver extension. There have been some sharp edges, but all-in-all I've had a pretty clean experience along the way, even though my Kubernetes knowledge is limited. Thank you!
The arm64 node is a Honeycomb LX2k board that is based around the LX2160s SOM and this requires a patch to the open-gpu-kernel-modules to function. This patch appears not to have made it into their master branch, and so don't think it is present in either the LTS or the production variant of the the Talos published extensions. A related issue is given here showing the OSS modules working with this patch. Before switching to Talos I was running containerized GPU images on this platform in Ubuntu 22.04 on an Ampere card without issues.
I checked out this repo thinking I might be able to apply a patch to the driver build script, but on closer inspection it appears like this repo actully stitches together prebuilt and signed artifacts from container registry
ghcr.io/siderolabs/nvidia-open-gpu-kernel-modules-*
. Would it be possible to nudge me in the right direction to patch, build and sign my own OSS modules to produce an updated Talos extension, or is there an official process whereby Sidero Labs can supply a prebuilt image with the patch applied to get this platform supported by the drivers?The text was updated successfully, but these errors were encountered: