Question: Supportability of GPU of NVIDIA, AMD and Intel #7374

ttsuuubasa · 2024-10-09T09:16:33Z

Hello everyone,

I have some questions about a supportability of GPU some vendors provide.

Question1:
Which vendor's GPU is supported in Cluster Autoscaler?

GPU is provided by some vendor like NVIDIA, AMD and Intel.
I would like to know whether Cluster Autoscaler correctly recognizes these GPU and autoscales nodes or not.

My understanding is that GPU vendor is hardcoded to "nvidia.com/gpu" as the following.
Therefore, --gpu-total option only works when using NVIDIA GPU.
However, autoscaling itself may work if k8s scheduler plugins in PredicateChecker correctly simulate the packing of pod requesting GPU except NVIDIA.

Could you please tell me how Cluster Autoscaler works when we use GPU provided by various vendors?

autoscaler/cluster-autoscaler/processors/customresources/gpu_processor.go

Lines 78 to 84 in e193af0

    
           func (p *GpuCustomResourcesProcessor) GetNodeGpuTarget(GPULabel string, node *apiv1.Node, nodeGroup cloudprovider.NodeGroup) (CustomResourceTarget, errors.AutoscalerError) { 
        
           	gpuLabel, found := node.Labels[GPULabel] 
        
           	if !found { 
        
           		return CustomResourceTarget{}, nil 
        
           	} 
        
           	gpuAllocatable, found := node.Status.Allocatable[gpu.ResourceNvidiaGPU]

Question2:
Can the following annotation be used to scale from zero node in cluster-api with gpu-type except NVIDIA?

capacity.cluster-autoscaler.kubernetes.io/gpu-type: "nvidia.com/gpu"
capacity.cluster-autoscaler.kubernetes.io/gpu-count: "2"

Question3:
Does Cluster Autoscaler correctly scale-in nodes with GPU except NVIDIA?

Cluster Autoscaler scale-in nodes with GPU when usage of GPU is lower than a threshold by observing scheduled pods.
I would like to know the judgement is made correctly when using AMD/Intel GPU.

Question4:
Is Cluster Autoscaler based on device-plugin for handling nodes with GPU?

Recently, Dynamic Resource Allocation(DRA) is being implemented for GPU management.
Is my understanding correct that the implementation of Cluster Autoscaler with DRA is in progress and it does not work yet?

The text was updated successfully, but these errors were encountered:

adrianmoisey · 2024-10-09T11:05:08Z

/kind cluster-autoscaler

k8s-ci-robot · 2024-10-09T11:05:11Z

@adrianmoisey: The label(s) kind/cluster-autoscaler cannot be applied, because the repository doesn't have them.

In response to this:

/kind cluster-autoscaler

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

adrianmoisey · 2024-10-09T11:05:51Z

/area cluster-autoscaler

MaciekPytel · 2024-10-31T17:37:06Z

Which vendor's GPU is supported in Cluster Autoscaler?

GPU is provided by some vendor like NVIDIA, AMD and Intel.
I would like to know whether Cluster Autoscaler correctly recognizes these GPU and autoscales nodes or not.

Currently (pre-DRA) node resources are just a map[string]quantity. Neither scheduler nor CA really understands what a particular resource is, it's literally just comparing matching keys in two dictionaries (pod requests and node allocatable). So, in principle, CA works with any resource.

There are two tricky parts though:

If you have a node in a nodeGroup, CA will copy it and assume every new node will be identical. When scaling-from-0 CA needs to know how a new node will look like - including the allocatable of any resource, such as the GPUs you mention. How you pass it is different for each provider (e.g. in AWS you can specify it by setting tag on ASG).
Resources managed by device plugins generally only show in node allocatable after a daemonset installs drivers. This means there is a window when a new node is already ready as far as kubernetes status conditions go, but it doesn't advertise a GPU yet. From CA's perspective that node has no GPU and pods requesting GPU are still pending, so CA will trigger another scale-up not understanding that the first node is still initializing.
- This can be solved by creating node with startup-taint (see our README) and removing the taint once the GPU is visible in allocatable.

The processor code you linked generally aims at solving the problems above specifically for nvidia.com/gpu. So those generally work out of the box with no extra setup required. Other GPUs should work fine, they just need a bit of extra setup described above.

Therefore, --gpu-total option only works when using NVIDIA GPU.

Cluster Autoscaler scale-in nodes with GPU when usage of GPU is lower than a threshold by observing scheduled pods.
I would like to know the judgement is made correctly when using AMD/Intel GPU.

Only resource (in the sense of key in the node allocatable) called "nvidia.com/gpu" is recognized as GPU for the purposes of resource limits and scale-down thresholds. Whether that key actually represents an Nvidia GPU is irrelevant for Cluster Autoscaler.

"Extended resources" (ie. any key in allocatable map that is not cpu, memory, nvidia gpu IIRC don't go through utilization threshold check. However, CA will only scale-down if all the pods running on node to be removed will be able to schedule on other nodes in the cluster. That check will take into account all scheduling constraints, including any extended resources.

Is Cluster Autoscaler based on device-plugin for handling nodes with GPU?

I'm not sure I understand the question. Autoscaling is based on scheduling simulation, which takes into account all key/value pairs in pod resource requests and node allocatable. Device-plugin is generally what sets the relevant allocatable value on node object, so in that narrow sense CA is based on device-plugin.

Recently, Dynamic Resource Allocation(DRA) is being implemented for GPU management.
Is my understanding correct that the implementation of Cluster Autoscaler with DRA is in progress and it does not work yet?

You're correct.

k8s-ci-robot added the area/cluster-autoscaler label Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Supportability of GPU of NVIDIA, AMD and Intel #7374

Question: Supportability of GPU of NVIDIA, AMD and Intel #7374

ttsuuubasa commented Oct 9, 2024

adrianmoisey commented Oct 9, 2024

k8s-ci-robot commented Oct 9, 2024

adrianmoisey commented Oct 9, 2024

MaciekPytel commented Oct 31, 2024

Question: Supportability of GPU of NVIDIA, AMD and Intel #7374

Question: Supportability of GPU of NVIDIA, AMD and Intel #7374

Comments

ttsuuubasa commented Oct 9, 2024

adrianmoisey commented Oct 9, 2024

k8s-ci-robot commented Oct 9, 2024

adrianmoisey commented Oct 9, 2024

MaciekPytel commented Oct 31, 2024