-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Supportability of GPU of NVIDIA, AMD and Intel #7374
Comments
/kind cluster-autoscaler |
@adrianmoisey: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/area cluster-autoscaler |
Currently (pre-DRA) node resources are just a map[string]quantity. Neither scheduler nor CA really understands what a particular resource is, it's literally just comparing matching keys in two dictionaries (pod requests and node allocatable). So, in principle, CA works with any resource. There are two tricky parts though:
The processor code you linked generally aims at solving the problems above specifically for nvidia.com/gpu. So those generally work out of the box with no extra setup required. Other GPUs should work fine, they just need a bit of extra setup described above.
Only resource (in the sense of key in the node allocatable) called "nvidia.com/gpu" is recognized as GPU for the purposes of resource limits and scale-down thresholds. Whether that key actually represents an Nvidia GPU is irrelevant for Cluster Autoscaler. "Extended resources" (ie. any key in allocatable map that is not cpu, memory, nvidia gpu IIRC don't go through utilization threshold check. However, CA will only scale-down if all the pods running on node to be removed will be able to schedule on other nodes in the cluster. That check will take into account all scheduling constraints, including any extended resources.
I'm not sure I understand the question. Autoscaling is based on scheduling simulation, which takes into account all key/value pairs in pod resource requests and node allocatable. Device-plugin is generally what sets the relevant allocatable value on node object, so in that narrow sense CA is based on device-plugin.
You're correct. |
Hello everyone,
I have some questions about a supportability of GPU some vendors provide.
Question1:
Which vendor's GPU is supported in Cluster Autoscaler?
GPU is provided by some vendor like NVIDIA, AMD and Intel.
I would like to know whether Cluster Autoscaler correctly recognizes these GPU and autoscales nodes or not.
My understanding is that GPU vendor is hardcoded to "nvidia.com/gpu" as the following.
Therefore, --gpu-total option only works when using NVIDIA GPU.
However, autoscaling itself may work if k8s scheduler plugins in PredicateChecker correctly simulate the packing of pod requesting GPU except NVIDIA.
Could you please tell me how Cluster Autoscaler works when we use GPU provided by various vendors?
autoscaler/cluster-autoscaler/processors/customresources/gpu_processor.go
Lines 78 to 84 in e193af0
Question2:
Can the following annotation be used to scale from zero node in cluster-api with gpu-type except NVIDIA?
Question3:
Does Cluster Autoscaler correctly scale-in nodes with GPU except NVIDIA?
Cluster Autoscaler scale-in nodes with GPU when usage of GPU is lower than a threshold by observing scheduled pods.
I would like to know the judgement is made correctly when using AMD/Intel GPU.
Question4:
Is Cluster Autoscaler based on device-plugin for handling nodes with GPU?
Recently, Dynamic Resource Allocation(DRA) is being implemented for GPU management.
Is my understanding correct that the implementation of Cluster Autoscaler with DRA is in progress and it does not work yet?
The text was updated successfully, but these errors were encountered: