DRA in Kubevirt #331

alaypatel07 · 2024-10-07T02:01:33Z

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

Design: A design document was considered and is present (link) or not required
PR: The PR description is expressive enough and will help future contributors
Code: Write code that humans can understand and Keep it simple
Refactor: You have left the code cleaner than you found it (Boy Scout Rule)
Upgrade: Impact of this change on upgrade flows was considered and addressed if required
Testing: New code requires new unit tests. New features and bug fixes require at least on e2e test
Documentation: A user-guide update was considered and is present (link) or not required. You want a user-guide update if it's a user facing feature / API change.
Community: Announcement to kubevirt-dev was considered

Release note:

Signed-off-by: Ryan Hallisey <[email protected]>

Signed-off-by: Alay Patel <[email protected]>

kubevirt-bot · 2024-10-07T02:01:36Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Signed-off-by: Alay Patel <[email protected]>

alicefr · 2024-10-08T06:49:44Z

design-proposals/dra.md

+	// Name of the GPU device as exposed by a device plugin
+	Name string `json:"name"`
+	// DeviceName is the name of the device provisioned by device-plugins
+	DeviceName string `json:"deviceName,omitempty"`


What is the difference between the 2?

@alicefr This is not a new API being added, it is how the existing device-plugins work, see example here. The idea is that if DeviceName is populated, then its a device provisioned by device-plugins, it will be passed to pod.spec.container.requests. Alternatively the Claim field below is not nil then its a DRA device, it will be passed to pod.spec.container.claims.

We have an alternate API suggestion that will make this choice to the user much more obvious.

alicefr · 2024-10-08T06:50:57Z

design-proposals/dra.md

+	Name string `json:"name"`
+	// DeviceName is the name of the device provisioned by device-plugins
+	DeviceName string `json:"deviceName,omitempty"`


Same question here

alicefr · 2024-10-08T07:35:36Z

design-proposals/dra.md

+This design also assumes that the deviceName will be provided in the ClaimParameters, which requires the DRA drivers
+to have a ClaimParameters.spec.deviceName in their spec.


From this description you are distinguishing the name and the deviceName which also refer to my previous questions. Maybe it will be clearer to add the difference between the 2 at the beginning

alicefr · 2024-10-08T08:18:07Z

design-proposals/dra.md

+1. For devices generated using DRA, virt-launcher needs to use the vmi.status.deviceStatus to generate the domxml
+   instead of environment variables as in the case of device-plugins
+
+# Alternate Designs


I tend to prefer the first design since it follow the same schema as the volumes/disks and the network/interfaces where we have the section for the pod resource and how they are mapped to the VM

alicefr · 2024-10-08T08:19:50Z

design-proposals/dra.md

+### Virt launcher changes
+
+1. For devices generated using DRA, virt-launcher needs to use the vmi.status.deviceStatus to generate the domxml
+   instead of environment variables as in the case of device-plugins


+1 for avoiding the env variables, I like the status!

From my perspective, there is no need to store driver internal data in the public, user facing, VMI status.
This information is only required by the virt-launcher internals, no other kubevirt components benefit from this.
I fail to understand how polluting the VMI status is superior to providing specific data to the virt-launcher's compute container directly.

Same way as we don't keep the PVC status in VMI we should keep DRA specific data out of the VMI status.

From my perspective, there is no need to store driver internal data in the public, user facing, VMI status.
This information is only required by the virt-launcher internals, no other kubevirt components benefit from this.
I fail to understand how polluting the VMI status is superior to providing specific data to the virt-launcher's compute container directly.

This is in part to address an architectural difference between DPs and DRA:

DPs pre-define a static resource which is usually identified by a device model: Eg: nvidia.com/GP102GL_Tesla_P40 . A VMI spec would reference this name directly for its device and it gives easily inferrable information to the user that the VMI will run on a nvidia Tesla P40 GPU.

In DRA, this information is lost as the resource claim that we provide in our spec masks all details about what kind of device that claim is allocating. A resourceclaim is a dynamic object where you would need to look into its spec to know what piece of hardware you're getting. For the user, this means that they can no longer look into the vmi spec to immediately tell what their VMI is running on.

Given this regression in the user expectation (conscious or unconscious), we should provide atleast the bare-minimum identifiable information about a device like the hardware model etc.

Same way as we don't keep the PVC status in VMI we should keep DRA specific data out of the VMI status.

In DRA, once a resource claim is allocated, you'll need to take the following steps to identify what device is allocated:

look at the vmi spec to see the referenced claim (template or name) and request.

look at virt-launcher pod status to match the claim from vmi spec and infer the resource claim k8s object

look at the resource claim k8s object to find out the driver and device names for each request.

look at the resource slice object for the VMI's node published by the corresponding driver and know all the relevant attributes needed

A PVC needs similar hoops to find the volume information but is a lot less complex than DRA. To tackle this problem, PVC information is currently posted onto the VMI status so consumers (users & kubevirt) can easily find the info about a volume. (See https://pkg.go.dev/kubevirt.io/api/core/v1#PersistentVolumeClaimInfo)

Given that DRA ResourceClaims are modeled after PVCs and we have precedence for publishing volume information in the VMI status to make visilibity/access easier, I believe we should follow the same approach for DRA devices.

@vladikr the particular case that @varunrsekar mentioned in the comment was discussed in the meeting on 10/15 (recording) and documented in detail here: https://docs.google.com/document/d/1bQdLoxwSC1ILvyIb4ljSm5RZmsVpvqfWKIbN0KOBKsw/edit#heading=h.wjmanqk7qzd9

Additionally, we came across another use-case of why the information is needed in vmi-status:

Usecase Title: Debuggability of VMI in failed state with DRA device

K8s has a feature of releasing the resource claims once the pod has gone to a terminal state(Failed or Succeeded). This is to allow for expensive resources to be used for other pods

in case where VMI has failed and the virt-launcher has gone to a Failed state, what hardware was attempted to be provisioned for this VMI is lost

It is documented in depth here: https://docs.google.com/document/d/1bQdLoxwSC1ILvyIb4ljSm5RZmsVpvqfWKIbN0KOBKsw/edit#heading=h.3z09ita4dzuz

Please let use know if this answers the question about why extending the VMI status with this information is useful to the user

alicefr · 2024-10-08T08:24:05Z

@alaypatel07 the proposal looks great.
One question. You mention in the goal section to allow external DRA drivers. Do we want to have a mechanism to enable certain dra driver or is this unnecessary? For CSI storge classes we don't have, but for the external device plugin we do.

alicefr · 2024-10-08T08:36:52Z

design-proposals/dra.md

+### Virt launcher changes
+
+1. For devices generated using DRA, virt-launcher needs to use the vmi.status.deviceStatus to generate the domxml
+   instead of environment variables as in the case of device-plugins


How do we populate the deviceStatus inormation with an external DRA driver?

@alicefr as far as this proposal concerns, the assumption is that drivers publish information needed to generate the right domxml for the device. For instance in case of a gpu, it could either be pciAddress(pGPU), or mdev uuid (vGPU). The drivers need to publish this information in ResourceSlice as attributes https://pkg.go.dev/k8s.io/[email protected]/resource/v1alpha3#BasicDevice.

Virt-controller will find the device for it and put this attribute in the vmi status using the steps mentioned here: https://github.com/kubevirt/community/blob/30a2005d8ef72cbae0d8fb1d818861b14d8a5d88/design-proposals/dra.md#dra-api-for-reading-device-related-information

Virt-launcher will then read the attribute to generate the correct domxml

~~So there could be two cases here:~~
~~- either the device has standard attributes, like pciAddress and uuid which will be supported in tree or~~
~~- the device needs additional attributes, for which a sidecar will be needed to read that attribute from vmi status and define it in the domxml~~

As I mentioned above, please provide more usecases for the sidecar scenario. I've provided an alternative take.

As I said in one of the above comments, will be updating the proposal to remove sidecar scenario due to lack of usecases, updated my previous comment to reflect it.

iholder101 · 2024-10-08T10:12:20Z

/cc

Please add to the PR description that it's a continuation of #293.

alaypatel07 · 2024-10-08T14:06:39Z

You mention in the goal section to allow external DRA drivers. Do we want to have a mechanism to enable certain dra driver or is this unnecessary? For CSI storge classes we don't have, but for the external device plugin we do.

Answered it inline, the proposal will implement the conversion logic of standard attributes in vmi status (pciAddress and uuid) for other attributes, a extension in form of side car is needed.

Signed-off-by: Varun Ramachandra Sekar <[email protected]>

Signed-off-by: Alay Patel <[email protected]>

vladikr · 2024-10-08T21:03:11Z

design-proposals/dra.md

+The virt-launcher will have the logic of converting a GPU device into its corresponding domxml. For use-cases that are 
+not handled in-tree, a sidecar container could be envisioned which will convert the information available in status to 
+the corresponding domxml.
+


Could you please provide a specific use case for the proposed use of a sidecar? It would also help to understand why this configuration cannot be handled directly in the virt-launcher itself.

Currently, a sidecar can't access the VMI object without changes to the virt-controller. How do you plan to differentiate between DRA-related sidecars and other sidecars?

I assume that the proposed sidecar would rely on KubeVirt's existing hooking mechanism. It's worth noting that this mechanism is provided on a best-effort basis and is not fully supported. It uses gRPC with a defined API, and rather than adding DRA-specific data to the VMI status, a new API call could be integrated into the existing gRPC used by virt-launcher. This way, the DRA-specific data could be handled only between virt-launcher and the sidecar and avoid replication of the DRA data across multiple components...

Building a solution based on the current hooking mechanism, or modifying the libvirt XML outside of virt-launcher, is not recommended. The libvirt XML is not a supported external API—only the KubeVirt API is. Please carefully consider how to maintain API compatibility between DRA, libvirt, and KubeVirt. If supporting a sidecar is necessary, we should develop a stable interface between compute and the DRA sidecar containers.

After last week’s discussion and another offline conversation with @xpivarc, its looks like currently we do not have enough use-cases for allowing users to configure these devices in sidecar. It seems like we would like encourage building the support for these devices in the core, rather than allowing them to be configured by a sidecar, I will modify this proposal to change the language to reflect this

vladikr · 2024-10-08T21:08:23Z

design-proposals/dra.md

+### Virt launcher changes
+
+1. For devices generated using DRA, virt-launcher needs to use the vmi.status.deviceStatus to generate the domxml
+   instead of environment variables as in the case of device-plugins


From my perspective, there is no need to store driver internal data in the public, user facing, VMI status.
This information is only required by the virt-launcher internals, no other kubevirt components benefit from this.
I fail to understand how polluting the VMI status is superior to providing specific data to the virt-launcher's compute container directly.

vladikr · 2024-10-08T21:09:46Z

design-proposals/dra.md

+### Virt launcher changes
+
+1. For devices generated using DRA, virt-launcher needs to use the vmi.status.deviceStatus to generate the domxml
+   instead of environment variables as in the case of device-plugins


Same way as we don't keep the PVC status in VMI we should keep DRA specific data out of the VMI status.

vladikr · 2024-10-08T21:11:16Z

design-proposals/dra.md

+### Virt launcher changes
+
+1. For devices generated using DRA, virt-launcher needs to use the vmi.status.deviceStatus to generate the domxml
+   instead of environment variables as in the case of device-plugins


As I mentioned above, please provide more usecases for the sidecar scenario. I've provided an alternative take.

vladikr · 2024-10-08T21:16:01Z

design-proposals/dra.md

+  deviceStatus:
+    gpuStatuses:
+    - deviceResourceClaimStatus:
+        deviceAttributes:
+          driverVersion:
+            version: 1.0.0
+          index:
+            int: 0
+          model:
+            string: LATEST-GPU-MODEL
+          uuid:
+            string: gpu-8e942949-f10b-d871-09b0-ee0657e28f90
+          pciAddress: 
+            string: 0000:01:00.0
+        deviceName: gpu-0
+        resourceClaimName: virt-launcher-vmi-fedora-9bjwb-gpu-resource-claim-m4k28
+      name: pgpu     


At the moment, I am not convinced that storing this information in the VMI status is needed.
What other Kubevirt components require this information?
Why CDI cannot provide this info directly to the consumer container?

Signed-off-by: Alay Patel <[email protected]>

…lternatives Signed-off-by: Alay Patel <[email protected]>

alaypatel07 · 2024-10-15T22:34:38Z

@alicefr From our discussion in the meeting on 10/15 on lifecycle of resourceclaim and VM/VMI

@varunrsekar and I looked at the nvme implementation and it seems that it was developed on an older version of the DRA apis (v1alpha2). With the newer version (k8s 1.31 and v1alpha3 of the ResourceClaim api), if the pod goes to failed state:

if the resourseclaim is being created by template, the resource claim will be deleted
if the resourceclaim is created by the user, the resource claim goes from allocated to pending state where the allocations are cleared

Here is the running demo from 1.31 cluster
1.

$ k version
Client Version: v1.31.1
Kustomize Version: v5.4.2
Server Version: v1.31.0

create the vmi with manual resource claim

$ k apply -f demo-with-manual-claim.yaml
namespace/gpu-test1 unchanged
resourceclaim.resource.k8s.io/virt-launcher-vmi-fedora created
Warning: metadata.finalizers: "foregroundDeleteVirtualMachine": prefer a domain-qualified finalizer name to avoid         accidental conflicts with other finalizer writers
virtualmachineinstance.kubevirt.io/vmi-fedora created

get the status of virt-launcher pod to check the claim it used:

$ k get pods -n gpu-test1 -oyaml| grep resourceClaimName
      resourceClaimName: virt-launcher-vmi-fedora

check the pod is in running state

$ k get pods -n gpu-test1 -oyaml | grep phase
    phase: Running

check the resource claim status:

$ k get resourceclaim -n gpu-test1 -oyaml | grep -A 5 status
  status:
    allocation:
      devices:
        config:
        - opaque:
            driver: gpu.nvidia.com

wait for pod to go to failed state

$ k get pods -n gpu-test1 -oyaml -w | grep phase
  phase: Running
  phase: Running
  phase: Failed

check the resourceclaim status:

$ k get resourceclaim -n gpu-test1 -oyaml | grep -A 5 status
  status: {}

This is further seen in the KEP: https://github.com/kubernetes/enhancements/pull/4709/files#diff-8fa237d276346416c2aafa209f721e43c9762f59cabec234eafafe01694de3eeR1221

I wonder if the use-cases that require the lifecycle of the ResourceClaim not attached to the lifecycle of the pod, be deferred to the future versions of DRA where it is possible for the allocation to persist after pod is deleted.

xpivarc · 2024-11-18T15:22:26Z

/sig compute

rthallisey · 2024-11-27T17:13:36Z

@alaypatel07 I think this can be taken out of draft/WIP

alaypatel07 · 2024-11-27T17:17:24Z

@rthallisey done, I have couple of updates to be pushed from our last unconference sync up. Will plan on pushing those soon.

alicefr · 2024-11-28T08:24:44Z

I wonder if the use-cases that require the lifecycle of the ResourceClaim not attached to the lifecycle of the pod, be deferred to the future versions of DRA where it is possible for the allocation to persist after pod is deleted.

@alaypatel07 sorry I missed the ping. I'm fine to tackle all the use case gradually. It needs just to be properly documented

update apis based on unconference discussions with doc cleanups Signed-off-by: Varun Ramachandra Sekar <[email protected]> Co-authored-by: Varun Ramachandra Sekar <[email protected]>

kubevirt-bot · 2024-12-04T14:46:07Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign fabiand for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alicefr · 2024-12-05T08:31:26Z

design-proposals/dra.md

+        - name: pci-nvme-request-name
+          deviceClassName: pci-nvme.kubevirt.io
+          allocationMode: ExactCount
+          count: 1


@alaypatel07 why do we want to have a count for the device? Doesn't a claim represent a single device? Or do we want to model a group of devices a well?

@alicefr with the alpha3 API the count field has to be set since the default allocation mode is ExactCount, ref: https://github.com/kubernetes/api/blob/b7783abfc99c11f3b56fee4e5cf99023fcc8120a/resource/v1alpha3/types.go#L446

And it has carried over to the beta APIs too: https://github.com/kubernetes/api/blob/master/resource/v1beta1/types.go#L457 :)

RamLavi · 2024-12-15T08:00:34Z

design-proposals/dra.md

+    kubevirt.io/vm: vm-cirros
+  name: vm-cirros
+spec:
+  running: false


nit: please use the new API: runStrategy: Halted

rthallisey and others added 3 commits May 17, 2024 11:24

KubeVirt DRA design proposal

fc7b8c5

Signed-off-by: Ryan Hallisey <[email protected]>

update DRA api for additional usecases

04cd829

Signed-off-by: Alay Patel <[email protected]>

refactor dra proposal for DRA api in k8s v1.31

4370901

Signed-off-by: Alay Patel <[email protected]>

kubevirt-bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Oct 7, 2024

kubevirt-bot requested review from jean-edouard and jobbler October 7, 2024 02:01

kubevirt-bot added size/L size/XL and removed size/L labels Oct 7, 2024

alaypatel07 force-pushed the dra branch 3 times, most recently from 14acb8b to bf19e3b Compare October 7, 2024 17:39

update proposal to add section for vmi.status.deviceStatus

30a2005

Signed-off-by: Alay Patel <[email protected]>

alaypatel07 force-pushed the dra branch from bf19e3b to 30a2005 Compare October 8, 2024 01:07

alicefr reviewed Oct 8, 2024

View reviewed changes

kubevirt-bot requested a review from iholder101 October 8, 2024 10:12

Varun Ramachandra Sekar and others added 2 commits October 8, 2024 10:43

add inline DeviceSource to API

fb59c1f

Signed-off-by: Varun Ramachandra Sekar <[email protected]>

Add assumption about attributes

e9b8d04

Signed-off-by: Alay Patel <[email protected]>

rthallisey mentioned this pull request Oct 8, 2024

Design-proposal: KubeVirt DRA design proposal #293

Closed

vladikr reviewed Oct 8, 2024

View reviewed changes

refactor the language to remove sidecars from conversation

1c24759

Signed-off-by: Alay Patel <[email protected]>

rthallisey mentioned this pull request Oct 14, 2024

Is there a plan to introduce CDI in kubevirt-gpu-device-plugin NVIDIA/kubevirt-gpu-device-plugin#114

Open

Modify user-stories, add device-plugins and DRA comparision, update A…

08725d7

…lternatives Signed-off-by: Alay Patel <[email protected]>

alaypatel07 force-pushed the dra branch from 1a8586a to 08725d7 Compare October 14, 2024 17:55

kubevirt-bot added the sig/compute label Nov 18, 2024

alaypatel07 changed the title ~~[DRAFT]: DRA in Kubevirt~~ DRA in Kubevirt Nov 27, 2024

alaypatel07 marked this pull request as ready for review November 27, 2024 17:16

kubevirt-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 27, 2024

kubevirt-bot requested a review from dhiller November 27, 2024 17:16

Update API based on unconference discussions (#1)

349ea89

update apis based on unconference discussions with doc cleanups Signed-off-by: Varun Ramachandra Sekar <[email protected]> Co-authored-by: Varun Ramachandra Sekar <[email protected]>

alicefr reviewed Dec 5, 2024

View reviewed changes

rthallisey mentioned this pull request Dec 11, 2024

Add a proposal to remove rootful VMs feature #248

Open

RamLavi reviewed Dec 15, 2024

View reviewed changes

		This design also assumes that the deviceName will be provided in the ClaimParameters, which requires the DRA drivers
		to have a ClaimParameters.spec.deviceName in their spec.

DRA in Kubevirt #331

Are you sure you want to change the base?

DRA in Kubevirt #331

Conversation

alaypatel07 commented Oct 7, 2024

kubevirt-bot commented Oct 7, 2024

Choose a reason for hiding this comment

alaypatel07 Oct 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alicefr Oct 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alicefr commented Oct 8, 2024

Choose a reason for hiding this comment

alaypatel07 Oct 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iholder101 commented Oct 8, 2024

alaypatel07 commented Oct 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alaypatel07 commented Oct 15, 2024

xpivarc commented Nov 18, 2024

rthallisey commented Nov 27, 2024

alaypatel07 commented Nov 27, 2024 • edited Loading

alicefr commented Nov 28, 2024

kubevirt-bot commented Dec 4, 2024

alicefr Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

alaypatel07 Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alaypatel07 Oct 8, 2024 •

edited

Loading

alicefr Oct 8, 2024 •

edited

Loading

alaypatel07 Oct 8, 2024 •

edited

Loading

alaypatel07 commented Nov 27, 2024 •

edited

Loading

alicefr Dec 5, 2024 •

edited

Loading

alaypatel07 Dec 5, 2024 •

edited

Loading