Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add validator nvidia driver whether install success #157

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

lengrongfu
Copy link
Member

@lengrongfu lengrongfu commented Feb 10, 2024

Fixes: #136

Add a driver-validator Daemonset workload, in the node to valid Nvidia driver when the install is a success. if install success, add node label hami.io/driver-validator="true" else hami.io/driver-validator="false", device-plugin Daemonset add a nodeSelect hami.io/driver-validator="true".

Test Result

  • normal running
image
  • reboot node
  • reboot success
    • step 1: current not NVIDIA driver, so device-plugin pod not exist.
image
  • step 2: wait for some time, device-plugin start success
image

@lengrongfu lengrongfu force-pushed the feat/add-nvidia-driver-validator branch 8 times, most recently from 77f7b36 to f001188 Compare February 10, 2024 07:51
@archlitchi
Copy link
Collaborator

In this version, we need to manually label GPU-node(label:"gpu=on") during installation, so we can safely assume that each node with label "gpu=on" is "gpu-ready".
However, in large cluster, manually label by cluster maintainers will be pain sometimes, so an auto-label strategy is needed.
I think it's better to rename "gpu=on" to "hami.io/nvidia-selector", and use .Values.devicePlugin.validator.enabled to switch between "manual-mode" and "auto-mode"(default manual mode)
Back to this PR, i suggest:

  1. rename label "hami.io/driver-validator" -> "hami.io/nvidia-selector"
  2. modify ".values.devicePlugin.nvidianodeSelector" entry ("gpu=on" -> "hami.io/nvidia-selector")
  3. modify documents related to "gpu=on" ("gpu=on" -> "hami.io/nvidia-selector")
  4. Remove field " {{- if .Values.devicePlugin.validator.enabled }}
    hami.io/driver-validator: "true"
    {{- end }}" in daemonset.yaml, since it is included in ".values.devicePlugin.nvidianodeSelector"

@lengrongfu
Copy link
Member Author

Thanks for your suggestion, I have a different opinion on one point "and use .Values.devicePlugin.validator.enabled to switch between "manual-mode" and "auto-mode"(default manual mode)",

we can add .Values.devicePlugin.autoValidatorDriver this field to imple "manual-mode" and "auto-mode" , when value is true expression "auto-mode" else value is false expression "manual-mode".

@archlitchi
Copy link
Collaborator

Thanks for your suggestion, I have a different opinion on one point "and use .Values.devicePlugin.validator.enabled to switch between "manual-mode" and "auto-mode"(default manual mode)",

we can add .Values.devicePlugin.autoValidatorDriver this field to imple "manual-mode" and "auto-mode" , when value is true expression "auto-mode" else value is false expression "manual-mode".

agreed:), that looks better

@lengrongfu lengrongfu force-pushed the feat/add-nvidia-driver-validator branch 2 times, most recently from e4b52aa to 84f807b Compare February 18, 2024 07:14
@archlitchi
Copy link
Collaborator

please modify "Label your nodes" section in README.md and README_CN.md, introduce your manual and auto label strategy here

@lengrongfu lengrongfu force-pushed the feat/add-nvidia-driver-validator branch from 84f807b to effd591 Compare February 19, 2024 08:31
@chaunceyjiang
Copy link
Contributor

It's just a question, I think this PR only introduces a small feature, but it requires a new Daemonset to implement this feature.

@CoderTH
Copy link
Contributor

CoderTH commented Feb 21, 2024

Can you add some test screenshots?

@lengrongfu
Copy link
Member Author

Can you add some test screenshots?

Added part of the testing info.

@chaunceyjiang
Copy link
Contributor

Additionally, can we start introducing keywords like HAMi and gradually phase out 4pd?

@lengrongfu lengrongfu force-pushed the feat/add-nvidia-driver-validator branch from effd591 to 637f675 Compare February 21, 2024 03:18
@lengrongfu lengrongfu force-pushed the feat/add-nvidia-driver-validator branch from 637f675 to ca232fa Compare February 21, 2024 03:30
@lengrongfu
Copy link
Member Author

Additionally, can we start introducing keywords like HAMi and gradually phase out 4pd?

They need to be modified together, and there are still many.

@chaunceyjiang
Copy link
Contributor

I still think this PR is overly complicated. Please give me some time, I will research whether there are other implementation solutions.

@lengrongfu
Copy link
Member Author

I still think this PR is overly complicated. Please give me some time, I will research whether there are other implementation solutions.

@chaunceyjiang Hi, any progress on this? Or we can merge PR first and then optimize later.

@chaunceyjiang
Copy link
Contributor

/hold

In the next few days, I will continue to explore this issue.

@chaunceyjiang
Copy link
Contributor

image

I believe we can use this initContainer .

@lengrongfu lengrongfu marked this pull request as draft April 15, 2024 08:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

if use gpu-operator install nvidia driver, when node restart, driver-plugin start CrashLoopBackOff
4 participants