Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(inputs.nvidia-smi): Add probe_on_startup option #15916

Conversation

LandonTClipp
Copy link
Contributor

Summary

There are some cases where the nvidia-smi plugin might be found in PATH and executable, but upon running it might always return a non-zero exit code. For various reasons, in the environment I work in, this might be expected. It's thus disruptive for system logs to be polluted with infinite error messages. It's preferable in this situation to check if nvidia-smi returns a good result on plugin startup, and if not, allow the error to be bubbled up and handled according to startup_error_behavior.

Checklist

  • No AI generated code was used in this PR

Related issues

resolves #15915

@telegraf-tiger telegraf-tiger bot added feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins labels Sep 19, 2024
@LandonTClipp LandonTClipp force-pushed the LandonTClipp/nvidia_smi_test_on_startup branch 2 times, most recently from 4f5b6bf to 1eec743 Compare September 19, 2024 20:02
@srebhan
Copy link
Member

srebhan commented Sep 20, 2024

Why not returning a startup error without adding a new option?

@LandonTClipp
Copy link
Contributor Author

LandonTClipp commented Sep 20, 2024

Why not returning a startup error without adding a new option?

Backwards compatibility reasons, I'm not sure if someone's system might strangely rely on this old behavior. I could envision a situation where nvidia-smi fails when called but if someone did not set startup_error_behavior = "ignore" then this would suddenly make their whole service shutdown (I think). There is also a potential issue with the plugin startup getting hung if nvidia-smi itself hangs. I'm just being conservative by retaining the old behavior.

What do you think?

@srebhan
Copy link
Member

srebhan commented Sep 20, 2024

The question is, would an error in executing the command be repairable at runtime? If that's not the case, I'd say go without the option as people still do have a way to fix the situation using another startup-error-behavior. Otherwise, we should add this option...

@srebhan srebhan self-assigned this Sep 20, 2024
@LandonTClipp
Copy link
Contributor Author

The question is, would an error in executing the command be repairable at runtime?

I think it depends on the issue at hand. In some cases it could be reparable, like if hot-rebooting a GPU fixes device memory corruption. In other situations like mine, it's not.

Personally, I think there is no risk at all to adding this option to at least give people the choice. Not adding the option has some risk, so why not do the zero risk option?

@LandonTClipp
Copy link
Contributor Author

@srebhan any thoughts on the above?

PS thanks for the responsiveness thus far 🙏🏻

Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LandonTClipp, adding options that are not necessary complicates things, making it more difficult for the user to find out what to do and sometimes even cause ambiguities. Anyway, I think you do have a point in adding the option so let's go with this.

plugins/inputs/nvidia_smi/sample.conf Outdated Show resolved Hide resolved
plugins/inputs/nvidia_smi/nvidia_smi.go Outdated Show resolved Hide resolved
There are some cases where the nvidia-smi plugin might be found in PATH and executable, but upon running it might always return a non-zero exit code. For various reasons, in the environment I work in, this might be expected. It's thus disruptive for system logs to be polluted with infinite error messages. It's preferable in this situation to check if nvidia-smi returns a good result on plugin startup, and if not, allow the error to be bubbled up and handled according to startup_error_behavior.
@LandonTClipp LandonTClipp force-pushed the LandonTClipp/nvidia_smi_test_on_startup branch from 4ad28d1 to cd27c4a Compare October 2, 2024 15:07
@LandonTClipp
Copy link
Contributor Author

@srebhan all comments addressed.

@LandonTClipp LandonTClipp changed the title feat(inputs.nvidia-smi): Add test_on_startup option feat(inputs.nvidia-smi): Add probe_on_startup option Oct 2, 2024
@telegraf-tiger
Copy link
Contributor

telegraf-tiger bot commented Oct 2, 2024

Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work @LandonTClipp! Thanks!

@srebhan srebhan added area/nvidia ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. labels Oct 2, 2024
@srebhan srebhan assigned DStrand1 and unassigned srebhan Oct 2, 2024
@DStrand1 DStrand1 merged commit b6e59aa into influxdata:master Oct 3, 2024
26 of 27 checks passed
@github-actions github-actions bot added this to the v1.33.0 milestone Oct 3, 2024
@LandonTClipp LandonTClipp deleted the LandonTClipp/nvidia_smi_test_on_startup branch October 4, 2024 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/nvidia feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

inputs.nvidia-smi: Add config option to test a single run of nvidia-smi on plugin startup
3 participants