Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fdbmonitor restart-delay unclear documentation #11775

Open
johscheuer opened this issue Nov 13, 2024 · 4 comments
Open

fdbmonitor restart-delay unclear documentation #11775

johscheuer opened this issue Nov 13, 2024 · 4 comments
Labels

Comments

@johscheuer
Copy link
Contributor

johscheuer commented Nov 13, 2024

fdbmonitor has a field restart-delay. Based on the documentation: The maximum number of seconds (subject to jitter) that fdbmonitor will delay before restarting a failed process.. I would assume that failed process means a process that exits with an exit code other than 0. But the actual implementation doesn't look at the exit code: https://github.com/apple/foundationdb/blob/main/fdbmonitor/fdbmonitor.h#L395-L405 + https://github.com/apple/foundationdb/blob/main/fdbmonitor/fdbmonitor.cpp#L646. So we either have to correct the documentation or fix the behaviour in fdbmonitor.

I saw this when I looked into #11764.

@johscheuer johscheuer added the bug label Nov 13, 2024
@johscheuer
Copy link
Contributor Author

CC @spraza

@johscheuer
Copy link
Contributor Author

I don't have a strong opinion on either but I think using the exit code and only delaying processes that have an exit code different than 0 should be delayed. My understanding of this setting is to prevent a process to crash loop directly in a short amount of time.

@johscheuer
Copy link
Contributor Author

For completeness: In the fdb-kubernetes-monitor we are only delaying the restart when the process exit code was not 0: https://github.com/apple/foundationdb/blob/main/fdbkubernetesmonitor/monitor.go#L453-L459

@spraza
Copy link
Collaborator

spraza commented Nov 14, 2024

Looked at the code. You're right, currently the delay calculation doesn't account for exit code. I think it's reasonable to only have delays for failing (non zero) exited processes. Testing the fix would probably take more time than the fix itself. I can send the PR out but it may take a few days before I get to it. Happy to review the PR if one of you is interested to fix this (@johscheuer, @mpatou-openai).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants