Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

healthcheck: fail on supervisorctl errors #317

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

remusb
Copy link

@remusb remusb commented Oct 17, 2018

Note: Please remember to review the Datadog Contribution Guidelines
if you have not yet done so.

What does this PR do?

Adds an extra check in probe.sh to first check if supervisorctl status exits with a 0. If status can not run, probe will fail before we should try to parse its contents.

Motivation

We have a scenario where the collector fails and I was expecting the health check to fail and recycle the task.

Upon checking, I found that supervisorctl encounters an exception and the egrep check does not handle this case.

Traceback (most recent call last):
  File "/opt/datadog-agent/bin/supervisorctl", line 6, in <module>
    from pkg_resources import load_entry_point
  File "build/bdist.linux-x86_64/egg/pkg_resources/__init__.py", line 36, in <module>
  File "/opt/datadog-agent/embedded/lib/python2.7/email/parser.py", line 12, in <module>
    from email.feedparser import FeedParser
  File "/opt/datadog-agent/embedded/lib/python2.7/email/feedparser.py", line 158, in <module>
    class FeedParser:
  File "/opt/datadog-agent/embedded/lib/python2.7/email/feedparser.py", line 161, in FeedParser
    def __init__(self, _factory=message.Message):
AttributeError: 'module' object has no attribute 'Message'
root@ip-10-71-29-36:/# echo $?
0

Exit code: 0
A scheduler that checks for the exit code of the probe will not catch this.

After adding the check for the supervisorctl exit code:

root@ip-10-71-29-36:/# /probe.sh
Traceback (most recent call last):
  File "/opt/datadog-agent/bin/supervisorctl", line 6, in <module>
    from pkg_resources import load_entry_point
  File "build/bdist.linux-x86_64/egg/pkg_resources/__init__.py", line 36, in <module>
  File "/opt/datadog-agent/embedded/lib/python2.7/email/parser.py", line 12, in <module>
    from email.feedparser import FeedParser
  File "/opt/datadog-agent/embedded/lib/python2.7/email/feedparser.py", line 158, in <module>
    class FeedParser:
  File "/opt/datadog-agent/embedded/lib/python2.7/email/feedparser.py", line 161, in FeedParser
    def __init__(self, _factory=message.Message):
AttributeError: 'module' object has no attribute 'Message'
root@ip-10-71-29-36:/# echo $?
1

Exit code: 1

A simple first check of supervisorctl status executed first to ensure it exits with a 0 solves this. Any exception or execution that can not even list the status should marked the container as failed.

Testing Guidelines

N/A - happy to be guided and add something if the probe is covered anywhere as a test scenario

Additional Notes

Can have an implication for this issue: #314
In our environment even with the extra check, it completes in under 1s. Naturally, this will depend on how many resources are allocated to the container.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant