Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic node health-monitoring from the BMC #170

Open
srcshelton opened this issue Jan 11, 2024 · 1 comment
Open

Basic node health-monitoring from the BMC #170

srcshelton opened this issue Jan 11, 2024 · 1 comment

Comments

@srcshelton
Copy link

Is your feature request related to a problem? Please describe.

As an end-user of a TP2 system,
I would like basic health-monitoring to be automatically performed by the BMC and report via the Web UI and any future alerting system (see #153)
so that I can be made aware if a node has experienced a problem.

Describe the solution you'd like

I'm not sure what access the BMC has to the built-in switch, but realising that most monitoring of any level of sophistication will likely require an agent running on each node's host OS, I'm wondering what we can do at a basic level without any further access to the node.

For example:

  • Optionally look for DHCP broadcast packets for each node, and report if none are send within a configurable time period after the node is powered-on;
  • Optionally look for packets with each node as a source or destination, and report if these suddenly stop for more than a given timeout (or fall below a configurable minimum number for a period);
  • Actively ping a node with ICMP echo packets and ensure it responds;
  • Optimally allow the services which are intended to run on each node to be specified, and ensure that the ports for these services can be successfully opened (or run something akin to nmap against each node and report changes);
  • etc. - what other passive monitoring approaches can we think of?

Describe alternatives you've considered

Having a standard/default TP2 monitoring agent which specifically reports to a service running on the BMC would be great too, but that's a separate enhancement request ;)

Additional context

(We might get much of this by including monit or similar in the BMC default packages, but there'd then want to be a way to plug this into the BMC Web UI)

@srcshelton
Copy link
Author

If a node is detected as failing, there could also be an option to automatically power-cycle it… although there should also be a configurable limit on retries, so that a node can't get stuck in a reboot loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant