Basic node health-monitoring from the BMC #170

srcshelton · 2024-01-11T14:06:15Z

Is your feature request related to a problem? Please describe.

As an end-user of a TP2 system,
I would like basic health-monitoring to be automatically performed by the BMC and report via the Web UI and any future alerting system (see #153)
so that I can be made aware if a node has experienced a problem.

Describe the solution you'd like

I'm not sure what access the BMC has to the built-in switch, but realising that most monitoring of any level of sophistication will likely require an agent running on each node's host OS, I'm wondering what we can do at a basic level without any further access to the node.

For example:

Optionally look for DHCP broadcast packets for each node, and report if none are send within a configurable time period after the node is powered-on;
Optionally look for packets with each node as a source or destination, and report if these suddenly stop for more than a given timeout (or fall below a configurable minimum number for a period);
Actively ping a node with ICMP echo packets and ensure it responds;
Optimally allow the services which are intended to run on each node to be specified, and ensure that the ports for these services can be successfully opened (or run something akin to nmap against each node and report changes);
etc. - what other passive monitoring approaches can we think of?

Describe alternatives you've considered

Having a standard/default TP2 monitoring agent which specifically reports to a service running on the BMC would be great too, but that's a separate enhancement request ;)

Additional context

(We might get much of this by including monit or similar in the BMC default packages, but there'd then want to be a way to plug this into the BMC Web UI)

The text was updated successfully, but these errors were encountered:

srcshelton · 2024-01-11T14:09:14Z

If a node is detected as failing, there could also be an option to automatically power-cycle it… although there should also be a configurable limit on retries, so that a node can't get stuck in a reboot loop.

srcshelton mentioned this issue Jan 11, 2024

Add warning/confirmation before powering off a node #169

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic node health-monitoring from the BMC #170

Basic node health-monitoring from the BMC #170

srcshelton commented Jan 11, 2024

srcshelton commented Jan 11, 2024

Basic node health-monitoring from the BMC #170

Basic node health-monitoring from the BMC #170

Comments

srcshelton commented Jan 11, 2024

srcshelton commented Jan 11, 2024