You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
As an end-user of a TP2 system, I would like basic health-monitoring to be automatically performed by the BMC and report via the Web UI and any future alerting system (see #153) so that I can be made aware if a node has experienced a problem.
Describe the solution you'd like
I'm not sure what access the BMC has to the built-in switch, but realising that most monitoring of any level of sophistication will likely require an agent running on each node's host OS, I'm wondering what we can do at a basic level without any further access to the node.
For example:
Optionally look for DHCP broadcast packets for each node, and report if none are send within a configurable time period after the node is powered-on;
Optionally look for packets with each node as a source or destination, and report if these suddenly stop for more than a given timeout (or fall below a configurable minimum number for a period);
Actively ping a node with ICMP echo packets and ensure it responds;
Optimally allow the services which are intended to run on each node to be specified, and ensure that the ports for these services can be successfully opened (or run something akin to nmap against each node and report changes);
etc. - what other passive monitoring approaches can we think of?
Describe alternatives you've considered
Having a standard/default TP2 monitoring agent which specifically reports to a service running on the BMC would be great too, but that's a separate enhancement request ;)
Additional context
(We might get much of this by including monit or similar in the BMC default packages, but there'd then want to be a way to plug this into the BMC Web UI)
The text was updated successfully, but these errors were encountered:
If a node is detected as failing, there could also be an option to automatically power-cycle it… although there should also be a configurable limit on retries, so that a node can't get stuck in a reboot loop.
Is your feature request related to a problem? Please describe.
As an end-user of a TP2 system,
I would like basic health-monitoring to be automatically performed by the BMC and report via the Web UI and any future alerting system (see #153)
so that I can be made aware if a node has experienced a problem.
Describe the solution you'd like
I'm not sure what access the BMC has to the built-in switch, but realising that most monitoring of any level of sophistication will likely require an agent running on each node's host OS, I'm wondering what we can do at a basic level without any further access to the node.
For example:
nmap
against each node and report changes);Describe alternatives you've considered
Having a standard/default TP2 monitoring agent which specifically reports to a service running on the BMC would be great too, but that's a separate enhancement request ;)
Additional context
(We might get much of this by including
monit
or similar in the BMC default packages, but there'd then want to be a way to plug this into the BMC Web UI)The text was updated successfully, but these errors were encountered: