Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding option to force reboot, ignoring active allerts #21

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 10 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,15 +61,16 @@ The following arguments can be passed to kured via the daemonset pod template:

```
Flags:
--alert-filter-regexp value alert names to ignore when checking for active alerts
--ds-name string namespace containing daemonset on which to place lock (default "kube-system")
--ds-namespace string name of daemonset on which to place lock (default "kured")
--lock-annotation string annotation in which to record locking node (default "weave.works/kured-node-lock")
--period duration reboot check period (default 1h0m0s)
--prometheus-url string Prometheus instance to probe for active alerts
--reboot-sentinel string path to file whose existence signals need to reboot (default "/var/run/reboot-required")
--slack-hook-url string slack hook URL for reboot notfications
--slack-username string slack username for reboot notfications (default "kured")
--alert-filter-regexp value alert names to ignore when checking for active alerts
--ds-name string namespace containing daemonset on which to place lock (default "kube-system")
--ds-namespace string name of daemonset on which to place lock (default "kured")
--lock-annotation string annotation in which to record locking node (default "weave.works/kured-node-lock")
--period duration reboot check period (default 1h0m0s)
--prometheus-url string Prometheus instance to probe for active alerts
--reboot-sentinel string path to file whose existence signals need to reboot (default "/var/run/reboot-required")
--force-reboot-sentinel string path to file whose existence signals need to force reboot aka. ignore active prometheus alerts (default "/var/run/force-reboot-required")
--slack-hook-url string slack hook URL for reboot notfications
--slack-username string slack username for reboot notfications (default "kured")
```

### Reboot Sentinel File & Period
Expand Down
40 changes: 29 additions & 11 deletions cmd/kured/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,15 +26,16 @@ var (
version = "unreleased"

// Command line flags
period time.Duration
dsNamespace string
dsName string
lockAnnotation string
prometheusURL string
alertFilter *regexp.Regexp
rebootSentinel string
slackHookURL string
slackUsername string
period time.Duration
dsNamespace string
dsName string
lockAnnotation string
prometheusURL string
alertFilter *regexp.Regexp
rebootSentinel string
forceRebootSentinel string
slackHookURL string
slackUsername string

// Metrics
rebootRequiredGauge = prometheus.NewGaugeVec(prometheus.GaugeOpts{
Expand Down Expand Up @@ -68,7 +69,8 @@ func main() {
"alert names to ignore when checking for active alerts")
rootCmd.PersistentFlags().StringVar(&rebootSentinel, "reboot-sentinel", "/var/run/reboot-required",
"path to file whose existence signals need to reboot")

rootCmd.PersistentFlags().StringVar(&forceRebootSentinel, "force-reboot-sentinel", "/var/run/force-reboot-required",
"path to file whose existence signals need to force reboot")
rootCmd.PersistentFlags().StringVar(&slackHookURL, "slack-hook-url", "",
"slack hook URL for reboot notfications")
rootCmd.PersistentFlags().StringVar(&slackUsername, "slack-username", "kured",
Expand Down Expand Up @@ -108,9 +110,21 @@ func sentinelExists() bool {
return false // unreachable; prevents compilation error
}
}
func forceRebootsentinelExists() bool {

This comment was marked as abuse.

This comment was marked as abuse.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awh What is your take on this , how does this fit the overall roadmap? do you eventually have other suggestions?

_, err := os.Stat(forceRebootSentinel)
switch {
case err == nil:
return true
case os.IsNotExist(err):
return false
default:
log.Fatalf("Unable to determine existence of force reboot sentinel: %v", err)
return false // unreachable; prevents compilation error
}
}

func rebootRequired() bool {
if sentinelExists() {
if sentinelExists() || forceRebootsentinelExists() {
log.Infof("Reboot required")
return true
} else {
Expand All @@ -120,6 +134,10 @@ func rebootRequired() bool {
}

func rebootBlocked() bool {
if forceRebootsentinelExists() {
log.Infof("Force reboot sentinel %v exists, force rebooting activated", forceRebootSentinel)
return false
}
if prometheusURL != "" {
alertNames, err := alerts.PrometheusActiveAlerts(prometheusURL, alertFilter)
if err != nil {
Expand Down