Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async Healthcheck #301

Merged
merged 19 commits into from
Aug 23, 2019
Merged

Async Healthcheck #301

merged 19 commits into from
Aug 23, 2019

Conversation

DimitarPetrov
Copy link
Member

@DimitarPetrov DimitarPetrov commented Jun 17, 2019

Motivation

Currently SM healthcheck works synchronously (calls all dependant components synchronously every time a call to /v1/health comes). This is not optimal and introduces unnecessary load both on SM and the dependencies.

Using https://github.com/InVisionApp/go-health healthchecks can happen asynchronously from the API call. We could remember the results of the last few executions so that we can return 500 on /v1/health only if the pg health indicator has failed 3-4 times in a row. This way we can avoid micro outages (e.g. if we couldnt reach the db for 5min straight, then we report status 500) otherwise we report 200 OK with body that mentions that db is down. This way hopefully we avoid microoutages reports caused by other components. We can leverage logic from the dependency in 1. that would help implement this

Approach

Each indicator could be configured separately in application.yml. The properties which could be configured are if the indicator is fatal or not, which says if the health of this indicator will affect the overall health. Failures treshold which is how much times in a row an indicator must report down until the component is considered down. And last is the interval time between each check for the component.

Example structure is:

...
health:
  indicators:
    storage:
      fatal: true
      failures_treshold: 3
      interval: 30
...

Positive response sample

{
  "details": {
    "ping": {
      "check_time": "2019-06-25T09:44:23.270103+03:00",
      "fatal": true,
      "first_failure_at": "0001-01-01T00:00:00Z",
      "name": "ping",
      "num_failures": 0,
      "status": "UP"
    },
    "storage": {
      "check_time": "2019-06-25T09:44:23.270114+03:00",
      "fatal": true,
      "first_failure_at": "0001-01-01T00:00:00Z",
      "name": "storage",
      "num_failures": 0,
      "status": "UP"
    }
  },
  "status": "UP"
}

Negative response sample (Treshold not exceeded and overall status is UP)

{
  "details": {
    "ping": {
      "check_time": "2019-06-25T09:46:23.272538+03:00",
      "fatal": true,
      "first_failure_at": "0001-01-01T00:00:00Z",
      "name": "ping",
      "num_failures": 0,
      "status": "UP"
    },
    "storage": {
      "check_time": "2019-06-25T09:46:23.274558+03:00",
      "error": "dial tcp [::1]:5432: connect: connection refused",
      "fatal": true,
      "first_failure_at": "2019-06-25T09:46:23.274626+03:00",
      "name": "storage",
      "num_failures": 1,
      "status": "DOWN"
    }
  },
  "status": "UP"
}

I made a PR in the library to make some of the fields in response optional since some of them are most of the time zero values like first_failure_at and num_failures when the component is UP, but still no response.
InVisionApp/go-health#64

@coveralls
Copy link

coveralls commented Jun 17, 2019

Coverage Status

Coverage decreased (-0.4%) to 89.377% when pulling 06295e6 on async-health into e9af4a3 on master.

@dpanayotov
Copy link
Member

dpanayotov commented Jun 18, 2019

I'm not entirely sure we need a new dependency that will span through all components just for its <100 rows of async collection of healths. Let's discuss this.

@KirilKabakchiev
Copy link
Member

KirilKabakchiev commented Jun 18, 2019

If we use more than the 100 lines of async collection of healths its worth using it - otherwise not. If we decide to use it we should make sure the pkg package does not expose any imports to this library (e.g. its internal implementation detail of the sm framework)

My idea was apart from the async collection of healths to also configure the library logging to use our logger so that we get health logging.

Also the library stores health failure timestamps and consecutive number of failures and other details which we could look at and decide if we want to enrich our health response with these extra details.
and some predefined healthcheckers such as Reachable, HTTP and SQLDB - but those are not a big deal.

pkg/sm/sm.go Outdated
}
smb.RegisterControllers(healthcheck.NewController(healthz, smb.HealthAggregationPolicy, smb.health.FailuresTreshold))

err := healthz.Start()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somewhere you should listen for ctx.Done() and call healthz.Stop()

pkg/sm/sm.go Outdated
smb.RegisterControllers(healthcheck.NewController(healthz, smb.HealthAggregationPolicy, smb.health.FailuresTreshold))

err := healthz.Start()
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline

@@ -182,7 +175,6 @@ type TransactionalRepositoryDecorator func(TransactionalRepository) (Transaction
//go:generate counterfeiter . Storage
type Storage interface {
OpenCloser
Pinger
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works just because the storage implementation is hardcoded in sm.New to use postgres.Storage. Assuming it is not, this should be reverted.

// Settings type to be loaded from the environment
type Settings struct {
FailuresTreshold int64 `mapstructure:"failures_treshold" description:"maximum failures in a row until component is considered down"`
Interval int64 `description:"seconds between health checks of components"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this to time.Duration and add mapstructure. if time.Duration adjust the description

if s.FailuresTreshold < 0 {
return fmt.Errorf("validate Settings: FailuresTreshold must be >= 0")
}
if s.Interval < 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this should be something larger. Like 30s/60s at least?

}
return New().WithStatus(overallStatus).WithDetails(details)
}

// ConvertStatus converts go-health status to Status
func ConvertStatus(status string) Status {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't decide if it would be better to change our statuses from UP and Down to OK and Failed to avoid this.

pkg/sm/sm.go Outdated
<-c.Done()
log.C(c).Debug("Context cancelled. Stopping health checks...")
if err := healthz.Stop(); err != nil {
log.D().Error(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log.C(c)

FailuresTreshold int64 `mapstructure:"failures_treshold" description:"maximum failures in a row until component is considered down"`
Interval int64 `description:"seconds between health checks of components"`
FailuresTreshold int64 `mapstructure:"failures_treshold" description:"maximum failures in a row until component is considered down"`
Interval time.Duration `mapstructure:"interval" description:"seconds between health checks of components"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

description should be time between...


// Validate validates the health settings
func (s *Settings) Validate() error {
if s.FailuresTreshold < 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add tests for these

}
return healthz.Up()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that these are not used, we might remove our health.Health altogether because it seems unnecessary conversion from one type to another

if len(healths) == 0 {
return New().WithDetail("error", "no health indicators registered").Unknown()
}
overallStatus := StatusUp
for _, health := range healths {
if health.Status == StatusDown {
if health.Status == "failed" && health.ContiguousFailures > failureTreshold {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

health.ContiguousFailures > failureTreshold introduces behavior, which I'm not sure we want for all health indicators.
If the storage is down when healthcheck call comes, you might get a response with body that says failed, but you'll get status 200 OK. This will happen if the time between the detection of storage down and the call is less than failureThreshold * interval.
Assuming we configure interval=60s and /v1/monitor/health is called every 5 minutes - on the 4th minute the storage fails and now on the health call we get 200 OK but with failure response. Assuming there is a storage outage for 4 minutes (1 minute before the next health call) then the next health call will report UP again. However, the service has not been operational for 4 minutes and all storage operations were failing.

Does it make sense to have the health configuration be a map[string]Settings where each health indicator can be separately configured with failureThreshold and interval? Then we might configure the storage healthcheck to report down even after 1 failure.

}

// NewController returns a new healthcheck controller with the given indicators and aggregation policy
func NewController(indicators []health.Indicator, aggregator health.AggregationPolicy) web.Controller {
func NewController(health h.IHealth, aggPolicy health.AggregationPolicy, failuresTreshold int64) web.Controller {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not a good idea (for projects with vendor folders) for pkg methods that will be used in other modules to expect input or return output that requires the caller of this to also import this dependency. this may complicate dependency management for any other project (proxy for example) - ideally creating a controller should require only imports from github.com/Peripli/service-manager

pkg/sm/sm.go Outdated
err := healthz.AddCheck(&h.Config{
Name: indicator.Name(),
Checker: indicator,
Interval: smb.health.Interval * time.Second,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably be configurable per HealthIndicator as the library allows - the way you implemented it limits us to having to use the same interval for all indicators which might be ok if the indicator didnt specify its own value

healthIndicator = &storage.HealthIndicator{
Pinger: pinger,
}
healthIndicator, _ = storage.NewStorageHealthIndicator(storage.PingFunc(ping))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a good practice to skip errors - even if its tests - normally you should handle the err by doing Expect(err).ShouldNot(HaveOccured()) - otherwise in case someone changed the logic in NewStorageHealthIndicator and the err occurred and was ignored here troubleshooting the failing test would be harder

// DefaultAggregationPolicy aggregates the healths by constructing a new Health based on the given
// where the overall health status is negative if one of the healths is negative and positive if all are positive
type DefaultAggregationPolicy struct {
}

// Apply aggregates the given healths
func (*DefaultAggregationPolicy) Apply(healths map[string]*Health) *Health {
func (*DefaultAggregationPolicy) Apply(healths map[string]health.State, failureTreshold int64) *Health {
Copy link
Member

@KirilKabakchiev KirilKabakchiev Jun 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The State also includes details such as

    CheckTime Time "json:\"check_time\""
    ContiguousFailures int64 "json:\"num_failures\""
    TimeOfFirstFailure Time "json:\"first_failure_at\""

It would be useful to have these in the json response for each of the indicators

}

healthz := h.New()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The library allows registering IStatusListener that gets called back when the health status changes. We could register one so that we can log health status changes - rough example here https://github.com/InVisionApp/go-health/tree/master/examples/status-listener

Since Recovered can also log the times failures happened before recovery this could be useful information in logs

pkg/sm/sm.go Outdated
@@ -167,9 +178,12 @@ func New(ctx context.Context, cancel context.CancelFunc, cfg *config.Settings) (

// Build builds the Service Manager
func (smb *ServiceManagerBuilder) Build() *ServiceManager {
// setup server and add relevant global middleware
smb.installHealth()
err := smb.installHealth()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err := ..; err != nil { ... }

pkg/sm/sm.go Outdated
healthz.Logger = l.New(logger)

for _, indicator := range smb.HealthIndicators {
err := healthz.AddCheck(&h.Config{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err := ..; err != nil { ... }

overallStatus = StatusDown
break
}
}
details := make(map[string]interface{})
for k, v := range healths {
details[k] = v
details[k] = ConvertStatus(v.Status)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you include the name in the details ? it would probably make sense to know what status comes from what indicator

@@ -17,27 +17,22 @@
package storage_test

import (
"context"
Copy link
Member

@KirilKabakchiev KirilKabakchiev Jun 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coverage has dropped with 0.4% which seems a bit too much - could you have a look at new missed lines and see if they can be covered with meaningful tests?

pkg/sm/sm.go Outdated
return err
}

// Handles safe termination of sm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like // Gracefully stop health checks or no comment ?

}

// ConfigureIndicators configures registry's indicators with provided settings
func (r *Registry) ConfigureIndicators() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a hidden extra step that needs to be performed. What do you think about hiding the indicators and add a func (Registry) AddHealthIndicator(Inidicator) that configures it?

pkg/sm/sm.go Outdated
// setup server and add relevant global middleware
smb.installHealth()
if err := smb.installHealth(); err != nil {
panic(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log.C(smb.ctx).Panic()

// NewController returns a new healthcheck controller with the given indicators and aggregation policy
func NewController(indicators []health.Indicator, aggregator health.AggregationPolicy) web.Controller {
// NewController returns a new healthcheck controller with the given health and tresholds
func NewController(health h.IHealth, indicators []health.Indicator) web.Controller {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass only the minimum that is required. you don't need the whole indicators slice.

pkg/sm/sm.go Outdated
@@ -183,7 +183,7 @@ func New(ctx context.Context, cancel context.CancelFunc, cfg *config.Settings) (
// Build builds the Service Manager
func (smb *ServiceManagerBuilder) Build() *ServiceManager {
if err := smb.installHealth(); err != nil {
panic(err)
log.C(smb.ctx).Panic()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Panic(err)

dpanayotov
dpanayotov previously approved these changes Aug 20, 2019
pkg/sm/sm.go Outdated
@@ -132,7 +134,13 @@ func New(ctx context.Context, cancel context.CancelFunc, cfg *config.Settings) (
return nil, fmt.Errorf("error creating core api: %s", err)
}

API.HealthIndicators = append(API.HealthIndicators, &storage.HealthIndicator{Pinger: storage.PingFunc(smStorage.Ping)})
storageHealthIndicator, err := storage.NewStorageHealthIndicator(storage.PingFunc(smStorage.PingContext))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storage.NewHealthIndicator (do not repeat storage)

pkg/sm/sm.go Outdated
}

API.HealthIndicators = append(API.HealthIndicators, storageHealthIndicator)
API.HealthSettings = cfg.Health.IndicatorsSettings
Copy link
Member

@KirilKabakchiev KirilKabakchiev Aug 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if its necessary to copy the settings over in the registry as the cfg is acessible in most places anyway?

pkg/sm/sm.go Outdated
for _, indicator := range smb.HealthIndicators {
if configurableIndicator, ok := indicator.(health.ConfigurableIndicator); ok {
if settings, ok := smb.HealthSettings[configurableIndicator.Name()]; ok {
configurableIndicator.Configure(settings)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think its necessary to let the indicators know about these settings - Indicator can have .State() and .Name() and no other interfaced methods and the first touchpoint between the settings and the actual indicator will be when calling healthz.AddCheck

return i.settings.Fatal
}

func NewStorageHealthIndicator(pingFunc PingFunc) (health.Indicator, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file order - constructors that return concrete types (typically) go right before the struct they return (readability)

FailuresTreshold() int64

// Fatal returns if the health indicator is fatal for the overall status
Fatal() bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does a fatal = false with failureTreshold = 5 indicator mean ? Meaning do we need both or does having both just bring confusion

}
overallStatus := health.StatusUp
for i, v := range state {
if v.Fatal && v.ContiguousFailures >= c.tresholds[i] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be >= or > (if threshold is the maximum allowed failures, the = should be removed)

return health.New().WithDetail("error", "no health indicators registered")
}
overallStatus := health.StatusUp
for i, v := range state {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you dont really use i so put _ instead. Also rename state to overrallState and v to state

return nil, fmt.Errorf("error creating storage health indicator: %s", err)
}

API.HealthIndicators = append(API.HealthIndicators, storageHealthIndicator)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could wrap the access to the HealthIndicators in a register method that also adds defaultsettings for this indicator if they are missing and "completes" any incomplete settings for the indicator that is being registered - this will help get rid of the Configurable interface

Apply(healths map[string]*Health) *Health
// ConfigurableIndicator is an interface to provide configurable health of a component
//go:generate counterfeiter . ConfigurableIndicator
type ConfigurableIndicator interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest tey to get rid of this interface. - thereare some ideas in the other comments

// NewController returns a new healthcheck controller with the given indicators and aggregation policy
func NewController(indicators []health.Indicator, aggregator health.AggregationPolicy) web.Controller {
// NewController returns a new healthcheck controller with the given health and tresholds
func NewController(health h.IHealth, tresholds map[string]int64) web.Controller {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NewController is a public api that will be used in other projects- so you have two options - go for a stable api (pass settings and know that when new settings are added the controller public api wont change) or pass minimal and risk having to refactor a public api afterwards. Both approaches are acceptable.

} else {
configurableIndicator.Configure(health.DefaultIndicatorSettings())
}
settings, ok := smb.cfg.Health.Indicators[indicator.Name()]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should probably also handle partcially configured settings that are present in the map - meaning "complete" them with default values

Copy link
Member

@KirilKabakchiev KirilKabakchiev Aug 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed and it seems that now partically configured indicator will result in confiuration validation error - which is also ok

sqlConfig := &checkers.SQLConfig{
Pinger: pingFunc,
}
sqlChecker, err := checkers.NewSQL(sqlConfig)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is implementation specific in the storage package. Move this to postgres package or rename this to SQLHealthIndicator / PostgresHealthIndicator.

@DimitarPetrov DimitarPetrov merged commit df40c87 into master Aug 23, 2019
@DimitarPetrov DimitarPetrov deleted the async-health branch August 23, 2019 05:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants