Async Healthcheck #301

DimitarPetrov · 2019-06-17T12:42:54Z

Motivation

Currently SM healthcheck works synchronously (calls all dependant components synchronously every time a call to /v1/health comes). This is not optimal and introduces unnecessary load both on SM and the dependencies.

Using https://github.com/InVisionApp/go-health healthchecks can happen asynchronously from the API call. We could remember the results of the last few executions so that we can return 500 on /v1/health only if the pg health indicator has failed 3-4 times in a row. This way we can avoid micro outages (e.g. if we couldnt reach the db for 5min straight, then we report status 500) otherwise we report 200 OK with body that mentions that db is down. This way hopefully we avoid microoutages reports caused by other components. We can leverage logic from the dependency in 1. that would help implement this

Approach

Each indicator could be configured separately in application.yml. The properties which could be configured are if the indicator is fatal or not, which says if the health of this indicator will affect the overall health. Failures treshold which is how much times in a row an indicator must report down until the component is considered down. And last is the interval time between each check for the component.

Example structure is:

...
health:
  indicators:
    storage:
      fatal: true
      failures_treshold: 3
      interval: 30
...

Positive response sample

{
  "details": {
    "ping": {
      "check_time": "2019-06-25T09:44:23.270103+03:00",
      "fatal": true,
      "first_failure_at": "0001-01-01T00:00:00Z",
      "name": "ping",
      "num_failures": 0,
      "status": "UP"
    },
    "storage": {
      "check_time": "2019-06-25T09:44:23.270114+03:00",
      "fatal": true,
      "first_failure_at": "0001-01-01T00:00:00Z",
      "name": "storage",
      "num_failures": 0,
      "status": "UP"
    }
  },
  "status": "UP"
}

Negative response sample (Treshold not exceeded and overall status is UP)

{
  "details": {
    "ping": {
      "check_time": "2019-06-25T09:46:23.272538+03:00",
      "fatal": true,
      "first_failure_at": "0001-01-01T00:00:00Z",
      "name": "ping",
      "num_failures": 0,
      "status": "UP"
    },
    "storage": {
      "check_time": "2019-06-25T09:46:23.274558+03:00",
      "error": "dial tcp [::1]:5432: connect: connection refused",
      "fatal": true,
      "first_failure_at": "2019-06-25T09:46:23.274626+03:00",
      "name": "storage",
      "num_failures": 1,
      "status": "DOWN"
    }
  },
  "status": "UP"
}

I made a PR in the library to make some of the fields in response optional since some of them are most of the time zero values like first_failure_at and num_failures when the component is UP, but still no response.
InVisionApp/go-health#64

coveralls · 2019-06-17T12:55:18Z

Coverage decreased (-0.4%) to 89.377% when pulling 06295e6 on async-health into e9af4a3 on master.

dpanayotov · 2019-06-18T06:37:20Z

I'm not entirely sure we need a new dependency that will span through all components just for its <100 rows of async collection of healths. Let's discuss this.

KirilKabakchiev · 2019-06-18T07:23:51Z

If we use more than the 100 lines of async collection of healths its worth using it - otherwise not. If we decide to use it we should make sure the pkg package does not expose any imports to this library (e.g. its internal implementation detail of the sm framework)

My idea was apart from the async collection of healths to also configure the library logging to use our logger so that we get health logging.

Also the library stores health failure timestamps and consecutive number of failures and other details which we could look at and decide if we want to enrich our health response with these extra details.
and some predefined healthcheckers such as Reachable, HTTP and SQLDB - but those are not a big deal.

dpanayotov · 2019-06-18T13:16:48Z

pkg/sm/sm.go

+	}
+	smb.RegisterControllers(healthcheck.NewController(healthz, smb.HealthAggregationPolicy, smb.health.FailuresTreshold))
+
+	err := healthz.Start()


somewhere you should listen for ctx.Done() and call healthz.Stop()

dpanayotov · 2019-06-18T13:16:55Z

pkg/sm/sm.go

+	smb.RegisterControllers(healthcheck.NewController(healthz, smb.HealthAggregationPolicy, smb.health.FailuresTreshold))
+
+	err := healthz.Start()
+	if err != nil {


dpanayotov · 2019-06-18T13:18:40Z

storage/interfaces.go

@@ -182,7 +175,6 @@ type TransactionalRepositoryDecorator func(TransactionalRepository) (Transaction
 //go:generate counterfeiter . Storage
 type Storage interface {
 	OpenCloser
-	Pinger


This works just because the storage implementation is hardcoded in sm.New to use postgres.Storage. Assuming it is not, this should be reverted.

dpanayotov · 2019-06-18T13:21:23Z

pkg/health/types.go

+// Settings type to be loaded from the environment
+type Settings struct {
+	FailuresTreshold int64 `mapstructure:"failures_treshold" description:"maximum failures in a row until component is considered down"`
+	Interval         int64 `description:"seconds between health checks of components"`


change this to time.Duration and add mapstructure. if time.Duration adjust the description

dpanayotov · 2019-06-18T13:22:08Z

pkg/health/types.go

+	if s.FailuresTreshold < 0 {
+		return fmt.Errorf("validate Settings: FailuresTreshold must be >= 0")
+	}
+	if s.Interval < 0 {


I suppose this should be something larger. Like 30s/60s at least?

dpanayotov · 2019-06-18T13:24:06Z

pkg/health/aggregation_policy.go

 	}
 	return New().WithStatus(overallStatus).WithDetails(details)
 }
+
+// ConvertStatus converts go-health status to Status
+func ConvertStatus(status string) Status {


I can't decide if it would be better to change our statuses from UP and Down to OK and Failed to avoid this.

dpanayotov · 2019-06-19T05:43:05Z

pkg/sm/sm.go

+		<-c.Done()
+		log.C(c).Debug("Context cancelled. Stopping health checks...")
+		if err := healthz.Stop(); err != nil {
+			log.D().Error(err)


dpanayotov · 2019-06-19T05:43:26Z

pkg/health/types.go

-	FailuresTreshold int64 `mapstructure:"failures_treshold" description:"maximum failures in a row until component is considered down"`
-	Interval         int64 `description:"seconds between health checks of components"`
+	FailuresTreshold int64         `mapstructure:"failures_treshold" description:"maximum failures in a row until component is considered down"`
+	Interval         time.Duration `mapstructure:"interval" description:"seconds between health checks of components"`


description should be time between...

dpanayotov · 2019-06-19T05:45:19Z

pkg/health/types.go

+
+// Validate validates the health settings
+func (s *Settings) Validate() error {
+	if s.FailuresTreshold < 0 {


add tests for these

dpanayotov · 2019-06-19T05:49:14Z

storage/healthcheck.go

 	}
-	return healthz.Up()


now that these are not used, we might remove our health.Health altogether because it seems unnecessary conversion from one type to another

dpanayotov · 2019-06-19T06:24:37Z

pkg/health/aggregation_policy.go

 	if len(healths) == 0 {
 		return New().WithDetail("error", "no health indicators registered").Unknown()
 	}
 	overallStatus := StatusUp
 	for _, health := range healths {
-		if health.Status == StatusDown {
+		if health.Status == "failed" && health.ContiguousFailures > failureTreshold {


health.ContiguousFailures > failureTreshold introduces behavior, which I'm not sure we want for all health indicators.
If the storage is down when healthcheck call comes, you might get a response with body that says failed, but you'll get status 200 OK. This will happen if the time between the detection of storage down and the call is less than failureThreshold * interval.
Assuming we configure interval=60s and /v1/monitor/health is called every 5 minutes - on the 4th minute the storage fails and now on the health call we get 200 OK but with failure response. Assuming there is a storage outage for 4 minutes (1 minute before the next health call) then the next health call will report UP again. However, the service has not been operational for 4 minutes and all storage operations were failing.

Does it make sense to have the health configuration be a map[string]Settings where each health indicator can be separately configured with failureThreshold and interval? Then we might configure the storage healthcheck to report down even after 1 failure.

KirilKabakchiev · 2019-06-18T19:37:27Z

api/healthcheck/healthcheck_controller.go

 }

 // NewController returns a new healthcheck controller with the given indicators and aggregation policy
-func NewController(indicators []health.Indicator, aggregator health.AggregationPolicy) web.Controller {
+func NewController(health h.IHealth, aggPolicy health.AggregationPolicy, failuresTreshold int64) web.Controller {


its not a good idea (for projects with vendor folders) for pkg methods that will be used in other modules to expect input or return output that requires the caller of this to also import this dependency. this may complicate dependency management for any other project (proxy for example) - ideally creating a controller should require only imports from github.com/Peripli/service-manager

KirilKabakchiev · 2019-06-18T19:41:48Z

pkg/sm/sm.go

+		err := healthz.AddCheck(&h.Config{
+			Name:     indicator.Name(),
+			Checker:  indicator,
+			Interval: smb.health.Interval * time.Second,


this should probably be configurable per HealthIndicator as the library allows - the way you implemented it limits us to having to use the same interval for all indicators which might be ok if the indicator didnt specify its own value

KirilKabakchiev · 2019-06-18T19:44:12Z

storage/healthcheck_test.go

-		healthIndicator = &storage.HealthIndicator{
-			Pinger: pinger,
-		}
+		healthIndicator, _ = storage.NewStorageHealthIndicator(storage.PingFunc(ping))


not a good practice to skip errors - even if its tests - normally you should handle the err by doing Expect(err).ShouldNot(HaveOccured()) - otherwise in case someone changed the logic in NewStorageHealthIndicator and the err occurred and was ignored here troubleshooting the failing test would be harder

KirilKabakchiev · 2019-06-18T19:51:35Z

pkg/health/aggregation_policy.go

 // DefaultAggregationPolicy aggregates the healths by constructing a new Health based on the given
 // where the overall health status is negative if one of the healths is negative and positive if all are positive
 type DefaultAggregationPolicy struct {
 }

 // Apply aggregates the given healths
-func (*DefaultAggregationPolicy) Apply(healths map[string]*Health) *Health {
+func (*DefaultAggregationPolicy) Apply(healths map[string]health.State, failureTreshold int64) *Health {


The State also includes details such as

CheckTime Time "json:\"check_time\"" ContiguousFailures int64 "json:\"num_failures\"" TimeOfFirstFailure Time "json:\"first_failure_at\""

It would be useful to have these in the json response for each of the indicators

KirilKabakchiev · 2019-06-18T19:57:58Z

pkg/sm/sm.go

 	}
+
+	healthz := h.New()


The library allows registering IStatusListener that gets called back when the health status changes. We could register one so that we can log health status changes - rough example here https://github.com/InVisionApp/go-health/tree/master/examples/status-listener

Since Recovered can also log the times failures happened before recovery this could be useful information in logs

KirilKabakchiev · 2019-06-18T20:43:03Z

pkg/sm/sm.go

@@ -167,9 +178,12 @@ func New(ctx context.Context, cancel context.CancelFunc, cfg *config.Settings) (

 // Build builds the Service Manager
 func (smb *ServiceManagerBuilder) Build() *ServiceManager {
-	// setup server and add relevant global middleware
-	smb.installHealth()
+	err := smb.installHealth()


if err := ..; err != nil { ... }

KirilKabakchiev · 2019-06-18T20:43:17Z

pkg/sm/sm.go

+	healthz.Logger = l.New(logger)
+
+	for _, indicator := range smb.HealthIndicators {
+		err := healthz.AddCheck(&h.Config{


if err := ..; err != nil { ... }

KirilKabakchiev · 2019-06-18T20:45:23Z

pkg/health/aggregation_policy.go

 			overallStatus = StatusDown
 			break
 		}
 	}
 	details := make(map[string]interface{})
 	for k, v := range healths {
-		details[k] = v
+		details[k] = ConvertStatus(v.Status)


do you include the name in the details ? it would probably make sense to know what status comes from what indicator

KirilKabakchiev · 2019-06-18T20:51:37Z

storage/healthcheck_test.go

@@ -17,27 +17,22 @@
 package storage_test

 import (
+	"context"


coverage has dropped with 0.4% which seems a bit too much - could you have a look at new missed lines and see if they can be covered with meaningful tests?

KirilKabakchiev · 2019-06-18T20:55:09Z

pkg/sm/sm.go

+		return err
+	}
+
+	// Handles safe termination of sm


Maybe something like // Gracefully stop health checks or no comment ?

dpanayotov · 2019-08-20T09:24:52Z

pkg/health/types.go

+}
+
+// ConfigureIndicators configures registry's indicators with provided settings
+func (r *Registry) ConfigureIndicators() {


This seems like a hidden extra step that needs to be performed. What do you think about hiding the indicators and add a func (Registry) AddHealthIndicator(Inidicator) that configures it?

dpanayotov · 2019-08-20T09:25:42Z

pkg/sm/sm.go

-	// setup server and add relevant global middleware
-	smb.installHealth()
+	if err := smb.installHealth(); err != nil {
+		panic(err)


log.C(smb.ctx).Panic()

dpanayotov · 2019-08-20T09:30:23Z

api/healthcheck/healthcheck_controller.go

-// NewController returns a new healthcheck controller with the given indicators and aggregation policy
-func NewController(indicators []health.Indicator, aggregator health.AggregationPolicy) web.Controller {
+// NewController returns a new healthcheck controller with the given health and tresholds
+func NewController(health h.IHealth, indicators []health.Indicator) web.Controller {


pass only the minimum that is required. you don't need the whole indicators slice.

dpanayotov · 2019-08-20T11:49:25Z

pkg/sm/sm.go

@@ -183,7 +183,7 @@ func New(ctx context.Context, cancel context.CancelFunc, cfg *config.Settings) (
 // Build builds the Service Manager
 func (smb *ServiceManagerBuilder) Build() *ServiceManager {
 	if err := smb.installHealth(); err != nil {
-		panic(err)
+		log.C(smb.ctx).Panic()


KirilKabakchiev · 2019-08-20T20:47:06Z

pkg/sm/sm.go

@@ -132,7 +134,13 @@ func New(ctx context.Context, cancel context.CancelFunc, cfg *config.Settings) (
 		return nil, fmt.Errorf("error creating core api: %s", err)
 	}

-	API.HealthIndicators = append(API.HealthIndicators, &storage.HealthIndicator{Pinger: storage.PingFunc(smStorage.Ping)})
+	storageHealthIndicator, err := storage.NewStorageHealthIndicator(storage.PingFunc(smStorage.PingContext))


storage.NewHealthIndicator (do not repeat storage)

KirilKabakchiev · 2019-08-20T20:48:46Z

pkg/sm/sm.go

+	}
+
+	API.HealthIndicators = append(API.HealthIndicators, storageHealthIndicator)
+	API.HealthSettings = cfg.Health.IndicatorsSettings


not sure if its necessary to copy the settings over in the registry as the cfg is acessible in most places anyway?

KirilKabakchiev · 2019-08-20T20:53:05Z

pkg/sm/sm.go

+	for _, indicator := range smb.HealthIndicators {
+		if configurableIndicator, ok := indicator.(health.ConfigurableIndicator); ok {
+			if settings, ok := smb.HealthSettings[configurableIndicator.Name()]; ok {
+				configurableIndicator.Configure(settings)


I dont think its necessary to let the indicators know about these settings - Indicator can have .State() and .Name() and no other interfaced methods and the first touchpoint between the settings and the actual indicator will be when calling healthz.AddCheck

KirilKabakchiev · 2019-08-20T20:55:00Z

storage/healthcheck.go

+	return i.settings.Fatal
+}
+
+func NewStorageHealthIndicator(pingFunc PingFunc) (health.Indicator, error) {


file order - constructors that return concrete types (typically) go right before the struct they return (readability)

KirilKabakchiev · 2019-08-20T20:56:22Z

pkg/health/types.go

+	FailuresTreshold() int64
+
+	// Fatal returns if the health indicator is fatal for the overall status
+	Fatal() bool


what does a fatal = false with failureTreshold = 5 indicator mean ? Meaning do we need both or does having both just bring confusion

KirilKabakchiev · 2019-08-20T21:53:42Z

api/healthcheck/healthcheck_controller.go

+	}
+	overallStatus := health.StatusUp
+	for i, v := range state {
+		if v.Fatal && v.ContiguousFailures >= c.tresholds[i] {


should it be >= or > (if threshold is the maximum allowed failures, the = should be removed)

KirilKabakchiev · 2019-08-20T21:56:19Z

api/healthcheck/healthcheck_controller.go

+		return health.New().WithDetail("error", "no health indicators registered")
+	}
+	overallStatus := health.StatusUp
+	for i, v := range state {


you dont really use i so put _ instead. Also rename state to overrallState and v to state

KirilKabakchiev · 2019-08-21T06:01:30Z

pkg/sm/sm.go

+		return nil, fmt.Errorf("error creating storage health indicator: %s", err)
+	}
+
+	API.HealthIndicators = append(API.HealthIndicators, storageHealthIndicator)


You could wrap the access to the HealthIndicators in a register method that also adds defaultsettings for this indicator if they are missing and "completes" any incomplete settings for the indicator that is being registered - this will help get rid of the Configurable interface

KirilKabakchiev · 2019-08-21T06:03:39Z

pkg/health/types.go

-	Apply(healths map[string]*Health) *Health
+// ConfigurableIndicator is an interface to provide configurable health of a component
+//go:generate counterfeiter . ConfigurableIndicator
+type ConfigurableIndicator interface {


I would suggest tey to get rid of this interface. - thereare some ideas in the other comments

KirilKabakchiev · 2019-08-21T06:19:37Z

api/healthcheck/healthcheck_controller.go

-// NewController returns a new healthcheck controller with the given indicators and aggregation policy
-func NewController(indicators []health.Indicator, aggregator health.AggregationPolicy) web.Controller {
+// NewController returns a new healthcheck controller with the given health and tresholds
+func NewController(health h.IHealth, tresholds map[string]int64) web.Controller {


NewController is a public api that will be used in other projects- so you have two options - go for a stable api (pass settings and know that when new settings are added the controller public api wont change) or pass minimal and risk having to refactor a public api afterwards. Both approaches are acceptable.

KirilKabakchiev · 2019-08-21T09:19:09Z

pkg/sm/sm.go

-			} else {
-				configurableIndicator.Configure(health.DefaultIndicatorSettings())
-			}
+		settings, ok := smb.cfg.Health.Indicators[indicator.Name()]


you should probably also handle partcially configured settings that are present in the map - meaning "complete" them with default values

Discussed and it seems that now partically configured indicator will result in confiuration validation error - which is also ok

dpanayotov · 2019-08-22T08:39:27Z

storage/healthcheck.go

+	sqlConfig := &checkers.SQLConfig{
+		Pinger: pingFunc,
+	}
+	sqlChecker, err := checkers.NewSQL(sqlConfig)


this is implementation specific in the storage package. Move this to postgres package or rename this to SQLHealthIndicator / PostgresHealthIndicator.

DimitarPetrov requested review from georgifarashev, pankrator, dpanayotov, dotchev, KirilKabakchiev and NickyMateev June 17, 2019 12:42

DimitarPetrov self-assigned this Jun 17, 2019

DimitarPetrov added the 👋request review label Jun 17, 2019

dpanayotov requested changes Jun 18, 2019

View reviewed changes

DimitarPetrov requested a review from dpanayotov June 18, 2019 15:10

dpanayotov requested changes Jun 19, 2019

View reviewed changes

KirilKabakchiev requested changes Jun 19, 2019

View reviewed changes

DimitarPetrov added 🚧 WIP 🚧 and removed 👋request review labels Jun 20, 2019

DimitarPetrov requested review from KirilKabakchiev and dpanayotov June 25, 2019 10:07

DimitarPetrov added 👋request review and removed 🚧 WIP 🚧 labels Jun 25, 2019

DimitarPetrov force-pushed the async-health branch from 85c9baf to b0e93e9 Compare July 8, 2019 06:45

DimitarPetrov force-pushed the async-health branch from 9594304 to f197ae0 Compare August 20, 2019 08:30

dpanayotov requested changes Aug 20, 2019

View reviewed changes

DimitarPetrov requested a review from dpanayotov August 20, 2019 11:41

dpanayotov reviewed Aug 20, 2019

View reviewed changes

dpanayotov previously approved these changes Aug 20, 2019

View reviewed changes

DimitarPetrov added 3 commits August 21, 2019 09:13

Async health check

4139ea3

Tests adaptation

144ea91

Make health configurable externally

61157b4

DimitarPetrov added 11 commits August 21, 2019 09:13

Fix health settings validation

115e2a4

Attach logger to health

37f9bc4

Address PR comments

4f21ed9

Configuration per indicator and refactoring

421d88f

Add health status listener

30ab52e

Minor tweaks

d25a87d

Fix indicator interval type

794c3b0

Fix tests

40fb02b

Remove unused import

37c03dd

Extract indicator configuration and address PR comments

f138807

Add error to panic

ce4d4bb

DimitarPetrov dismissed dpanayotov’s stale review via ce4d4bb August 21, 2019 06:14

DimitarPetrov force-pushed the async-health branch from 88129ba to ce4d4bb Compare August 21, 2019 06:14

KirilKabakchiev requested changes Aug 21, 2019

View reviewed changes

Address PR comments

3a34381

DimitarPetrov requested review from KirilKabakchiev and dpanayotov August 21, 2019 09:15

KirilKabakchiev reviewed Aug 21, 2019

View reviewed changes

minor fix

91eb2a9

KirilKabakchiev previously approved these changes Aug 21, 2019

View reviewed changes

dpanayotov requested changes Aug 22, 2019

View reviewed changes

rename storage indicator

06295e6

DimitarPetrov dismissed KirilKabakchiev’s stale review via 06295e6 August 22, 2019 12:25

DimitarPetrov requested review from dpanayotov and KirilKabakchiev August 22, 2019 12:26

dpanayotov approved these changes Aug 22, 2019

View reviewed changes

KirilKabakchiev approved these changes Aug 23, 2019

View reviewed changes

DimitarPetrov merged commit df40c87 into master Aug 23, 2019

DimitarPetrov deleted the async-health branch August 23, 2019 05:52

DimitarPetrov mentioned this pull request Sep 10, 2019

Platforms connection in health-check #325

Merged

Async Healthcheck #301

Async Healthcheck #301

Conversation

DimitarPetrov commented Jun 17, 2019 • edited Loading

Motivation

Approach

Positive response sample

Negative response sample (Treshold not exceeded and overall status is UP)

coveralls commented Jun 17, 2019 • edited Loading

dpanayotov commented Jun 18, 2019 • edited Loading

KirilKabakchiev commented Jun 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KirilKabakchiev Jun 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KirilKabakchiev Jun 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KirilKabakchiev Aug 20, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KirilKabakchiev Aug 21, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DimitarPetrov commented Jun 17, 2019 •

edited

Loading

coveralls commented Jun 17, 2019 •

edited

Loading

dpanayotov commented Jun 18, 2019 •

edited

Loading

KirilKabakchiev commented Jun 18, 2019 •

edited

Loading

KirilKabakchiev Jun 18, 2019 •

edited

Loading

KirilKabakchiev Jun 18, 2019 •

edited

Loading

KirilKabakchiev Aug 20, 2019 •

edited

Loading

KirilKabakchiev Aug 21, 2019 •

edited

Loading