Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Platforms connection in health-check #325

Merged
merged 36 commits into from
Sep 12, 2019
Merged

Platforms connection in health-check #325

merged 36 commits into from
Sep 12, 2019

Conversation

DimitarPetrov
Copy link
Member

@DimitarPetrov DimitarPetrov commented Aug 27, 2019

Motivation

Connection with platforms should be included in the health-check so it is easy to figure out if any state is propagated to a platform.

Approach

A new indicator for all platforms registered in SM is introduced.

...
health:
  indicators:
    storage:
      fatal: true
      failures_threshold: 3
      interval: 30s
    platforms:
      fatal: true
      failures_threshold: 3
      interval: 30s
...

And health-check response for such a configuration looks like this:

{
  "details": {
    "platforms": {
      "check_time": "2019-09-11T16:17:54.121789+03:00",
      "details": {
        "asd": {
          "details": {
            "since": "0001-01-01T00:00:00Z",
            "type": "cloudfoundry"
          },
          "status": "DOWN"
        },
        "asdasd": {
          "details": {
            "since": "0001-01-01T00:00:00Z",
            "type": "cloudfoundry"
          },
          "status": "DOWN"
        },
        "asdf": {
          "details": {
            "since": "0001-01-01T00:00:00Z",
            "type": "kubernetes"
          },
          "status": "DOWN"
        }
      },
      "error": "there are 3 inactive platforms",
      "fatal": true,
      "first_failure_at": "2019-09-11T16:17:54.121889+03:00",
      "name": "platforms",
      "num_failures": 1,
      "status": "DOWN"
    },
    "storage": {
      "check_time": "2019-09-11T16:17:54.117632+03:00",
      "fatal": true,
      "first_failure_at": "0001-01-01T00:00:00Z",
      "name": "storage",
      "num_failures": 0,
      "status": "UP"
    }
  },
  "status": "UP"
}

@coveralls
Copy link

coveralls commented Aug 27, 2019

Coverage Status

Coverage increased (+0.2%) to 89.626% when pulling cfd8a90 on platform-health into 7b05398 on master.

// but later not registered nothing will happen.
var indicatorNames = [...]string{
StorageIndicatorName,
"kubernetes" + PlatformIndicatorSuffix,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SM must not care about different platform types in the code. These should be a configuration

Copy link
Member

@KirilKabakchiev KirilKabakchiev Sep 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't mention the indicator names here somehow then they cannot be configured via the configuration

@dotchev
Copy link
Contributor

dotchev commented Sep 10, 2019

Describe in the Approach section the meaning of failures_threshold and interval properties

@dotchev
Copy link
Contributor

dotchev commented Sep 10, 2019

Any documentation for this feature?

@DimitarPetrov
Copy link
Member Author

Describe in the Approach section the meaning of failures_threshold and interval properties

This properties are part of different change which introduces async healthcheck.
This is the PR #301 which introduces them.

Copy link
Contributor

@dotchev dotchev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using separate health indicators for each platform type, it would be simpler to have one indicator for all platforms and in its details list the state of each platform together with its type.
Leave SM wrappers decide which platform status is fatal. This decision may not be based on the platform type.

@@ -17,10 +17,12 @@
package healthcheck

import (
"context"
h "github.com/InVisionApp/go-health"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

h alias is too short for such a wide scope

@@ -56,7 +58,7 @@ func (c *controller) healthCheck(r *web.Request) (*web.Response, error) {
return util.NewJSONResponse(status, healthResult)
}

func (c *controller) aggregate(overallState map[string]h.State) *health.Health {
func (c *controller) aggregate(ctx context.Context, overallState map[string]h.State) *health.Health {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"overall" suggests the state is already aggregated, better rename it

@@ -70,6 +72,9 @@ func (c *controller) aggregate(overallState map[string]h.State) *health.Health {
details := make(map[string]interface{})
for name, state := range overallState {
state.Status = convertStatus(state.Status)
if strings.Contains(name, health.PlatformIndicatorSuffix) && !web.IsAuthorized(ctx) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid parsing strings as it is error prone, better store this data in a structured way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not remove the details for all the health indicators in case the request is not authorised?

@@ -70,6 +72,9 @@ func (c *controller) aggregate(overallState map[string]h.State) *health.Health {
details := make(map[string]interface{})
for name, state := range overallState {
state.Status = convertStatus(state.Status)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remap the status?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to adopt library’s status “ok” and “failed” this change should be propagated everywhere since all the components check UP to determine if it is started correctly.

@@ -70,6 +72,9 @@ func (c *controller) aggregate(overallState map[string]h.State) *health.Health {
details := make(map[string]interface{})
for name, state := range overallState {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why loop a second time over overallState? can't we do it in one loop?

return nil, fmt.Errorf("could not fetch platforms health from storage: %v", err)
}

details := make(map[string]interface{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map[string]*health.Health

details[platform.Name] = health.New().WithStatus(health.StatusUp)
} else {
details[platform.Name] = health.New().WithStatus(health.StatusDown).WithDetail("since", platform.LastActive)
err = fmt.Errorf("there is inactive %s platforms", pi.platformType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better include the total number of inactive platforms, e.g. "there are %d inactive %s platforms"

// default settings again, but this defaults could be overridden only via application.yml,
// env variables and pflags won't have any effect. If an indicator is specified in this list
// but later not registered nothing will happen.
var indicatorNames = [...]string{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about indicators outside Peripli? who will add them here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They will maintain separate list of indicator names and extend default settings with them while configuring their settings. Other approach is to export this array lets discuss which is better.

BEGIN;

ALTER TABLE platforms ADD COLUMN active boolean NOT NULL DEFAULT '0';
ALTER TABLE platforms ADD COLUMN last_active TIMESTAMP NOT NULL DEFAULT '1970-01-01 00:00:00+00';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's strange, better use NULL is the platform was never active

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use NULL then in Platform struct we need to use sql.NullString instead of time.Time so rows scanning can be done properly for null values and deal with casting, which for me seems like unnecessary complication but i am okay with both solutions.

if err := c.updatePlatform(platform.ID, func(p *types.Platform) {
p.Active = false
p.LastActive = time.Now()
}); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this if is hard to read, break into multiple statements

@dotchev dotchev merged commit 63c53a2 into master Sep 12, 2019
@dotchev dotchev deleted the platform-health branch September 12, 2019 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants