Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite metrics ingestion path and integrate with just the central veneur. #578

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

clin-stripe
Copy link
Contributor

@clin-stripe clin-stripe commented Nov 8, 2018

To make this a bit easier to review, check out the commits tab. Things are roughly broken out step-by-step.

Review guide

Sorry this is like 2000 lines. Most of it is testing, altho I get that that doesn't make it easier to review. FWIW, I asked @asf-stripe to look at this halfway through and got the thumbs up to move forward.

If you're not sure how to get started, I suggest going in this order:

  1. Review the main business logic: Start with aggingestor.go, go to aggworker.go, go to sinkflusher.go.
  2. Review the mixed histogram implementation if you're familiar with this: go to mixedhisto.go
  3. Review the test framework and maybe skim the test cases (this is most of the lines).

Summary

This PR does the following:

  1. Point the gRPC import server at the new code.
  2. Build out a metricsingester that works for the way central veneurs currently work: metrics aggregation is sharded across several workers and a flush function runs every so often to collect metrics and send them to sinks.

In other words, this is a non-breaking change. But it will allow local veneurs to forward all metrics to the central veneur.

Architecture

The new metrics ingester does everything the old one does but a bit cleaned up.

  1. AggregatingIngestor is the entry point (aggingestor.go). Through the Ingest and Merge methods, the AggregatingIngestor shards metrics across a series of workers. At a set interval, it flushes the aggregated metrics out of the workers and feeds them into a flusher (see below).

  2. AggregatingWorkers (aggworker.go) take in metrics and feed them into samplers, doing all the work of metrics aggregation. They work as separate goroutines to facilitate parallelism.

  3. sinkflusher.go contains the code to flush metrics to sinks. Given a set of samplers, the sinkflusher generates raw metrics and sinks them to sinks.

Testing

I wrote a testing framework that allows us to express tests as table tests. So while there's only a few top level tests, we actually capture about 20 different test cases in TestE2EFlushingIngester, TestMixedHistoMerge, and TestMixedHistoSample.

We could use additional coverage at the ingestor level so that we're not just testing import-path cases or depend on the import path for testing. This will come in a later PR (unless y'all think we should have it now).

We also have the benefit of the old e2e test coverage because those tests should be using the new path as well.

Other things to note

Mixed histogram stuff

  • I wrote a new sampler, MixedHistogramSampler, to capture the mixed histo behavior of "percentiles are aggregated globally, min max count are emitted with host tags". There's a bunch of testing around this.

  • In order for things not to break as we roll out, it's necessary for us to not change the behavior of MixedScope histograms. Therefore, the behavior for the central veneur is to not emit mixed min/max/counts unless a metric is imported with a new scope, MixedGlobal.

Design doc

https://docs.google.com/document/d/1JXHLj0VI1nIiRcNLu3MXvy_qt2V38yvjj6ejW5oa4dA/edit#

r? @aditya-stripe @ChimeraCoder

Motivation

(stripe internal)
https://docs.google.com/document/d/1C1IBh7AbWrHRkAJBbUHM0jx0XKg2lo3ISmffUrzaWT4/edit#

@asf-stripe asf-stripe self-assigned this Nov 12, 2018
@clin-stripe
Copy link
Contributor Author

@asf-stripe feedback make sure to preserve e2e local -> central forwarding test.

@clin-stripe clin-stripe changed the title WIP Migrate metrics ingestion into new package and have global ingester call it. Rewrite metrics ingestion path and integrate with just the central veneur. Nov 20, 2018
@stripe-ci
Copy link

Gerald Rule: Copy Observability on Veneur, Unilog, Falconer pull requests

cc @stripe/observability
cc @stripe/observability-stripe

@aubrey-stripe
Copy link
Contributor

@clin-stripe did you intend to assign me to this PR? What can I do to help here?

@clin-stripe
Copy link
Contributor Author

@aubrey-stripe You mentioned at the falalfel that you'd be interested in reviewing so I thought I'd give you a heads up!

Copy link
Contributor

@asf-stripe asf-stripe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've skimmed this, left comments in places that seemed to need comments, and have a few wider-angle questions (besides the failing tests, what's up with that?):

  • I see your work is pulling the Hostname property out from the metrics' tags; that's cool, but I am wondering what we're going to do with the other tags that we currently use to identify metrics from particular subsets of hosts (e.g. our host_type and host_az tag); or let's say I'm confused how the Hostname becomes more important than the others, and how that will relate to what we're going to do when we switch the default on Mixed scope Histograms (I would love for us to not report host_az and host_type on the global histograms, since those are always going to be lies, and at least the AZ tag adds cardinality that we really don't need).

  • There's a comment I left in there about perf-sensitivity; I'd like to make sure we're not accidentally regressing, so do you mind adding benchmarks / running existing ones to compare how local veneur perf will be affected?

Other than that, I think this has the right shape, let's polish it!

// TestE2EFlushingIngester tests the integration of the import endpoint with
// the flushing ingester.
func TestE2EFlushingIngester(t *testing.T) {
for _, tc := range testE2EFlushingCases {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yay table-driven tests!

// }
// })
// }
//}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these for? If they're intended to be useful, please uncomment, otherwise delete (:

// PROTOBUF
//

func pbmetrics(ms ...*metricpb.Metric) []*metricpb.Metric {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh, that's a neat trick for avoiding the []*metricpb.Metric{ ... } syntax grossness.

}
}

func OptLogger(logger *logrus.Logger) ingesterOption {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Believe this is unused, but should be used in server.go, like the OptTraceClient function below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what happens when you use a nil logger either... does it work, or does it die?

return ing
}

// TODO(clin): This needs to take ctx.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, would it expand scope too much to put it in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll introduce this in the next PR for adding the ingestor interface for local veneur. This will need context since we'll be making RPCs in that interface.

type aggWorker struct {
samplers samplerEnvelope

inC chan Metric
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I navigated this code correctly, this is a fairly perf-critical bit; we recently (last year? @cory-stripe would know) spent a bunch of time making most channels take pointers to avoid copying. I'm not necessarily saying this should be a pointer channel, but I think I'd like to see a benchmark to ensure we're not regressing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will write benchmark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asf-stripe I wrote a benchmark and compared a pointer Metric to non-pointer metric in the aggWorker.Ingest() path, but there was no performance difference. On my machine, both clocked in at about 680ns/op.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it wasn't always obvious when go preferred a pointer to avoid copying. We tried it as a blanket thing and didn't see much difference in some cases.

if span, ok := opentracing.SpanFromContext(ctx).(*trace.Span); ok {
return log.
WithField("trace_id", span.TraceID).
WithField("span_id", span.SpanID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks interesting, what is it for / how does it look when logging? If it doesn't print hex numbers yet, you might want to make it do that - that's how LS presents them at least, so should make it easier to cross-link between tools.


tags := map[string]string{"part": "post"}
for _, sinkInstance := range s.sinks {
// TODO(clin): Add back ms once we finalize the ms client pull request.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's ms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's supposed to say metrics which are back in, will remove comment

"github.com/stripe/veneur/trace"
)

type ChannelMetricSink struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat! I think that dedups a bunch of code from our tests (:

@clin-stripe
Copy link
Contributor Author

Missed this comment when we went over this together.

I see your work is pulling the Hostname property out from the metrics' tags; that's cool, but I am wondering what we're going to do with the other tags that we currently use to identify metrics from particular subsets of hosts (e.g. our host_type and host_az tag); or let's say I'm confused how the Hostname becomes more important than the others, and how that will relate to what we're going to do when we switch the default on Mixed scope Histograms (I would love for us to not report host_az and host_type on the global histograms, since those are always going to be lies, and at least the AZ tag adds cardinality that we really don't need).

Hostname is special only because it's used to make grouping decisions. In the special case of MixedHistogram, we need to make sure that hostname is not part of the grouping decision. That is, two mixedhistograms with the same name and tags need to be merged together regardless of their hostname. host_set and host_type, by contrast, are tags that we currently group histograms by.

Just to clarify, we're not planning on dropping mixedhistogram behavior in this change. It would be super nice to drop it of course ... but I think we need to do some work to migrate use cases for this first. Same with dropping host_az.

@clin88
Copy link

clin88 commented Dec 10, 2019

y no merge y'all

@clin88
Copy link

clin88 commented Dec 10, 2019

:'(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants