-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[processor/lsminterval] Define cardinality limits and handle overflows #235
base: main
Are you sure you want to change the base?
Conversation
6aa881a
to
d47a60c
Compare
processor/lsmintervalprocessor/internal/merger/limits/tracker.go
Outdated
Show resolved
Hide resolved
processor/lsmintervalprocessor/internal/merger/limits/tracker.go
Outdated
Show resolved
Hide resolved
// Metrics doesn't have overflows (only datapoints have) | ||
// Clone it *without* the datapoint data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this mean we risk OOM due to high cardinality metric names? Should the limit be on time series (metric + dimensions) per scope, rather than data points per scope? Or metrics per scope and time series per metric?
processor/lsmintervalprocessor/internal/merger/limits/tracker.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a high level I think the approach looks good. As mentioned in a comment, I think we should have limits on all parts of the hierarchy, including metrics.
I ran the benchmarks, and it seems there's quite a drop in throughput.
Yeah, there is a considerable overhead. This is what I am working on right now -- to improve performance and drop some of the unneeded complexity around attribute matching and filtering. |
@axw I have addressed all of the comments other than one and I am marking this ready for review now. The performance consideration is still there and I have a few threads to chase down for improvements but I expect them to give minor gains, so, if you have any ideas (even rough) I will be more than happy to chase them down. One of the sources of the extra overhead is creating the lookup maps on unmarshaling the binary to the go struct -- previously I was doing this when required but now I am always doing this to simplify overflow handling a bit. I have left adding metrics limits to a follow-up PR. I will also add a few more detailed overflow tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preliminary comments, ran out of time today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, mostly minor comments/questions
// Encode resource tracker at the 0th resource metrics | ||
// TODO (lahsivjar): Is this safe? We don't ever remove | ||
// resource metrics so it should be but best to check. | ||
// Also, the limits checker should ensure max cardinality | ||
// is greater than zero. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another approach we could take that does not involve piggybacking on attributes would be to first encode the non-overflowed ResourceMetrics, and then encode a tracker for each node in the tree in traversal order. i.e. one for the Resource, then for each Scope in that resource, etc.
That may increase the storage size slightly when there are no/few overflows, but would avoid having to iterate through the attributes when unmarshalling. If we were to do this, we would probably want to add some versioning to the format, e.g. so we can add another level of limits.
Did you consider this? Not sure if it's better at all, what you've done is fairly straightforward - I mostly worry about the performance cost of iterating through attributes when unmarshalling individual batches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the code to encode tracker separately. This resulted in lesser allocation and better performance. The encoding is pretty straightforward right now and there are some scopes for further improvements but I don't think they would be a bottleneck.
// Remove any hanging resource or scope which failed to have any entries | ||
// due to children reaching their limits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing as the scope limit is per resource, and the metric/dp limit is per scope, when would we add an empty resource or scope?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This happens due to metric overflows, I have added a comment to explain this better:
// Remove any hanging metrics, scope, or resource which failed to have any
// entries due to datapoints overflowing. Overflowing datapoints discards
// that metric and creates a new overflow metric which might result in the
// original metric and its parent to exist without any datapoints.
} | ||
dp := metric.Sum().DataPoints().AppendEmpty() | ||
s.numberLookup[streamID] = dp | ||
return dp, true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the caller could be simplified if we called otherDP.CopyTo(dp)
here? Then instead of returning true here, we would return true only when the caller should update the returned data point. WDYT? (Same for other metric types of course.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the caller could be simplified if we called otherDP.CopyTo(dp) here?
The issue with this is due to the mergeCumulative
method in which we want to copy the datapoint only if it is newer.
toDP, ok := addDP(toMetricID, fromDP) | ||
if toDP == zero { | ||
// Overflow, discard the datapoint | ||
continue | ||
} | ||
if ok { | ||
// New data point is created so we can copy the old data directly | ||
fromDP.CopyTo(toDP) | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparing to zero feels like an unnecessarily expensive way to do this. See my other comment about returning true only when we should update. Then this could become
if toDP, exists := addDP(toMetricID, fromDP); exists {
switch fromDP := any(fromDP).(type) {
case pmetric.NumberDataPoint:
mergeDeltaSumDP(fromDP, any(toDP).(pmetric.NumberDataPoint))
case pmetric.HistogramDataPoint:
mergeDeltaHistogramDP(fromDP, any(toDP).(pmetric.HistogramDataPoint))
case pmetric.ExponentialHistogramDataPoint:
mergeDeltaExponentialHistogramDP(fromDP, any(toDP).(pmetric.ExponentialHistogramDataPoint))
}
}
Bringing this back to draft as I have a few good ideas for optimization and combined with @axw 's above comments I have a few more things to do here. (I have also pushed a different way to encode limit trackers independent of the pmetric ds) |
(deleted the above comment on performance improvements as they were void and incorrect) I have made a silly mistake in the |
After the above changes, below is the benchmark diff from Benchmark diff from `main`
TL;DR, there is still ~20% performance degradation with overflows. We could eliminate the overflow path when it is not defined but that will only give us a false sense of improvement since, for our use case, we would always have overflows defined. I have another PR open here which optimizes the pebble options for our use case and that will have improvements to this PR too but the diff between with and without overflow will still be approximately same. |
Related to: #141
For more details checkout #141 (comment) comment.