-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[processor/lsminterval] Handle overflow for metrics aggregations #141
Comments
Current overflow handling in APM Server using apm-aggregationAPM aggregation implements overflow handling logic based on a set of factors, each having their own threshold. These factors are configured in APM Server based on the available resources or hard-coded if the use-case is known. By default, APM Server sets all the limits based on the memory available using a linear scaling based on the formula Maximum number of servicesThe number of unique services for the purpose of aggregated metrics is defined as the cardinality of the following set of labels (identified as service aggregation key):
When the max service limit for an aggregation interval is reached then a new This overflow service bucket acts like a catch-all for all overflow aggregated metrics i.e. in addition to the service count, the overflow services also record service transaction, transaction, and span metrics in their corresponding data types (histograms, aggregate_metric_double, or counters) with the following overflow identifiers
Maximum number of service transactionsService transactions are the number of unique transaction types within a service (this is based on the current key definition and can change in future). The number of unique service transactions for the purpose of aggregated metrics is defined as the cardinality of the following set of labels (identified as service transaction aggregation key):
When the maximum service transaction limit within a service is reached then a new metric is created for the service that reached the limit with Maximum number of transactionsTransactions are the number of unique transaction keys within the service where a transaction key is defined with the following fields:
When the maximum transaction limit within a service is reached then a new metric is created for the service that reached the limit with Maximum number of spansSpans are the number of unique span keys within a service where the span key is defined with the following fields:
When the maximum span limit within a service is reached then a new metric is created for the service that reached the limit with |
Proposal for overflow handling in LSM interval processorSignal to metrics connector and LSM interval processor together provide the OTel native way to do aggregations as required by APM. Signal to metrics connector has the role of extracting the aggregated metrics from the incoming signals and LSM interval processor simply aggregates the metrics for the specified aggregation intervals. A sample configuration used for both components to produce APM aggregated metrics can be perused here. The OTel implementation differs from the APM implementation due to the fact that the aggregated metrics cannot be first-class entities. Instead, the aggregated metrics have to be defined as any other metrics and processed by the components. In order to formalize handling overflows with the OTel data model, we have to identify the problem that the overflow handling in APM solves. ProblemThe main purpose of aggregating metrics is to reduce storage and query costs by doing aggregations during the ingestion process. The aggregation shines when the cardinality of the attributes that are being aggregated is bounded. High cardinality attributes blow up the memory requirements, increasing the resource requirements for the aggregations during ingestion. Unbounded cardinality makes things worse and aggregations with bigger intervals are almost made impractical. Unbounded/high cardinality is usually due to bugs in instrumentation or simply bad instrumentation. Assumptions
ProposalSolving the problem of unbounded/high cardinality attributes requires identifying that there is an issue and reporting it to the owners of the instrumentation. To this end, our OTel pipeline needs a way to identify the cardinality issues during aggregation. Keeping the above assumptions in mind, the proposal will act like a cardinality limiter, over defined set of attributes, with overflow buckets when the cardinality exceeds the defined limits. (The names of the configuration might be a bit confusing, will update with better names as we evolve this) limits:
- action: oneOf{"drop", "overflow"} # if drop is configured the metrics exceeding the limit are simply dropped
resource_attributes: # A list of resource attributes over which to apply the cardinality limits
- key: <string> # If the resource is not present in the input then an empty value will be used for cardinality calculation
scope_attributes: [] # A list of scope attributes over which to apply the cardinality limit, empty means no limits
datapoint_attributes: [] # A list of datapoint attributes over which to apply the cardinality limit
max_size: <int> # the max cardinality for the above set of attributes
# Below configuration are only used if action is `overflow`
overflow: # Defines how overflow buckets/resource metrics will be constructed
resource_attributes: # A list of static resource attributes to add to the overflow buckets
- key: <string>
value: <any>
scope_attributes: []
datapoint_attributes: [] Few points to note:
|
Thanks for the writeup @lahsivjar. Seems reasonable overall. It feels a bit awkward that the definition of metrics & overflow logic are defined in two different places, but I'm not sure if there's a better option. |
What would help me to understand this a bit better is an example configuration of both the One aspect of the APM aggregations that I don't see mentioned, and I'm not sure is possible in the proposal, is the protection against a single service consuming most of the cardinality limit. Aside from that, I'm wondering if it could make the configuration a bit more concise to combine the resource_attributes:
- key: <string>
overflow_value: <any> # optional |
Below I have put a sample configuration based on the proposal: signaltometrics:
spans:
- name: transaction.duration.histogram
description: APM service transaction aggregated metrics as histogram
include_resource_attributes:
- key: service.name
- key: deployment.environment
- key: telemetry.sdk.language
- key: agent.name
attributes:
- key: transaction.root
- key: transaction.type
- key: metricset.name
default_value: service_transaction
- key: elasticsearch.mapping.hints
default_value: [_doc_count]
unit: us
exponential_histogram:
value: Microseconds(end_time - start_time)
lsminterval:
intervals:
- duration: 60m
statements:
- set(resource.attributes["metricset.interval"], "60m")
- set(attributes["data_stream.dataset"], Concat([attributes["metricset.name"], "60m"], "."))
- set(attributes["processor.event"], "metric")
limits:
- action: overflow
resource_attributes:
- key: service.name
- key: deployment.environment
- key: telemetry.sdk.language
- key: agent.name
max_size: 100 # Only 100 services can exist, more than that will overflow
overflow: # Defines what attributes the service overflow will have
# we will also need single-writer handling as we do in signaltometrics
resource_attributes:
- key: service.name
value: _other
- action: overflow
resource_attributes:
- key: service.name
- key: deployment.environment
- key: telemetry.sdk.language
- key: agent.name
datapoint_attributes:
- key: transaction.root
- key: transaction.type
- key: metricset.name
max_size: 1000
overflow: # Defines what attributes the overflow aggregation will have
# we will also need single-writer handling as we do in signaltometrics
datapoint_attributes:
- key: transaction.type
value: _other
In the proposal, the cardinality limits are applied to a defined set of attributes, and we don't have global limits yet. For apm-aggregation, we have the following limits (ref): (✅ means we cover this in the above proposal and ⛔ means we don't cover this yet)
We could apply a global configuration on the limits to take care of this (I proposed this here in my notes too):
I am working on fleshing this out a bit more as well as thinking about some other mechanism to reach our end goal. I have an idea to make this work with the current configuration but not sure on the details - will follow up on this. @felixbarny WDYT? Does it make sense so far? |
I think what's missing is a way to enforce isolation/partitioning of services. I'd like to see this as a first class citizen of the lsmintervalprocessor. The current proposal has a noisy neighbor-like problem where services with a high cardinality for I think another big benefit of creating isolated partitions for each service is that the merging process of the lsm can work on smaller units, so that the peak memory requirement can be a lot lower. IIUC, the main source of memory usage is loading all metrics stored in segment files into main memory during the companion process. What I'm thinking is that you can define a partitioning key, for example Within each partition (service), we can then limit the cardinality of the different metrics and create overflow buckets. But one partition can never take away resources (aggregation buckets) from another partition. This also helps to ensure we never run out of memory as it creates an upper bound for the memory required by a single partition. Based on that, we can calculate how many partitions a single instance can handle per GB of allocated memory. |
Discussed this with @felixbarny IRL:
The current proposal doesn't apply any global limit, so, each service (or a set of attributes) will get its fair share of aggregation buckets. This would mean that the global limit would be implicitly defined based on other limits. Another point to note as per the current proposal is that the limits should be applied in the order they are defined, so, the order is important. In addition to this, we also discussed some other points: Not limiting the number of services for the aggregationThe basic idea here is to have no overflows for the number of services i.e. each service can have its aggregation bucket but bounding the aggregation buckets within the service. With the current design of the processor, we would end up having unbounded memory usage if we adopt this, however, we can make it work with the following improvements to the component:
Since the memory usage of the LSM interval processor is proportional to the max memory requirement for a partition, we will have a bounded memory requirement if the service is used as a partitioning key and overflows are properly defined within a service. This assumes that merges and harvest performed by the LSM aggregator are using a concurrency level of 1 (i.e. only one merge/harvest operation at a time). Scaling collectors performing aggregationsHorizontally scaling collectors configured with LSM interval is challenging under circumstances where resource usage cannot always be reduced proportionally to the number of collector instances performing aggregations. For example: if data for a specific service is being sent to all replicas of the collector configured with LSM interval processor and we are doing aggregations using histograms then the same memory would be required for each service across all replicas - though, each replica would be handling lesser throughput of data. This could be addressed by consistent hash-based routing to collector replicas OR consistent hash-based partition assignment in case of kafka receivers. Alternatively, if we used the service as a partitioning key and overflows are bounded within a service then we might be able to have reasonably low + bounded memory requirements and not care much about this point. Utilizing memory limiter processor instead of overflowsThe idea here is to piggyback on the memory limiter processor in a way that overflows will be redundant. Since memory limiter processor can limit the memory used by the collector, we can introduce partitioning of the incoming data in a way that the limiter can be configured to push back to the receivers for partitions that are using too much memory. This would also allow the limits to be dynamically set based on the number of partitions identified. A major drawback here would be how pushback to receivers could be performed based on customizable partitions (defined by sets of attributes). The idea will falter for async cases where pushing back to receivers could lead to data drop, however, if receivers are backed by persistence then we can address this. |
Zooming out a bit, what we're trying to achieve is being able to handle scale within a service (high throughput and/or high cardinality metrics) as well as scale the number of services we can aggregate in a multi-tenant environment. I think these are the most important aspects to keep in mind for that:
Based on that, I'm not sure if the memory limiter approach that applies back pressure for high cardinality services instead of creating overflow buckets fulfills these requirements. Once a service produces high-cardinality metrics and we're pushing back, it's unlikely that it'll produce metrics with less cardinality in the future. Therefore, we may never be able to catch up with aggregating metrics for the current interval. We also don't necessarily want to scale out to avoid misbehaving services from consuming too many resources. |
I think "guarantee" here is too strict and bit out of scope of the component. We would probably need to dynamically impose limits based on available memory to achieve this which I am not sure is a good idea.
Agreed!
Hmm, it is possible that this happens if pebble throughput cannot match the input throughput (this would be a factor of disk IO and other parameters related to input) but I am not sure if this is a problem we should be solving with the component. It is more of a scaling problem and anyone trying to push component to this limit should be ready to scale horizontally IMO.
Completely agree with doing away from any limits outside of the service boundary but I don't think we need to introduce such service as a first class citizen for the limit. It is our use case where we don't want limits outside service boundary but I think it should be possible for the component to impose limits in general. To the similar point, the ability to horizontally scale is outside the scope of the component IMO. I say this because naively horizontally scaling the component might not always distribute the resource usage (especially memory) and will need to be handled as per the workloads - for example, using kafka in between with hash based partitioning or using approaches to distribute services between available instances.
+1 on this, I think we can drop the memory limiter approach. Updated ProposalI still think we can continue with the proposal mentioned here as it allows us to define limits in a generic way. The limits can be defined to achieve the following needs:
Let me know if the proposal makes sense so far or if we need to address any of the above (or any other) points in more detail. Next steps
|
I think the proposal above sounds broadly sensible: have a limit on services per project, all other limits are per service. I'd be keen to see a PoC on this. |
Sorry, I wasn't very clear. In my last comment, I didn't want to imply that all of these aspects would need to be tackled just by the lsm interval processor. These are the end-to-end aspects that the whole system needs to have. Some of them may be in scope for the lsm processor, some may be a part of how we partition the data in the queue. But we should be clear about which aspects are handled where in a holistic way so that when we put the different pieces together, we we're not missing an important aspect. Also, it's totally fine to have an iterative approach to this as long as we're not painting ourselves into a corner. Think of it as some kind of acceptance criteria for the end-to-end solution where any implementation that satisfies these requirements is acceptable. The intention was to give you more freedom in choosing the right approach by laying out the requirements and constraints more explicitly. I think the proposal looks good but I wanted to make sure we're also thinking about how aspects that aren't handled directly by the lsm interval processor fit in. |
I was thinking that this would limit the number of services a single instance can aggregate, not limit the number of services a project can have. I'm not sure if we want to limit the number of per project due to
|
When I said "have a limit on services per project", I meant only in the context of an instance of the aggregator. |
Quick update on this, I am working on a PoC implementation and it should be ready in a couple of days. |
Apologies for the delay, I had some hiccups in the implementation related to complexity of the above model and the performance (especially the overhead in encoding the multiple limits that was proposed above). I made some simplifications in the overflow handling model:
Here is a draft PR for this model. Note that the PR is still WIP and requires some refactoring + optimizations before it is in a mergeable state (given we agree on the approach) but it is in a working state. I have also added some simple tests with different metric types to show how the overflow happens for datapoints. |
@lahsivjar I think this approach sounds fine - will need to think more on it as I review the code. If users need more fine-grained control over the limits, perhaps we could add support for conditions to the processor so it only processes certain metrics? Then users could create multiple instances of the processor for different sets of metrics, each with their own limits. I don't think you would be able to share limits across instances though, so not a perfect replacement. |
Aggregations could be boundless with huge cardinality OR with buggy instrumentation. To protect against these, the aggregated metrics should have limits and overflow to specific buckets after those limits are breached. This would be similar to what is implemented in https://github.com/elastic/apm-aggregation
The text was updated successfully, but these errors were encountered: