Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[processor/lsminterval] Behaviour of timestamp for aggregated metrics #158

Open
lahsivjar opened this issue Oct 3, 2024 · 5 comments
Open

Comments

@lahsivjar
Copy link
Contributor

lahsivjar commented Oct 3, 2024

LSM interval creates aggregated metrics without utilizing the timestamp of the incoming events. The timestamp of the aggregated event is set to the latest timestamp of the aggregated datapoints. This behaviour is based on the upstream intervalprocessor on which the lsmintervalprocessor is based on.

OTOH, apm-aggregation uses truncated event timestamp (based on aggregation interval) for bucketing data. Due to this, old metrics are grouped in a separate bucket. This would mean that, aggregations for old spans would have a new timestamp in OTel.

@felixbarny
Copy link
Member

Maybe this is something that should be configurable (event timestamp vs arrival timestamp).

If we truncate the event timestamp, we'll need to add another dimension for late arrivals. Otherwise, we could create another data point for the same timestamp for the same time series, leading to duplicate rejections.

This was one of the reasons that made it hard for use to adopt TSDB for APM.

@lahsivjar
Copy link
Contributor Author

If we truncate the event timestamp, we'll need to add another dimension for late arrivals. Otherwise, we could create another data point for the same timestamp for the same time series, leading to duplicate rejections.

Good point! It seems duplicate rejection would always be the case with APM logic for late arrivals which didn't make their realtime aggregation bucket. The current lsmintervalprocessor doesn't consider timestamp as an aggregation dimension but if we want to do it, we would have to truncate the event timestamp making data rejection a lot more probable. We could add another dimension, like a current truncated aggregation processing window which should make it unique but this would mean a new timeseries for every aggregation period... not sure if this is a good idea.

Alternatively, if we decide to leave things as they are i.e. aggregating with arrival timestamp then the resulting data might be a bit skewed. We could decide to only aggregate late arrivals up to a limit and minimize the deviation of the skew but not sure if that is any better.

@felixbarny
Copy link
Member

What I'm thinking of is to add a numeric dimension (for example named offset) that indicates how "late" the data is. For example, if we aggregate spans for the current time, the offset would be 0. For the 1m bucket, if we aggregate a span from the past minute, the offset is 1, and so on. So we "only" create a new time series for actual late arriving data.

While that would work well with delta temporality, I don't think it would really work with cumulative temporality.

@lahsivjar
Copy link
Contributor Author

lahsivjar commented Oct 4, 2024

Hmm, that's a neat trick which gives me another idea. We could use the concept of offset but instead of encoding it as a separate attribute, we could encode it in the timestamp. If we were to emit event timestamp for aggregated metrics, we would truncate the UNIX timestamp as per a given interval. This would give us exact points in timestamp for a 0 offset aggregated metric. For any late arrivals, we can calculate its offset and add that many milliseconds to the timestamp. Even if an aggregation interval is 1 minute in length, we would be able to accommodate up to 1000 hours worth of late arrivals.

@felixbarny
Copy link
Member

That won't allow us to accept timestamps that are arbitrarily late. But TSDB already has limitations on late-arriving data, so that's probably fine. Also, if we use nanosecond precision timestamps, it will be even less of an issue.

Maybe we should have a configurable set of strategies as the ideal strategy probably depends on the capabilities of the backend (nanosecond support, support for late arrivals, etc.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants