Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref(transactions): add functionality to sample transactions in save_event_transaction #81077

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

JoshFerge
Copy link
Member

first PR of #81065

no behavior changes until option is turned on. need to change post_process as well before we can turn the option on. equivalent of

if is_transaction_event:
in post_process

@lynnagara
Copy link
Member

What are the implications of transactions sampling not happening exactly once?

Won't this potentially sample some messages in flight twice or not at all depending on the direction of option change

@lynnagara
Copy link
Member

I think #81079 could probably be part of this same PR, otherwise all this option does is double sample.

@JoshFerge
Copy link
Member Author

I think #81079 could probably be part of this same PR, otherwise all this option does is double sample.

@lynnagara the goal of the PRs is to be reviewed separately, not to be deployed separately. i would never turn the option on without the other one deployed

@JoshFerge
Copy link
Member Author

What are the implications of transactions sampling not happening exactly once?
Won't this potentially sample some messages in flight twice or not at all depending on the direction of option change

from reading about the transactions sampler, i believe it is a feature that helps with parameterizing high cardinality transaction names. i don't believe double sampling (wouldn't happen as i've implemented)

or not having samples (would be the case for a few moments for this deploy) would have a large effect on the product.

https://develop.sentry.dev/api-server/application-domains/transaction-clustering/#accidental-erasure-of-non-identifiers

Copy link
Member

@jjbayer jjbayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We originally put the "record transaction" logic in post process to move it off the critical path. But if that task is going to disappear completely, this makes sense.

src/sentry/options/defaults.py Outdated Show resolved Hide resolved
src/sentry/event_manager.py Outdated Show resolved Hide resolved
@JoshFerge
Copy link
Member Author

JoshFerge commented Nov 21, 2024

We originally put the "record transaction" logic in post process to move it off the critical path. But if that task is going to disappear completely, this makes sense.

Yeah, that makes. Unfortunately at this point the load we're putting on RabbitMQ and potential instability it causes isn't worth having a whole other task for transactions. I think once we have more durable execution with task workers, it's certainly worth investigating breaking up again.

tests/sentry/event_manager/test_event_manager.py Outdated Show resolved Hide resolved
tests/sentry/event_manager/test_event_manager.py Outdated Show resolved Hide resolved
@@ -1540,6 +1541,133 @@ def test_transaction_event_span_grouping(self) -> None:
# the basic strategy is to simply use the description
assert spans == [{"hash": hash_values([span["description"]])} for span in data["spans"]]

@override_options({"transactions.do_post_process_in_save": True})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What value does this test add in contrast with the next one? I think they're the same but the next one tests for the two calls.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a comment below, but the next test ensures that the functions get called with the correct mocks, while this one ensures that the functions themselves don't error / run correctly. (didn't see a great way to validate that the side effects those functions call), so have two.

@JoshFerge JoshFerge changed the title ref(transactions): add functionality to sample transactions in save_event_transac ref(transactions): add functionality to sample transactions in save_event_transaction Nov 21, 2024
Copy link
Member

@markstory markstory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good with #81079 so that we aren't double sampling.

…_process (#81079)

in lockstep with #81077, stops
doing work in post_process on transactions. part of
#81065
@JoshFerge
Copy link
Member Author

Looks good with #81079 so that we aren't double sampling.

merged that PR into this one so it's easier to read / can assure they go out together. (won't do anything until the option is modified)

@JoshFerge
Copy link
Member Author

rollout strategy: #81065 (comment)

):
continue
project = job["event"].project
record_transaction_name_for_clustering(project, job["event"].data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is probably fine, although I don't really know how this sampling is used. If this feature relies on this transaction being indexed in snuba then there could be downstream problems.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confirmed it does not rely on the transaction being indexed -- this system uses redis only.

@wedamija
Copy link
Member

We also log

duration = time() - event.data["received"]
metrics.timing(
"events.time-to-post-process",
duration,
instance=event.data["platform"],
tags=metric_tags,
)
as part of post process. I'm not sure if there are any datadog monitors or stat pages that will be affected by removing this. Might be worth adding this to the end of save event for transactions?

@lynnagara
Copy link
Member

lynnagara commented Nov 21, 2024

We have to be very careful about how this impacts performance of save event. I think we should look into a couple of things first before making this change:

  • Can we collect timings on transactions post processing is first before we move it into save event? I think that whether it is safe to do depends on the timing. If it's above, say, 10 milliseconds, i think we should maybe consider some other options?
  • Secondly, signals aren't allowed in the save event task, as it is too easy for additional work to be added there, and. Can we remove those?
  • The volume on this pipeline is expected to increase up to 4 fold in the coming months and we would need to understand what's needed to scale if save event transactions job gets slower

@wedamija
Copy link
Member

We have to be very careful about how this impacts performance of save event. I think we should look into a couple of things first before making this change:

  • Can we collect timings on transactions post processing is first before we move it into save event? I think that whether it is safe to do depends on the timing. If it's above, say, 10 milliseconds, i think we should maybe consider some other options?
  • Secondly, signals aren't allowed in the save event task, as it is too easy for additional work to be added there, and. Can we remove those?
  • The volume on this pipeline is expected to increase up to 4 fold in the coming months and we would need to understand what's needed to scale if save event transactions job gets slower

I agree we should time it to be safe, but I think it's unlikely that it really takes very long compared to

_detect_performance_problems(jobs, projects)
which we do in in save_transaction_events. This does a lot of relatively expensive analysis on all the details of the transaction.

Copy link
Member Author

@JoshFerge JoshFerge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: let's use two options for the rollout

@JoshFerge
Copy link
Member Author

we spent some time looking at the performance of the signals / transactions clustering on current telemetry in post_process, and both looked to be sub 10ms average, with no large outliers. this PR includes timing telemetry so we can monitor perf in S4S to confirm before any wider rollout takes place.

also confirmed that transaction clustering doesn't look at anything in clickhouse, it only uses redis.

i've split the option used to roll this out into two, so we can roll out by turning on transactions.do_post_process_in_save, confirming looks good (in this time we'll also do the same work in post_process, but doing things double here are fine)

after we confirm looks good, we can flip on transactions.dont_do_transactions_logic_in_post_process, which will stop doing the work in post_process.

Copy link

codecov bot commented Nov 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #81077      +/-   ##
==========================================
+ Coverage   78.48%   80.36%   +1.87%     
==========================================
  Files        7215     7217       +2     
  Lines      319812   319896      +84     
  Branches    44045    20741   -23304     
==========================================
+ Hits       251012   257078    +6066     
- Misses      62411    62426      +15     
+ Partials     6389      392    -5997     

@lynnagara
Copy link
Member

lynnagara commented Nov 21, 2024

Currently there are 3 things that happen in post processing:

  • transactions clustering
  • signal + onboarding tasks
  • delete from rc-processing

I think instead of splitting these under different options you can combine this under a single one transactions.do_post_process_in_save and only checking that option in save_event_transaction and not in post processing at all. I believe this works without any additional checks in post process because if you do everything in including deletion here then post process will never run anyway. The relevant line that prevents this from being re- post-processed is here:

data = processing_store.get(cache_key)
.

I think there are a few benefits of this approach:

  • Consistency. You get exactly once processing without much work. You do not have a minute or two when this option gets flipped where the events that are in the middle of the pipeline may skip this processing
  • Less complexity
  • All combinations of options are valid. There is only one option and turning it up and down is harmless. It's easy for anyone (like SRE) to flip back in an incident. The other way requires more knowledge of the code to ensure an invalid combination isn't picked.

What do you think?

@JoshFerge
Copy link
Member Author

Currently there are 3 things that happen in post processing:

  • transactions clustering
  • signal + onboarding tasks
  • delete from rc-processing

I think instead of splitting these under different options you can combine this under a single one transactions.do_post_process_in_save and only checking that option in save_event_transaction and not in post processing at all. I believe this works without any additional checks in post process because if you do everything in including deletion here then post process will never run anyway. The relevant line that prevents this from being re- post-processed is here:

data = processing_store.get(cache_key)

.
I think there are a few benefits of this approach:

  • Consistency. You get exactly once processing without much work. You do not have a minute or two when this option gets flipped where the events that are in the middle of the pipeline may skip this processing
  • Less complexity
  • All combinations of options are valid. There is only one option and turning it up and down is harmless. It's easy for anyone (like SRE) to flip back in an incident. The other way requires more knowledge of the code to ensure an invalid combination isn't picked.

What do you think?

i've gone ahead and implemented this -- it's much cleaner. thank you for the suggestion. closing the follow up PR.

@JoshFerge
Copy link
Member Author

updated rollout plan: #81065 (comment)

Copy link
Member

@lynnagara lynnagara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Comment on lines +586 to +595
if (
consumer_type == ConsumerType.Transactions
and event_id
and in_rollout_group("transactions.do_post_process_in_save", event_id)
):
# we won't use the transaction data in post_process
# so we can delete it from the cache now.
if cache_key:
processing_store.delete_by_key(cache_key)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just double checking - should this be in the finally block, or in an else block?

do we want to run it if if we hit the save event exception above?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Scope: Backend Automatically applied to PRs that change backend components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants