ref(transactions): add functionality to sample transactions in save_event_transaction #81077

JoshFerge · 2024-11-20T21:43:46Z

first PR of #81065

no behavior changes until option is turned on. need to change post_process as well before we can turn the option on. equivalent of

sentry/src/sentry/tasks/post_process.py

Line 614 in b84d3d9

if is_transaction_event:

in post_process

lynnagara · 2024-11-21T00:40:02Z

What are the implications of transactions sampling not happening exactly once?

Won't this potentially sample some messages in flight twice or not at all depending on the direction of option change

lynnagara · 2024-11-21T00:41:44Z

I think #81079 could probably be part of this same PR, otherwise all this option does is double sample.

JoshFerge · 2024-11-21T00:42:54Z

I think #81079 could probably be part of this same PR, otherwise all this option does is double sample.

@lynnagara the goal of the PRs is to be reviewed separately, not to be deployed separately. i would never turn the option on without the other one deployed

JoshFerge · 2024-11-21T00:48:12Z

What are the implications of transactions sampling not happening exactly once?
Won't this potentially sample some messages in flight twice or not at all depending on the direction of option change

from reading about the transactions sampler, i believe it is a feature that helps with parameterizing high cardinality transaction names. i don't believe double sampling (wouldn't happen as i've implemented)

or not having samples (would be the case for a few moments for this deploy) would have a large effect on the product.

https://develop.sentry.dev/api-server/application-domains/transaction-clustering/#accidental-erasure-of-non-identifiers

jjbayer

We originally put the "record transaction" logic in post process to move it off the critical path. But if that task is going to disappear completely, this makes sense.

src/sentry/options/defaults.py

src/sentry/event_manager.py

JoshFerge · 2024-11-21T14:45:13Z

We originally put the "record transaction" logic in post process to move it off the critical path. But if that task is going to disappear completely, this makes sense.

Yeah, that makes. Unfortunately at this point the load we're putting on RabbitMQ and potential instability it causes isn't worth having a whole other task for transactions. I think once we have more durable execution with task workers, it's certainly worth investigating breaking up again.

tests/sentry/event_manager/test_event_manager.py

armenzg · 2024-11-21T14:46:49Z

tests/sentry/event_manager/test_event_manager.py

@@ -1540,6 +1541,133 @@ def test_transaction_event_span_grouping(self) -> None:
        # the basic strategy is to simply use the description
        assert spans == [{"hash": hash_values([span["description"]])} for span in data["spans"]]

+    @override_options({"transactions.do_post_process_in_save": True})


What value does this test add in contrast with the next one? I think they're the same but the next one tests for the two calls.

there is a comment below, but the next test ensures that the functions get called with the correct mocks, while this one ensures that the functions themselves don't error / run correctly. (didn't see a great way to validate that the side effects those functions call), so have two.

src/sentry/event_manager.py

…vent_tx

markstory

Looks good with #81079 so that we aren't double sampling.

…_process (#81079) in lockstep with #81077, stops doing work in post_process on transactions. part of #81065

JoshFerge · 2024-11-21T15:54:09Z

Looks good with #81079 so that we aren't double sampling.

merged that PR into this one so it's easier to read / can assure they go out together. (won't do anything until the option is modified)

JoshFerge · 2024-11-21T16:53:08Z

rollout strategy: #81065 (comment)

wedamija · 2024-11-21T18:52:33Z

src/sentry/event_manager.py

+            ):
+                continue
+            project = job["event"].project
+            record_transaction_name_for_clustering(project, job["event"].data)


I think that this is probably fine, although I don't really know how this sampling is used. If this feature relies on this transaction being indexed in snuba then there could be downstream problems.

confirmed it does not rely on the transaction being indexed -- this system uses redis only.

wedamija · 2024-11-21T18:57:45Z

We also log

sentry/src/sentry/tasks/post_process.py

Lines 653 to 659 in d25bb29

    
           duration = time() - event.data["received"] 
        
           metrics.timing( 
        
               "events.time-to-post-process", 
        
               duration, 
        
               instance=event.data["platform"], 
        
               tags=metric_tags, 
        
           )

as part of post process. I'm not sure if there are any datadog monitors or stat pages that will be affected by removing this. Might be worth adding this to the end of save event for transactions?

lynnagara · 2024-11-21T19:13:02Z

We have to be very careful about how this impacts performance of save event. I think we should look into a couple of things first before making this change:

Can we collect timings on transactions post processing is first before we move it into save event? I think that whether it is safe to do depends on the timing. If it's above, say, 10 milliseconds, i think we should maybe consider some other options?
Secondly, signals aren't allowed in the save event task, as it is too easy for additional work to be added there, and. Can we remove those?
The volume on this pipeline is expected to increase up to 4 fold in the coming months and we would need to understand what's needed to scale if save event transactions job gets slower

wedamija · 2024-11-21T19:20:31Z

We have to be very careful about how this impacts performance of save event. I think we should look into a couple of things first before making this change:

Can we collect timings on transactions post processing is first before we move it into save event? I think that whether it is safe to do depends on the timing. If it's above, say, 10 milliseconds, i think we should maybe consider some other options?

Secondly, signals aren't allowed in the save event task, as it is too easy for additional work to be added there, and. Can we remove those?

The volume on this pipeline is expected to increase up to 4 fold in the coming months and we would need to understand what's needed to scale if save event transactions job gets slower

I agree we should time it to be safe, but I think it's unlikely that it really takes very long compared to

sentry/src/sentry/event_manager.py

Line 2635 in 9ee399c

_detect_performance_problems(jobs, projects)

which we do in in save_transaction_events. This does a lot of relatively expensive analysis on all the details of the transaction.

src/sentry/event_manager.py

JoshFerge

TODO: let's use two options for the rollout

JoshFerge · 2024-11-21T20:51:42Z

we spent some time looking at the performance of the signals / transactions clustering on current telemetry in post_process, and both looked to be sub 10ms average, with no large outliers. this PR includes timing telemetry so we can monitor perf in S4S to confirm before any wider rollout takes place.

also confirmed that transaction clustering doesn't look at anything in clickhouse, it only uses redis.

i've split the option used to roll this out into two, so we can roll out by turning on transactions.do_post_process_in_save, confirming looks good (in this time we'll also do the same work in post_process, but doing things double here are fine)

after we confirm looks good, we can flip on transactions.dont_do_transactions_logic_in_post_process, which will stop doing the work in post_process.

codecov · 2024-11-21T21:08:44Z

Codecov Report

All modified and coverable lines are covered by tests ✅

✅ All tests successful. No failed tests found.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #81077      +/-   ##
==========================================
+ Coverage   78.48%   80.36%   +1.87%     
==========================================
  Files        7215     7217       +2     
  Lines      319812   319896      +84     
  Branches    44045    20741   -23304     
==========================================
+ Hits       251012   257078    +6066     
- Misses      62411    62426      +15     
+ Partials     6389      392    -5997

lynnagara · 2024-11-21T21:10:29Z

Currently there are 3 things that happen in post processing:

transactions clustering
signal + onboarding tasks
delete from rc-processing

I think instead of splitting these under different options you can combine this under a single one transactions.do_post_process_in_save and only checking that option in save_event_transaction and not in post processing at all. I believe this works without any additional checks in post process because if you do everything in including deletion here then post process will never run anyway. The relevant line that prevents this from being re- post-processed is here:

sentry/src/sentry/tasks/post_process.py

Line 528 in 2e30c63

data = processing_store.get(cache_key)

.

I think there are a few benefits of this approach:

Consistency. You get exactly once processing without much work. You do not have a minute or two when this option gets flipped where the events that are in the middle of the pipeline may skip this processing
Less complexity
All combinations of options are valid. There is only one option and turning it up and down is harmless. It's easy for anyone (like SRE) to flip back in an incident. The other way requires more knowledge of the code to ensure an invalid combination isn't picked.

What do you think?

…sing

JoshFerge · 2024-11-21T23:40:22Z

Currently there are 3 things that happen in post processing:

transactions clustering

signal + onboarding tasks

delete from rc-processing

I think instead of splitting these under different options you can combine this under a single one transactions.do_post_process_in_save and only checking that option in save_event_transaction and not in post processing at all. I believe this works without any additional checks in post process because if you do everything in including deletion here then post process will never run anyway. The relevant line that prevents this from being re- post-processed is here:

sentry/src/sentry/tasks/post_process.py

Line 528 in 2e30c63

data = processing_store.get(cache_key)

.
I think there are a few benefits of this approach:

Consistency. You get exactly once processing without much work. You do not have a minute or two when this option gets flipped where the events that are in the middle of the pipeline may skip this processing

Less complexity

All combinations of options are valid. There is only one option and turning it up and down is harmless. It's easy for anyone (like SRE) to flip back in an incident. The other way requires more knowledge of the code to ensure an invalid combination isn't picked.

What do you think?

i've gone ahead and implemented this -- it's much cleaner. thank you for the suggestion. closing the follow up PR.

JoshFerge · 2024-11-22T00:12:27Z

updated rollout plan: #81065 (comment)

lynnagara

lynnagara · 2024-11-22T04:27:58Z

src/sentry/tasks/store.py

+            if (
+                consumer_type == ConsumerType.Transactions
+                and event_id
+                and in_rollout_group("transactions.do_post_process_in_save", event_id)
+            ):
+                # we won't use the transaction data in post_process
+                # so we can delete it from the cache now.
+                if cache_key:
+                    processing_store.delete_by_key(cache_key)
+


just double checking - should this be in the finally block, or in an else block?

do we want to run it if if we hit the save event exception above?

JoshFerge requested a review from a team as a code owner November 20, 2024 21:43

JoshFerge requested review from a team, markstory and lynnagara November 20, 2024 21:43

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Nov 20, 2024

vercel bot deployed to Preview November 20, 2024 21:44 View deployment

JoshFerge mentioned this pull request Nov 20, 2024

ref(transactions): option for not doing transactions sampling in post_process #81079

Merged

vercel bot deployed to Preview November 20, 2024 21:53 View deployment

JoshFerge requested a review from a team November 20, 2024 23:58

JoshFerge force-pushed the jferg/tx-post-process-1 branch from b07a90f to a44c8d6 Compare November 21, 2024 01:45

vercel bot deployed to Preview November 21, 2024 01:49 View deployment

JoshFerge force-pushed the jferg/tx-post-process-1 branch from a44c8d6 to 74e168f Compare November 21, 2024 01:51

vercel bot deployed to Preview November 21, 2024 01:55 View deployment

jjbayer reviewed Nov 21, 2024

View reviewed changes

src/sentry/options/defaults.py Outdated Show resolved Hide resolved

src/sentry/event_manager.py Outdated Show resolved Hide resolved

armenzg approved these changes Nov 21, 2024

View reviewed changes

JoshFerge force-pushed the jferg/tx-post-process-1 branch from 74e168f to 48e4329 Compare November 21, 2024 14:58

vercel bot deployed to Preview November 21, 2024 15:02 View deployment

JoshFerge force-pushed the jferg/tx-post-process-1 branch from 48e4329 to 80c2656 Compare November 21, 2024 15:03

vercel bot deployed to Preview November 21, 2024 15:06 View deployment

JoshFerge changed the title ~~ref(transactions): add functionality to sample transactions in save_event_transac~~ ref(transactions): add functionality to sample transactions in save_event_transaction Nov 21, 2024

JoshFerge commented Nov 21, 2024

View reviewed changes

src/sentry/event_manager.py Outdated Show resolved Hide resolved

ref(transactions): add functionality to sample transactions in save_e…

6f9afa7

…vent_tx

JoshFerge force-pushed the jferg/tx-post-process-1 branch from 80c2656 to 6f9afa7 Compare November 21, 2024 15:32

vercel bot deployed to Preview November 21, 2024 15:35 View deployment

markstory reviewed Nov 21, 2024

View reviewed changes

ref(transactions): option for not doing transactions sampling in post…

dfcb3a5

…_process (#81079) in lockstep with #81077, stops doing work in post_process on transactions. part of #81065

vercel bot deployed to Preview November 21, 2024 15:57 View deployment

JoshFerge mentioned this pull request Nov 21, 2024

Don't have transactions go through post_process #81065

Open

9 tasks

wedamija approved these changes Nov 21, 2024

View reviewed changes

JoshFerge commented Nov 21, 2024

View reviewed changes

src/sentry/event_manager.py Outdated Show resolved Hide resolved

JoshFerge commented Nov 21, 2024

View reviewed changes

vercel bot deployed to Preview November 21, 2024 20:50 View deployment

split into separate flags; don't use signal

605d980

JoshFerge force-pushed the jferg/tx-post-process-1 branch from a4a40d8 to 605d980 Compare November 21, 2024 20:55

vercel bot deployed to Preview November 21, 2024 20:59 View deployment

consolidate options, remove post_process logic, delete from rc-proces…

ed57cc6

…sing

JoshFerge mentioned this pull request Nov 21, 2024

ref(transactions): dont send transactions to post_process #81085

Closed

vercel bot deployed to Preview November 21, 2024 23:42 View deployment

add assertion for case when flag is off

0a6f692

vercel bot deployed to Preview November 22, 2024 00:09 View deployment

consolidate methods, use event_id instead of project_id for rollout

3a0e076

vercel bot deployed to Preview November 22, 2024 01:58 View deployment

lynnagara approved these changes Nov 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ref(transactions): add functionality to sample transactions in save_event_transaction #81077

ref(transactions): add functionality to sample transactions in save_event_transaction #81077

JoshFerge commented Nov 20, 2024

lynnagara commented Nov 21, 2024

lynnagara commented Nov 21, 2024

JoshFerge commented Nov 21, 2024

JoshFerge commented Nov 21, 2024

jjbayer left a comment

JoshFerge commented Nov 21, 2024 •

edited

Loading

armenzg Nov 21, 2024

JoshFerge Nov 21, 2024

markstory left a comment

JoshFerge commented Nov 21, 2024

JoshFerge commented Nov 21, 2024

wedamija Nov 21, 2024

JoshFerge Nov 21, 2024

wedamija commented Nov 21, 2024

lynnagara commented Nov 21, 2024 •

edited

Loading

wedamija commented Nov 21, 2024

JoshFerge left a comment

JoshFerge commented Nov 21, 2024

codecov bot commented Nov 21, 2024 •

edited

Loading

lynnagara commented Nov 21, 2024 •

edited

Loading

JoshFerge commented Nov 21, 2024

JoshFerge commented Nov 22, 2024

lynnagara left a comment

lynnagara Nov 22, 2024

ref(transactions): add functionality to sample transactions in save_event_transaction #81077

Are you sure you want to change the base?

ref(transactions): add functionality to sample transactions in save_event_transaction #81077

Conversation

JoshFerge commented Nov 20, 2024

lynnagara commented Nov 21, 2024

lynnagara commented Nov 21, 2024

JoshFerge commented Nov 21, 2024

JoshFerge commented Nov 21, 2024

jjbayer left a comment

Choose a reason for hiding this comment

JoshFerge commented Nov 21, 2024 • edited Loading

armenzg Nov 21, 2024

Choose a reason for hiding this comment

JoshFerge Nov 21, 2024

Choose a reason for hiding this comment

markstory left a comment

Choose a reason for hiding this comment

JoshFerge commented Nov 21, 2024

JoshFerge commented Nov 21, 2024

wedamija Nov 21, 2024

Choose a reason for hiding this comment

JoshFerge Nov 21, 2024

Choose a reason for hiding this comment

wedamija commented Nov 21, 2024

lynnagara commented Nov 21, 2024 • edited Loading

wedamija commented Nov 21, 2024

JoshFerge left a comment

Choose a reason for hiding this comment

JoshFerge commented Nov 21, 2024

codecov bot commented Nov 21, 2024 • edited Loading

Codecov Report

lynnagara commented Nov 21, 2024 • edited Loading

JoshFerge commented Nov 21, 2024

JoshFerge commented Nov 22, 2024

lynnagara left a comment

Choose a reason for hiding this comment

lynnagara Nov 22, 2024

Choose a reason for hiding this comment

JoshFerge commented Nov 21, 2024 •

edited

Loading

lynnagara commented Nov 21, 2024 •

edited

Loading

codecov bot commented Nov 21, 2024 •

edited

Loading

lynnagara commented Nov 21, 2024 •

edited

Loading