Elaborate, document and propose remedies for "the 10k span problem" #80

codefromthecrypt · 2017-07-11T08:53:08Z

Traces that have orders of thousands of spans can be problematic. They can choke the UI (not just ours) and increase the operating costs of a tracing system. There are a number of scenarios which can result in "the 10k span problem", such as broadcast messaging to boundless consumers or buggy traced loops. Some workarounds are easier than others. For example, dropping local spans reported is easier than trying to coordinate message consumers to have them drop.

This issue should clarify the major scenarios, known workarounds and remedies. Hopefully, it can result in at least documentation, and in ideal case in coding practice that defends against this

Here are some breadcrumbs:

Notes on 10k span problem from tracing workshop
Brave 5 wish list brave#444 - span sampler
zipkin ui very slow when has a lot of span zipkin#1460 - ui choking

ImFlog · 2017-07-11T09:24:06Z

One of the solution discussed could be to drop spans. The question is where to do It ?

on the Zipkin server side (with a rule system for example), this could fix the UI issue
on the collector side, this reduces the load but also introduces complexity : How do we know this is a long trace ?

codefromthecrypt · 2017-07-11T09:39:55Z

One of the solution discussed could be to drop spans. The question is where to do It ? - on the Zipkin server side (with a rule system for example), this could fix the UI issue By server, I think you mean at query time, right? One tradeoff of dropping

at query time is that there is an assumption the only customer of the api is the UI (which isn't the case, eventhough it is the primary consumer). Nested in the attached google doc is a slight variation which is to drop or simply collapse (make unrenderable) spans in the client-side javascript. This is another option to help from overloading the UI, and it has the advantage of not requiring a data model change or dropping data.

- on the collector side, this reduces the load but also introduces complexity : How do we know this is a long trace To qualify what you've mentioned here, this is where you don't know how

many spans will be created in the process (for example, broadcast messaging spans, which fork on receipt). There are scenarios that create a lot of spans in-process, and the local tracer could sample there w/o coordination.

codefromthecrypt · 2017-07-11T09:44:15Z

so one way to proceed from here could be to enumerate different patterns and strategies for each. For example 10k spans due to local spans, or broadcast, or RPC, etc. I've created a google doc here that might help https://docs.google.com/document/d/1XkFGflrQP4wF8vqv-veFDE-t-V5iyH5bh9VXRYaOROg/edit you can also look here for some text about common tracing patterns, the summary of which might be helpful in elaborating. https://drive.google.com/drive/u/0/folders/0B0tSnQT3uGdAUVVUcDA5d21rRWM

felixbarny mentioned this issue Aug 30, 2018

[APM] Timeline: Show spans without transaction elastic/kibana#22347

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elaborate, document and propose remedies for "the 10k span problem" #80

Elaborate, document and propose remedies for "the 10k span problem" #80

codefromthecrypt commented Jul 11, 2017

ImFlog commented Jul 11, 2017

codefromthecrypt commented Jul 11, 2017 via email

codefromthecrypt commented Jul 11, 2017 via email

Elaborate, document and propose remedies for "the 10k span problem" #80

Elaborate, document and propose remedies for "the 10k span problem" #80

Comments

codefromthecrypt commented Jul 11, 2017

ImFlog commented Jul 11, 2017

codefromthecrypt commented Jul 11, 2017 via email

codefromthecrypt commented Jul 11, 2017 via email