Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement tracing support #163

Open
1st1 opened this issue Jun 1, 2018 · 12 comments
Open

Implement tracing support #163

1st1 opened this issue Jun 1, 2018 · 12 comments

Comments

@1st1
Copy link
Member

1st1 commented Jun 1, 2018

This is a meta issue to discuss how new asyncio tracing API should look like. We will implement it first in uvloop, and later, if necessary, port it to asyncio.


Requirements

  1. Base on the new contextvars (PEP 567) API, so that async tasks and callbacks they schedule are traced consistently.

  2. Traces should be hierarchical, i.e. it should be possible to see what and how exactly one Task did, what Tasks that it spawned did, etc.

  3. Low performance overhead. Ideally we should be able to run tracing in production.

  4. Low-level and detailed: we want to be able to measure DNS requests, read/write callbacks performance, throughput, number of callbacks/tasks created etc.

  5. Flexibility: traces should be in easy to work with format, so that it's possible to submit them to existing systems like zipkin.

  6. Extensibility: the new API should be easy to maintain on both ends: in the event loop and in applications that use it. This means that adding a new kind of trace should not require people to write any glue code.

  7. [TBD]

@1st1
Copy link
Member Author

1st1 commented Jun 1, 2018

#160 is the first attempt at implementing the new API.

The main thing I don't like about it is that it's designed around having various _begin and _end callbacks in so called Collector objects. There are a few problems with this approach:

  • This isn't flexible: you'll need to update your collector with new _begin and _end methods whenever a new version of uvloop with new tracing bits comes out.

  • It's too generic. I don't like that users will be able to run arbitrary code at arbitrary places. Writing a proper and low-overhead collector will be a non-trivial task. Speaking from my experience, we will end up in a situation where people do all kinds of crazy stuff in those callbacks, including scheduling more callbacks and doing disk IO.

In my opinion it would be better to create a Span dataclass that has the following attributes:

  • name: str -- name of the trace, like "task.created", or "network.data_received".
  • id: uint64 -- unique span ID
  • parent_id: uint64 -- ID of the parent span
  • type: uint8: (or this can be just type of the class)
    • 0: duration
    • 1: counter

For "duration" spans:

  • created_at: uint64 -- local monotonic time (in nanoseconds?) when the span was created
  • duration: uint64 -- how long the span was active

For "counter" spans:

  • count: uint64 -- how many times a certain thing happened, or how many bytes were transferred

This is just an idea we should play with. The design is inspired by statsd and zipkin.

@pfreixes
Copy link
Contributor

pfreixes commented Jun 2, 2018

Thanks for your feedback @1st1

Regarding your concerns about the implementation proposed by #160 I partially agree on both of them but I have my reasons starting with such proposal.

This first approximation had as a goal implement a generic system that would allow the user to implement on top of it its own logic or implementation, so it would allow the implementation of interfaces that would provide the glue for ZipKin, OpenTracing, AWS Xray, Statsd, and other tools either for tracing or for monitoring. So basically the interface presented wanted to solve both cases: tracing plus monitoring.

It's true that having such open interface it might end up allowing the users using bad practices, and yes that the creation of a new duration span with the pair _begin and _end functions would mean that most of the interfaces should update their implementations.

Your proposal, that goes more aligned with the initial conversations, narrows and makes explicit the implementation, so losing freedom but at the same point having the following advantages:

  • Avoid implicitly the use of bad practices, by design.
  • Puts a clear boundary between the application code and the uvloop code, without having cross communication as we had with the callbacks that must help to keep the footprint of the tracing within the cython figures.
  • Could allow flexibility in terms of maintencance for future wrappers.

I agree with the concept of Span in terms of an operation that takes some time - as it is already described in the tracing world. But I disagree with the other cases that are not timed operations, such as the counters which IMO they are more related to monitoring.

Lets take as an example the following snippet:

async def coro(loop):
    with uvloop.tracing("coro_span") as span:
        connection = await redis.connect("loclahost")
        value = await connection.get("foo")
        span.finish()
        print(span)

The uvloop.tracing will create span such as:

Span(
  name='coro_span',
  created_at=t0,
  duration=x,
  children=[
    Span(
      name='uvloop.getaddrinfo',
      created_at=t1,
      duration=y,
      host='localhost'
      port=6677
      addr='127.0.0.1'
    ),
    Span(
      name='uvloop.connect',
      created_at=t2,
      duration=y,
      addr='127.0.0.1:6667', 
    ),
  ]
)

So basically the Spans are a hierarchy of the duration events with their metadata.

Having something like this, the goal of tracing gets narrowed and it should mean that future wrappers - such us OpenTracing - must be able to implement something enough generic to allow changes on the metadata related to each Span, allowing them to make a pure translation to their span idiom.

Even I guess we might add as metadata some metrics that can be gathered within an ongoing span, such as the counters that we were speaking, for example, the next snippet shows what could implement the callback that is in charge of feed some data from the network

def feed(self, data):
    if span:
        span.rx += len(data)

So later the span will incorporate this data automatically.

But I'm still thinking that we would need a way to provide pure monitoring KPIs with an interface that is not coupled to the tracing interface, allowing the owners of the services that run applications on top of uvloop to gather the operational metrics such as RX, TX, callbacks, tasks, polling time and so on, that is part of the monitoring field rather than tracing.

Regarding the id of the Span, I'm not really sure if we need it, and I'm inclined to think that this is more something that depends on the tracer implementation - the wrapper - that identifies univocally the spans with its system. So, the question will be if the raw implementation really needs it and if an implicit serialization as the one that was presented would be enough.

I finally the implementation of the tracing moves into that direction, meaning that uvloop is gonna provide its own span implementation. As soon as there is a POC of this implementation I would try to make some tests and check how much friction imposes its wrapping for systems such as Zipkin, OpenTracing, AWS Xray and so on.

@pfreixes
Copy link
Contributor

pfreixes commented Jun 4, 2018

I've been digging a bit about how easily will be serializing the uvloop spans that later would be converted to the tracer format that is the one being used by an application such as Jaeger, DDog, AWS Xray and so on.

Take as an example the following code that tries to emulate this serialization for the Jaeger library [1]

from jaeger import tracer

@contextmanager
def jaeger_trace(loop):
    uvloop_span = loop.trace():
    yield
    uvloop_span.finish()

    root = tracer.start_span(
        uvloop_span.name,
        tags=uvloop_span.tags
        start_time=uvloop_span.start_time
    )

    for uvloop_child in uvloop_span.children:
        child = tracer.start_span(
            uvloop_child.name,
            child_of=root,
            tags=uvloop_child.tags,
            start_time=uvloop_child.start_time
        )
        child.finish(finish_time=uvloop_child.stop_time)

    root.finish(finish_time=uvloop_span.stop_time)

Basically it leaves to uvloop to make all of the work and once it is done the wrapper builds programatically the Jaeger spans using as a source the uvloop spans.

In an ideal world all of the libraries that iplement the OpenTracing standard [2] must be able to serialize traces using a piece of code quite similar with the above one. But some libraries that claim to be OpenTracing compatibles do not have an exact implementation, perhaps Data Dog has a Span object that supports the finish time as a parameter of its constructor while the start_span function - the one used to start a new span - does not support it [4]. Others like AWS Xray [5] would need a change in their API to allow serialize tracers programatically that come from a uvloop.

So summary, providing uvloop its own system to create and handle spans might create some resistence to be adopted by some tracers. On the other hand, having a more generic implemenentation as the callback one provided by #160, libraries should have enough freedom to implement their own span system.

Worth mentioning that tracers system place none core fields - id, duration, - under a generic field called tag,meta or other. So if finally uvloop implements its own span system, I would keep the core fields as minimum is possible copying the ones implemented by OpenTracing and putting all of the extra under a namespace such as tags, meta u other.

[1] https://github.com/jaegertracing/jaeger-client-python
[2] https://github.com/opentracing/opentracing-python/blob/master/opentracing/tracer.py#L37
[3] https://github.com/DataDog/dd-trace-py/blob/master/ddtrace/span.py#L51
[4] https://github.com/DataDog/dd-trace-py/blob/master/ddtrace/tracer.py#L137
[5] https://github.com/aws/aws-xray-sdk-python/blob/master/aws_xray_sdk/core/recorder.py#L163

@1st1
Copy link
Member Author

1st1 commented Jun 4, 2018

So summary, providing uvloop its own system to create and handle spans might create some resistence to be adopted by some tracers.

I agree, that will be a problem. The idea here is that we need a mechanism to setup a custom Span constructor per trace.

Here's my current idea of the API. It's quite close to Jaeger and few other tracing libraries.

########################################################################
# Classes that a tracing library would need to provide to
# support uvloop:
########################################################################


class Span(abc.ABC):

    @abc.abstract_method
    def get_parent_span(self) -> Span:
        """Return the parent span."""


class TimerSpan(Span):
    pass


class CounterSpan(Span):
    pass


class Tracer(abc.ABC):

    @abc.abstract_method
    def start_timer_span(self, name: str) -> TimerSpan:
        pass

    @abc.abstract_method
    def end_timer_span(self, span: TimerSpan):
        pass

    @abc.abstract_method
    def start_counter_span(self, name: str) -> CounterSpan:
        pass

    @abc.abstract_method
    def end_counter_span(self, span: CounterSpan):
        pass

    @abc.abstract_method
    def increment_counter_span(self, span: CounterSpan, inc: int):
        pass


########################################################################
# APIs that uvloop will provide:
########################################################################


class TracedContext(abc.ABC):
    """An opaque object that encapsulates information about tracing."""

    def _get_tracer(self) -> Tracer:
        """Return the Tracer that was used to start the context."""
        # Private uvloop API

    def _get_current_span(loop) -> Optional[Span]:
        """Return the current span."""
        # Private uvloop API

    def _is_tracing(loop) -> bool:
        """Return True if tracing is enabled in the current context."""
        # Private uvloop API


def uvloop.start_tracing(loop, tracer: Tracer) -> TracedContext:
    """Start a new TracedContext and set it as current."""


def uvloop.end_tracing(loop, trace_ctx: TracedContext):
    """Stop tracing for the passed TracedContext."""

The actual tracing implementation is deliberately hidden from uvloop users, allowing us to use contextvars or any other mechanism we want.

Open questions:

  1. Should TracedContext support nesting? Maybe TracedContext can be a Span? In principle I'd be OK with uvloop.start_tracing() creating a top-level TimingSpan instance, but I'd like tracing boundaries to be a bit more visible than that.

  2. Ideally we should somehow indicate new Tasks -- so maybe we need a TaskSpan span too.

  3. How can we shield some tasks from tracing? That's needed to spawn new tasks that are not traced (even if the Task that spawns them is traced).

@jettify
Copy link

jettify commented Jun 5, 2018

I like zipkin approach to spans since API is minimalistic and very powerful, here I created tracer for asyncio with aiohttp instrumentation (https://github.com/aio-libs/aiozipkin). I have implementation of spans/tracers/context etc.

Quick question, zipkin/opentracing and co are all about distributed tracing (as described in dapper paper), it helps to find all tasks spawned by particular request on several distributed machines. Given that what purpose of uvloop/asyncio tracing subsystem? Is this only for debugging only local process or distributed also?

Distributed case has some implication, trace or not to trace often decided on separate server per request, if we infer from incoming request that it sampled, subsystem should trace all child tasks and attach it to current span.

@1st1
Copy link
Member Author

1st1 commented Jun 5, 2018

I like zipkin approach to spans since API is minimalistic and very powerful, here I created tracer for asyncio with aiohttp instrumentation (https://github.com/aio-libs/aiozipkin). I have implementation of spans/tracers/context etc.

Nice. What do you think about the approach I outlined in #163 (comment)? Would it work for aiozipkin? If no, then how would you change it?

Quick question, zipkin/opentracing and co are all about distributed tracing (as described in dapper paper), it helps to find all tasks spawned by particular request on several distributed machines. Given that what purpose of uvloop/asyncio tracing subsystem? Is this only for debugging only local process or distributed also?

I'd aim for both.

Distributed case has some implication, trace or not to trace often decided on separate server per request, if we infer from incoming request that it sampled, subsystem should trace all child tasks and attach it to current span.

We'll be using contextvars to store the tracing context, so enabling tracing just for one web request should be naturally supported.

@pfreixes
Copy link
Contributor

pfreixes commented Jun 6, 2018

Regarding your proposal @1st1, TL;DR Ok!.

The proposal circumvents the issue of the friction to allow the wrappers being integrated with not so much pain, indeed this proposal goes more into the direction of the callbacks but narrowing the use case making the API super opinionated.

On the other hand, having this proposal you know that who will dictate the performance degradation will be the tracer, so the crosstalk between Cython code and Python code will be there by design. None sampled spans end up as real Python objects [1]

About the three different spans Timer, Counter and Increment IMO as I said it does not make sense, a Span is an object that represents an execution during a time that is the perfect match for the Timer span, while Counter and Increment are just metrics that have another nature. Take as an example the Jaeger implementation, they do not have any other concept for Span aside of the Timer one, indeed they have a metrics subsystem [2] that allow them to send some related metrics such as counters and increments.

IMO the metrics/stats must be provided by uvloop in another interface perhaps:

ascnc def coro():
    with uvloop.stats() as stats:
        with uvloop.start_tracing(tracer) as span:
            span.set_tag("rx", stats.rx)

Regarding your quesitons:

Should TracedContext support nesting? Maybe TracedContext can be a Span? In principle I'd be OK with uvloop.start_tracing() creating a top-level TimingSpan instance, but I'd like tracing boundaries to be a bit more visible than that.

I dont get this question.

In any case seeing how other traces implement its counterpart start_tracing, this should create the root span.

Ideally we should somehow indicate new Tasks -- so maybe we need a TaskSpan span too.

IMO whatever part of the underlying code that is executed by uvloop that can be isolated and it is interpretable as a span should implement what is needed to be traced, creating the spans as a children of the root span.

So the easy example is the DNS resolution, so having

ascnc def coro():
    with uvloop.start_tracing(tracer) as span:
        hosts = await loop.getaddrinfo(..)

The span object must have a children span that identifies clearly the DNS resolution. Im inclined to think that we have to use the same Span object for all of the spans that uvloop can create, or at maximum just an ephemeral child class that dictates some attributes such as the name.

Then the no easy part, the Task one. If we have a code such as:

async def coro():
    with uvloop.start_tracing(tracer) as span:
        task = loop.create_task(bar())
        await task

Im inclined to think that the await task must be considered as a children span. The same for any of the other primitives that allow you wait for a task. Hence, facing to all of the impediments that we will have without having Asyncio implementing tracing natively.

I would say that this one of the critical parts that would need some answers as soon as possible, having also the option of just leave aside this problematic without implementing them as children spans, so the user losing some granualirity in the trace.

How can we shield some tasks from tracing? That's needed to spawn new tasks that are not traced (even if the Task that spawns them is traced).

Im wondering why someone that is interested on instrumentalize a piece of code gathering the trace and its spans would be interested on discard some specific tasks. I have the feeling that there is no an easy way to make this and the implementation and inteface can end up much more complicated for just do something that might be used by nobody.

And last but not least, did you considere to put the start_tracing as a method of the loop? This defenitly will help in the future to make the convergency with asyncio, or is it too much adventurous?

[1] https://github.com/aio-libs/aiozipkin/blob/master/aiozipkin/span.py#L81
[2] https://github.com/jaegertracing/jaeger-client-python/blob/master/jaeger_client/metrics/metrics.py

@pfreixes
Copy link
Contributor

pfreixes commented Jun 6, 2018

Hi @jettify gad to see you here

Given that what purpose of uvloop/asyncio tracing subsystem?

The goal is give more granularity to the user when the user traces an application that runs on top of Asyncio, IMO the annalogy is other frameworks that they already implement tracing system [1].

Having Asyncio implementing tracing under the hood we must be able to enrich current solutions, such as the Aiohttp middleware for AWS Xray [2] giving to the user a more fine granularity for free.

The question is why not in Asyncio ?

[1] https://github.com/opentracing-contrib
[2] https://github.com/aws/aws-xray-sdk-python/blob/master/aws_xray_sdk/ext/aiohttp/middleware.py#L15

@jettify
Copy link

jettify commented Jun 6, 2018

As as far as aiozipkin concerts, I need only callbacks and way to pass context between them. zipkin spans already has capabilities to: annotate timeline, add metadata and tags, create child span. And developer usually decided how much to trace and what callbacks to instrument.

As for #163 (comment) not sure what the purpose of TimerSpan vs CounterSpan vs TaskSpan. Say I want to trace subtask tree creation, make zipkin span per task with appropriated child parent relation (this way we can see sequential vs parallel task execution as result find places where we can increase parallelism). How mentioned API fits for this?

@pfreixes
Copy link
Contributor

pfreixes commented Jun 9, 2018

Could we start implementing something like:

    with tracer.start_span('TestSpan') as span:
        tracer_ctx = uvloop.stat_tracing(tracer.uvloop)
        .....
        uvloop.stop_stracing(tracer_ctx)

The tracer.start_span its just the Jaeger implementation - could be others such as aiozipkin, etc - and it's irrelevant, the importance here is:

  1. uvloop.stat_tracing flags uvloop to start making the spans at each instrumentalized piece of code
  2. uvloop.stat_tracing receives the interface that allows uvloop to handle spans internally using the external implementation provided by the tracer implementation.
  3. In this POC all spans created by uvloop are just child spans of the parent span create before by the tracer.

@1st1 @jettify thoughts ?

pfreixes added a commit to pfreixes/uvloop that referenced this issue Jun 21, 2018
Implements a new tracing support for uvloop adding the pair methods
`start_tracing` and `stop_tracing` that allows the user to start or
stop the tracing. The user must provide a valid `uvloop.tracing.Tracer`
implementation that would be used internally by uvloop to create spans,
each span must also meet the `uvloop.tracing.Span` class contract.

This POC only implements the creation of spans during underlying calls
to the `getaddrinfo` function.

This POC relates to the conversation in MagicStack#163.
@pfreixes
Copy link
Contributor

pfreixes commented Jun 22, 2018

Regarding #171, the pending discussions that it raises are:

  • MetricsSpan, CounterSpan and TimingSpan controversial, this PR only implements a generic Span that is a timing one.

  • uvloop.tracing.Tracer.current_span needs to be adapted to avoid race conditions, when a span is created the "global" attribute is overwritten. Most likely current_span will end up as another context variable.

  • Agreement of an almost final draft of the start_tracing entry point, right now the implementation hides the TracedContext, so making it usable only internally by uvloop. I'm wondering if we must make it public providing to the user a way of creating spans in his code, so becoming the way of trace asyncio uvloop code.

  • Current implementation implicitly wires the Tracer context using the parent span, so for example when internally a new span is created using the TracedContext.start_span under the hood the parent span is passed to the Tracer.create_span, so allowing to the tracer to fetch some context information from this, also allowing make the new span child of the parent one. Should we reconsider this with a more explicit interface?

Support for other spans within uvloop. But before doing this, we would need a kind of agreement of behind what circumstances the spans are created. A good example is the create_connection loop function, would feel comfortable with something like this:

async def create_connection(...)
    with __traced_context() as span:
        .....

That it would mean that when there is no tracing enabled it will create a noop span, the benefits of this global wrapping make the code cleaner and easier to maintain but the drawbacks are quite clear.

Also worth mentioning that going to a pattern that uses a global context would mean that we will be creating a span when indeed in some failure scenarios we connected to nothing and without yielding the loop.

@pfreixes
Copy link
Contributor

any feedback on this @1st1?

I've been a bit out lately and I don't think that I will have time till the second week of August. But any discussion that helps us to reach some conclusions will be worth it.

Regarding the last summary that I've written down related with the #171 PR I left two open questions which I have some thoughts.

Agreement of an almost final draft of the start_tracing entry point, right now the implementation hides the TracedContext, so making it usable only internally by uvloop. I'm wondering if we must make it public providing to the user a way of creating spans in his code, so becoming the way of trace asyncio uvloop code.

The tracer provider already provides a known API and the user is used to use it, so IMO we must allow the user to keep using the tracer interface that is used to use it and not forcing it to use a new interface. Therefore, the TracedContext and other stuff implemented by uvloop would remain hidden.

we would need a kind of agreement of behind what circumstances the spans are created. A good example is the create_connection loop function, would feel comfortable with something like this:

I'm inclined to think that the span has to be created only when the loop is yielded. For example, for the create_connection loop function if it never reaches the step that yields the loop the span shouldn't be created. So, the __traced_context context wouldn't make sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants