Opentelemetry-specification: Use Context to stop tracing

Created on 28 Mar 2020 · 23Comments · Source: open-telemetry/opentelemetry-specification

There are circumstances for instrumented actions where:

The action is frequent and of low interest: a healthcheck, polling a message queue, etc.
An OpenTracing exporter uses libraries that themselves may be instrumented (risking infinite tracing).
If the current layer (e.g. RPC) happens to offer sufficiently detailed tracing, lower HTTP/DNS/TCP/UDP layers do not need to be traced when invoked with this library. This need is heighten by the use of server/client spans, which AFAIK needs to remove spans in between them. (#526).

(1) has come up in #173

(2) has been a problem in opentelemetry-js (https://github.com/open-telemetry/opentelemetry-js/issues/332) HTTP-based exporters, since the Node.js stdlib is instrumented globally. The solution in that has was adding a special HTTP header x-opentelemetry-outgoing-request headers, that the http instrumentation ignores. In essence it uses HTTP headers as a poor-man's context API. (This may not scale to other APIs.)

It would be nice to cut off all automatic tracing "below this current scope" by setting a context key. The default tracer implementation would create no-op spans if the context disables spans.

EDIT: Case (1) may call for sampling, rather than full disabling.

api after-ga context trace

Source

pauldraper

👍1

Most helpful comment

Why do the instrumentations check? Would you not just have the tracer check the context and return a non-reporting span?

dyladan on 30 Mar 2020

👍2

All 23 comments

OpenTelemetry Ruby does this and I really like the idea

dyladan on 28 Mar 2020

👍1

See the Python solution, which I think is rather elegant: https://github.com/open-telemetry/opentelemetry-python/pull/181 (updated in https://github.com/open-telemetry/opentelemetry-python/pull/395 for OTEP 66).

You introduce a conventional context variable "suppres_instrumentation" that all instrumentations check and the span processors set to true while exporting.

Oberon00 on 30 Mar 2020

Why do the instrumentations check? Would you not just have the tracer check the context and return a non-reporting span?

dyladan on 30 Mar 2020

👍2

That would also work, but personally I prefer the more explicit no-magic approach here. Also, the instrumentation overhead can be further reduced if the instrumentation checks manually and shortcuts, instead of doing all the usual instrumentation with a no-op Span. Checking Span.IsRecording would probably work in most cases though.

Oberon00 on 30 Mar 2020

I think that is the primary use for the isRecording flag. I just prefer to only have to implement such things in a single place so that there is less risk of forgetting. I'm playing around with such an implementation in JS right now and I set the context in the span processor, then return non-recording spans from the tracer if that context is set. It works well and required very few changes.

dyladan on 30 Mar 2020

👍1

@toumorokoshi @codeboten FYI

Oberon00 on 30 Mar 2020

I agree with @dyladan .

(A) The tracer is already managing context (activeSpan, withActiveSpan).

(B) Fewer places is better.

pauldraper on 31 Mar 2020

@pauldraper The tracer will stop "managing" context but it will still need to access the context when creating a new span, see #527

Oberon00 on 31 Mar 2020

👍1

Hi, another opentelemetry-python contributor chiming in, thanks @Oberon00!

Why do the instrumentations check? Would you not just have the tracer check the context and return a non-reporting span?

I think have the tracer check the flag is a good idea. I don't think it's too much magic, and comes at the benefit of not having to require instrumentation authors to implement this pattern. We can control this behavior on the sdk level, which enables stronger guarantees on behavior consistency.

I think that is the primary use for the isRecording flag.

It doesn't look like that's the intention of IsRecording, at least from what I see in the spec:

Returns true if this Span is recording information like events with the AddEvent operation, attributes using SetAttributes, status with SetStatus, etc. There should be no parameter.

This flag SHOULD be used to avoid expensive computations of a Span attributes or
events in case when a Span is definitely not recorded. Note that any child
span's recording is determined independently from the value of this flag
(typically based on the sampled flag of a TraceFlag on
SpanContext).

I haven't seen examples of these expensive computations yet, but it seems unrelated to whether the span itself should be exported. I would argue for a separate context variable (suppressInstrumentation to borrow our current python convention).

The additional pro/con of a suppressInstrumentation flag separate from the Span itself is the fact that both metrics and spans can share that value. But maybe, it would be better if we had those as two separate flags as most likely you will only want to suppress one or the other, most likely in the exporter.

So maybe:

suppressTraceInstrumentation and suppressMetricInstrumentation?

toumorokoshi on 1 Apr 2020

I haven't seen examples of these expensive computations yet

You have to look at everything the instrumentation does for each span. E.g. the sum of all the things that the Python WSGI instrumentation does per request is relatively expensive.

Oberon00 on 1 Apr 2020

@toumorokoshi I was saying the primary use of isRecording is to avoid expensive operations to gather span attributes. @Oberon00 had recommended checking the context for instrumentation disabled flag within instrumentations because it would also avoid said expensive computations.

In my opinion the "correct" implementation on the instrumentation side would be:

instrumentation calls startSpan
tracer sees instrumentation disabled, returns a non-recording span
instrumentation sees span is non-recording and avoids expensive operations

This avoids the need for every instrumentation to check this flag. Because there will be many instrumentations, we cannot guarantee none of them will forget.

The flag is set on the context by span processors before they call exporters, or optionally in the exporter. I have no strong opinion on this difference.

If an instrumentation wants to disable the instrumentation of the lower-level protocols, it may also set the flag. e.g. some database driver which uses http to talk to the database may want to disable the http instrumentation in favor of instrumenting the protocol they've built on top of it.

If starting the span itself requires one of these expensive operations, then nothing is preventing the instrumentations from _also_ checking the context flag.

dyladan on 1 Apr 2020

I haven't seen examples of these expensive computations yet

You have to look at everything the instrumentation does for each span. E.g. the sum of all the things that the Python WSGI instrumentation does per request is relatively expensive.

I think I understand. You're referring to adding a check, per integration, to see if the IsRecording flag is set. if it's false, then we avoid adding attributes altogether, which would skip a non-trivial amount of code to collect and add attributes:

https://github.com/open-telemetry/opentelemetry-python/blob/master/ext/opentelemetry-ext-wsgi/src/opentelemetry/ext/wsgi/__init__.py#L121

I think if the spec called out explicitly that the resulting span would actually be missing attributes that would have been added in an IsRecording scenario, that would have been a little clearer to me. Thank you for clarifying.

In my opinion the "correct" implementation on the instrumentation side would be:

Make sense. As a clarification, is it just a non-recording span? It's a non-recording, no-op span that ensures that the span will never reach the exporter in the first place, correct?

Although it makes sense to me intuitively, I don't see a call-out for a no-op or default span object that would not emit spans to spanprocessors in the specification. So maybe that's something that should be added as well.

toumorokoshi on 2 Apr 2020

I think I understand. You're referring to adding a check, per integration, to see if the IsRecording flag is set. if it's false, then we avoid adding attributes altogether, which would skip a non-trivial amount of code to collect and add attributes

yes

I think if the spec called out explicitly that the resulting span would actually be missing attributes that would have been added in an IsRecording scenario, that would have been a little clearer to me

A non-recording span doesn't record any attributes at all

As a clarification, is it just a non-recording span? It's a non-recording, no-op span that ensures that the span will never reach the exporter in the first place, correct?

A no-op span is one step further than a non-recording span. A non-recording span still propagates context, just doesn't record anything to the backend. It is useful in cases where you do not want to break traces. A no-op span can't be propagated, because it has no trace context or trace state. The non-recording span isn't sent to the backend, but propagates trace context correctly so that traces aren't broken. In this instance, a no-op span would be equally effective and possibly an even better choice.

dyladan on 2 Apr 2020

@toumorokoshi Your link is actually perfect to illustrate why we additionally could use the "surpress_instrumentation" context variable: In https://github.com/open-telemetry/opentelemetry-python/blob/v0.6.0/ext/opentelemetry-ext-wsgi/src/opentelemetry/ext/wsgi/__init__.py#L121, attributes are collected to be passed to start_span (so the sampler has more info and knows e.g. the URL), so the is_recording property could only be queried after.

Oberon00 on 2 Apr 2020

@toumorokoshi Your link is actually perfect to illustrate why we additionally could use the "surpress_instrumentation" context variable: In https://github.com/open-telemetry/opentelemetry-python/blob/v0.6.0/ext/opentelemetry-ext-wsgi/src/opentelemetry/ext/wsgi/__init__.py#L121, attributes are collected to be passed to start_span (so the sampler has more info and knows e.g. the URL), so the is_recording property could only be queried after.

Ah! That's a very good point. I guess between the two, I would probably op to just use the context variable, which can then be wrapped in a utiltity method on the trace module itself like trace.instrumentationEnabled(). And I do agree that, in that case, that may make isRecording redundant.

toumorokoshi on 3 Apr 2020

@pauldraper will a PR be filed on this? I like this approach and would love to see it in the spec so I can implement it.

toumorokoshi on 17 Apr 2020

@trask I wonder, how does auto-instr-java deal with that? It has a gRPC instrumentation, would that not cause a feedback loop with the OTLP exporter currently?

Oberon00 on 25 May 2020

👍1

In auto-instr-java, we load the exporter (and it's dependencies) inside of a separate class loader. This is primarily done to avoid version conflicts, but has the nice the side-effect that we can skip instrumentation of those classes (by looking at what classloader they're in) and avoid the feedback loop.

trask on 25 May 2020

One interesting caveat to the above: https://github.com/open-telemetry/opentelemetry-auto-instr-java/issues/375#issuecomment-630452595

trask on 25 May 2020

👍1

For a reference point, I've run into 3) myself. I find spans from e.g., the netty instrumentation noise when they're being wrapped by a client library, for example the aws sdk. Others may not though meaning some configurability would be nice. But it creates the problem that

1) With auto instrumentation, it wouldn't really be possible to disable netty instrumentation only when called from an RPC SDK, have to disable it globally

2) Even when disabled, it causes issues where semantics depend on whether there is an underlying instrumentation or not like in https://github.com/open-telemetry/opentelemetry-auto-instr-java/pull/323

Being able to pause new spans for a stack frame would help with this.

anuraaga on 26 May 2020

This was discussed in today's Spec SIG.
Would someone involved in the conversation today please post a summary of any decisions that were made?

jmacd on 27 Apr 2021

I created a PR based on the discussion from the SIG #1653

The discussion was primarily around clients creating and exporting APIs that were not specified and how to do it in a way that is compatible with future spec. It was generally agreed that it would be better to export it in some sort of API extensions package that can be rev'd to 2.0 without having to rev the whole API, and would not interfere if the API specification is added later.

In this particular instance, it was agreed that this is a generally good mechanism and should be added to the spec, so that's what #1653 does.

dyladan on 27 Apr 2021

@dyladan was there a decision around what the right way to move forward for suppressing instrumentations? I see the proposed solution of making a key available through the API was closed.