Opentelemetry-specification: Revert use of PropagatedSpan back to using a special Context key for extracted Span Context

Created on 21 Oct 2020 · 21Comments · Source: open-telemetry/opentelemetry-specification

What are you trying to achieve?

In discussion last night in the Spec SIG meeting we talked about the confusion added with PropagatedSpan and whether it didn't cause more confusion than the issue it was trying to resolve (the code required to check if an extracted Span Context exists if there isn't a parent in the current Context or the Context manually passed as the parent to StartSpan.

I'm opening a revert of the PropagatedSpan PR to go along with this issue.

api p1 required-for-ga trace

Source

tsloughter

👎2

Most helpful comment

The use of two context key is not a requirement, the value can be a union or the Either type. Therefore there is no 4 possible states, only 3: nothing, SpanContext, Span.

What I like about it is the API-level enforcement. A remote span context should not be pretending to be a writeable span because the receiver should not be writing to the span created by the caller (Zipkin's shared span model notwithstanding, because it can still be accommodated, like Jaeger SDKs do). So with OneOf[None, SpanContext, Span] approach the user cannot try to do something with SpanContext that should not be possible, enforced at compile time.

Having said that, the situation where we actually need SpanContext in the Context is very limited in scope, it's the time between Extract() and tracer.StartSpan(). Not sure if it's worth exposing the 3-state complexity to every other call site where the SpanContext state would almost never be valid/relevant.

So I'd be ok with just #1131.

yurishkuro on 22 Oct 2020

👍2

All 21 comments

In discussion last night in the Spec SIG meeting we talked about the confusion added with PropagatedSpan and whether it didn't cause more confusion than the issue it was trying to resolve (the code required to check if an extracted Span Context exists if there isn't a parent in the current Context or the Context manually passed as the parent to StartSpan.

It is not about the code requiring to check that, it is about the fact that it is error prone, and will lead to unexpected behaviors.

Let's assume I want to stop tracing in one part of the code:

// I want to decouple the next code from the trace
Context context = SpanUtils.with(Context.current(), null);
try (Scope scope = context.activate()) {
  ... // do some code
  ...
  thirdpartyfoo()
  ...
}

thirdpartyfoo() {
  // Based on my intention this should be a root span, but will not be because there is a SpanContext leaked.
  tracer.startSpan(...);
}

Confusion for other signal correlations (e.g. trying to add ids to logs):
- Do I get the current Span? What do I do if not present?
- Do I get the current SpanContext? What do I do if this is present but no Span?
Batching scenarios:
- With one entry is simple I create a new Span (use links to link to all the other spans).
- With two entries, what do I put for SpanContext? Do I create a "linked" span context?

My conclusion, having 2 entries to the context is way more confusing and error prone than the "Propagation Span".

bogdandrutu on 21 Oct 2020

👍2

Even if you call this a "revert", this is making the spec objectively more complicated. Since we always coalesce null/not existing/invalid into invalid for Span and SpanContext, we currently have only two context states regarding Span: There is one (that is valid) or there is none. If we implement this issue, we have four states:

| Span | SpanContext |
|-------|-------------|
| valid | valid |
| valid | none |
| none | valid |
| none | none |

Especially the first listed state is problematic.

Thus I'm against this change.

Oberon00 on 22 Oct 2020

I think the confusion about PropagatedSpan is better addressed with renaming to NonRecodingSpan, as suggested in #1087

Oberon00 on 22 Oct 2020

👍2

NonRecordingSpan doesn't convey to me that this is the extracted span.

I'm not sure what those states of valid/none are. There is either a span of some sort in the passed context (manual parent), the current active context or an extracted span context.

If the most maintainers find it easier to not have the second context field to check then I'm ok with that. But I would suggest instead of adding NonRecordingSpan why not just set is_recording to false when extracting?

Technically could be another class that inherits from Span or whatever in OO languages but we already have an is_recording so feels confusing, and OO specific, to be adding an additional type of Span that is called NonRecordingSpan.

tsloughter on 22 Oct 2020

@open-telemetry/specs-approvers @open-telemetry/specs-trace-approvers please review/comment. We need a decision on this.

tigrannajaryan on 22 Oct 2020

@tsloughter

NonRecordingSpan doesn't convey to me that this is the extracted span.

It should not convey this, because there may be other reasons to use NonRecordingSpan.

I'm not sure what those states of valid/none are

None means there is none (or equivalently an invalid one). Valid means that there is one (and that is valid).

There is either a span of some sort in the passed context (manual parent), the current active context or an extracted span context.

That, and there can also be both. I see two problems with that:

We need to establish rules for the semantics of having both keys. Depending on the implementation, API users may need to know these rules too.
The distinction between a spancontext and a span on the context is useless and purely an implementation artifact.

But I would suggest instead of adding NonRecordingSpan why not just set is_recording to false when extracting?

Technically could be another class that inherits from Span or whatever in OO languages but we already have an is_recording so feels confusing, and OO specific, to be adding an additional type of Span that is called NonRecordingSpan.

The current spec does not require implementing the non-recording span concept with a NonRecordingSpan type. That is an OO specific way of interpreting the spec. EDIT: See https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/api.md#propagated-span-creation. Notice how the spec only says that "The API MUST provide an operation for wrapping a SpanReference with an object implementing the Span interface." and "If a new type is required for supporting this operation, it SHOULD be named PropagatedSpan." (bold added by me)

Oberon00 on 22 Oct 2020

@Oberon00 but you are introducing NonRecordingSpan in the spec when we already having Span and is_recording. Why is that?

tsloughter on 22 Oct 2020

I also thought the rules were pretty clear, if a span exists in the passed context that takes precedence, if not, then if there is one in the current context use that and lastly check the extracted key.

But I think I'd be ok with just putting it in the Context as a usual Span but marked with IsRecording false. At least I haven't thought of a reason that could backfire yet.

tsloughter on 22 Oct 2020

But you are introducing NonRecordingSpan in the spec when we already having Span and is_recording. Why is that?

I do not want to introduce NonRecordingSpan as a a public type. For example in Java the "PropagatedSpan" is implemented as a single public method Span wrap(SpanContext spanContext) and I think that is the ideal implementation of this spec (the implementing type is called DefaultSpan and it is more lightweight than the full Span but that is a detail).

Maybe the spec could be reworded (editorial change) to make more clear that we do not want a new public type if possible.

Oberon00 on 22 Oct 2020

But I think I'd be ok with just putting it in the Context as a usual Span but marked with IsRecording false.

Only problem with that is that you need to have this API working without an SDK. But otherwise this is a valid implementation, that is allowed even today.

Oberon00 on 22 Oct 2020

I will propose a PR to clarify (hopefully really only clarify without changing requiremts) the spec.

Oberon00 on 22 Oct 2020

I created #1131. Does that help?

Oberon00 on 22 Oct 2020

👍2

@tsloughter I think you have a very good view of the problem from a non-oo language perspective, and we may have problems to understand the reasons. So please bear with us and help us understand better. Thanks in advance.

Technically could be another class that inherits from Span or whatever in OO languages but we already have an is_recording so feels confusing, and OO specific, to be adding an additional type of Span that is called NonRecordingSpan.

Indeed in OO languages what we try to say is that this is a different implementation of the Span that always returns false for is_recording. Also one of the main reasons that is a different class is that this functionality requires to be implemented in the API, so the sdk will have its own implementation of the Span (or it may not extend this class) that does all the recordings etc. and returns true on the is_recording.

I also thought the rules were pretty clear, if a span exists in the passed context that takes precedence, if not, then if there is one in the current context use that and lastly check the extracted key.

Here I disagree with you, and confirms somehow that there is a confusion. If users pass a context implicitly to the API when starting a Span the current context should be completely ignored. So my understanding of that would be if there is a Span in the passed context it takes precedence, if not, then if there is an extracted key use that, otherwise the Span has no parent which means it becomes root.

But I think I'd be ok with just putting it in the Context as a usual Span but marked with IsRecording false. At least I haven't thought of a reason that could backfire yet.

That is what we are trying to suggest but as @Oberon00 mentioned, we want this to be possible without an implementation available, because we require that the API propagates the incoming context even without an implementation.

bogdandrutu on 22 Oct 2020

possible without an implementation available

Ooh, maybe this is what I'm missing...

tsloughter on 22 Oct 2020

👍1

I'm personally in support of "propagated span" and #1131 definitely helps. Here's another technical arguments in support of "Revert".

Especially the first listed state is problematic.

I agree it's a little bit problematic, but there are problems everywhere we look. I would define-away this problem by stating that Extract() should reset the Span in the context when starting a new local root in the trace. The API would not let you enter that state.

The use of two context keys was nice, in my opinion, because anywhere in the local sub-trace of a request I'm able to ask what my remote-parent's span context is (and its tracestate, I think?).

jmacd on 22 Oct 2020

@yurishkuro your input would be appreciated.

jmacd on 22 Oct 2020

The use of two context keys was nice, in my opinion, because anywhere in the local sub-trace of a request I'm able to ask what my remote-parent's span context is (and its tracestate, I think?).

If we want that capability, we should explicitly mention it at least in a note. Java never supported that.

Oberon00 on 22 Oct 2020

The use of two context key is not a requirement, the value can be a union or the Either type. Therefore there is no 4 possible states, only 3: nothing, SpanContext, Span.

So I'd be ok with just #1131.

yurishkuro on 22 Oct 2020

👍2

(Marking as Required-for-GA as we need to either accept this or close it before GA).

carlosalberto on 23 Oct 2020

I'd like to be sure everyone who voiced concern on the call the other day is satisfied with #1131 and then I can close this and the PR.

tsloughter on 23 Oct 2020

I think maybe @arminru was the other person who voiced concern and I'd like to get agreement from before closing this.

I think @jmacd has a good point about being able to know the remote parent at any point in the stack is interesting but without a viable use case it isn't worth reverting for as it can probably be added later?

tsloughter on 23 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings