What are you trying to achieve?
In discussion last night in the Spec SIG meeting we talked about the confusion added with PropagatedSpan and whether it didn't cause more confusion than the issue it was trying to resolve (the code required to check if an extracted Span Context exists if there isn't a parent in the current Context or the Context manually passed as the parent to StartSpan.
I'm opening a revert of the PropagatedSpan PR to go along with this issue.
In discussion last night in the Spec SIG meeting we talked about the confusion added with PropagatedSpan and whether it didn't cause more confusion than the issue it was trying to resolve (the code required to check if an extracted Span Context exists if there isn't a parent in the current Context or the Context manually passed as the parent to StartSpan.
It is not about the code requiring to check that, it is about the fact that it is error prone, and will lead to unexpected behaviors.
// I want to decouple the next code from the trace
Context context = SpanUtils.with(Context.current(), null);
try (Scope scope = context.activate()) {
... // do some code
...
thirdpartyfoo()
...
}
thirdpartyfoo() {
// Based on my intention this should be a root span, but will not be because there is a SpanContext leaked.
tracer.startSpan(...);
}
Confusion for other signal correlations (e.g. trying to add ids to logs):
Span? What do I do if not present?SpanContext? What do I do if this is present but no Span?Batching scenarios:
My conclusion, having 2 entries to the context is way more confusing and error prone than the "Propagation Span".
Even if you call this a "revert", this is making the spec objectively more complicated. Since we always coalesce null/not existing/invalid into invalid for Span and SpanContext, we currently have only two context states regarding Span: There is one (that is valid) or there is none. If we implement this issue, we have four states:
| Span | SpanContext |
|-------|-------------|
| valid | valid |
| valid | none |
| none | valid |
| none | none |
Especially the first listed state is problematic.
Thus I'm against this change.
I think the confusion about PropagatedSpan is better addressed with renaming to NonRecodingSpan, as suggested in #1087
NonRecordingSpan doesn't convey to me that this is the extracted span.
I'm not sure what those states of valid/none are. There is either a span of some sort in the passed context (manual parent), the current active context or an extracted span context.
If the most maintainers find it easier to not have the second context field to check then I'm ok with that. But I would suggest instead of adding NonRecordingSpan why not just set is_recording to false when extracting?
Technically could be another class that inherits from Span or whatever in OO languages but we already have an is_recording so feels confusing, and OO specific, to be adding an additional type of Span that is called NonRecordingSpan.
@open-telemetry/specs-approvers @open-telemetry/specs-trace-approvers please review/comment. We need a decision on this.
@tsloughter
NonRecordingSpan doesn't convey to me that this is the extracted span.
It should not convey this, because there may be other reasons to use NonRecordingSpan.
I'm not sure what those states of valid/none are
None means there is none (or equivalently an invalid one). Valid means that there is one (and that is valid).
There is either a span of some sort in the passed context (manual parent), the current active context or an extracted span context.
That, and there can also be both. I see two problems with that:
But I would suggest instead of adding NonRecordingSpan why not just set is_recording to false when extracting?
Technically could be another class that inherits from Span or whatever in OO languages but we already have an is_recording so feels confusing, and OO specific, to be adding an additional type of Span that is called NonRecordingSpan.
The current spec does not require implementing the non-recording span concept with a NonRecordingSpan type. That is an OO specific way of interpreting the spec. EDIT: See https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/api.md#propagated-span-creation. Notice how the spec only says that "The API MUST provide an operation for wrapping a SpanReference with an object implementing the Span interface." and "If a new type is required for supporting this operation, it SHOULD be named PropagatedSpan." (bold added by me)
@Oberon00 but you are introducing NonRecordingSpan in the spec when we already having Span and is_recording. Why is that?
I also thought the rules were pretty clear, if a span exists in the passed context that takes precedence, if not, then if there is one in the current context use that and lastly check the extracted key.
But I think I'd be ok with just putting it in the Context as a usual Span but marked with IsRecording false. At least I haven't thought of a reason that could backfire yet.
But you are introducing NonRecordingSpan in the spec when we already having Span and is_recording. Why is that?
I do not want to introduce NonRecordingSpan as a a public type. For example in Java the "PropagatedSpan" is implemented as a single public method Span wrap(SpanContext spanContext) and I think that is the ideal implementation of this spec (the implementing type is called DefaultSpan and it is more lightweight than the full Span but that is a detail).
Maybe the spec could be reworded (editorial change) to make more clear that we do not want a new public type if possible.
But I think I'd be ok with just putting it in the Context as a usual Span but marked with IsRecording false.
Only problem with that is that you need to have this API working without an SDK. But otherwise this is a valid implementation, that is allowed even today.
I will propose a PR to clarify (hopefully really only clarify without changing requiremts) the spec.
I created #1131. Does that help?
@tsloughter I think you have a very good view of the problem from a non-oo language perspective, and we may have problems to understand the reasons. So please bear with us and help us understand better. Thanks in advance.
Technically could be another class that inherits from Span or whatever in OO languages but we already have an is_recording so feels confusing, and OO specific, to be adding an additional type of Span that is called NonRecordingSpan.
Indeed in OO languages what we try to say is that this is a different implementation of the Span that always returns false for is_recording. Also one of the main reasons that is a different class is that this functionality requires to be implemented in the API, so the sdk will have its own implementation of the Span (or it may not extend this class) that does all the recordings etc. and returns true on the is_recording.
I also thought the rules were pretty clear, if a span exists in the passed context that takes precedence, if not, then if there is one in the current context use that and lastly check the extracted key.
Here I disagree with you, and confirms somehow that there is a confusion. If users pass a context implicitly to the API when starting a Span the current context should be completely ignored. So my understanding of that would be if there is a Span in the passed context it takes precedence, if not, then if there is an extracted key use that, otherwise the Span has no parent which means it becomes root.
But I think I'd be ok with just putting it in the Context as a usual Span but marked with IsRecording false. At least I haven't thought of a reason that could backfire yet.
That is what we are trying to suggest but as @Oberon00 mentioned, we want this to be possible without an implementation available, because we require that the API propagates the incoming context even without an implementation.
possible without an implementation available
Ooh, maybe this is what I'm missing...
I'm personally in support of "propagated span" and #1131 definitely helps. Here's another technical arguments in support of "Revert".
Especially the first listed state is problematic.
I agree it's a little bit problematic, but there are problems everywhere we look. I would define-away this problem by stating that Extract() should reset the Span in the context when starting a new local root in the trace. The API would not let you enter that state.
The use of two context keys was nice, in my opinion, because anywhere in the local sub-trace of a request I'm able to ask what my remote-parent's span context is (and its tracestate, I think?).
@yurishkuro your input would be appreciated.
The use of two context keys was nice, in my opinion, because anywhere in the local sub-trace of a request I'm able to ask what my remote-parent's span context is (and its tracestate, I think?).
If we want that capability, we should explicitly mention it at least in a note. Java never supported that.
The use of two context key is not a requirement, the value can be a union or the Either type. Therefore there is no 4 possible states, only 3: nothing, SpanContext, Span.
What I like about it is the API-level enforcement. A remote span context should not be pretending to be a writeable span because the receiver should not be writing to the span created by the caller (Zipkin's shared span model notwithstanding, because it can still be accommodated, like Jaeger SDKs do). So with OneOf[None, SpanContext, Span] approach the user cannot try to do something with SpanContext that should not be possible, enforced at compile time.
Having said that, the situation where we actually need SpanContext in the Context is very limited in scope, it's the time between Extract() and tracer.StartSpan(). Not sure if it's worth exposing the 3-state complexity to every other call site where the SpanContext state would almost never be valid/relevant.
So I'd be ok with just #1131.
(Marking as Required-for-GA as we need to either accept this or close it before GA).
I'd like to be sure everyone who voiced concern on the call the other day is satisfied with #1131 and then I can close this and the PR.
I think maybe @arminru was the other person who voiced concern and I'd like to get agreement from before closing this.
I think @jmacd has a good point about being able to know the remote parent at any point in the stack is interesting but without a viable use case it isn't worth reverting for as it can probably be added later?
Most helpful comment
The use of two context key is not a requirement, the value can be a union or the Either type. Therefore there is no 4 possible states, only 3: nothing, SpanContext, Span.
What I like about it is the API-level enforcement. A remote span context should not be pretending to be a writeable span because the receiver should not be writing to the span created by the caller (Zipkin's shared span model notwithstanding, because it can still be accommodated, like Jaeger SDKs do). So with
OneOf[None, SpanContext, Span]approach the user cannot try to do something with SpanContext that should not be possible, enforced at compile time.Having said that, the situation where we actually need SpanContext in the Context is very limited in scope, it's the time between Extract() and tracer.StartSpan(). Not sure if it's worth exposing the 3-state complexity to every other call site where the SpanContext state would almost never be valid/relevant.
So I'd be ok with just #1131.