It was proposed in https://github.com/open-telemetry/oteps/pull/123#issuecomment-656788771 and subsequent discussion that we should drop Span.Status from the spec. If there is an agreement on that, I can create a PR.
There have been many PRs related to errors so would help if you could summarize the current state of what we have if Status is removed. For example, is there a generic way of knowing whether a span is a success or failure? From what I can tell, presence of exception is that, but I'm not sure
Other affected areas would be:
@anuraaga There is no generic way to detect if a span has failed. The status is also not really suitable for this, except if you also consider things such as HTTP 404 as errors.
Still, I think a generic error=true or error=name_of_attribute_with_detailed_info (e.g. error=http.status_code) could be introduced before removing the status iff we can find agreement on such a thing.
Exceptions are not the same as errors. Often a span with at least one exception will be failed, but all of them could be handled too. Also, errors can occur without exceptions.
Presence of exception does not necessarily mean span was a failure. This was discussed in #697. And I have just discovered that spec does not have any error=true attribute, so something should be proposed to replace status... Not as easy as I thought :)
The simplest replacement would indeed be generic error=<boolean> field or attribute.
Could we create a separate issue for discussing how to indicate errors generically, once status is removed? Or maybe #599 is the place to discuss this.
I am afraid we cannot just remove status without offering any way to signal that this span signifies an error. The simplest way forward, assuming that we have #697 in its current form, is the following:
errorDuring todays Errors WG meeting it was agreed:
Span.status field from API and specSpan.status as deprecated in OTLPstatus to errorAs a separate step to start discussion about adding severity notion to Span reusing log severity concept
A discussion about severity on Span events is happening in #698
Just a top level boolean failed field should be enough to identify the Span as having an error should be sufficient as a quick check. If this is true then the span attributes can be checked for further info. The other change that needs to be made is to add a gRPC semantic convention to report gRPC status code since this will now not be in status.
Collapsing the span status field into a single boolean seems problematic to me. The main problem really lies in the definition of "error" states. If the instrumentation is responsible for adding an error boolean to spans (and, presumably, a matching label to metrics), then it must be aware of what the current user considers an error condition. This implies some configuration API that _something_ uses to set the desired error conditions, and another API that instrumentation must use to configure itself.
The instrumentation should apply as much information it can to the spans and metrics it creates; it's easier to reduce granularity of error conditions downstream than it is to increase it.
In my opinion, this complexity belongs in an exporter, or in a metric view, or in a vendor's back-end.
Collapsing the span status field into a single boolean seems problematic to me.
Rather than collapsing, I think we are exploding it: Instead of a single canonical status we will have:
status.hresult , status.errno, status.win32_last_error, ...Collapsing the span status field into a single boolean seems problematic to me.
Collapsing every conceivable error in a fixed StatusCode was already quite bad. The error boolean is obviously not gonna cut it alone: It's just meant as a supplement, for easier detection, it does not really add that much information in itself.
(Can I take this as an opportunity to plug the idea of setting the value of error not to true, but to the name of the attribute with more information?)
(Can I take this as an opportunity to plug the idea of setting the value of error not to true, but to the name of the attribute with more information?)
No :) One of the benefits of having simple boolean _field_ is ease of reading it. Without the need to parse attributes.
We can, should and will add semantic conventions/more fields with more information about the error state of the given span. But without sacrificing easy and fast way for collector/backend/exporter to get very meaningful information from the span.
Isn't that independent? If the error field/attribute is non-empty, that is equivalent to true. Only if you actually want to access the additional information, then you need to parse the error.
Why do you want that name to be in error field? Why should it be dynamic at all? If you want to have a separate attribute on the Span, containing some detailed information about some error condition (independent from exception events), then we can add a semantic convention for that with a fixed name. And error can remain a simple, efficient, boolean.
What benefit will bring such indirect reference via error field as opposed to a separate semantic attribute?
It will bring no benefit except for conserving a field name. Using a separate attribute like error.seealso sounds good too.
Why should it be dynamic at all? If you want to have a separate attribute on the Span, containing some detailed information about some error condition (independent from exception events), then we can add a semantic convention for that with a fixed name.
Maybe now I understand where we are talking past each other: I think that we already have different names for error detail attributes, and should continue to have and add even more different names. For example http.status_code is one such attribute. It has well-defined semantics beyond just being "detailed error info". Still it would be cool if your backend's UI could not only add an ❌ to the span but could put "(http.status_code=500) next to it, even without knowing about the http.status_code semantic convention specifically.
A! Now I get it! :) I think this is overcomplicating, but I will not object it. If nobody comment on this question here, I will raise this question during Error WG this Thursday.
I'm happy to find a simpler way. But some way to classify errors even without knowing a particular semantic convention the span falls into would be good (although IMHO not necessarily required for GA). Ideally without forcing instrumentations to translate their natural error conditions into some enum. 😃
Still, my main concern is this:
The main problem really lies in the definition of "error" states. If the instrumentation is responsible for adding an error boolean to spans (and, presumably, a matching label to metrics), then it must be aware of what the current user considers an error condition.
Making the instrumentation responsible for knowing what constitutes an error feels like a big can of worms to me. I'd much prefer to see instrumentation add non-opinionated data to spans and metrics, and for views, exporters, and back-ends to apply their own logic to interpret that data.
Hmm, you are right that this error.seealso/whatever attribute could be at most an optional hint. So maybe it is not that useful after all.
@justinfoote Do you oppose adding error field or removing Span.Status? I understand your concern about boolean field. It is indeed may be not that easy for auto-instrumentations to decide. But setting current status is not straightforward as well.
I oppose both. 😄
I like Span.Status because it is not opinionated. It describes the state of the span in a relatively low-cardinality field, with logic that is not configurable. It doesn't make any assumptions about what the user cares about, it just tries to reflect reality in a field with a set enumeration of values.
I also dislike the error boolean for the reasons I've stated.
The current generation New Relic agents use an error boolean, and they must perform a handshake with the backend on startup to resolve error condition settings -- which http status codes represent an error, which exception classes should be ignored, etc... I'd prefer _not_ to see this complexity reproduced in the OpenTelemetry agents.
The current generation New Relic agents use an error boolean, and they must perform a handshake with the backend on startup to resolve error condition settings -- which http status codes represent an error
I don't think this is the direction we plan to go with our error. It simply should be a hint from the instrumentation when it thinks that something is likely an error. I think error=true should simply be equivalent to today's status != OK && status != UNKNOWN. EDIT: The backend could override that decision based e.g. on the http.status_code attribute if it knows about that.
I like Span.Status because it is not opinionated.
The choice of possible status enum values is very opinionated and I think that is what most people (at least myself) did not like about it.
I like
Span.Statusbecause it is not opinionated
I would argue with this. E.g. mapping from http status code to gRPC status is not obvious. Or how do you map SQL exception into it?
OK, I can see your point that status is opinionated. I think I didn't explain myself well enough. It _is_ opinionated, but the opinions will be defined upfront in semantic conventions, not at run-time in the instrumentation.
There are definitely decisions to make about how to map specific error conditions to status code. Some of them are very straightforward (like http 404 maps to NOT_FOUND).
Database exceptions may be less straightforward, but I haven't found errors which would be too difficult to map. This seems like a job for semantic conventions -- we should be able to provide guidelines sufficient for instrumentation providers to provide consistent behavior here.
@Oberon00, this:
I think error=true should simply be equivalent to today's status != OK && status != UNKNOWN.
Is exactly the sort of slippery slope I'm concerned about. Your opinion is that UNKNOWN is not an error, and that every other status _is_. But I would expect that UNKNOWN should be an error. The description of the UNKNOWN status code is:
Unknown error. For example, this error may be returned when a Status value received from another address space belongs to an error space that is not known in this address space. Also errors raised by APIs that do not return enough error information may be converted to this error.
This seems like a good signal that _something_ went wrong, even if the instrumentation doesn't know enough to determine what it was. I would expect that this, along with an http.status_code or error.class attributes would be enough for a back-end to provide a good user experience about this failure.
In addition, I think it's likely that other statuses _should not_ represent errors.
For example, it is very common for our users to ignore HTTP 404; that is, to not count them as errors. So perhaps some customers would want the error condition to be status != OK && status != NOT_FOUND.
In other cases, users ignore unauthenticated requests. Perhaps this is the standard response before their users have logged in, and it represents normal user behavior. For them, perhaps the error condition is status != OK && status != NOT_FOUND && status != UNAUTHENTICATED.
The description of the UNKNOWN status code is: Unknown error
Oh I did not know that. IIRC it was named UnknownError previously in OTel and I thought that the renaming also changed the semantics (#385). Then I would simply go for status != OK, I also think we should be wary of a slippery slope here.
OK, if we use status != OK as the criterion for an error condition, it's almost guaranteed to not match the conventions of some users. And, if the error boolean does not represent the error conditions that the customer cares about, what value does it add?
A sensible default that can be ignored if the backend knows the specific semantic convention, or is specifically configured. E.g. display spans as error if they have error=true, unless they also have http.status_code=404 would work. The error=true boolean is just one more signal that backends can use to determine if a Span was an error.
At the spec SIG on Tuesday I mentioned three components of error handling that need to be addressed.
We have discussed at length that there is not universal agreement on what constitutes an error for an application. I would propose that we will never reach agreement here. However, I think we can agree that if an error does happen, you want to know about it, and there should be a way to indicate this on a span. Annotating whether or not as span has had an error is a requirement, even if we can't agree on what an error is. This is what I'm calling _error identification_.
For the second part of this issue, we do need a way for users to decide that something that has been identified as an error, should not be considered an error for them. This is what I'm referring to as _error suppression_. We should give some thought into how we can facilitate this requirement from the SDK side, as well as what approaches could be used for backends.
Matt, I really like this framing, and I agree that these three components of error handling need to be addressed.
The way I read this debate is something like this:
I'm arguing that they belong downstream.
Also, I believe that error identification and error suppression _could_ be defined as a single concern. There's no reason to identify errors upstream (that is: in instrumentation) only to suppress those errors downstream (possibly in an exporter or in a vendor's back-end).
I may be talking myself out of the canonical status code. Perhaps the best design is for instrumentation to add full details of exceptions and/or status codes, without passing judgement about whether that data qualifies as an error. Any interpretation of that data (that is: flagging errors with a boolean or attempting to categorize _status_ into an enum) belongs downstream.
Exactly _where_ downstream is something to discuss, I believe. For metrics, I think this belongs in a MetricView sort of object. For spans, I'm not sure. In a SpanProcessor? In an exporter?
The main complication, and the core reason that error suppression needs to exist is due to auto-instrumentation. Without auto-instrumentation, an application developer would not set error=true on a span if they didn't consider something to be an error condition and it would be a non-issue.
Instrumentation is where errors are going to encountered and it will to make the best decision it can with the information it has at the time. Often this will be catch an exception, record it, then re-throw. I don't think you can push this downstream because the instrumentation is where you will have a handle on the error and a chance to process it. I would also posit that in most cases, auto-instrumentation makes the right decision for errors. There are edge cases, and matters of interpretation where a user may want override that decision.
Given that auto-instrumentation is the likely generator of false positives (error=true tags or equivalent), I could see a few ways to manage this. It could be handled on the tracing backend completely. Each span has an operation name, and information identifying the instrumentation library that generated it. It would be possible to filter out errors based on user defined rules on the tracing backend.
Another possibility, is that an SDK could choose to implement something such as per instrumentation library exception handlers. Custom code be registered that could mark an exception to be recorded, or ignored on a per component basis.
The per-component exception handlers might be a way to combine identification and suppression into a single concern, but I'm not sure it's completely necessary, and a backend solution might suffice. I'd be interested in any other proposals on how this can handled.
I'm going to weigh in that I don't like the flavor of the current status and greatly prefer the processing simplicity of error=true. Why/how errors happened and what attributes are relevant to that are multi-dimensional - hence the need for http.status_code, stacktrace, etc. (@Oberon00 has a great list above).; for that reason I don't think a "pointer attribute" is a great approach. As for special cases like specific apps/users wanting particular HTTP status codes to be/not be an error, it's easy enough to write processor/UI logic accordingly.
Based on WG meeting, July 23rd, 2020, some changes are in order.
From the backward compatibility point of view it is much easier to add things later, than to remove them. There is no consensus on boolean error field or any other indication of error status of span that can be assigned by instrumenting library (or auto instrumentation).
Based on the above we are going to remove Span.Status without any replacement for the time being. The work to find such a replacement will continue, but will not block this issue's implementation.
@iNikem talked about this during the bug triage session, sounds like a PR to remove the span.status from api would be desirable to finish this issue. cc @open-telemetry/technical-committee
from the conversations, a proposal to replace will be a follow up action separate from this issue.
@andrewhsu I will try to prepare such PR next week. I have a concern about wire protocols. OTLP is declared stable and it has this field. OTLP->Jaeger protocol translation uses this field as well. I will reach to @tigrannajaryan and @bogdandrutu regarding these issues.
We need to decide this by the end of the week. Next steps:
See proposal in https://docs.google.com/document/d/1HgUI69rBridFzCXxXuTjQrkG6jb-YcFQnZjPcyBcK1U/edit made during error WG meeting for a replacement:
If we keep Span.Status, but change the semantics, would that be a better solution?
- UNSPECIFIED the status has not been set and the the backend must determine if an error has occurred by inspecting the span.
- ERROR mark as an error, regardless of the attached semantic convention.
- OK do not mark as an error, regardless of the attached semantic convention.
Note that there was no clear unanimous agreement on this as far as I could see. However, I'd agree that this is an improvement over the current status.
Pinging @lmolkova who originally created #306.
FYI this OTEP addresses this issue: https://github.com/open-telemetry/oteps/pull/136
This issue should now be closed in favor of https://github.com/open-telemetry/opentelemetry-specification/issues/965.
Closing this issue in favor of #965.
Most helpful comment
I am afraid we cannot just remove status without offering any way to signal that this span signifies an error. The simplest way forward, assuming that we have #697 in its current form, is the following:
error