StatusCanonicalCode is suitable for gRPC codes.
Even when it comes to HTTP instrumentation, the real status appears in the attributes rather than Status. The soon-to-be-guidance for mapping will introduce redundancy by bucketing HTTP codes into a bit smaller subset of codes.
For other protocols (or logical operations), it seems everyone has to find the most appropriate status in the gRPC codes and map their error into one from the list.
The feedback I've got from the team inside Azure which did the instrumentation (hi @pakrym), that StatusCanonicalCode is not the perfect match for arbitrary error codes agnostic to the protocol/logical operation.
So Unknown error with elaborate description is the options they go with, at least for now.
Nitpicking: aborted vs cancelled vs DeadlineExceeded and similar ambiguity. Things like Internal seems to be as good as unknown for logical operations.
So I wonder if we can revisit choice for status code and try to come up with
I agree that having to define a somewhat arbitrary mapping to these status code seems to be redundant work at best. The states that I think are interesting for a span are better described with something like:
Especially the fine distinction between a semantically failed operation and an operation that ended with some error indicator is tricky. For example AFAIK we don't mark most non 2xx-HTTP status codes as failed operations in Dynatrace. E.g. a 404 could just mean an application is checking if some entity exists and 404 would be the equivalent of a boolean false returned. I think we even allow users to configure in the Dynatrace Web UI, if for some particular URL (or class of URLs), a particular HTTP status code should be considered a failed operation.
In particular, I don't know which problem the CannonicalStatusCode enumeration is trying to solve.
I guess the broad problem we want to solve with CannonicalStatusCode is grouping spans by the finite number of well-known error codes. Backend UX can do nice things on top of it, users can easily query
The problems are:
I propose the following approach
```
Status
// protocol or library specific status code
int statusCode = 0
string description = null
// optional OTel status code
// indicates if it's a success or failure
Result result = Result.None
enum Result
// not set or hard to tell whether it's success or failure
// exporters may think about it as success or get into the business of parsing statusCode
None,
Ok,
Failure,
// maybe we want to cover a handful of very common client errors
Cancelled
```
(naming needs a lot of improvements)
HTTP:
gRPC:
Internal spans:
I don't think we should have an int statusCode.
For example, for OS errors, a string with the error code constant like ECONNREFUSED would be a better fit (POSIX only specifies the names not the values of the error codes, see https://pubs.opengroup.org/onlinepubs/9699919799.2013edition/basedefs/errno.h.html).
Actually, that ECONNREFUSED highlights another use case that is not well-supported by that statusCode field: If we had some generic convention for OS errors, you could do something like setting os.errcode = "ECONNREFUSED", os.errfunc = "connect" on an HTTP span and you would know that there was an OS-level error, not an HTTP-level error.
EDIT: Or maybe err.kind = "os", err.code = "ECONNREFUSED", err.os.func = "connect". Could be adapted to err.kind = "exception" err.code = "java.net.ConnectException", err.msg = "Connection to [::2]:14426 refused.". We need semantic conventions for errors and exceptions!
int statusCode is useful for a lot of things (HTTP, gRPC, Oracle error codes, MS HRESULT, you name it). Historically everyone checks range for HTTP code and with int it's much easier.
With "ECONNREFUSED" it seems it has int value but users are free to not use it and provide common string representation in the description.
My expectation was that exceptions will be covered with Logging API and using span status for it is an abuse.
So your example could be
statusCode = 111 (or 0), description = "ECONNREFUSED", Result = Failure
Plus, in the future, there will be log with error severity that provides exception information. And you could even write it today with log4j or another common logging API, just without OTel support
Typically strongly typed fields are needed when there are experiences for them in SDK or OTcollector. I think in case of status code we may have need for the success/failure bit and maybe for client vs. server indication. statusCode can probably be moved to attributes.
@Oberon00 for your proposal I'd suggest we have a separate "error" type that can be reported independently from the Span. And it may carry all this information. Would it be what you need? Or you think this information is needed on a Span itself?
client/server indication lives in SpanKind. essentially all SpanKind.Client have only client errors.
success/failure may be ok, but not enough (404 is a perfect example).
I agree statusCode can move to the attributes - this will make status ever leaner.
@lmolkova
Of course ECONNREFUSED has a concrete and more or less stable integer value per POSIX implementation but Linux, BSD, MacOS, Windows (WSAECONNREFUSED) might each have different ones. The value is meaningless, the meaning is carried by "ECONNREFUSED". In HTTP it's the other way round, 404 identifies the error and "Not found" is merely a description.
My expectation was that exceptions will be covered with Logging API and using span status for it is an abuse.
An exception that merely occured during an operation: Yes, just log it. An operation that was ended by throwing an exception should be handled differently though.
@lmolkova
Partially disagree with
Client have only client errors.
E.g. a client receiving HTTP 500 would arguably fail with a server error.
@SergeyKanzhelev
@Oberon00 for your proposal I'd suggest we have a separate "error" type that can be reported independently from the Span. And it may carry all this information. Would it be what you need? Or you think this information is needed on a Span itself?
I don't really understand how you would separate an error from a span? The error would describe the result of the operation described by the span, so it would naturally be a property of the span. But I think that errors could be handled through semantic conventions, beyond maybe the very basic success/failure boolean (or maybe 3 to max 4 clearly defined enum values).
Separation from the Span is in the sense that Span semantics only tells whether it's success or failure. And there is something else defines all the properties of this failure.
Many systems has a separate data types to represent errors and exceptions. And we can go with this. Or we can have Event on Span with the specific semantic of an error.
Of course ECONNREFUSED has a concrete and more or less stable integer value per POSIX implementation but Linux, BSD, MacOS, Windows (WSAECONNREFUSED) might each have different ones. The value is meaningless, the meaning is carried by "ECONNREFUSED". In HTTP it's the other way round, 404 identifies the error and "Not found" is merely a description.
Right, what is the problem with description carrying this ECONNREFUSED as a string?
E.g. a client receiving HTTP 500 would arguably fail with a server error.
You don't know which server or maybe it's your own circuit breaker. Arguably what you received is not necessary what server responded with.
An exception that merely occured during an operation: Yes, just log it. An operation that was ended by throwing an exception should be handled differently though.
I think there are different options here. Exceptions are huge, they may have additional information attached (local variables, or links to dumps!), so semantic conversion on span will never be enough to cover all exceptions scenarios.
Arguably exceptions are stored separately. I can easily imagine how you want to sample exceptions differently than spans or track them when you don't do any distributed tracing.
So I agree with @SergeyKanzhelev that exceptions may be represented separately from Spans and correlation ids allow backends to stitch together lean spans with rich exceptions.
Based on this discussion https://github.com/open-telemetry/opentelemetry-specification/issues/67, it looks like Event is a temp API to workaround missing logging. In future users should be able to use logs OR events and semanically it will be the same.
If we decide to follow this route (which I'm not a fan of) conventions on events would allow tracking exceptions. But still, status is not for them.
Right, what is the problem with description carrying this ECONNREFUSED as a string?
My problem with that is that an integer errno like 53 is meaningless unless you know it originated e.g. from a FreeBSD system. On the other hand, the error names are a POSIX standard. So I'd really prefer something like a Span attribute error.errno that is set to a string "ECONNREFUSED". I don't see any added value in the integer number.
An error description would be what strerror returns for ECONNREFUSED, like "Connection refused."
I think there are different options here. Exceptions are huge, they may have additional information attached (local variables, or links to dumps!), so semantic conversion on span will never be enough to cover all exceptions scenarios.
HTTP requests are also huge ๐ I think we only need these two properties to satisfy most use cases for exceptions:
Stack traces would be nice too, but there is often considerable overhead associated with them, both in gathering them (resolving symbols, ...) and of course sending them.
EDIT: I agree that any information beyond the two basic properties type & message can be stitched together on demand and may be delivered as log or something else.
Separation from the Span is in the sense that Span semantics only tells whether it's success or failure. And there is something else defines all the properties of this failure.
I would go a middle course here, as I stated above.
this was moved to the next milestone. Few meeting notes from https://docs.google.com/document/d/1-bCYkN-DWJq4jw1ybaDZYYmx-WAe6HnwfWbkm8d57v8/edit#:
from the spec sig mtg today, looks like this one could be for the v0.5 milestone but sounds like more input would be needed desirably from the .net folks @bruno-garcia @reyang
.NET OTel is going to be leveraging types from the BCL (i.e., the .NET standard library) that are general and can't have a property like status tied to a specific protocol.
It will be good to separate error reporting and span status conversations, but at first, just having a simple span status, e.g.: Ok|Failed simplifies things. The application/library should set the Failed status since the criteria for that can vary, i.e., a failed child span should not automatically trigger the failure of the parent. This approach keeps simple the determination of the span status for the users.
Error reporting/information is a much larger discussion about formats and conventions. The relation with span status is that once it is Failed, there should be a way to indicate which specific error, between a sequence of errors, finally triggered the span status to be Failed. Typically this should be the last error, but there may be cases in which an error is logged after the condition that triggered the failure of the span (e.g., error trying to communicate about the failure itself).
/cc @andrewhsu
I'm not sure I understand this right. Do you mean .NET has chosen to not implement the span status code API? EDIT: Seems it's there https://github.com/open-telemetry/opentelemetry-dotnet/blob/master/src/OpenTelemetry.Api/Trace/Status.cs
@Oberon00 currently the plan is to use Activity as a replacement for OTel Span and drop the old implementation - there is an ongoing discussion if there should be an OTel API wrapping it (at this moment I'm particularly in favor of having the wrapper). The BCL team modified Activity to better align with OTel but there is some differences Status is one of the most visible.
While this seems like a very sensible choice for .NET, I doubt you can call the result OpenTelemetry, as the API/SDK separation with ability for vendors to use their own SDK (incl Span) implementations is IMHO a central part of its promise. But I'm starting to go off-topic here.
While this seems like a very sensible choice for .NET, I doubt you can call the result OpenTelemetry, as the API/SDK separation with ability for vendors to use their own SDK (incl Span) implementations is IMHO a central part of its promise.
I agree with you in this regard - we are having discussions about this on the .NET SIG.
As you said let's focus on the question about Span status here.
I've added this to the 05/19/2020 OpenTelemetry SIG: specifications meeting agenda.
Please join the discussion if you're interested.
I believe that Status can be treated as a semantic convention, and that span attributes can be used to convey the Status property. Instead of a dedicated SetStatus() API, we can use SetAttribute. I would recommend the use of a dictionary-typed value for the status, e.g.,
span.SetAttribute("span.status", { "message": "blah blah", "code": "DEADLINE_EXCEEDED" })
On the other hand, I support the notion that the existing status codes are sufficiently and usefully generic. In the SIG call, one example was given: suppose the user has a "language exception". How should this map into a status code? Language exceptions can mean anything: some language exceptions will map to deadline exceeded, some to permission denied, some to not found, etc. It's fine to leave the default, which is "OK", which gives less information to the backend, but is "correct".
Let's look at examples. What are some common Windows system call errors that you think do not map well into the canonical codes?
(Also @tedsuo mentioned SpanKind during this conversation. It's in the same boat. Why isn't it a semantic convention?)
Next follow up meeting - Thu 5/21/2020 9:00AM-10:00AM PST
Or join by phone (audio only)
+1 425-616-0754,,320785411# United States, Seattle
Phone Conference ID: 320 785 411#
Find local phone number: https://dialin.teams.microsoft.com/8551f4c1-bea3-441a-8738-69aa517a91c5?id=320785411
Why do we want Status? How do we intend to use Status? Is GRPC code a good fit?
Status.Status to cover these scenarios.Should this be a 1st class member, or just a normal custom property plus semantic conventions?
Next steps:
Status - it seems the current preference is YES.Status should be a 1st class member of Span, or something else.Status should be a 1st class member of Span.Status as a custom property instead of a 1st class member field; Later .NET can make an additive change to promote Status to 1st class. Worst case we'll have chance to make the change in a year.Status shim/extension.Status(level, domain, attributes).level is NOT an indication of severity (e.g. VERBOSE vs. WARNING).@bogdandrutu @noahfalk @pjanotti @tarekgh please review/comment and help to keep me honest ๐
@reyang @noahfalk Any progress with the OTEP?
Apologies it took longer than I expected to get to it. I've got a draft written up and I'm waiting to get a little feedback from my coworkers to see if it passes some basic scrutiny ; ) If they like it I'll stick it in the oteps repo as the next step.
^ OTEP that addresses this issue: https://github.com/open-telemetry/oteps/pull/123 "Add support for more expressive span status API" by @noahfalk
Recording exception was addressed in open-telemetry/opentelemetry-specification#697.
As there is a lot of controversy and different opinions about Span.status, there is a proposal to remove this field altogether before GA: #706. It will be much easier to add it back when/if we agree on it, than to support the current status which so many people are uncomfortable with.
I propose to declare that this issue is superseded by #706 and should be closed. @open-telemetry/specs-approvers @open-telemetry/specs-trace-approvers If you agree, please close this issue.
@open-telemetry/specs-approvers @open-telemetry/specs-trace-approvers If you agree, please close this issue.
Since there were only two ๐ (althoug no ๐), I would leave that to @open-telemetry/technical-committee but it is probably uncontroversial by now.
From the sig mtg today, discussed and sounds like this should be closed in favor of https://github.com/open-telemetry/opentelemetry-specification/issues/706
Most helpful comment
.NET OTel is going to be leveraging types from the BCL (i.e., the .NET standard library) that are general and can't have a property like status tied to a specific protocol.
It will be good to separate error reporting and span status conversations, but at first, just having a simple span status, e.g.:
Ok|Failedsimplifies things. The application/library should set theFailedstatus since the criteria for that can vary, i.e., a failed child span should not automatically trigger the failure of the parent. This approach keeps simple the determination of the span status for the users.Error reporting/information is a much larger discussion about formats and conventions. The relation with span status is that once it is
Failed, there should be a way to indicate which specific error, between a sequence of errors, finally triggered the span status to beFailed. Typically this should be the last error, but there may be cases in which an error is logged after the condition that triggered the failure of the span (e.g., error trying to communicate about the failure itself)./cc @andrewhsu