Stackexchange.redis: Distributed tracing story with Redis client

Created on 12 May 2018 · 16Comments · Source: StackExchange/StackExchange.Redis

.NET supports distributed tracing with System.Diagnostics.Activity that represents ambient context for trace and System.Diagnostics.DiagnosticSource as an event subscription and notification API.

Here is the basic guidance for the library instrumentation.

Following this approach, tracing system could subscribe to receive events from any instrumented library without taking a dependency on it and receive basic information. E.g. Operation 'StringGetAsync' took 5 ms and finished successfully with custom data 'cache status': 'miss'. Tracing system is able to correlate this operation with parent.
Based on this users could find root cause for failures or long runing operations no matter how big their system is and how many services are involved in the processing.

Tracing system could build visual experiences like this one

Some principals behind the instrumentation:

In presence of compatible tracing system, end user does not need to do anything to the enable instrumentation.
If there is no tracing system that listens to the particular library, the instrumentation cost in close to 0 (nanoseconds). We allow tracing system to implement sampling and reduce number of instrumented operations. For the instrumeted operations, we benchmarked impact to be several microseconds depending on the particular tracing system (listener) implementation.

The examples of frameworks/libs instrumented in this way are Asp.NET Core, HttpClient in .NET Core, SQL client in .NET Core, Azure ServiceBus and Azure EventHubs SDKs.

Consumer for these events are Microsoft tracing and debugging tools like ApplicationInsights, Visual Studio, Glimpse, OSS projects like OpenTracing .NET.

Possible high-level implementation details that are described in our guidance:
We'd like to wrap public API that do external call to Redis server (perhaps it could be done in ConnectionMultiplexer.ExecuteAsync once to cover all calls)

Before the call we'd check that someone listens to Redis diagnostics events
create a new Activity (it will get a parent automagically)
add some important info: e.g. server address/name
perform the operation
when done fill in status (ok, failure) and important info (e.g. hit or miss) on the Activity
stop the activity

We'd like to thoughts on tracing for StackExchange Redis and get your feedback on this approach

StackExchange Redis is the top-most client library ApplicationInights users request for auto-collection.
Depending on your feedback we (Microsoft ApplicaitonInsights team) would like to contribute this feature and/or help you with the implementation.

Source

lmolkova

👍8

Most helpful comment

Thanks for your responses. I disagree on some of your points but I don't want to argue again - we've done so before in quite a few issues and I've done so by mail with Vance.

I agree on your philosophy bullet points though. Adding a tracer should be as easy & "automagically" as possible. That's why we need to be able to write tracers that don't have to rely on special satellite libraries for everything they're instrumenting. (e.g. MyTracer.EntityFrameworkCore, MyTracer.SqlClient, MyTracer.Whatever). Having reasonable tags and Inject/Extract hooks is a must-have for that IMO. Having a configurable way to disable stuff for performance reasons would be great as well of course. Maybe we could add functionality to let subscribers indicate if they're interested in the tags or so...

Having a way to access the payload (and therefore taking a dependency on the library) is a great feature of Activity but it should always be optional. It should only be there to improve the default output if a tracer wants to do so (e.g. in case of a very large library like ASP.NET Core).

Trying to improve the documentation is great but I'd love to also see some ideas/proposals/PRs about actual API improvements - there hasn't been anything like that in over a year now (unless I've missed something).

cwe1ss on 17 May 2018

👍2

All 16 comments

Hi. I'm not opposed in principle, and I think it has been mentioned in passing a few times in the past. My reservations so far have mostly been:

Unfamiliarity with the API meaning I'm not sure how to implement it most appropriately and performantly
Concern about potential performance impact even if the "that tool isn't active" case

If you have time to help us work through these things in a way that helps users without adversely impacting things, that gets a :thumbsup: from me.

mgravell on 12 May 2018

I've dealt with this quite a bit as also being the primary maintainer of MiniProfiler. Unfortunately the experience is very varied depending on the implementation, even within Microsoft libraries. For example: when a consumer wants to listen to events, they need to know how to analyze the payload. This means they either need to use the diagnostic adapter (generating IL - limited platform support), or have strongly typed objects.

Unfortunately, I've plead for this to be implemented in things like ASP.NET Core and have gotten nowhere, because even Application Insights isn't using but 1 of the events (unless this changed recently). Entity Framework was a bit more fruitful as they did a major investment in events, exposed strongly typed objects, event names, etc. This allows a proper OO adapter with compile-time checking rather than dynamic parameter name matching on anonymous objects (requiring the adapter, as is still the case with ASP.NET Core). I doubt this will be a priority since I raised it well before 2.0 and a breaking change to fix all the events wasn't done then either (though EF did so).

Let's say we had strongly typed objects: that's good, but it's still an intermediate object allocation (even if it's a string) per-call which is quite a bit at scale. They cannot reasonably be structs due to how much copying (and no ref control) in the pipe, so we can't avoid it. The dependency isn't terrible, but it is limited in consumers because of the above, and the extra allocation is something we're hoping to avoid. Honestly the dream for us is to have Assembly Neutral Interfaces to decouple all of this and eliminate the intermediate transport object allocation...but that's looking pretty grim.

At the moment, I'd be remiss to recommend DiagnosticSource as the platform to implement because even getting the platform teams on it has seen very limited progress and I wouldn't recommend it as a route for the profiler we maintain either because of that overhead.

I'd be curious what the team's thoughts on the above experiences are. It seems to be that a fundamental conflict exits between wanting to be everywhere, not tying to anything, and providing a structure that anyone can follow or consume...it seems the last one gets dropped in favor of the former 2 in my trials through it so far. Seeing how the Microsoft side sees this changing would help a lot here.

NickCraver on 12 May 2018

Interesting, since I updated my fork yesterday for this exact purpose :)

This gets a big +1 from me since it would be extremely useful due to the zero configuration required and ease of sampling as well to reduce the profiling impact.

It's integration with Activities makes it super useful as well for cross-application tracing.

stebet on 13 May 2018

@NickCraver

A lot of effort and work is ongoing into standardizing the usage to reduce need to know the different object layout for basic profiling/tracing purposes. See the guidelines link @lmolkova linked for more info.

Here is an example of the implementation used in HttpClient: https://github.com/dotnet/corefx/blob/master/src/System.Net.Http/src/System/Net/Http/DiagnosticsHandler.cs

@SergeyKanzhelev can also give some more insights on the standardization work as it's a big multi-company effort.

stebet on 13 May 2018

I've looked at the examples there, but I don't see anything has changed - if anything the lessons haven't been learned in new implementations. For example, from Http:
```c#
s_diagnosticListener.StartActivity(activity, new { Request = request });

```c#
s_diagnosticListener.Write(DiagnosticsHandlerLoggingStrings.ExceptionEventName, new { Exception = ex, Request = request });

Note that the payloads are still anonymous objects. I cannot program against them. Am I missing something here? I've raised all this with the diagnostics team before. Let's say I'm in MiniProfiler, and I want to program and actually use the payloads. How do I do it? My understanding is you have to use the diagnostic adapter to by dynamic binding (and match property names up, only found by viewing source code), and all of this will only fail at runtime. At least, that's been the only way ever provided to me, but information is very sparse on consumers of the diagnostics API. All of the docs are around publishing.

Now, if I want to use the payloads, I have to reference the library anyway to get those types. So why not strongly type them? Why not get all the advantages of OO while we're in there? I simply cannot understand what's happening here, so I'd to understand. I'd be ecstatic to have my mind changed on this but as it stands: I wouldn't implement this because of a) performance overhead and b) I, being a profiler author, wouldn't even use it. It's a hard sell when that's the case. And to make these anonymous types concrete types is a breaking change (for most libs, like ASP.NET Mvc which has camel-cased properties unlike other framework APIs vs. this Http which is title case). The teams seem to be unwilling to make that breaking change, so what we have is a mess.

Besides all that, the anonymous types require DiagnosticAdapter, which doesn't yet have a stable netstandard2.0 version. I complained about this very early in 2.0 planning. This is a hard blocker on consumption. Even if it weren't, we get no compile-time checking whatsoever.

Please, show me the light here. I really am wanting to understand the picture here and where it's headed. I'm not trying to be difficult, just completely honest on experiences so far. If you can show me what we're not seeing, I'd happily implement more of it in MiniProfiler. But, if my understanding is correct, then I simply disagree with going down this road and the shape of these APIs - and implementing this one now means being (likely) stuck with it in place of something much more appealing later.

So - what am I missing? Are there consumer examples out there that show a better way of handling these? Do those examples require the DiagnosticAdapter?

NickCraver on 13 May 2018

TLDR: No, the DiagnosticAdapter is not required unless you really need to inspect the payload, which for a lot of cases is unnecessary, since the Activity.Current.Tags can provide most of the data needed for profiling a.la MiniProfiler (simple type, duration, success tracing) through "standardized" tags.

Long version:
Most of the normal profiling/tracing a.la MiniProfiler can be done by just subscribing to the diagnostics events and looking at the Activity.Current.Tags at the time of the ".Stop" event, without having to reflect on the dynamic payload, unless some in-depth data is needed which probably means you have internal knowledge of the library anyways. See https://github.com/opentracing/specification/blob/master/semantic_conventions.md#span-tags-table for examples of "standardized" tags.

I do agree though that it's unfortunate to have these objects dynamic, but hopefully @lmolkova and @SergeyKanzhelev can shed more light on that and future plans, but I agree, strongly typed object, since you'd need to reference the library anyway to use the payload, would be better. That way the library could even provide helpers for those in-depth profiling purposes.

So as I see it, you have two separate things here. The Activity which should contain enough information with tags to give you basic profiling/tracing functionality (tags include type, such as "Redis", "SQL", "RabbitMQ", the target host etc, again see the OpenTracing doc above for details.

Then you have the dynamic payload which is more for injection/inspection purposes than just logging/dependency tracing/profiling, but which requires you to know the library and have a reference to it if you want to use them, like injecting headers in the HttpRequestMessage or reading response headers from the HttpResponseMessage. That could also be documented properly for those advanced scenarios and the payload could even be a well defined known object which could be exclusively cast to via. library provided helpers.

Since you were asking for a consumer example: here is an example of a general consumer of dependency events using just the Activity and it's tags: https://github.com/Microsoft/ApplicationInsights-dotnet-server/blob/develop/Src/DependencyCollector/Shared/TelemetryDiagnosticSourceListener.cs

So from my standpoint, the Activity creation with the standardized tags, would be the biggest win i.m.o, the payload is in my mind a nice-to-have, but could also be provided and explained in documentation.

B.t.w, the DiagnosticAdapter has a netstandard2.0 version with the Core 2.1 release, as you can see on the NuGet page.

Edit:
Updated text for clarity, since I typed the earlier comment on my mobile. Also, need to learn to read previous comments better.

stebet on 13 May 2018

Since you were asking for a consumer example

I'm looking for an example of the payload consumption, since that's where all the problems come in at. I've yet to see that.

B.t.w, the DiagnosticAdapter has a netstandard2.0 version with the Core 2.1 release, as you can see on the NuGet page.

Yep, I'm aware, but that's not stable yet. And it has all the runtime exception or failure problems as it ever has.

The theory of Activity is great: "here's the tags everyone needs". But there's a basic flawed assumption in that it contains all the interesting or relevant data one would need. If that were true, we wouldn't have payloads.

Let's take for example MiniProfiler, and with Http (again just as a single example). I want the request because I want to log:

What the method was
What the URL was
Maybe specific items from the headers
Route data provided (e.g. even Controller/Action)
Whether the hit was HTTPS or not (this cannot be determined from the URL alone

That's just off the top of my head, but the point is: I need the payload.

Activity also just doesn't help much here. It doesn't help the pitfalls of anonymously typed payloads, it just copies the problem into string keys. I have to know the keys to fetch, what the types are, and I have to find all that out by looking at the source code for the other library. I get it, we want to have a weakly-linked no-dependency profiling API. That's great. But if I'm willing to take a dependency, I should get all the benefits of it. Right now we get zero compile time checking for anything with anything pitched here, including activity. All consumers may as well be writing in a dynamic language.

I applaud the effort here, I really do, but from a consumer standpoint, I still find the whole story severely lacking and, barring any new information, am very against implementing it.

Now, let's say we use strongly typed objects for payloads (the only way we'd implement this) - this resolves the problem for StackExchange.Redis, and I could make a MiniProfiler.Providers.* package that references and profiles this way. That'd be acceptable if the performance overhead wasn't too bad (there are allocations here, there's no way around it). There is performance overhead - that's not in question. The question is how bad it is, but that must be analyzed over time with GC stall impact to the application for significant workloads (as we have on Stack Overflow). It's not a simple doesn't-impact-timing-much answer that many people have replied to me with here. If we did all this, we'd still be advancing a very broken ecosystem here where most* of Microsoft's libraries do not play nicely (e.g. anonymously-typed payloads). I have a lot of disagreements with doing *that, even if we did it properly - or as properly as the Diagnostics APIs allow us to.

If Microsoft wanted to get all the libraries in line, that can't happen until a major version and breaking changes. Since it doesn't seem to be a company effort to do this (at least it wasn't in 2.0), I admittedly have little faith in it ever happening. I have pushed for this personally - I have a dozen issues or so around it in corefx and aspnet, but I think I've about reached the end of my abilities to effect any change. This has to be an internal push to make the ecosystem work. Microsoft libs may be the first and a small pool in many, but they are the initial examples and this is critical.

I'd love @lmolkova to correct me if I'm wrong, but at the moment I'd suspect: there is no guidance around payloads. These types should be concrete and exposed. If I reference the lib I should be able to see and use the type. I should get compile time checking and all the other benefits that come with a static language. I should see when there's a break across a major version due to something changing (right now this silently fails with DiagnosticAdapter).

Show me what's in store - help change my mind here. The above isn't intended to be a rant, I want to honestly convey what's wrong with this area. I'd love to see this API flourish and us finally have a consistent profiling/timing API available in the ecosystem. This is the best progress so far, but consistent and consumable, it is not. It only works for very basic cases and not even our own.

NickCraver on 13 May 2018

👍2

Thank you all for the feedback! it'd great to see you are familiar with the API

Tracing vs profiling

First, let's agree that profiling and tracing are different beasts. Tracing focus on the high-level, cross process calls, while profiling requires deeper in-proc level details.
DiagnosticSource/Activity similarly to OpenTracing, OpenCensus, Zipkin and other distributed tracing approaches are mostly interested in tracing client and server external calls. For this sole purpose, payloads are not actually needed.
Activity serves as a single source of production-level information majoirty of users need: name, Id, timestamp and duration with a custom property bag - all essential for tracing. Looking into different tracing systems like Amazon X-Ray and Google StackDriver or OSS world Zipkin/OpenTracing, they all work with this level of details.
This also answers the question why ApplicationInsigths uses just a couple of events from AspNetCore: because that's ~~almost~~ all users need - high-level information.
This is the direction we (together with tracing communities outside Microsoft) are going to.

@NickCraver you are right that contract is needed for Activity.Tags, and there are several attempts on this (more or less all of them are aligned with OpenTracing). Based on this Google and Zipkin build pretty useful diagnostics and UIs that are used by companies like Uber, Twitter, and many others.

I assume profiler-kind of tools may require another approach, but tracing is required to be enabled in production (maybe with sampling) and should be efficient and not too verbose.

Payloads

Ony advanced users/scenarios need event payloads for distributed tracing - tracing systems likely would not read them.

@NickCraver you are right, there is no actual guidance around payload. The single requirement: backward compatibility - no braking changes that cause app to lose functionality. Listener should expect all kind of possibilities that may happen with payload in future: unexpected types, no expected property, etc. Listener MUST NEVER EVER throw and may gracefully deal with lack of details/inability to obtain some property.
While you don't like lack of policy, it help us to support many scenarios and decouple lib and listener completely.

DiagnosticSource and payloads historically became available before Activity was created and immediately it was clear that some context is needed that would correlate start and end events and would allow correlating things in the scope of this operation.

The main issue here is not a need to parse the payloads. The problem is that listener cannot take a direct compile-time dependency on all possible libraries it listens to: i.e. ApplicationInsights cannot depend on Redis (and MongoDb, and Azure Storage SDK and everything else it wants to listen to). While both of them can take a dependency on things that are already in .NET SDK like Activity/DiagnosticSource.
The necessity to install additional libs that would enable tracing for a particular component does not work well with automagical collection all tracing tools are moving toward.

None of the popular OSS solutions provide such level of details as raw request/response objects as diagnostic source, they only provide Span (Activity in .NET) i.e. shared context. So while payloads seemed to be useful with HTTP, we, as a tracing system, do not actually want to be in the business of parsing them anyway.

Having strongly-typed payloads is totally fine, however, it may increase a public API surface of the lib and it becomes harder to have backward-compatibilty if the lib ever changes payload type. This is the reason why HttpClient has anonymous payload types.

We do not use DiagnosticAdapter mostly because it does not play well with advances sampling scenarios DiagnosticSource provides (https://github.com/aspnet/EventNotification/issues/67). So we efficiently parse payloads: here are example1 - DiagnosticSource to EventSource bridge and example2 - HttpClient events consumer.
I agree with you we should do a better job providing efficient ways to parse payloads and supporting it with DiagnositcSource changes.

Instrumentation and all-Microsoft libs support

DiagnosticSource/Activity is not a replacement for ETW/EventSource. The main focus is instrumenting external outging and incoming calls. We covered ASP.NET Core and anything hosted in IIS for the server part. Our focus is Azure SDKs, many of which are covered by HTTP tracing and we are working with Azure to develop coherent story considering mostly other platforms and languages. There are many things to consider there (e.g. we are also driving this: https://github.com/w3c/distributed-tracing), so we are moving not as fast as we'd want to and have not covered all interesting components yet.

Performance

Let's assume only high-level operations are instrumented, so there is one Activity/event/payload allocation per operation (the single event that is required is Stop event). Note that all allocations are done only when there is a listener for Redis instrumentation. Tracing system may do sampling and tell beforehand (without even a new Activity instantiation) whether this operation is interesting or not (likely based on parent Activity Id). As a part of instrumentation we can do benchmarking to demonstrate performance impact (which mostly depends on the tracing system efficiency), but we are talking about a few microseconds per opration here.

As there is no real guidance on payloads, it's really up to the library to provide payload as tracing system by default won't likely look into them anyway. We found many cases when users want to look into Http requests deeply or modify them, but it may not be the case for Redis or RabbitMQ.

Please let me know if I missed something.

lmolkova on 15 May 2018

@lmolkova Thanks for the response! Thoughts inline:

First, let's agree that profiling and tracing are different beasts.

I don't agree. I don't agree because the API doesn't agree. If you read the posts here overall they read as "payloads don't matter". Then why are they in the API? This API is meant to support both. We're talking about implementing one API, regardless of what it's for. For tracing it's too heavy with allocations and the sheer weight of payloads (and serializing them) at all. For profiling, it effectively lacks payloads when the provision is anonymous types. The API and reasoning can't make up its mind. The proposal here is for tracing, so payloads don't matter. Our primary use case involves the payload. The API could do both, but fails to be usable on the latter half.

While you don't like lack of policy, it help us to support many scenarios and decouple lib and listener completely.

But it doesn't - that's my gripe. I don't see how it makes any sense at all. To use payloads, it just means devs have to go look at the source and manually match up parameter names and types and almost always take a dependency anyway (the exception can be some abstractions, but that's still a dependency). Having concrete types wouldn't harm anyone here. If you're going to break people with a change then you're going to break people with a change. The only reality here is we're doing it a runtime as a surprise instead. It's an effectively an undocumented, unsupported API that can be broken at any time because "it's not a real type". That's a horrendous smell and very painful, a pain which I gave up on. If concrete types were provided people could do what they are doing today or they could take a reference and do much more.

The problem is that listener cannot take a direct compile-time dependency on all possible libraries

They don't have to. What I'm saying with concrete types is they can if they want to use the payloads. The Azure tracing bits would not have to.

Having strongly-typed payloads is totally fine, however, it may increase a public API surface of the lib and it becomes harder to have backward-compatibilty if the lib ever changes payload type. This is the reason why HttpClient has anonymous payload types.

This is the argument I kept getting from the Microsoft side (not directed at you - I've heard this many times now) and it's unfounded. There is a break either way. Whether the type is anonymous or concrete. If an argument name changes a consumer may throw a null ref. If a type changes they'll get a casting error, etc. The only thing anonymous types do here is a) let the teams break it at will, because it's not really an API or documented anywhere (show me one DiagnosticSource that has fields or types documented), and b) fail to document anything (which has been the case thus far). Frankly, I don't buy it. The lack of concrete & exposed types here is a lack of investment in the API from the teams involved. If they aren't going to provide a stable API, I can't consume a stable API. It's as simple as that.

The Entity Framework team saw this for what it is and exposed the types, many of them. They have one of the largest surface areas for this and it wasn't a problem there. And as soon as they implemented concrete types, I happily built a strongly-typed listener in MiniProfiler. It works great. And the build server will tell me if anything breaks. But it won't, because it's an actual API contract now. That's something I (as a consumer) can depend on.

but we are talking about a few microseconds per opration here

I'm concerned about the allocation, not the time (to be fair you address this somewhat). We're talking about a payload object, whatever is in that object, references held by the object, the dictionary for activity, everything held onto by that dictionary, a larger GC crawl tree from all of the above, and at 250,000 requests/sec/thread in this library for ops - that can add up fast. And note: we'll have to keep additional things around for the life of an operation, beginning to end. We'd need to see how much that is, but compared to current, we'd be keeping tracing information per task.

I really do appreciate the responses here, but at the same time I see nothing to change my mind. @mgravell may have different feelings here, but with the current approach, I have no interest in adding the weight of this profiling to StackExchange.Redis unless Microsoft wants to make a real investment here. If we're serious about it, show me the concrete types. For what it's worth, I posted and tried to progress this over a year ago, and the reason I'm very against it now is I don't see any signs of API progress since then. I want to love DiagnosticSource. I want to recommend it. I want us to have a ubiquitous API for tracing and profiling. But I'm still waiting on all of those.

NickCraver on 15 May 2018

Our primary use case involves the payload.

Are you talking about MiniProfiler? How do you trace Redis these days?

To use payloads, it just means devs have to go look at the source and manually match up parameter names and types and almost always take a dependency anyway (the exception can be some abstractions, but that's still a dependency).

I'd argue about 'almost anyway', as I mentioned there are multiple tracing systems and instrumentations out there and just a few of them (DiagnosticSource, direct profiler APIs and perhaps ILogger scopes) deal with raw objects. EventSource did not have payloads until recently.

here is a break either way. Whether the type is anonymous or concrete

I totally agree with you. At the same time, we cannot change the fact that some components have chosen anonymous types and do not want to increase their API surface. They have their own reasons for it. There is nothing wrong with using concrete types for payloads.
As you may have noticed, we changed the payloads/events in the HttpClient and Asp.NET Core and we had to make this change backward-compatible anyway, even with anonymous payloads. So there is actually no difference in policy for concrete and anonymous types in terms of support.

To be fair, today's guidance is to use explicit types.

CONSIDER creating an explicit type for the payload. The main value for doing this is that the receiver can cast the received object to that type and immediately fetch fields (with anonymous types reflection must be used to fetch fields). This is both easier to program and more efficient.

It works great. And the build server will tell me if anything breaks. But it won't, because it's an actual API contract now. That's something I (as a consumer) can depend on.

Well, the problems start where consumer (tracing system) is injected automagically at runtime - which is also the reality, i.e. user app compiles perfectly but fails at runtime.

Overall, I hear your concerns.

I'd like to see what we can do in two different directions

do you have any picture of API that would satisfy both: tracing (essential info) and profiling (raw payloads) scenarios while being performant enough for both cases

it's unlikely we will be able to or have enough motivation change existing providers like Http to send concrete types in the payload. It serves the purpose quite well for now.
However, we may do Redis instrumentation in thy way it suits your needs.

I have no interest in adding the weight of this profiling to StackExchange.Redis unless Microsoft wants to make a real investment here.

I'd like to understand more what kind of investment you are talking about. Changing the tracing API? Changing guidance?
We can come up with a prototype, measure all kinds of performance impact and adjust it according to your feedback. We can see how big is this weight is and whether DiagnosticSource could serve this purpose as both tracing an profiling API. I'm not convinced it cannot. Otherwise, we would have a strong real-life case based on which we can improve the .NET tracing API.

Thoughts?

lmolkova on 15 May 2018

Adding my 2 cents (yet again):

I also still think that Microsoft has not taken and does not take this topic seriously enough. Unfortunately, very few people even knew about these additions (it's impossible to github-watch the corefx repo) so the types have been implemented pretty much without any community feedback. Nick, myself and others have raised concerns since these types have been introduced but there has been almost zero (public) effort on Microsoft's side to improve things.
And we all know that this will likely never happen because these types are now part of the .NET base class libraries (as deeply integrated as it gets) and therefore can never be changed. This is very frustrating and makes it really hard for any non-Microsoft person to invest time in proposing changes.

I've said these things many times, so I'm repeating myself, but here's what's wrong with DiagnosticSource/Activity IMO:

Instrumentation code is EXTREMELY inconsistent. There's:
- no consistent naming of events,
- no consistent naming of payload fields (camelCase vs PascalCase),
- no consistent usage of payload types (anymous vs. explicit types),
- no consistent usage of tags (HttpClient, ASP.NET Core do not set tags).
Instrumentation code is quite complex and has to do weird dances to work "against" the Activity API as it e.g. sometimes has to call activity.Start() and sometimes diagnosticListener.StartActivity() - see e.g. here
Having only a IObservable<KeyValuePair<string, object>> contract is too weak for subscription code. Subscribers have to come up with weird & complex code as well - e.g. to detect if they're actually within an Activity event. I'd argue that that's not even possible in a 100% safe way - especially with generic subscribers.
No one knows if he/she has to instrument with DiagnosticSource and/or EventSource and/or ILogger (and/or a 3rd party API like LibLog). It seems like every library does this differently and there's lots of legacy combinations. ASP.NET Core currently has 283 lines of instrumentation code (not even counting HostingLoggerExtensions.cs) for instrumenting an incoming request.
As libraries use multiple instrumentation APIs, the subscribers have to fiddle around with weird exclusion filters etc to prevent duplicate entries. E.g. if a tracer listens to DiagnosticSource & ILogger events, he has to make sure that he doesn't get the ILogger-events for RequestStart/RequestEnd.
Propagation of headers has been put into the responsibility of instrumentation code and is currently hard-coded. And it's - surprise - inconsistent as well. HttpClient & ASP.NET Core use a Request-Id header, the Azure Event Hubs library uses a Diagnostic-Id header. And - another surprise - all of them will be legacy soon as the industry will standardize on different headers (People from Application Insights are part of that w3c/distributed-tracing standardization group)
- This propagation code should have never been hard-coded and I really can't understand why this was done. Once the new standard is out, .NET can decide if it's going to become/stay proprietary (by only supporting the existing headers) or if it's going to add the new headers - In any case, .NET Core would then happily send the old headers around the world (see "lack of configurability" later) even though pretty much no one needs them. So we'll probably end up with some flags somewhere and even more complex instrumentation code.
There's no way for an application developer to configure anything in this system. Once there's an active subscriber, the whole system just sends & accepts baggage and IDs to/from any external system.
- This happens even if the subscriber doesn't need to correlate requests - e.g. in case of a simple monitoring subscriber like prometheus that "only" wants to track requests.
This lack of configurability is also a privacy & security concern, given that baggage might contain User-ids and therefore maybe even email addresses.
There is no proper documentation. Having a Markdown file somewhere deep in a github repository just isn't enough. There MUST be proper documentation on docs.microsoft.com and blog posts so that library authors can find it.
(And everything else that Nick has said already on this thread)

cwe1ss on 15 May 2018

@cwe1ss I think you listed our backlog in your comment =). We have a long way to go. As @lmolkova said there are many moving parts on the way, but we are getting there. Some work is visible, some is not outside the company. We also concerned about community opinion very much. For instance, we had a workshop between a bunch of APM vendors recently discussing the topic of standardizing the API shape for distributed tracing. We listen on GitHub and other channels and adopt. But operating in time and background limitations.

That's said we believe that Diagnostics Source is the answer for instrumentation for distributed tracing scenarios on .NET. We working with teams to adopt it and collecting feedback on what should be improved in its APIs.

This is very valid feedback that Diagnostics Source may not be at the level when you are comfortable to pick it up. One option for experiment will be to expose Diagnostics Source via some strongly typed tracing proxy. Doing so will allow customers using this Redis see their calls to redis as dependencies in Application Insights. And other companies like SkyWalking (they already using Diagnostics Source) pick them up as well with minimal work on their side. We believe there will be more partners who pick up distributed tracing calls via DiagnosticsSource instrumentation soon.

SergeyKanzhelev on 15 May 2018

@cwe1ss

Our .NET solution is still evolving (perhaps the pace is too slow from your perspective, I can relate to that). Our work on W3C HTTP spec is another step forward in this area.
Should we have waited until all details are clarified around HTTP standard and only then deliver working solution for our users? From the business perspective, we decided to move forward.

Our direction moving forward is definitely support for W3C HTTP distributed tracing standard and we'll chang the Activity to support it. Will the standard be hardcoded? likely yes because it is a standard. Extensibility could be provided in addition to it. As you know .NET/ASP.NET cannot really afford breaking changes, that's why they are so conservative around new APIs (and extensibility points) so we provide them after we know they are needed. And sometimes it's not really that simple: e.g. configuring header names would not work as there are different Id formats requirements out there.

Configuration is totally possible, however, we rely on the tracing system rather than user app to control it. Tracing system decides which events to receive and inject/extract its own headers, it may allow users to configure it.

I totally agree with lack of documentation. I believe many of your concerns would be clarified if we had done better work on it.

E.g.

Instrumentation code is quite complex and has to do weird dances to work "against" the Activity API as it e.g. sometimes has to call activity.Start() and sometimes diagnosticListener.StartActivity()

Is pure optimization (verbosity control) to prevent Start event and payload allocation associated with it.
These weird dances and 283 lines of ASP.NET instrumentation ensure instrumentation makes as small impact as possible.

So yes, we do lack documentation that would explain all this and the motivation behind it. Perhaps md files are not good enough for this purpose. We are working on that and there are many things to consider before putting out official guidance everyone needs to follow.

I can see that some concerns are also relevant for OpenTracing/OpenTracing/Zipkin, so perhaps there is some level of complexity that none of the existing systems can solve at least yet.
I'd say we moving forward with open-tracing-like usage of tags, but tags also do not provide any consistency or contracts, if you are instrumenting a new library, there is no guidance which tags to use and what info to add, It's also not clear which OSS library to use to instrument the code.

Our philosophy for tracing still has to follow some principles like

automagical with (almost) zero configuration
(almost) zero perf impact when nobody listens to, and runtime sampling and verbosity control when there is a listener
no additional dependencies, so .NET is traceable itself
ability to build rich UI on top of it and run our important customer scenarios

I welcome you to think how we can solve your concerns still satisfying these requirements. I really do.
Just as a first step, I would really appreciate if you helped us review this lib instrumentation guidance. We can discuss any concerns around IsEnabled dances and Tags, I hope you can help me clarify my questions about Tags contracts. We can change this guidance as we agree upon and I will be working on publishing the result. I weclome @NickCraver, @stebet and anyone else to join this discussion - https://github.com/Microsoft/ApplicationInsights-dotnet-server/issues/913.

lmolkova on 15 May 2018

Thanks for your responses. I disagree on some of your points but I don't want to argue again - we've done so before in quite a few issues and I've done so by mail with Vance.

cwe1ss on 17 May 2018

👍2

Are there any more plans on getting this done for Redis? As a Redis user it would be really helpful to see Redis calls as dependency calls in application insights. This should not be blocked simply because some Microsoft libraries don't behave well in their use of tracing events or other issues with diagnostic source.

Thanks!