Runtime: ActivityId missing in a newly added HttpTelemetry Stop events.

Created on 13 Aug 2020 · 19Comments · Source: dotnet/runtime

In non-trivial number of cases, the RequestStop event has empty activity id.

9 945 out of 35 452 requests sent during HTTP FunctionalTests were missing activity id.

Other Start/Stop event pairs introduced with the new telemetry are not properly linked with ActivityId as well.

Also easily reproducible with a simple HttpClient scenario:
```C#
class Program
{
static async Task Main(string[] args)
{
using var listener = new HttpEventListener();
using var client = new HttpClient();

    await client.GetAsync(@"https://github.com/dotnet/runtime/blob/master/src/libraries/System.Net.Http/src/HttpDiagnosticsGuide.md");
}

}

internal sealed class HttpEventListener : EventListener
{
public const EventKeywords TasksFlowActivityIds = (EventKeywords)0x80;
//public const EventKeywords Debug = (EventKeywords)0x20000;

protected override void OnEventSourceCreated(EventSource eventSource)
{
    if (eventSource.Name == "System.Net.Http" ||
        eventSource.Name == "System.Net.Sockets" ||
        eventSource.Name == "System.Net.Security" ||
        eventSource.Name == "System.Net.NameResolution")
    {
        EnableEvents(eventSource, EventLevel.LogAlways);
    }
    else if (eventSource.Name == "System.Threading.Tasks.TplEventSource")
    {
        EnableEvents(eventSource, EventLevel.LogAlways, TasksFlowActivityIds /*| Debug*/);
    }
}

protected override void OnEventWritten(EventWrittenEventArgs eventData)
{
    var sb = new StringBuilder().Append($"{eventData.TimeStamp:HH:mm:ss.fffffff}[{eventData.EventName}]{eventData.ActivityId}.{eventData.RelatedActivityId} ");
    for (int i = 0; i < eventData.Payload?.Count; i++)
    {
        if (i > 0)
            sb.Append(", ");
        sb.Append(eventData.PayloadNames?[i]).Append(": ").Append(eventData.Payload[i]);
    }
    Console.WriteLine(sb.ToString());
}

}


Result (with the change in https://github.com/dotnet/runtime/pull/41022)

09:34:45.3129327[RequestStart]00000011-0000-0000-0000-0000093f9d59.00000000-0000-0000-0000-000000000000 scheme: https, host: github.com, port: 443, pathAndQuery: /dotnet/runtime/blob/master/src/libraries/System.Net.Http/src/HttpDiagnosticsGuide.md, versionMajor: 1, versionMinor: 1
09:34:45.4344957[ResolutionStart]00001011-0000-0000-0000-0000090f9d59.00000011-0000-0000-0000-0000093f9d59 hostNameOrAddress: github.com
09:34:45.4752982[ResolutionStop]00001011-0000-0000-0000-0000090f9d59.00000000-0000-0000-0000-000000000000
09:34:45.4848747[ConnectStart]00002011-0000-0000-0000-0000091f9d59.00000011-0000-0000-0000-0000093f9d59 address: InterNetworkV6:28:{1,187,0,0,0,0,0,0,0,0,0,0,0,0,0,0,255,255,140,82,118,3,0,0,0,0}
09:34:45.5208470[ConnectStop]00002011-0000-0000-0000-0000091f9d59.00000000-0000-0000-0000-000000000000
09:34:45.5347626[HandshakeStart]00003011-0000-0000-0000-0000096f9d59.00000011-0000-0000-0000-0000093f9d59 isServer: False, targetHost: github.com
09:34:45.8030641[HandshakeStop]00003011-0000-0000-0000-0000096f9d59.00000000-0000-0000-0000-000000000000 protocol: Tls13
09:34:45.8070942[Http11ConnectionEstablished]00000011-0000-0000-0000-0000093f9d59.00000000-0000-0000-0000-000000000000
09:34:46.2040073[ResponseHeadersBegin]00000011-0000-0000-0000-0000093f9d59.00000000-0000-0000-0000-000000000000
09:34:46.3278444[RequestStop]00000011-0000-0000-0000-0000093f9d59.00000000-0000-0000-0000-000000000000
```

~Un-paired Stop events:~
~- ResolutionStop: ran in continuation: https://github.com/dotnet/runtime/blob/master/src/libraries/System.Net.NameResolution/src/System/Net/Dns.cs#L485~
~- ConnectStop: ran in SocketsEventArgs: https://github.com/dotnet/runtime/blob/master/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.cs#L744~
~- HandshakeStop: ran in continuation: https://github.com/dotnet/runtime/blob/master/src/libraries/System.Net.Security/src/System/Net/Security/SslStream.Implementation.cs#L240~
They're paired, I misinterpreted the logs. The Stop event doesn't carry parent activity id, only the start one.

What needs to be done to achieve desired statistics:

[x] Fix Request events #41022
[ ] ~Fix Resolution events~ wrongly interpreted logs
[ ] ~Fix Connect events~ wrongly interpreted logs
[ ] ~Fix Handshake events~ wrongly interpreted logs
[x] Add events (Start/Stop) for request headers sent
[x] Add events (Start/Stop) for request body sent
[x] Add events (Start/Stop) for response headers received
[x] Add events (Start/Stop) for response body received when we buffer it #41022
[x] Add events (Start/Stop) to convenience methods using ResponseHeadersRead for the whole method #41022
[x] Add events (Start/Stop) to convenience methods for the response stream reading (the same pair as in for response body received) #41022
[ ] Document how to add custom Start/Stop events for users using ResponseHeadersRead that want to properly time response body reading (based on our solution for convenience methods)

We can also add or leave existing unpaired telemetry (though it shouldn't be named Start/Stop then) for events that cannot be bound by a single async context. We can link them to the Start/Stop events manually either via remembering ActivityId or using some other custom ID.

cc: @samsp-msft @MihaZupan @karelz

area-System.Net.Http bug

Source

ManickaP

👍2

All 19 comments

Tagging subscribers to this area: @tarekgh, @tommcdon, @pjanotti
See info in area-owners.md if you want to be subscribed.

msftbot[bot] on 13 Aug 2020

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

msftbot[bot] on 13 Aug 2020

I might have found the root cause of this problem.
When I fetch an HTTP response without content, the ActivityId is present in the Stop event.

The RequestStop is called only after the response has been read to the end, which happens outside of the async scope started when RequestStart was called.

Details:

RequestStart is called in HttpConnectionPoolManager, several levels down from HttpClient.SendAsync.
RequestStop is called in HttpRequestMessage called when the response content stream has been read: e.g.: ContentLengthReadStream. Which is either called by the user, or in case of response buffering, in HttpClient.FinishSendAsync.

Several options how to solve it:

track only request part of request / response exchange, i.e. call stop sooner.
- maybe add separate tracking for response (no idea if it's doable)
manually handle activity id and pass it to the stop event (no idea if it's doable)
keep it as it is, pairing won't work
- manually add custom identifier into the logging so at least something can be used

ManickaP on 14 Aug 2020

@ManickaP nice investigating 👍
Indeed as you observed, ActivityId is async-local so it flows from parent to child when creating a Task but changes do not flow back from child to parent when an await completes.

I'm not knowledgeable about the networking code, but is there an option to call Start() sooner so that it runs in the same Task where Stop() will eventually be called? Other more customized solutions may also be possible, but I've never attempted to author one and my suspicion is it would get messy.

noahfalk on 14 Aug 2020

👍1

is there an option to call Start() sooner so that it runs in the same Task where Stop() will eventually be called?

We could, theoretically, run it from HttpClient.SendAsync, but that would only help us with buffered responses. In the other case, the request gets finished after the user either disposes it or reads the response stream till the end and we have no control over when and where this happens.

Is there a way to manually retrieve and later attach the ActivityId to the stop event?

ManickaP on 14 Aug 2020

As @ManickaP said, we could only do that for buffered responses. For unbuffered ones, we hand out the response that the user can interact with that will eventaully trigger a Stop event.

If I'm understanding this correctly, we can't flow the ActivityID up to the parent?
In that case, is the HttpClient design (handing out the response to the caller to process) incompatible with a Start/Stop activity tracking the entire processing lifetime?

If we give up on tracking the response portion, we would cut into the value of the activity. At that point we would be telling users to log their own Start/Stop activity enclosing the processing of a request.

Alternatively, would introducing a "SendAndProcessWithTelemetryAsync" method that takes a callback make any sense? In that can we could ensure that everything happens in the same scope.

Something along the lines of (just a rough idea):
c# // On HttpClient Task<HttpResponseMessage> SendAndProcessWithTelemetryAsync(HttpRequestMessage request, HttpCompletionOption completionOption, Func<HttpResponseMessage, Task> responseCallback, CancellationToken cancellationToken) { TelemetryStart(); try { HttpResponseMessage response = await this.SendAsync(request, completionOption, cancellationToken); await responseCallback(response); TelemetryStop(); return response; } catch { TelemetryAbort(); // Which also logs Stop throw; } }

MihaZupan on 14 Aug 2020

Alternatively, would introducing a "SendAndProcessWithTelemetryAsync" method that takes a callback make any sense? In that can we could ensure that everything happens in the same scope.

That would measure/trace different things depending of HttpCompletionOptions, for buffered it'd include reading the response, for read-headers it wouldn't.
It's similar the first solution I suggested in here: https://github.com/dotnet/runtime/issues/40776#issuecomment-673954728.

ManickaP on 14 Aug 2020

for read-headers it wouldn't.

The idea is that it would be captured in the user-provided callback.

Essentially if a user had code like

```c#
Task DoSomethingWithResponseAsync(HttpResponseMessage response) {}


```c#
var response = await client.SendAsync(request, completionOption);
await DoSomethingWithResponseAsync(response);

it would turn into

c# await client.SendAndProcessWithTelemetryAsync(request, completionOption, DoSomethingWithResponseAsync);

That way the Telemetry Start/Stop would be in the enclosing scope of the response processing.

MihaZupan on 14 Aug 2020

👍1

Now I see.

ManickaP on 14 Aug 2020

Is there a way to manually retrieve and later attach the ActivityId to the stop event?

One could save EventSource.CurrentThreadActivityId after the start event, use EventSource.SetCurrentThreadActivityId to temporarily change the activity ID before and after the stop event, and set EventActivityOptions.Disable on the stop event to prevent ActivityTracker from overriding the activity ID. However, if the request causes any other activities, then those would not get the correct parent in their hierarchical GUIDs, so the async solution seems safer.

KalleOlaviNiemitalo on 14 Aug 2020

Alternatively, would introducing a "SendAndProcessWithTelemetryAsync" method that takes a callback make any sense?

Good question, but no :-) We do not want to encourage that style of programming, and it won't fix the 99% case (on top of which it's more expensive).

stephentoub on 14 Aug 2020

One could save EventSource.CurrentThreadActivityId after the start event ...

What @KalleOlaviNiemitalo suggests would work to set the activity id of the stop event, however there is also the consideration of what activity id is attached to all events that happen to occur in between the start and stop. Perhaps we are in position to know that we can control all of those events because we own 100% of the code that executes between those two points? I didn't get a chance to read all the updates above and it is almost 5:30am so I need to sleep before I'll be able to do a good job thinking about this. When I'm awake again I'll dig into it further : )

noahfalk on 14 Aug 2020

👍1

We can move the RequestStart somewhere within the start of HttpClient.SendAsync. This should put all of our other activities in the same async scope. After that, the control is in user's hands, we don't need to track those activities, we just need to link the RequestStop once it happens. For that, we can use the SetCurrentThreadActivityId. The only things that would be excluded are content reading methods (LoadIntoBuffer, SerializeToStream, CreateContentReadStream, etc.), which, AFAIK, don't have any activities, so we should be fine.

All of this is based on premise, that tracing the request until the whole response in consumed is what we want. Alternatively, we could narrow the scope only to the point of creating the response and exclude the response content reading.

ManickaP on 14 Aug 2020

If the response is unbuffered, then we should make the stop be when we hand back control to the user code, as its not ultimately us in control.
If we want to make start/stop events for this that can be correlated, we could create a new event with its own activity ID, that is not parented (so as not to break the async generated IDs) and then reference either it or the other ID in one of other of the payloads, so somebody can track it if they need to. I would put this uber-event at a lower priority for being completed. It may not make sense to be an activity ID, just some other ID that can be used to correlate.

samsp-msft on 17 Aug 2020

In that case, is the HttpClient design (handing out the response to the caller to process) incompatible with a Start/Stop activity tracking the entire processing lifetime?

Yes (at least in the cases where the caller will still need to download the remainder of the response after SendAsync task completes). Apologies for not recognizing the potential for this issue earlier.

The main premises of Activities are:

Contiguous time periods of work on a logical thread (an async Task and any Task children). This requires them to have a start time, an end time, and an implied duration between.
They nest. When a logical thread starts working on a new activity, the new activity needs to finish before work resumes on the parent.

Even if we are capable creating events that don't adhere to these rules, the tools doing analysis are likely to struggle because we expect Activities form a tree. So a generic pattern that is hard for the library to represent with a single Activity looks like this:

```c#
class Worker
{
Task DoWorkPart1(...); // in our case this is HttpClient.SendAsync()
Task DoWorkPart2(...); // and Stream.ReadAsync() on the returned stream
}

A user can consume that API and do very not-treeish things with it:
```c#
List<string> workList = new List<string>();
foreach(string work in workList) { await Worker.DoWorkPart1(work); }
foreach(string work in workList) { await Worker.DoWorkPart2(work); }

The individual parts of the work can be sequenced/nested, but if part1 and part2 are combined into a single activity they don't nest:

time ---->

| work 1.1 | work 2.1 | work 3.1 |  work 1.2 | work 2.2 | work 3.2 |
< -------------- work 1 --------------------->
            <------------------ work 2 ----------------->
                       < ------------------ work 3 ---------------->

However, there are still options we could look at to improve the situation, now and/or in the future:

Create additional (optional) Activities with start/stop events when the response stream is read. We could also create a correlation id so that analysis tools can connect those reads back to the SendAsync which initiated the request. Stream.GetHashCode() might be unique enough to use? (this mirrors @samsp-msft suggestion above)
Collaborate with Visual Studio profiler and/or perfview to add logic correlating the different chunks of work and improve analysis/UI. This type of IO analysis seems likely to be useful so if there isn't already some existing support I can advocate that we build it.
Scope down the activity created by SendAsync so that it sends the stop event at the end of SendAsync.
Create single activity scopes for situations where the library can enforce that all the work happens contiguously, for example GetByteArrayAsync or GetStringAsync
Provide guidance to higher level libraries and app developers showing them how they can create single activities that group together multiple portions of the work when they are in position to know it happens contiguously.

I also want to loop in @brianrob who has probably had more experience with this type of issue than I have. He may know of some pre-existing approaches that handle this better without needing to do anything novel. Hope this helps!

noahfalk on 18 Aug 2020

For 5, I'd recommend we scope down and send the stop at the end of SendAsync, and make it clear either with the event name, or parameters in the payload as to what type of request it is, and so what the Stop will actually correspond to.

samsp-msft on 18 Aug 2020

👍1

FYI: For your test app, make sure you aren’t running under the debugger. VS TPL analysis can conflict with the activity tracking – its something Noah mentioned to me when I was looking at some other event stuff.

Sam

samsp-msft on 19 Aug 2020

Closing as resolved by #41022

MihaZupan on 31 Aug 2020

Fixed in 5.0 in PR #41610