Runtime: Missing GC events from EventListener when in Server Mode

Created on 5 Dec 2018  路  20Comments  路  Source: dotnet/runtime

With .NET Core 2.2, it is now possible to listen to Clr traces in-process via EventListener derived classes
Use IssueEventListener.zip to reproduce the issue (code is based on https://blogs.msdn.microsoft.com/dotnet/2018/12/04/announcing-net-core-2-2/)

If an application runs in Server mode (true in the .csproj) only a few GC related events are received when allocations trigger a GC (or induced GC.Collect()).

EventWrittenEventArgs -- ThreadID = 20708 ID = 35 Name = GCTriggered
        Name = "Reason" Value = "0"
        Name = "ClrInstanceID" Value = "8"
EventWrittenEventArgs -- ThreadID = 5292 ID = 14 Name = GCFinalizersBegin_V1
        Name = "ClrInstanceID" Value = "8"
EventWrittenEventArgs -- ThreadID = 5292 ID = 13 Name = GCFinalizersEnd_V1
        Name = "Count" Value = "1"
        Name = "ClrInstanceID" Value = "8"
EventWrittenEventArgs -- ThreadID = 5292 ID = 14 Name = GCFinalizersBegin_V1
        Name = "ClrInstanceID" Value = "8"
EventWrittenEventArgs -- ThreadID = 5292 ID = 13 Name = GCFinalizersEnd_V1
        Name = "Count" Value = "0"
        Name = "ClrInstanceID" Value = "8"
EventWrittenEventArgs -- ThreadID = 20708 ID = 35 Name = GCTriggered
        Name = "Reason" Value = "10"
        Name = "ClrInstanceID" Value = "8"

If the application runs in workstation mode, all expected events are received.

EventWrittenEventArgs -- ThreadID = 33004 ID = 9 Name = GCSuspendEEBegin_V1
    Name = "Reason" Value = "1"
    Name = "Count" Value = "0"
    Name = "ClrInstanceID" Value = "8"

EventWrittenEventArgs -- ThreadID = 33004 ID = 8 Name = GCSuspendEEEnd_V1
    Name = "ClrInstanceID" Value = "8"

EventWrittenEventArgs -- ThreadID = 33004 ID = 35 Name = GCTriggered
    Name = "Reason" Value = "0"
    Name = "ClrInstanceID" Value = "8"

EventWrittenEventArgs -- ThreadID = 33004 ID = 1 Name = GCStart_V2
    Name = "Count" Value = "1"
    Name = "Depth" Value = "0"
    Name = "Reason" Value = "0"
    Name = "Type" Value = "0"
    Name = "ClrInstanceID" Value = "8"
    Name = "ClientSequenceNumber" Value = "0"

EventWrittenEventArgs -- ThreadID = 33004 ID = 202 Name = GCMarkWithType
    Name = "HeapNum" Value = "0"
    Name = "ClrInstanceID" Value = "8"
    Name = "Type" Value = "0"
    Name = "Bytes" Value = "1053010"

EventWrittenEventArgs -- ThreadID = 33004 ID = 202 Name = GCMarkWithType
    Name = "HeapNum" Value = "0"
    Name = "ClrInstanceID" Value = "8"
    Name = "Type" Value = "1"
    Name = "Bytes" Value = "0"

EventWrittenEventArgs -- ThreadID = 33004 ID = 202 Name = GCMarkWithType
    Name = "HeapNum" Value = "0"
    Name = "ClrInstanceID" Value = "8"
    Name = "Type" Value = "2"
    Name = "Bytes" Value = "53176"

EventWrittenEventArgs -- ThreadID = 33004 ID = 202 Name = GCMarkWithType
    Name = "HeapNum" Value = "0"
    Name = "ClrInstanceID" Value = "8"
    Name = "Type" Value = "3"
    Name = "Bytes" Value = "204911"

EventWrittenEventArgs -- ThreadID = 33004 ID = 205 Name = GCGlobalHeapHistory_V2
    Name = "FinalYoungestDesired" Value = "6291456"
    Name = "NumHeaps" Value = "1"
    Name = "CondemnedGeneration" Value = "0"
    Name = "Gen0ReductionCount" Value = "0"
    Name = "Reason" Value = "0"
    Name = "GlobalMechanisms" Value = "20"
    Name = "ClrInstanceID" Value = "8"
    Name = "PauseMode" Value = "1"
    Name = "MemoryPressure" Value = "0"

EventWrittenEventArgs -- ThreadID = 33004 ID = 204 Name = GCPerHeapHistory_V3
    Name = "ClrInstanceID" Value = "8"
    Name = "FreeListAllocated" Value = "0"
    Name = "FreeListRejected" Value = "0"
    Name = "EndOfSegAllocated" Value = "0"
    Name = "CondemnedAllocated" Value = "0"
    Name = "PinnedAllocated" Value = "0"
    Name = "PinnedAllocatedAdvance" Value = "0"
    Name = "RunningFreeListEfficiency" Value = "0"
    Name = "CondemnReasons0" Value = "0"
    Name = "CondemnReasons1" Value = "0"
    Name = "CompactMechanisms" Value = "0"
    Name = "ExpandMechanisms" Value = "0"
    Name = "HeapIndex" Value = "0"
    Name = "ExtraGen0Commit" Value = "32088"
    Name = "Count" Value = "4"

EventWrittenEventArgs -- ThreadID = 33004 ID = 2 Name = GCEnd_V1
    Name = "Count" Value = "1"
    Name = "Depth" Value = "0"
    Name = "ClrInstanceID" Value = "8"

EventWrittenEventArgs -- ThreadID = 33004 ID = 4 Name = GCHeapStats_V1
    Name = "GenerationSize0" Value = "24"
    Name = "TotalPromotedSize0" Value = "1314320"
    Name = "GenerationSize1" Value = "2134648"
    Name = "TotalPromotedSize1" Value = "0"
    Name = "GenerationSize2" Value = "24"
    Name = "TotalPromotedSize2" Value = "0"
    Name = "GenerationSize3" Value = "282528"
    Name = "TotalPromotedSize3" Value = "0"
    Name = "FinalizationPromotedSize" Value = "32"
    Name = "FinalizationPromotedCount" Value = "1"
    Name = "PinnedObjectCount" Value = "1"
    Name = "SinkBlockCount" Value = "10"
    Name = "GCHandleCount" Value = "50"
    Name = "ClrInstanceID" Value = "8"

EventWrittenEventArgs -- ThreadID = 33004 ID = 7 Name = GCRestartEEBegin_V1
    Name = "ClrInstanceID" Value = "8"

EventWrittenEventArgs -- ThreadID = 33004 ID = 3 Name = GCRestartEEEnd_V1
    Name = "ClrInstanceID" Value = "8"

Also note that if you run perfview in both cases, all expected events are received via etw by perfview

area-Tracing-coreclr

Most helpful comment

What are the chances this will get backported to 2.2? I noticed the current milestone is 3.0. Right now the new metrics aren't usable on linux servers sadly.

All 20 comments

cc @noahfalk @Maoni0

thanks for finding this, @chrisnas!

@noahfalk is on vacation, could someone else from @tommcdon's team take a look?

It seems also that BulkType events are not containing the type details and they are received AFTER the finalizers events (this applies even in Workstation GC mode)

ThreadID = 18860 ID = 29 Name = FinalizeObject
    Name = "TypeID" Value = "1210056592"
    Name = "ObjectID" Value = "1371069040"
    Name = "ClrInstanceID" Value = "8"

ThreadID = 18860 ID = 15 Name = BulkType
    Name = "Count" Value = "1"
    Name = "ClrInstanceID" Value = "8"

Adding @tommcdon @jorive from the diagnostics team.

After investigating missing GCRestartEEBegin event, it seems that (vm\threads.inl)GetThread() returns null in EventPipe::WriteEventInternal()
image
In that case, the test that follows just return without emitting anything
if(pThread == NULL)
{
// We can't write an event without the thread object.
return;
}
Maybe some threads (GC threads for example) are not calling SetThread (or setting manually the corresponding TLS)

I hope this helps

It is by design that GC threads do not have the Thread object. This is bug in EventPipe. It should be able to able to log even without the Thread object.

Updating GC and the rest of the system to work with non-null Thread* for server GC worker threads is likely even more difficult.

Hmm, this will require a bit of thought/care. EventPipe works by having a per-thread buffer that it writes to, and today this hangs off of the runtime's 'Thread' structure. As mentioned currently server GC threads do not have a 'Thread' strcuture defined for them, which is why currently we ignore events raised by server GC threads.

There are a number of possible approaches, one of which is to abstract the per-thread structures needed by the logger to so they can be fetched from th Thread Structure but also on their own and these could be fetched differently for the server GC threads.

We have to decide on an approach first...

This probably explains why some threadpool events are missing as well. I suppose that the threadpool gate thread (that pushes the ThreadPoolWorkerThreadAdjustmentAdjustment event) does not have a Thread object.
There could be plenty of other cases like those ones. I agree that the solution should be at the EventPipe level.

one of which is to abstract the per-thread structures needed by the logger to so they can be fetched from the Thread Structure but also on their own

Why not let EventPipe to always manage the buffers on its own? I do not see the value of this abstraction.

Why not let EventPipe to always manage the buffers on its own? I do not see the value of this abstraction.

We may be talking about the same thing (at least conceptually). Ultimately EventPipe needs a way of hanging its state off a thread. Thus all EventPipe state on the EE Thread class has to be gathered into its own object so that it can be hung off any threads. With that done, we need to be able to fetch it given the current thread. I believe your suggestion is to use C++ thread locals for that which is reasonable.

dotnet/coreclr#21817 was merged to master, so you should start seeing events under server GC if you use the runtime from latest master branch. There are still existing issues with EventPipeBuffer works, but that's a separate issue from this so I filed https://github.com/dotnet/coreclr/issues/22502. Please feel free to reopen if the issue still exists.

What are the chances this will get backported to 2.2? I noticed the current milestone is 3.0. Right now the new metrics aren't usable on linux servers sadly.

@joshgarnett There is a bug bar that needs to be met before we can put anything on 2.2. I'm not entirely sure if this bug meets the bar given the scope of change.

Quick question - Is there a particular reason you can't use 3.0 preview? The fixes I made should ship with .NET Core 3.0 Preview 3, and that should address this issue. If there is a strong reason that you cannot move to 3.0 and stay on 2.2, please let us know since that is what we need before we can put changes to 2.2.

@sywhang - I can't speak for @joshgarnett specifically, but for many customers the choice is usually between waiting for the next major release (3.0) vs. waiting for a servicing release (2.2.x). Preview releases are great for poking around and testing upcoming features, but if you need to deploy to production using a Preview build is a risky proposition.

We can follow up with the servicing folks offline, but at minimum I don't think we could take this fix without also getting dotnet/coreclr#22502 fixed. Otherwise we are trading an old bug for a new bug which we strongly try to avoid in servicing.

@noahfalk is correct. My current team is working on a new service that will likely hit production here soon. I'm always cagey about going with initial major releases as well as there always ends up being some gotcha fixed in a dot release.

Given the fix for dotnet/coreclr#22502 was merged in May, can this fix be shipped in a 2.2 servicing release? Has it already?

After dotnet/coreclr#22502 was completed we found further weaknesses in other parts of the multi-threading support for EventPipe that shipped in 2.2. We have been fixing them as we have identified them in 3.0 but the original changes already weren't small and now the complete set is even larger. Backporting only the original fix or a subset is risky from the perspective that it perturbs a part of the system that was already on shaky foundations. Backporting the entire series of fixes is risky from the perspective that it has become interwoven with other work happening in 3.0 and trying to separate it is labor intensive and error-prone. Also a 2.2 servicing fix at this point probably wouldn't come much earlier on the calendar than the 3.0 release.

I know its not the answer you are hoping for, but I think the best course of action if you want this functionality is to use a 3.0 preview build, or wait for the upcoming 3.0 release build.

@noahfalk No worries, thanks for the confirmation :+1:

Was this page helpful?
0 / 5 - 0 ratings