Runtime: Test hang in System.Diagnostics.Tracing.EventProvider.EventUnregister

Created on 25 Apr 2020  路  23Comments  路  Source: dotnet/runtime

``` System.IO.Tests.FileStream_LockUnlock.OverlappingRegionsFromOtherProcess_ThrowsException(fileLength: 10, firstPosition: 0, firstLength: 10, secondPosition: 1, secondLength: 2) [FAIL]
Half-way through waiting for remote process.
Memory load: 14

  Image Name                     PID Services                                    
  ========================= ======== ============================================
  svchost.exe                    620 BrokerInfrastructure, DcomLaunch, LSM,      
                                     PlugPlay, Power, SystemEventsBroker         
  svchost.exe                    652 RpcEptMapper, RpcSs                         
  svchost.exe                    776 Dhcp, EventLog, lmhosts, vmictimesync,      
                                     Wcmsvc                                      
  svchost.exe                    808 AeLookupSvc, Appinfo, BITS, CertPropSvc,    
                                     gpsvc, iphlpsvc, LanmanServer, ProfSvc,     
                                     sacsvr, Schedule, SENS, SessionEnv,         
                                     ShellHWDetection, Themes, Winmgmt           
  svchost.exe                    888 EventSystem, FontCache, netprofm, nsi,      
                                     W32Time, WinHttpAutoProxySvc                
  svchost.exe                    976 CryptSvc, Dnscache, LanmanWorkstation,      
                                     NlaSvc, WinRM                               
  svchost.exe                    592 BFE, DPS, MpsSvc, pla                       
  svchost.exe                   1136 AppHostSvc                                  
  svchost.exe                   1164 DiagTrack                                   
  svchost.exe                   1328 TrkWks, UALSVC, UmRdpService,               
                                     vmickvpexchange, vmicshutdown, vmicvss      
  svchost.exe                   1352 W3SVC, WAS                                  
  svchost.exe                   2632 TermService                                 
  svchost.exe                   2672 vmicheartbeat, vmicrdv                      
  Timed out at 4/25/2020 1:55:40 AM after 60000ms waiting for remote process.
  Wrote mini dump to: C:\h\w\AE3609A5\w\B05309CA\uploads\3476.zuzob0j0.l20.dmp
    Process ID: 3476
    Handle: 1004
    Name: dotnet
    MainModule: C:\h\w\AE3609A5\p\dotnet.exe
    StartTime: 4/25/2020 1:54:40 AM
    TotalProcessorTime: 00:00:00.2343750
    Threads:
        Thread #1 (OS 0x560)   
            [InlinedCallFrame] (System.Environment._Exit)
            [InlinedCallFrame] (System.Environment._Exit)
            System.Environment.Exit(Int32)
            Microsoft.DotNet.RemoteExecutor.Program.Main(System.String[])
        Thread #2 (OS 0xA38) [Finalizer] [Background] [MTA]
            [InlinedCallFrame] (Interop+Advapi32.EventUnregister)
            [InlinedCallFrame] (Interop+Advapi32.EventUnregister)
            System.Diagnostics.Tracing.EtwEventProvider.System.Diagnostics.Tracing.IEventProvider.EventUnregister(Int64)
            System.Diagnostics.Tracing.EventProvider.EventUnregister(Int64)
            System.Diagnostics.Tracing.EventProvider.Dispose(Boolean)
            System.Diagnostics.Tracing.EventProvider.Dispose()
            System.Diagnostics.Tracing.EventSource.Dispose(Boolean)
            System.Diagnostics.Tracing.EventListener.DisposeOnShutdown(System.Object, System.EventArgs)
            System.AppContext.OnProcessExit()
        Thread #3 (OS 0x700) [Thread pool worker] [Background] [MTA]
        Thread #4 (OS 0xE50) [Thread pool worker] [Background] [MTA]

```

area-System.Diagnostics.Tracing blocking-clean-ci

Most helpful comment

I have stepped through the Event Pipe IPC shutdown sequence. I believe that there is a clear HANDLE use-after-close bug:

Handle that is being closed here:

00 CoreCLR!IpcStream::DiagnosticsIpc::Close+0xbb [D:\runtime\src\coreclr\src\debug\debug-pal\win\diagnosticsipc.cpp @ 213] 
01 CoreCLR!IpcStreamFactory::ConnectionState::Close+0x42 [D:\runtime\src\coreclr\src\vm\ipcstreamfactory.h @ 38] 
02 CoreCLR!IpcStreamFactory::CloseConnections+0x71 [D:\runtime\src\coreclr\src\vm\ipcstreamfactory.cpp @ 115] 
03 CoreCLR!DiagnosticServer::Shutdown+0x1e8 [D:\runtime\src\coreclr\src\vm\diagnosticserver.cpp @ 229] 
04 CoreCLR!EEShutDownHelper+0x1b6 [D:\runtime\src\coreclr\src\vm\ceemain.cpp @ 1212] 
05 CoreCLR!EEShutDown+0x346 [D:\runtime\src\coreclr\src\vm\ceemain.cpp @ 1677] 
06 CoreCLR!CorHost2::UnloadAppDomain2+0xfb [D:\runtime\src\coreclr\src\vm\corhost.cpp @ 942] 

is being waited on - or about to be waited on - here:

00 KERNELBASE!WaitForMultipleObjects [minkernel\kernelbase\synch.c @ 1401] 
01 CoreCLR!IpcStream::DiagnosticsIpc::Poll+0x31f [D:\runtime\src\coreclr\src\debug\debug-pal\win\diagnosticsipc.cpp @ 311] 
02 CoreCLR!IpcStreamFactory::GetNextAvailableStream+0x1c2 [D:\runtime\src\coreclr\src\vm\ipcstreamfactory.cpp @ 162] 
03 CoreCLR!DiagnosticServer::DiagnosticsServerThread+0x379 [D:\runtime\src\coreclr\src\vm\diagnosticserver.cpp @ 50] 
04 KERNEL32!BaseThreadInitThunk+0x14 [base\win32\client\thread.c @ 64] 
05 ntdll!RtlUserThreadStart+0x21 [minkernel\ntdll\rtlstrt.c @ 1153] 

How does this explain the hang:

  • The handle gets reused for auto-reset event allocated by OS loader code
  • The Wait operation in IpcStream will reset the auto-reset event
  • The OS loader code will hang waiting for the event forever becase of it was reset already

Why is it more likely to repro on Win7/8:

  • Differences in the OS scheduler and differences in the OS handle allocation algorithm

The shutdown sequence of the event pipe need careful review. Also, it would be a good idea to minimize work done during shutdown to only what's absolutely necessary, e.g. these event handles do not really need to be closed since the OS will close them anyway as part of the process shutdown.

All 23 comments

I couldn't figure out the best area label to add to this issue. Please help me learn by adding exactly one area label.

Tagging subscribers to this area: @tarekgh, @tommcdon
Notify danmosemsft if you want to be subscribed.

Saved dump at \jkotas9\drops\issue-35451

We are stuck inside GetProcAddress for no apparent reason:

 # Child-SP          RetAddr           Call Site
00 0000000f`f392ef48 00007ffa`a429de39 ntdll!ZwWaitForSingleObject+0xa [d:\blue.obj.amd64fre\minkernel\ntdll\daytona\objfre\amd64\usrstubs.asm @ 198] 
01 0000000f`f392ef50 00007ffa`a429b7e4 ntdll!RtlpWaitOnCriticalSection+0xe1 [d:\blue\minkernel\ntdll\resource.c @ 771] 
02 0000000f`f392f020 00007ffa`a429b6dc ntdll!RtlpEnterCriticalSectionContended+0xa4 [d:\blue\minkernel\ntdll\resource.c @ 1220] 
03 0000000f`f392f060 00007ffa`a429f6be ntdll!LdrpAcquireLoaderLock+0x2c [d:\blue\minkernel\ntdll\ldr.c @ 250] 
04 0000000f`f392f0a0 00007ffa`a17d58f0 ntdll!LdrGetProcedureAddressForCaller+0x19e [d:\blue\minkernel\ntdll\ldrapi.c @ 1901] 
05 0000000f`f392f290 00007ffa`803ee358 KERNELBASE!GetProcAddressForCaller+0x70 [d:\blue\minkernel\kernelbase\module.c @ 1891] 
06 0000000f`f392f2e0 00007ffa`803ee17d coreclr!NDirectMethodDesc::FindEntryPoint+0x4c [F:\workspace\_work\1\s\src\coreclr\src\vm\method.cpp @ 5339] 
07 0000000f`f392f340 00007ffa`803edb8b coreclr!NDirect::NDirectLink+0x119 [F:\workspace\_work\1\s\src\coreclr\src\vm\dllimport.cpp @ 7121] 
08 0000000f`f392f670 00007ffa`804a2ae3 coreclr!NDirectImportWorker+0x7b [F:\workspace\_work\1\s\src\coreclr\src\vm\dllimport.cpp @ 7271] 
09 0000000f`f392f6f0 00007ffa`2093d471 coreclr!NDirectImportThunk+0x43 [F:\workspace\_work\1\s\src\coreclr\src\vm\amd64\AsmHelpers.asm @ 198] 
0a 0000000f`f392f760 000e000f`d1613df0 System_Private_CoreLib!System.Diagnostics.Tracing.EtwEventProvider.System.Diagnostics.Tracing.IEventProvider.EventUnregister(Int64)+0x81 [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventProvider.cs @ 1286] 

This may be duplicate of #13610

@noahfalk for thoughts on what might be causing this

I'll see if I can take a closer look tomorrow (I'll need to get back on corpnet to access the dump). Off the cuff that callstack looked like the thread was blocked on the loader lock so probably some other thread is holding it. If so we'll spot the culprit in the dump.

@jkotas - I wasn't able to access the dump on your share. I can list directories in \\jkotas9 and \\jkotas9\public, but I get access denied trying to open \\jkotas9\drops. For comparison I am able to access shares on \\cpvsbuild\ so I suspect this is a security configuration issue on your share rather than a broader issue with my credentials or VPN connection.

Attaching windbg showed the loader lock was unowned:
0:002> dx -r1 ((ntdll!_RTL_CRITICAL_SECTION *)0x7ffaa438e8a8) ((ntdll!_RTL_CRITICAL_SECTION *)0x7ffaa438e8a8) : 0x7ffaa438e8a8 [Type: _RTL_CRITICAL_SECTION *] [+0x000] DebugInfo : 0x7ffaa438ebe0 [Type: _RTL_CRITICAL_SECTION_DEBUG *] [+0x008] LockCount : 0 [Type: long] [+0x00c] RecursionCount : 0 [Type: long] [+0x010] OwningThread : 0x0 [Type: void *] [+0x018] LockSemaphore : 0x168 [Type: void *] [+0x020] SpinCount : 0x4000000 [Type: unsigned __int64]
Best guess the loader lock was owned earlier causing the OS to switch the thread out, later the lock was released and for whatever reason the OS hasn't yet switched the thread back in.

Jan's suggestion of #13610 sounds very plausible to me. If we wanted more data on this failure (or failures of this class) we'd probably need to upgrade the telemetry to determine if the thread is making progress and what behavior is happening machine wide. A better set of telemetry for a suspected hung test to capture would be:

  1. process dump
  2. ETW trace including stack sampling and context switches machine wide for ~30 sec
  3. another process dump
    This would clearly show if progress was being made, as well as what else the machine is doing.

Closing this as duplicate of #13610

This was most likely introduced by #33307 according the CI telemetry. See https://github.com/dotnet/runtime/issues/13610#issuecomment-621523649 for more context.

Another data point for this, based on the Kusto data I see, this only happens in Windows7 and Windows8.1 -- I tried to repro locally on Windows 10 and after 150 runs it didn't repro.

Thanks @safern! That's very likely to save @josalem from hitting some deadends.

this only happens in Windows7 and Windows8.1 -- I tried to repro locally on Windows 10 and after 150 runs it didn't repro.

Interesting!

I'll take a look at this today. Thanks for triaging things @safern! I'm not sure what would be causing this issue, but based on the versions being hit, I'm guessing it might be something with Overlapped IO in old versions of Windows. I'll have to do some investigating.

I spun up a win7 VM and I'm trying to catch this, but I can't seem to reproduce the failure. I am running the System.Collections.Concurrent.Tests suite in a loop and having it ping me if the test takes longer than 60 seconds. So far I've noticed that the tests either take almost exactly 30 seconds or they take almost exactly 65 seconds. My VM is 2 vCPU 4GB, so it's a little bigger than the ones in CI unfortunately.

I've narrowed the test cycle down to just the System.Collections.Concurrent.Tests.EtwTests test that I saw a failure on in the other thread. That test usually completes in <1 second, but every once in a while takes >10 seconds (1 in 750+ loops so far). I'll leave it running overnight and see if it gets hit more often. I might try artificially bogging down the machine as well and see if that changes things.

@safern, were you ever able to locally reproduce this?

While I try to get a repro set up, I'll take another look at the dump that Jan shared. Based on what I've already seen and what Noah found, I'm not sure there is enough in the dump to root cause this. The thing I found interesting, is that the dump shows that the DiagnosticsServer is inactive, i.e., it's in the middle of a Poll call that is using WaitForMultipleEvents to wait on a Named Pipe handle. That PR doesn't include any changes to eventing, and since the monitor configuration isn't enabled, it shouldn't be doing anything new. I'm pretty sure this also means that we haven't reached the VM's shutdown code, so we wouldn't have called DiagnosticServer::Shutdown yet, so it should be fine that the server is waiting. I'll can try bisecting my changes with ifdefs to see if something is exacerbating the problem, but I can't do too much until I can reliably reproduce the error. I'll keep digging.

By "ping" I meant that it literally made a ping sound from the console. I had stepped away from my computer for a while, and it started beeping. It seems like after 1000 loops most of the runs are taking >10 seconds! The xunit output claims the tests ran in 5 seconds (still a big increase over the 0.6 second runtime from the first 900 loops), but the RuntTests.cmd script didn't finish before 10 seconds. The logs _also_ say that the tests passed. I'll dig into this more tomorrow with some better logging and see if it actually means anything (and mute my computer before I call it a day 馃槤 ).

I left my script running overnight. I'm seeing some interesting behavior that I'm going to try and reproduce the rest of today.

I wrote up a powershell script does the following in an infinite loop:

$Now = Get-Date
$job = Start-Job -ScriptBlock {
    # invoke RuntTests.cmd scoped to just System.Collections.Concurrent.Tests.EtwTests
}

While ($job.State -eq 'Running') {
    if ((Get-Date).AddSeconds(-10) -gt $Now) {
        # beep and log
    }
}

$Output = $job | Receive-Job
Write-Host $Output

The time check is wall clock time, so it isn't timing how long xunit claims the tests executed for. For the first ~1000 iterations, the job would compelete in <10s and xunit would claim ~0.6s of test time. After that, almost every job took >10s and xunit claimed the tests tool 3-6s, which is 10x time increase at the upper end. After ~3000 iterations, it looks like something went haywire and the powershell job died, but was still in the running state. The time check loop never exited, but the state was still in 'Running' and there wasn't a dotnet.exe process running on the system.

I want to build a more robust script that does some more interesting logging, especially better timing. Right now, my thinking is there might be some kind of OS scheduling issue that we are exacerbating with high volumes of RemoteExecutor tests which causes performance of the tests to degrade over time. I'll try running this same experiment with another test suite and see if I'm seeing similar behavior. I think it might be possible that the tests are hitting a timeout cumulatively as this degradation happens. When we then take a dump we keep seeing the EventUnregister stack because that method is specifically waiting on the OS to wake it up, but the scheduling degradation is causing a consistent delay so that our dump logic is invoked before the OS "gets around to it". I need more evidence to back this up though since it does seem to happen more often after #33307 despite that code not touching this logic and being inert through the test I used.

Can we revert the offending PR? This is causing at least one leg in the majority of PRs to fail. It's going to make it much more likely that other regressions slip in, as we're having to rerun lots of legs and merge with failures that might be masking other failures, and it's going to make it more likely devs assume that failures aren't theirs even if they are in fact caused by their PR. Thanks!

I agree this is being hit in 50+ runs a day and we鈥檙e having a 0% pass rate now in rolling builds as well.

Reverting in #35767

I have stepped through the Event Pipe IPC shutdown sequence. I believe that there is a clear HANDLE use-after-close bug:

Handle that is being closed here:

00 CoreCLR!IpcStream::DiagnosticsIpc::Close+0xbb [D:\runtime\src\coreclr\src\debug\debug-pal\win\diagnosticsipc.cpp @ 213] 
01 CoreCLR!IpcStreamFactory::ConnectionState::Close+0x42 [D:\runtime\src\coreclr\src\vm\ipcstreamfactory.h @ 38] 
02 CoreCLR!IpcStreamFactory::CloseConnections+0x71 [D:\runtime\src\coreclr\src\vm\ipcstreamfactory.cpp @ 115] 
03 CoreCLR!DiagnosticServer::Shutdown+0x1e8 [D:\runtime\src\coreclr\src\vm\diagnosticserver.cpp @ 229] 
04 CoreCLR!EEShutDownHelper+0x1b6 [D:\runtime\src\coreclr\src\vm\ceemain.cpp @ 1212] 
05 CoreCLR!EEShutDown+0x346 [D:\runtime\src\coreclr\src\vm\ceemain.cpp @ 1677] 
06 CoreCLR!CorHost2::UnloadAppDomain2+0xfb [D:\runtime\src\coreclr\src\vm\corhost.cpp @ 942] 

is being waited on - or about to be waited on - here:

00 KERNELBASE!WaitForMultipleObjects [minkernel\kernelbase\synch.c @ 1401] 
01 CoreCLR!IpcStream::DiagnosticsIpc::Poll+0x31f [D:\runtime\src\coreclr\src\debug\debug-pal\win\diagnosticsipc.cpp @ 311] 
02 CoreCLR!IpcStreamFactory::GetNextAvailableStream+0x1c2 [D:\runtime\src\coreclr\src\vm\ipcstreamfactory.cpp @ 162] 
03 CoreCLR!DiagnosticServer::DiagnosticsServerThread+0x379 [D:\runtime\src\coreclr\src\vm\diagnosticserver.cpp @ 50] 
04 KERNEL32!BaseThreadInitThunk+0x14 [base\win32\client\thread.c @ 64] 
05 ntdll!RtlUserThreadStart+0x21 [minkernel\ntdll\rtlstrt.c @ 1153] 

How does this explain the hang:

  • The handle gets reused for auto-reset event allocated by OS loader code
  • The Wait operation in IpcStream will reset the auto-reset event
  • The OS loader code will hang waiting for the event forever becase of it was reset already

Why is it more likely to repro on Win7/8:

  • Differences in the OS scheduler and differences in the OS handle allocation algorithm

The shutdown sequence of the event pipe need careful review. Also, it would be a good idea to minimize work done during shutdown to only what's absolutely necessary, e.g. these event handles do not really need to be closed since the OS will close them anyway as part of the process shutdown.

The dump at \\jkotas9\drops\issue-35451 confirms this: pHandles[0] on thread 3 is 0x168, CriticalSection.LockSemaphore on thread 2 is 0x168

Thanks, Jan!

Oof! That's a tricky one! Thanks for pinpointing the issue @jkotas! I'm writing a fix for it now. Earlier iterations of the PR specifically didn't close all the handles and didn't attempt to cancel any waits, but I recalled #12991 which was "fixed" by closing handles to break out of a blocking. I'll come up with a more elegant approach that cancels the blocking Poll call and doesn't attempt subsequent cleanup.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

EgorBo picture EgorBo  路  3Comments

omariom picture omariom  路  3Comments

aggieben picture aggieben  路  3Comments

bencz picture bencz  路  3Comments

omajid picture omajid  路  3Comments