``` System.IO.Tests.FileStream_LockUnlock.OverlappingRegionsFromOtherProcess_ThrowsException(fileLength: 10, firstPosition: 0, firstLength: 10, secondPosition: 1, secondLength: 2) [FAIL]
Half-way through waiting for remote process.
Memory load: 14
Image Name PID Services
========================= ======== ============================================
svchost.exe 620 BrokerInfrastructure, DcomLaunch, LSM,
PlugPlay, Power, SystemEventsBroker
svchost.exe 652 RpcEptMapper, RpcSs
svchost.exe 776 Dhcp, EventLog, lmhosts, vmictimesync,
Wcmsvc
svchost.exe 808 AeLookupSvc, Appinfo, BITS, CertPropSvc,
gpsvc, iphlpsvc, LanmanServer, ProfSvc,
sacsvr, Schedule, SENS, SessionEnv,
ShellHWDetection, Themes, Winmgmt
svchost.exe 888 EventSystem, FontCache, netprofm, nsi,
W32Time, WinHttpAutoProxySvc
svchost.exe 976 CryptSvc, Dnscache, LanmanWorkstation,
NlaSvc, WinRM
svchost.exe 592 BFE, DPS, MpsSvc, pla
svchost.exe 1136 AppHostSvc
svchost.exe 1164 DiagTrack
svchost.exe 1328 TrkWks, UALSVC, UmRdpService,
vmickvpexchange, vmicshutdown, vmicvss
svchost.exe 1352 W3SVC, WAS
svchost.exe 2632 TermService
svchost.exe 2672 vmicheartbeat, vmicrdv
Timed out at 4/25/2020 1:55:40 AM after 60000ms waiting for remote process.
Wrote mini dump to: C:\h\w\AE3609A5\w\B05309CA\uploads\3476.zuzob0j0.l20.dmp
Process ID: 3476
Handle: 1004
Name: dotnet
MainModule: C:\h\w\AE3609A5\p\dotnet.exe
StartTime: 4/25/2020 1:54:40 AM
TotalProcessorTime: 00:00:00.2343750
Threads:
Thread #1 (OS 0x560)
[InlinedCallFrame] (System.Environment._Exit)
[InlinedCallFrame] (System.Environment._Exit)
System.Environment.Exit(Int32)
Microsoft.DotNet.RemoteExecutor.Program.Main(System.String[])
Thread #2 (OS 0xA38) [Finalizer] [Background] [MTA]
[InlinedCallFrame] (Interop+Advapi32.EventUnregister)
[InlinedCallFrame] (Interop+Advapi32.EventUnregister)
System.Diagnostics.Tracing.EtwEventProvider.System.Diagnostics.Tracing.IEventProvider.EventUnregister(Int64)
System.Diagnostics.Tracing.EventProvider.EventUnregister(Int64)
System.Diagnostics.Tracing.EventProvider.Dispose(Boolean)
System.Diagnostics.Tracing.EventProvider.Dispose()
System.Diagnostics.Tracing.EventSource.Dispose(Boolean)
System.Diagnostics.Tracing.EventListener.DisposeOnShutdown(System.Object, System.EventArgs)
System.AppContext.OnProcessExit()
Thread #3 (OS 0x700) [Thread pool worker] [Background] [MTA]
Thread #4 (OS 0xE50) [Thread pool worker] [Background] [MTA]
```
I couldn't figure out the best area label to add to this issue. Please help me learn by adding exactly one area label.
Tagging subscribers to this area: @tarekgh, @tommcdon
Notify danmosemsft if you want to be subscribed.
Hit by #35427 in https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-35427-merge-6fdb541e69394f00ba/System.Threading.Thread.Tests/console.72689c94.log?sv=2019-02-02&se=2020-05-05T01%3A21%3A24Z&sr=c&sp=rl&sig=Aalm%2BlDqE2HM88uUCuqEEJtCb%2F6EW2vB%2BQH9i9G1F6Y%3D (test System.Threading.Threads.Tests.ThreadTests.WindowsPrincipalPolicyTest_Windows)
Saved dump at \jkotas9\drops\issue-35451
We are stuck inside GetProcAddress for no apparent reason:
# Child-SP RetAddr Call Site
00 0000000f`f392ef48 00007ffa`a429de39 ntdll!ZwWaitForSingleObject+0xa [d:\blue.obj.amd64fre\minkernel\ntdll\daytona\objfre\amd64\usrstubs.asm @ 198]
01 0000000f`f392ef50 00007ffa`a429b7e4 ntdll!RtlpWaitOnCriticalSection+0xe1 [d:\blue\minkernel\ntdll\resource.c @ 771]
02 0000000f`f392f020 00007ffa`a429b6dc ntdll!RtlpEnterCriticalSectionContended+0xa4 [d:\blue\minkernel\ntdll\resource.c @ 1220]
03 0000000f`f392f060 00007ffa`a429f6be ntdll!LdrpAcquireLoaderLock+0x2c [d:\blue\minkernel\ntdll\ldr.c @ 250]
04 0000000f`f392f0a0 00007ffa`a17d58f0 ntdll!LdrGetProcedureAddressForCaller+0x19e [d:\blue\minkernel\ntdll\ldrapi.c @ 1901]
05 0000000f`f392f290 00007ffa`803ee358 KERNELBASE!GetProcAddressForCaller+0x70 [d:\blue\minkernel\kernelbase\module.c @ 1891]
06 0000000f`f392f2e0 00007ffa`803ee17d coreclr!NDirectMethodDesc::FindEntryPoint+0x4c [F:\workspace\_work\1\s\src\coreclr\src\vm\method.cpp @ 5339]
07 0000000f`f392f340 00007ffa`803edb8b coreclr!NDirect::NDirectLink+0x119 [F:\workspace\_work\1\s\src\coreclr\src\vm\dllimport.cpp @ 7121]
08 0000000f`f392f670 00007ffa`804a2ae3 coreclr!NDirectImportWorker+0x7b [F:\workspace\_work\1\s\src\coreclr\src\vm\dllimport.cpp @ 7271]
09 0000000f`f392f6f0 00007ffa`2093d471 coreclr!NDirectImportThunk+0x43 [F:\workspace\_work\1\s\src\coreclr\src\vm\amd64\AsmHelpers.asm @ 198]
0a 0000000f`f392f760 000e000f`d1613df0 System_Private_CoreLib!System.Diagnostics.Tracing.EtwEventProvider.System.Diagnostics.Tracing.IEventProvider.EventUnregister(Int64)+0x81 [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventProvider.cs @ 1286]
This may be duplicate of #13610
@noahfalk for thoughts on what might be causing this
I'll see if I can take a closer look tomorrow (I'll need to get back on corpnet to access the dump). Off the cuff that callstack looked like the thread was blocked on the loader lock so probably some other thread is holding it. If so we'll spot the culprit in the dump.
@jkotas - I wasn't able to access the dump on your share. I can list directories in \\jkotas9 and \\jkotas9\public, but I get access denied trying to open \\jkotas9\drops. For comparison I am able to access shares on \\cpvsbuild\ so I suspect this is a security configuration issue on your share rather than a broader issue with my credentials or VPN connection.
Attaching windbg showed the loader lock was unowned:
0:002> dx -r1 ((ntdll!_RTL_CRITICAL_SECTION *)0x7ffaa438e8a8)
((ntdll!_RTL_CRITICAL_SECTION *)0x7ffaa438e8a8) : 0x7ffaa438e8a8 [Type: _RTL_CRITICAL_SECTION *]
[+0x000] DebugInfo : 0x7ffaa438ebe0 [Type: _RTL_CRITICAL_SECTION_DEBUG *]
[+0x008] LockCount : 0 [Type: long]
[+0x00c] RecursionCount : 0 [Type: long]
[+0x010] OwningThread : 0x0 [Type: void *]
[+0x018] LockSemaphore : 0x168 [Type: void *]
[+0x020] SpinCount : 0x4000000 [Type: unsigned __int64]
Best guess the loader lock was owned earlier causing the OS to switch the thread out, later the lock was released and for whatever reason the OS hasn't yet switched the thread back in.
Jan's suggestion of #13610 sounds very plausible to me. If we wanted more data on this failure (or failures of this class) we'd probably need to upgrade the telemetry to determine if the thread is making progress and what behavior is happening machine wide. A better set of telemetry for a suspected hung test to capture would be:
Closing this as duplicate of #13610
This was most likely introduced by #33307 according the CI telemetry. See https://github.com/dotnet/runtime/issues/13610#issuecomment-621523649 for more context.
Another data point for this, based on the Kusto data I see, this only happens in Windows7 and Windows8.1 -- I tried to repro locally on Windows 10 and after 150 runs it didn't repro.
Thanks @safern! That's very likely to save @josalem from hitting some deadends.
this only happens in Windows7 and Windows8.1 -- I tried to repro locally on Windows 10 and after 150 runs it didn't repro.
Interesting!
I'll take a look at this today. Thanks for triaging things @safern! I'm not sure what would be causing this issue, but based on the versions being hit, I'm guessing it might be something with Overlapped IO in old versions of Windows. I'll have to do some investigating.
I spun up a win7 VM and I'm trying to catch this, but I can't seem to reproduce the failure. I am running the System.Collections.Concurrent.Tests suite in a loop and having it ping me if the test takes longer than 60 seconds. So far I've noticed that the tests either take almost exactly 30 seconds or they take almost exactly 65 seconds. My VM is 2 vCPU 4GB, so it's a little bigger than the ones in CI unfortunately.
I've narrowed the test cycle down to just the System.Collections.Concurrent.Tests.EtwTests test that I saw a failure on in the other thread. That test usually completes in <1 second, but every once in a while takes >10 seconds (1 in 750+ loops so far). I'll leave it running overnight and see if it gets hit more often. I might try artificially bogging down the machine as well and see if that changes things.
@safern, were you ever able to locally reproduce this?
While I try to get a repro set up, I'll take another look at the dump that Jan shared. Based on what I've already seen and what Noah found, I'm not sure there is enough in the dump to root cause this. The thing I found interesting, is that the dump shows that the DiagnosticsServer is inactive, i.e., it's in the middle of a Poll call that is using WaitForMultipleEvents to wait on a Named Pipe handle. That PR doesn't include any changes to eventing, and since the monitor configuration isn't enabled, it shouldn't be doing anything new. I'm pretty sure this also means that we haven't reached the VM's shutdown code, so we wouldn't have called DiagnosticServer::Shutdown yet, so it should be fine that the server is waiting. I'll can try bisecting my changes with ifdefs to see if something is exacerbating the problem, but I can't do too much until I can reliably reproduce the error. I'll keep digging.
By "ping" I meant that it literally made a ping sound from the console. I had stepped away from my computer for a while, and it started beeping. It seems like after 1000 loops most of the runs are taking >10 seconds! The xunit output claims the tests ran in 5 seconds (still a big increase over the 0.6 second runtime from the first 900 loops), but the RuntTests.cmd script didn't finish before 10 seconds. The logs _also_ say that the tests passed. I'll dig into this more tomorrow with some better logging and see if it actually means anything (and mute my computer before I call it a day 馃槤 ).
I left my script running overnight. I'm seeing some interesting behavior that I'm going to try and reproduce the rest of today.
I wrote up a powershell script does the following in an infinite loop:
$Now = Get-Date
$job = Start-Job -ScriptBlock {
# invoke RuntTests.cmd scoped to just System.Collections.Concurrent.Tests.EtwTests
}
While ($job.State -eq 'Running') {
if ((Get-Date).AddSeconds(-10) -gt $Now) {
# beep and log
}
}
$Output = $job | Receive-Job
Write-Host $Output
The time check is wall clock time, so it isn't timing how long xunit claims the tests executed for. For the first ~1000 iterations, the job would compelete in <10s and xunit would claim ~0.6s of test time. After that, almost every job took >10s and xunit claimed the tests tool 3-6s, which is 10x time increase at the upper end. After ~3000 iterations, it looks like something went haywire and the powershell job died, but was still in the running state. The time check loop never exited, but the state was still in 'Running' and there wasn't a dotnet.exe process running on the system.
I want to build a more robust script that does some more interesting logging, especially better timing. Right now, my thinking is there might be some kind of OS scheduling issue that we are exacerbating with high volumes of RemoteExecutor tests which causes performance of the tests to degrade over time. I'll try running this same experiment with another test suite and see if I'm seeing similar behavior. I think it might be possible that the tests are hitting a timeout cumulatively as this degradation happens. When we then take a dump we keep seeing the EventUnregister stack because that method is specifically waiting on the OS to wake it up, but the scheduling degradation is causing a consistent delay so that our dump logic is invoked before the OS "gets around to it". I need more evidence to back this up though since it does seem to happen more often after #33307 despite that code not touching this logic and being inert through the test I used.
Can we revert the offending PR? This is causing at least one leg in the majority of PRs to fail. It's going to make it much more likely that other regressions slip in, as we're having to rerun lots of legs and merge with failures that might be masking other failures, and it's going to make it more likely devs assume that failures aren't theirs even if they are in fact caused by their PR. Thanks!
I agree this is being hit in 50+ runs a day and we鈥檙e having a 0% pass rate now in rolling builds as well.
Reverting in #35767
I have stepped through the Event Pipe IPC shutdown sequence. I believe that there is a clear HANDLE use-after-close bug:
Handle that is being closed here:
00 CoreCLR!IpcStream::DiagnosticsIpc::Close+0xbb [D:\runtime\src\coreclr\src\debug\debug-pal\win\diagnosticsipc.cpp @ 213]
01 CoreCLR!IpcStreamFactory::ConnectionState::Close+0x42 [D:\runtime\src\coreclr\src\vm\ipcstreamfactory.h @ 38]
02 CoreCLR!IpcStreamFactory::CloseConnections+0x71 [D:\runtime\src\coreclr\src\vm\ipcstreamfactory.cpp @ 115]
03 CoreCLR!DiagnosticServer::Shutdown+0x1e8 [D:\runtime\src\coreclr\src\vm\diagnosticserver.cpp @ 229]
04 CoreCLR!EEShutDownHelper+0x1b6 [D:\runtime\src\coreclr\src\vm\ceemain.cpp @ 1212]
05 CoreCLR!EEShutDown+0x346 [D:\runtime\src\coreclr\src\vm\ceemain.cpp @ 1677]
06 CoreCLR!CorHost2::UnloadAppDomain2+0xfb [D:\runtime\src\coreclr\src\vm\corhost.cpp @ 942]
is being waited on - or about to be waited on - here:
00 KERNELBASE!WaitForMultipleObjects [minkernel\kernelbase\synch.c @ 1401]
01 CoreCLR!IpcStream::DiagnosticsIpc::Poll+0x31f [D:\runtime\src\coreclr\src\debug\debug-pal\win\diagnosticsipc.cpp @ 311]
02 CoreCLR!IpcStreamFactory::GetNextAvailableStream+0x1c2 [D:\runtime\src\coreclr\src\vm\ipcstreamfactory.cpp @ 162]
03 CoreCLR!DiagnosticServer::DiagnosticsServerThread+0x379 [D:\runtime\src\coreclr\src\vm\diagnosticserver.cpp @ 50]
04 KERNEL32!BaseThreadInitThunk+0x14 [base\win32\client\thread.c @ 64]
05 ntdll!RtlUserThreadStart+0x21 [minkernel\ntdll\rtlstrt.c @ 1153]
How does this explain the hang:
Why is it more likely to repro on Win7/8:
The shutdown sequence of the event pipe need careful review. Also, it would be a good idea to minimize work done during shutdown to only what's absolutely necessary, e.g. these event handles do not really need to be closed since the OS will close them anyway as part of the process shutdown.
The dump at \\jkotas9\drops\issue-35451 confirms this: pHandles[0] on thread 3 is 0x168, CriticalSection.LockSemaphore on thread 2 is 0x168
Thanks, Jan!
Oof! That's a tricky one! Thanks for pinpointing the issue @jkotas! I'm writing a fix for it now. Earlier iterations of the PR specifically didn't close all the handles and didn't attempt to cancel any waits, but I recalled #12991 which was "fixed" by closing handles to break out of a blocking. I'll come up with a more elegant approach that cancels the blocking Poll call and doesn't attempt subsequent cleanup.
Most helpful comment
I have stepped through the Event Pipe IPC shutdown sequence. I believe that there is a clear HANDLE use-after-close bug:
Handle that is being closed here:
is being waited on - or about to be waited on - here:
How does this explain the hang:
Why is it more likely to repro on Win7/8:
The shutdown sequence of the event pipe need careful review. Also, it would be a good idea to minimize work done during shutdown to only what's absolutely necessary, e.g. these event handles do not really need to be closed since the OS will close them anyway as part of the process shutdown.