We found the crossgen’ed powershell assemblies (targeting .NET Core) cause an intermittent hang when running our powershell class basic parsing tests on Linux/OSX. We have been seeing this issue in our Travis CI builds for some time. I tried to reproduce it locally by running the tests in a loop and found it only reproducible with the crossgen’ed assemblies (see an example screenshot below).
powershell -command 'foreach ($i in 1..20) { Invoke-Pester tests/Scripting.Classes.BasicParsing.Tests.ps1 }'. This command basically runs that test file in a loop for 20 times, and a hang will happen for most of the time.
@rahku any possibility we can get this fixed in 2.0.0 servicing?
@rahku does not work on CoreCLR anymore.
cc @sergiy-k @russellhadley
I'm surprised it's still not addressed. It would be unpleasant if some Azure service using PowerShell Core were to be freeze.
This is all the more terrible to the users that even we can't get a dump.
I took a look at this today and was able to reproduce the issue. I think I've figured out why this hang is occurring.
When the LoaderAllocator gets destroyed, it calls LoaderAllocator::GCLoaderAllocators which tries to delete unreferenced domain assemblies. The destructor for Assembly calls Assembly::Terminate which suspends the EE and then calls ExecutionManager::Unload. At this point, the ExecutionManager tries to delete code heaps, but in order to do so, it must acquire a writer lock (which in turn requires that there are no more readers active). This thread is stuck waiting here because...
...on another thread, System.Management.Automation.LocationGlobber.ExpandMshGlobPath threw an ItemNotFoundException, and we're in the process of dispatching that exception. One of the first things we need to do is unwind to the first managed call frame. This means checking if the code is managed and to do so, we first acquire the ExecutionManager's reader lock (because the scan flags for that thread tell us we need to). Now comes the part where ReadyToRun comes in: while we're holding the lock, we call JitCodeToMethodInfo, and the ReadyToRun version of that (ReadyToRunJitManager::JitCodeToMethodInfo) calls ReadyToRunInfo::GetMethodDescForEntryPoint which tries to do a hashmap lookup to find the MethodDesc corresponding to the entry point. However, HashMap::LookupValue tries to RareDisablePreemptiveGC before doing anything, and so the thread gets stuck while still holding the reader lock we got before.
@adityamandaleeka it would be great if we can get a fix in 2.0.x servicing
@SteveL-MSFT @daxian-dbw I agree that this issue should be fixed. We'll try to find a good solution.
Out of curiosity, though, have you tried crossgen-ing the assemblies with the FragileNonVersionable switch? If you pass that switch to crossgen, it will generate non-ReadyToRun images, which should work around this issue (at least, that's what I'd assume based on my analysis above). The images generated with the FragileNonVersionable switch will be brittle (not resilient to changes in the runtime/framework or other dependencies), but as far as I can tell you ship all the dependencies in the PowerShell packages anyway, so that might be okay for you.
have you tried crossgen-ing the assemblies with the FragileNonVersionable switch
These are quite a bit bigger, and we do not have any extensive testing for this config - it is pretty likely you will hit different bugs.
@SteveL-MSFT @daxian-dbw Update on this: I have a PR out with a fix in master. Once that's in, I'll go through the process to get it ported to the release/2.0.0 branch.
@adityamandaleeka Thanks for the fix! When will we have a servicing package that includes the fix? Could you please point me to any docs about how .NET Core servicing works?
@daxian-dbw This will go into the next release after 2.0.3. There will be a pre-release build soon if you're interested in trying that out.
Most helpful comment
I took a look at this today and was able to reproduce the issue. I think I've figured out why this hang is occurring.
When the LoaderAllocator gets destroyed, it calls
LoaderAllocator::GCLoaderAllocatorswhich tries to delete unreferenced domain assemblies. The destructor for Assembly callsAssembly::Terminatewhich suspends the EE and then callsExecutionManager::Unload. At this point, the ExecutionManager tries to delete code heaps, but in order to do so, it must acquire a writer lock (which in turn requires that there are no more readers active). This thread is stuck waiting here because......on another thread,
System.Management.Automation.LocationGlobber.ExpandMshGlobPaththrew an ItemNotFoundException, and we're in the process of dispatching that exception. One of the first things we need to do is unwind to the first managed call frame. This means checking if the code is managed and to do so, we first acquire the ExecutionManager's reader lock (because the scan flags for that thread tell us we need to). Now comes the part where ReadyToRun comes in: while we're holding the lock, we callJitCodeToMethodInfo, and the ReadyToRun version of that (ReadyToRunJitManager::JitCodeToMethodInfo) callsReadyToRunInfo::GetMethodDescForEntryPointwhich tries to do a hashmap lookup to find the MethodDesc corresponding to the entry point. However,HashMap::LookupValuetries toRareDisablePreemptiveGCbefore doing anything, and so the thread gets stuck while still holding the reader lock we got before.