Runtime: Crossgen'ed powershell assemblies cause an intermittent hang when running our tests on Linux and OSX

Created on 23 Feb 2017  Â·  10Comments  Â·  Source: dotnet/runtime

We found the crossgen’ed powershell assemblies (targeting .NET Core) cause an intermittent hang when running our powershell class basic parsing tests on Linux/OSX. We have been seeing this issue in our Travis CI builds for some time. I tried to reproduce it locally by running the tests in a loop and found it only reproducible with the crossgen’ed assemblies (see an example screenshot below).

Repro step

  1. Install powershell_6.0.0-alpha.16-1ubuntu1.14.04.1_amd64.deb on a Ubuntu.14.04-x64 machine. The powershell assemblies from that package were crossgen'ed. And the "crossgen" executable used was from ".nuget/packages/runtime.ubuntu.14.04-x64.Microsoft.NETCore.Runtime.CoreCLR/1.1.0/tools/crossgen"
  2. Download the attached "tests.tar.gz", decompress it to get the "tests" folder, and run powershell -command 'foreach ($i in 1..20) { Invoke-Pester tests/Scripting.Classes.BasicParsing.Tests.ps1 }'. This command basically runs that test file in a loop for 20 times, and a hang will happen for most of the time.

Things worth mentioning

  1. I couldn't reproduce the hang when using IL assemblies on Linux and OSX.
  2. The crossgen’ed assemblies for Windows work fine. I never saw this hang happen in our AppVeyor CI builds. The hang only happens on Linux and OSX.
  3. Those powershell class parsing tests generate a lot of dynamic assemblies. PowerShell class is basically CLR types – powershell emits types and creates dynamic assemblies when parsing a powershell class in a script.

tests.tar.gz

hang

area-CrossGeNGEN-coreclr area-ReadyToRun-coreclr blocking

Most helpful comment

I took a look at this today and was able to reproduce the issue. I think I've figured out why this hang is occurring.

When the LoaderAllocator gets destroyed, it calls LoaderAllocator::GCLoaderAllocators which tries to delete unreferenced domain assemblies. The destructor for Assembly calls Assembly::Terminate which suspends the EE and then calls ExecutionManager::Unload. At this point, the ExecutionManager tries to delete code heaps, but in order to do so, it must acquire a writer lock (which in turn requires that there are no more readers active). This thread is stuck waiting here because...

...on another thread, System.Management.Automation.LocationGlobber.ExpandMshGlobPath threw an ItemNotFoundException, and we're in the process of dispatching that exception. One of the first things we need to do is unwind to the first managed call frame. This means checking if the code is managed and to do so, we first acquire the ExecutionManager's reader lock (because the scan flags for that thread tell us we need to). Now comes the part where ReadyToRun comes in: while we're holding the lock, we call JitCodeToMethodInfo, and the ReadyToRun version of that (ReadyToRunJitManager::JitCodeToMethodInfo) calls ReadyToRunInfo::GetMethodDescForEntryPoint which tries to do a hashmap lookup to find the MethodDesc corresponding to the entry point. However, HashMap::LookupValue tries to RareDisablePreemptiveGC before doing anything, and so the thread gets stuck while still holding the reader lock we got before.

All 10 comments

@rahku any possibility we can get this fixed in 2.0.0 servicing?

@rahku does not work on CoreCLR anymore.

cc @sergiy-k @russellhadley

I'm surprised it's still not addressed. It would be unpleasant if some Azure service using PowerShell Core were to be freeze.
This is all the more terrible to the users that even we can't get a dump.

I took a look at this today and was able to reproduce the issue. I think I've figured out why this hang is occurring.

When the LoaderAllocator gets destroyed, it calls LoaderAllocator::GCLoaderAllocators which tries to delete unreferenced domain assemblies. The destructor for Assembly calls Assembly::Terminate which suspends the EE and then calls ExecutionManager::Unload. At this point, the ExecutionManager tries to delete code heaps, but in order to do so, it must acquire a writer lock (which in turn requires that there are no more readers active). This thread is stuck waiting here because...

...on another thread, System.Management.Automation.LocationGlobber.ExpandMshGlobPath threw an ItemNotFoundException, and we're in the process of dispatching that exception. One of the first things we need to do is unwind to the first managed call frame. This means checking if the code is managed and to do so, we first acquire the ExecutionManager's reader lock (because the scan flags for that thread tell us we need to). Now comes the part where ReadyToRun comes in: while we're holding the lock, we call JitCodeToMethodInfo, and the ReadyToRun version of that (ReadyToRunJitManager::JitCodeToMethodInfo) calls ReadyToRunInfo::GetMethodDescForEntryPoint which tries to do a hashmap lookup to find the MethodDesc corresponding to the entry point. However, HashMap::LookupValue tries to RareDisablePreemptiveGC before doing anything, and so the thread gets stuck while still holding the reader lock we got before.

@adityamandaleeka it would be great if we can get a fix in 2.0.x servicing

@SteveL-MSFT @daxian-dbw I agree that this issue should be fixed. We'll try to find a good solution.

Out of curiosity, though, have you tried crossgen-ing the assemblies with the FragileNonVersionable switch? If you pass that switch to crossgen, it will generate non-ReadyToRun images, which should work around this issue (at least, that's what I'd assume based on my analysis above). The images generated with the FragileNonVersionable switch will be brittle (not resilient to changes in the runtime/framework or other dependencies), but as far as I can tell you ship all the dependencies in the PowerShell packages anyway, so that might be okay for you.

have you tried crossgen-ing the assemblies with the FragileNonVersionable switch

These are quite a bit bigger, and we do not have any extensive testing for this config - it is pretty likely you will hit different bugs.

@SteveL-MSFT @daxian-dbw Update on this: I have a PR out with a fix in master. Once that's in, I'll go through the process to get it ported to the release/2.0.0 branch.

@adityamandaleeka Thanks for the fix! When will we have a servicing package that includes the fix? Could you please point me to any docs about how .NET Core servicing works?

@daxian-dbw This will go into the next release after 2.0.3. There will be a pre-release build soon if you're interested in trying that out.

Was this page helpful?
0 / 5 - 0 ratings