Runtime: KB4487017 wreaking havoc on CoreCLR

Created on 14 Feb 2019  路  20Comments  路  Source: dotnet/runtime

Long story short. We got some isolated reports 2 days ago of our product not starting with an error like this:

image

Essentially the server started, and died. No information on logs or anything... on the event viewer this was found:

Application: Raven.Server.exe
CoreCLR Version: 4.6.27129.4
Description: The process was terminated due to an internal error in the .NET Runtime at IP 00007FF8E8693B8D (00007FF8E84F0000) with exit code c0000005.

At first we thought it was our fault, but suddenly overnight one of our environments starts failing. The cause was that KB4487017 was killing us with an Access Violation. Some of our devs went to try uninstalling it and then we could run normally... But lighting striked twice, after reinstall it to double check we got this:

image

This issue then was labelled critical on our side, to the point that we are issuing a notice to all our clients to delay security patching until we can figure out the issue.

Our whole team has been investigating the issue for the last couple of hours. We got the following information:

  • Error disappears after uninstalling KB4487017, therefore both are linked
  • Different CoreCLR versions are being affected.
  • We catched the error under the debugger in random locations both on managed and unmanaged calls.
  • Apparently we are not the only ones suffering from it https://www.reddit.com/r/Warframe/comments/aqj7n9/crash_to_desktop_on_login_pc/
  • Only version 1803 is affected. 2019 fall update doesn't include this security patch (it got a different one) and works fine

Will update this post with any new information we are able to uncover.

tracking-external-issue

Most helpful comment

This seems to be addressed in 3B OS patch - KB 4489868 - https://support.microsoft.com/en-us/help/4489868/windows-10-update-kb4489868

@redknightlois confirmed the problem does not reproduce on Windows Update 1809 (it was reproducing on 1803).

Closing as addressed.

All 20 comments

Not directly related but could be a lead...

One of our customers had a similar issue discovered yesterday when an update released the same day (not sure which though) removed System.Threading.Tasks.Extensions v4.1.0.0 from the machine and it was an indirect dependency of something he was using.

It just crashed the process and didn't cause a blue screen.

He had to create a binding redirect to a newer version to get round it.

Not the same issue but possibly related and so may help point you in the right direction.

@redknightlois sorry for the trouble.
Is it .NET Core or .NET Framework problem?
Will you be able to share some dumps (privately) for investigation if we need them?

@karelz .Net Core and yes. I am on it :)

Which version of .NET Core? 2.1 or 2.2?
To submit (private) dumps, etc. please use: https://developercommunity.visualstudio.com (when needed) - it allows to upload data for MS eyes only. Just give us the report link when you create it and upload some.

We made it fail on both. 2.2.2 and 2.2.1 for sure. I am checking if we tried on 2.1 (because I don't remember what RDB 4.0 is running on). It is fair to say though, that it fails on all our CoreCLR versions in use.

EDIT: Confirmed with the team that fails on 2.1.6, 2.1.7 and 2.1.8.

I activated gflags.exe with silent monitoring. This is the dump. It is an empty instance (no user data).
Mini Dump: RDB.Empty.zip
Heap Dump: Raven.Server.exe-(PID-35924)-150431375.zip
Heap Dump with tiered compilation disabled: Raven.Server.exe-(PID-20984)-755937.zip

In the full dump, thread 32 is hitting some kind of fatal error during jitting.

00 00007ffc`9d07b153 : coreclr!EEPolicy::HandleFatalError+0x7a [e:\a\_work\62\s\src\vm\eepolicy.cpp @ 1522] 
01 00007ffd`3fc5f7dd : coreclr!ProcessCLRException+0x1081c3 [e:\a\_work\62\s\src\vm\exceptionhandling.cpp @ 1029] 
02 00007ffd`3fbcd856 : ntdll!RtlpExecuteHandlerForException+0xd [minkernel\ntos\rtl\amd64\xcptmisc.asm @ 131] 
03 00007ffd`3fbcbe9a : ntdll!RtlDispatchException+0x3c6 [minkernel\ntos\rtl\amd64\exdsptch.c @ 569] 
04 00007ffd`3c21a388 : ntdll!RtlRaiseException+0x31a [minkernel\ntos\rtl\amd64\raise.c @ 178] 
05 00007ffc`9cfe44e1 : KERNELBASE!RaiseException+0x68 [minkernel\kernelbase\xcpt.c @ 922] 
06 00007ffd`3fc5ed63 : coreclr!__CxxCallCatchBlock+0x151 [f:\dd\vctools\crt\vcruntime\src\eh\frame.cpp @ 1186] 
07 00007ffc`9cf154c6 : ntdll!RcFrameConsolidation+0x3 [minkernel\ntos\rtl\amd64\capture.asm @ 653] 
08 00007ffc`9d04cc7f : coreclr!MethodDesc::JitCompileCodeLocked+0x212 [e:\a\_work\62\s\src\vm\prestub.cpp @ 841] 

OK. Update to now. We built a version of the executable with PrefetchVirtualMemory disabled and it doesn't crash. At least we are onto something.

Repro steps:

  • Windows 10 Version 1803
  • Install KB 4487017
  • Download latest stable 1.4.1 from https://ravendb.net/download
  • Execute run.ps1 or Raven.Server.exe

@redknightlois Thank you! We really appreciate the detailed bug report. As a result of your last post, my teammate Chris Ahna has successfully reproduced this issue locally. We are working as quickly as possible to figure out the root cause of this issue. We currently suspect something has gone wrong in the Windows memory manager (but that's just a best-guess right now).

I will post updates to this thread as I have them. It sounds like you are unblocked but please let me know if there's anything we can do to help lower the impact for you and your customers as we chase this issue down.

@leculver we will issue a hotfix for those that have the issue, at the expense of performance. I would say not push KB4487017 to windows update until fixed would be a good idea :) ... we got the problem on one of our machines (which was good as we couldnt reproduce) because Azure forced update the VM.

@leculver after careful consideration we are not going to issue the workaround. If the error (as what we know right now) is deep into the memory manager, there is no guarantee that workaround works, and is not just making it harder to happen or if it breaks other memory guarantees required to ensure data consistency and safety. For now our recommendation to pull the plug on the security patch until the real impact assessment is clearer is the safe course of action.

@redknightlois Just to clarify. Does:

our recommendation to pull the plug on the security patch

mean: uninstall KB4487017?

You can either uninstall KB4487017 or upgrade Windows to version 1809 (October 2018 Update)

@PureKrome yes, though my meaning there was that the KB should be retired from compulsive Windows Update installation altogether.

Any news on this?

Current status: the KB was re-installed silently and I ended up with
image

Bump?

Another update, another 3AM call with a production server going down for 2 hours because Azure decided it was a good idea to push a KB on a server on Sunday. Not fun. Any idea what the OS and Azure guys are doing with this? Client opened a ticket on Azure and no response in 3 weeks about the issue.

This seems to be addressed in 3B OS patch - KB 4489868 - https://support.microsoft.com/en-us/help/4489868/windows-10-update-kb4489868

@redknightlois confirmed the problem does not reproduce on Windows Update 1809 (it was reproducing on 1803).

Closing as addressed.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

aggieben picture aggieben  路  3Comments

chunseoklee picture chunseoklee  路  3Comments

jamesqo picture jamesqo  路  3Comments

yahorsi picture yahorsi  路  3Comments

omajid picture omajid  路  3Comments