Runtime: Possible JIT optimization bug on .NET 5 preview

Created on 9 Jul 2020 · 19Comments · Source: dotnet/runtime

Description

On a managed .NET 5 C# project, I get intermittent errors, sometimes an AccessViolationException, other times ExecutionEngineException, and even NullReferenceException (?) with no managed code in sight on the debugger.

The code does mathematical optimization and is fully deterministic, however the exceptions are not always consistent even though they are frequent.

On the following scenarios these exceptions have never appeared so far (I tried a few times):

Debug mode.
Release mode with Optimize=False,
Release mode with Optimize=True and COMPlus_JITMinOpts=1.

Unfortunately, I don't know how to create a small program to reproduce the problem, but there seemed to be a particular point where the code breaks the most:

For context, _complement is declared as:
private readonly ImmutableArray<ulong> _complement
An extra detail: in the constructor I initialize a regular ulong[] and use Unsafe.As<ulong[], ImmutableArray<ulong>> to set _complement.

Here is a misterious NullReferenceException (on a value type?!)

JIT and compiler optimizations have a lot of impact on the performance in this project so I don't think disabling these optimizations is a long term solution.

Configuration

.NET 5.0.0-preview.6.20305.6

Also tried preview 4 before and was getting similar exceptions.

category:correctness
theme:testing
skill-level:expert
cost:large

area-CodeGen-coreclr tenet-reliability

Source

dellamonica

Most helpful comment

+1 thanks @dellamonica for helping track this down and fix it before final release. Much appreciated.

danmosemsft on 29 Jul 2020

👍2

All 19 comments

Can you provide a more complete code example so we can attempt to reproduce the issue on our end?

tannergooding on 9 Jul 2020

It is a rather complex algorithm that sometimes runs for several minutes before hitting these exceptions (and some times it doesn't). If I knew how to create a reasonably small code example, I would.

Perhaps a Live Share session or with some help I could try to get more info with WinDbg.

dellamonica on 10 Jul 2020

Could you try a current master branch build (see https://github.com/dotnet/installer)?

Another thing to try: run with COMPlus_GCStress=F. If there's a problem with GC information, this might make the failure more deterministically reproducible. (If it's not a GC related issue, it probably won't help. It will also make the app run very slow.)

BruceForstall on 10 Jul 2020

Additional info: I've tried running on .net core 3.1 with basically a single change (due to MemoryExtensions.Sort not being present in 3.1) and the exceptions did not occur. I've tested for about an 1h on several instances, while on .net 5 preview I usually get these exceptions very frequently.

@BruceForstall, yes I can test the master branch build and get back here to let you know.

I can also try COMPlus_GCStress=F.

dellamonica on 10 Jul 2020

A crash dump might be useful.

AndyAyersMS on 10 Jul 2020

@BruceForstall: I've got an ExecutionEngineException on 5.0.0-preview.8.20358.9.
I've set COMPlus_GCStress=F and it took forever to get even Visual Studio/PSCore open, if it is that much slower, then we might need hours and for that I'd have to use another PC/VM, so I'll postpone that test.

@AndyAyersMS: from VS Debug > Save Dump as... I exported a crash dump and it's 201Mb and I'm not sure if source code is present in this file, if there is a way to send the file to you privately, or if there is a way to generate a smaller dump without (C#) code present, please let me know.

dellamonica on 10 Jul 2020

The dump will contain IL but not sources.

You can share it securely by opening an issue on the VS Developer Portal and then attaching it there, or you can create your own share and email me the access info ([email protected]).

AndyAyersMS on 10 Jul 2020

The dump will contain IL but not sources.

You can share it securely by opening an issue on the VS Developer Portal and then attaching it there, or you can create your own share and email me the access info ([email protected]).

Sent to your e-mail!

dellamonica on 10 Jul 2020

Thanks. I should be able to look at it later today.

AndyAyersMS on 10 Jul 2020

Working with @dellamonica offline -- initial impression is bad GC reporting by the jit. Will add this to 5.0.

AndyAyersMS on 11 Jul 2020

Analysis of various dumps provided by @dellamonica shows potentially corrupted GC info, but if so, it's not clear how it got corrupted. Failures were always in the same method at the same offset, so it doesn't look like random corruption. I can repro the exact jit codegen in a mocked-up version of the method, and get normal-looking GC info.

We are ready to do some more diagnosis to try and pin down what is going on, but apparently the failure doesn't repro like it once did. So we're kind of stuck waiting for this to start failing again...

AndyAyersMS on 15 Jul 2020

Failures are reproing once more, am going to look at a simplified repro provided by @dellamonica.

AndyAyersMS on 17 Jul 2020

Still working on tracking this down.

@dellamonica has shared some non-crash examples showing the GC info is fine right after jitting. So either the GC info is produced incorrectly at times, or gets corrupted after it's produced. Given how surgical and repeatable the corruption is, the former seems far more likely.

In particular if the IG flags or liveness state for an IG can be corrupted that could lead exactly to the sort of malformed GC info we see.

So trying to figure out what could be happening in the jit that leads to occasional corruption of IG state -- we're going to enable pageheap to see if we can catch some out of bounds write. Also am also going to look into a special jit build that keeps duplicate IG information and sanity checks that both copies agree.

AndyAyersMS on 21 Jul 2020

Have some strong evidence now this is the jit misbehaving. Still not sure why. Here are two gc info traces, one that is correct, the other incorrect.

Register slot id for reg rsi = 3.
Register slot id for reg rcx (byref) = 4.
Register slot id for reg rax (byref) = 5.
Register slot id for reg r9 (byref) = 6.

;; good info

...
Set state of slot 3 at instr offset 0x16 to Live.
Set state of slot 4 at instr offset 0x16 to Live.
Set state of slot 5 at instr offset 0x16 to Live.
Set state of slot 6 at instr offset 0x16 to Live.

;; bad info

Set state of slot 3 at instr offset 0x16 to Live.
Set state of slot 3 at instr offset 0x16 to Dead.
Set state of slot 4 at instr offset 0x19 to Live.
Set state of slot 5 at instr offset 0x20 to Live.
Set state of slot 6 at instr offset 0x27 to Live.
Set state of slot 3 at instr offset 0x2a to Live.

Here 0x16 is the start of the method body. In the bad case RSI and some byrefs are left unprotected on entry. Failure is that a GC happens here and RSI's referent gets relocated.

Checked jit always produces good info. As does release jit with various forms of DirectAlloc / Pageheap.

AndyAyersMS on 25 Jul 2020

Think I've finally figured this one out.

Recall the release jit only sometimes generates bad GC info, and the checked jit never does.

The release jit can sometimes end up in a situation where fgFirstBB does not have the BBF_HAS_LABEL or BBF_JMP_TARGET flags set. If this happens and there are GC references live into the method body, they won't be reported properly as the jit won't create a label for the first BB.

A checked jit will typically always set BBF_HAS_LABEL flag on fgBirstBB because of this bit of code:

https://github.com/dotnet/runtime/blob/50a999dd3597a33b62a7a1a1712c849e764d50e9/src/coreclr/src/jit/codegenlinear.cpp#L99-L102

It turns out fgHasSwitch is never explicitly initialized, so it tends to be set to true in checked builds, given the default fill pattern (the only fill pattern that would have exposed this error in a checked jit is 00 which currently can't be used).

In release builds, fgHasSwitch will have a somewhat random value in methods that don't have switches. Apparently a random 0 tends to be rare, so most of the time the release jit will also set the BBF_HAS_LABEL flag on fgBirstBB. But occasionally the value is 0 and the release jit fails to create a label for the first BB, and thus leaves the jit in position to generate bad GC info.

The above explains why checked jits always produce good GC info, and release jits mostly always do.

This is a regression; the precipitating change is quite likely #1309. Before that change it wasn't possible (or at least wasn't common) for the jit to create a scratch BB after building pred lists. The order of these is significant; fgComputePreds sets both BBF_HAS_LABEL and BBF_JMP_TARGET flags on the first block. Post #1309 the jit might create a scratch BB after fgComputePreds has run, and in that case the new first BB does not have the right flags set.

The simple minimal fix is to initialize fgHasSwitch to false in fgInit, and then update fgEnsureFirstBBisScratch to set the necessary flags on the new first BB.

I have sent an updated jit to @dellamonica to test; hopefully this is indeed the problem.

AndyAyersMS on 28 Jul 2020

🚀1 🎉1

@AndyAyersMS, I've run the crashing repro 8x with the new JIT and it did not crash once. Before we had a failure rate of about 60%, so I'm pretty confident that the problem is solved.

Thank you for your efforts!

dellamonica on 28 Jul 2020

@dellamonica thanks for reporting this, and for your help and patience in tracking this down.

AndyAyersMS on 29 Jul 2020

👍1

Closed via #40038.

AndyAyersMS on 29 Jul 2020

+1 thanks @dellamonica for helping track this down and fix it before final release. Much appreciated.

danmosemsft on 29 Jul 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings