On a managed .NET 5 C# project, I get intermittent errors, sometimes an AccessViolationException, other times ExecutionEngineException, and even NullReferenceException (?) with no managed code in sight on the debugger.
The code does mathematical optimization and is fully deterministic, however the exceptions are not always consistent even though they are frequent.
On the following scenarios these exceptions have never appeared so far (I tried a few times):
Unfortunately, I don't know how to create a small program to reproduce the problem, but there seemed to be a particular point where the code breaks the most:

For context, _complement is declared as:
private readonly ImmutableArray<ulong> _complement
An extra detail: in the constructor I initialize a regular ulong[] and use Unsafe.As<ulong[], ImmutableArray<ulong>> to set _complement.
Here is a misterious NullReferenceException (on a value type?!)

JIT and compiler optimizations have a lot of impact on the performance in this project so I don't think disabling these optimizations is a long term solution.
.NET 5.0.0-preview.6.20305.6
Also tried preview 4 before and was getting similar exceptions.
category:correctness
theme:testing
skill-level:expert
cost:large
Can you provide a more complete code example so we can attempt to reproduce the issue on our end?
It is a rather complex algorithm that sometimes runs for several minutes before hitting these exceptions (and some times it doesn't). If I knew how to create a reasonably small code example, I would.
Perhaps a Live Share session or with some help I could try to get more info with WinDbg.
Could you try a current master branch build (see https://github.com/dotnet/installer)?
Another thing to try: run with COMPlus_GCStress=F. If there's a problem with GC information, this might make the failure more deterministically reproducible. (If it's not a GC related issue, it probably won't help. It will also make the app run very slow.)
Additional info: I've tried running on .net core 3.1 with basically a single change (due to MemoryExtensions.Sort not being present in 3.1) and the exceptions did not occur. I've tested for about an 1h on several instances, while on .net 5 preview I usually get these exceptions very frequently.
@BruceForstall, yes I can test the master branch build and get back here to let you know.
I can also try COMPlus_GCStress=F.
A crash dump might be useful.
@BruceForstall: I've got an ExecutionEngineException on 5.0.0-preview.8.20358.9.
I've set COMPlus_GCStress=F and it took forever to get even Visual Studio/PSCore open, if it is that much slower, then we might need hours and for that I'd have to use another PC/VM, so I'll postpone that test.
@AndyAyersMS: from VS Debug > Save Dump as... I exported a crash dump and it's 201Mb and I'm not sure if source code is present in this file, if there is a way to send the file to you privately, or if there is a way to generate a smaller dump without (C#) code present, please let me know.
The dump will contain IL but not sources.
You can share it securely by opening an issue on the VS Developer Portal and then attaching it there, or you can create your own share and email me the access info ([email protected]).
The dump will contain IL but not sources.
You can share it securely by opening an issue on the VS Developer Portal and then attaching it there, or you can create your own share and email me the access info ([email protected]).
Sent to your e-mail!
Thanks. I should be able to look at it later today.
Working with @dellamonica offline -- initial impression is bad GC reporting by the jit. Will add this to 5.0.
Analysis of various dumps provided by @dellamonica shows potentially corrupted GC info, but if so, it's not clear how it got corrupted. Failures were always in the same method at the same offset, so it doesn't look like random corruption. I can repro the exact jit codegen in a mocked-up version of the method, and get normal-looking GC info.
We are ready to do some more diagnosis to try and pin down what is going on, but apparently the failure doesn't repro like it once did. So we're kind of stuck waiting for this to start failing again...
Failures are reproing once more, am going to look at a simplified repro provided by @dellamonica.
Still working on tracking this down.
@dellamonica has shared some non-crash examples showing the GC info is fine right after jitting. So either the GC info is produced incorrectly at times, or gets corrupted after it's produced. Given how surgical and repeatable the corruption is, the former seems far more likely.
In particular if the IG flags or liveness state for an IG can be corrupted that could lead exactly to the sort of malformed GC info we see.
So trying to figure out what could be happening in the jit that leads to occasional corruption of IG state -- we're going to enable pageheap to see if we can catch some out of bounds write. Also am also going to look into a special jit build that keeps duplicate IG information and sanity checks that both copies agree.
Have some strong evidence now this is the jit misbehaving. Still not sure why. Here are two gc info traces, one that is correct, the other incorrect.
Register slot id for reg rsi = 3.
Register slot id for reg rcx (byref) = 4.
Register slot id for reg rax (byref) = 5.
Register slot id for reg r9 (byref) = 6.
;; good info
...
Set state of slot 3 at instr offset 0x16 to Live.
Set state of slot 4 at instr offset 0x16 to Live.
Set state of slot 5 at instr offset 0x16 to Live.
Set state of slot 6 at instr offset 0x16 to Live.
;; bad info
Set state of slot 3 at instr offset 0x16 to Live.
Set state of slot 3 at instr offset 0x16 to Dead.
Set state of slot 4 at instr offset 0x19 to Live.
Set state of slot 5 at instr offset 0x20 to Live.
Set state of slot 6 at instr offset 0x27 to Live.
Set state of slot 3 at instr offset 0x2a to Live.
Here 0x16 is the start of the method body. In the bad case RSI and some byrefs are left unprotected on entry. Failure is that a GC happens here and RSI's referent gets relocated.
Checked jit always produces good info. As does release jit with various forms of DirectAlloc / Pageheap.
Think I've finally figured this one out.
Recall the release jit only sometimes generates bad GC info, and the checked jit never does.
The release jit can sometimes end up in a situation where fgFirstBB does not have the BBF_HAS_LABEL or BBF_JMP_TARGET flags set. If this happens and there are GC references live into the method body, they won't be reported properly as the jit won't create a label for the first BB.
A checked jit will typically always set BBF_HAS_LABEL flag on fgBirstBB because of this bit of code:
It turns out fgHasSwitch is never explicitly initialized, so it tends to be set to true in checked builds, given the default fill pattern (the only fill pattern that would have exposed this error in a checked jit is 00 which currently can't be used).
In release builds, fgHasSwitch will have a somewhat random value in methods that don't have switches. Apparently a random 0 tends to be rare, so most of the time the release jit will also set the BBF_HAS_LABEL flag on fgBirstBB. But occasionally the value is 0 and the release jit fails to create a label for the first BB, and thus leaves the jit in position to generate bad GC info.
The above explains why checked jits always produce good GC info, and release jits mostly always do.
This is a regression; the precipitating change is quite likely #1309. Before that change it wasn't possible (or at least wasn't common) for the jit to create a scratch BB after building pred lists. The order of these is significant; fgComputePreds sets both BBF_HAS_LABEL and BBF_JMP_TARGET flags on the first block. Post #1309 the jit might create a scratch BB after fgComputePreds has run, and in that case the new first BB does not have the right flags set.
The simple minimal fix is to initialize fgHasSwitch to false in fgInit, and then update fgEnsureFirstBBisScratch to set the necessary flags on the new first BB.
I have sent an updated jit to @dellamonica to test; hopefully this is indeed the problem.
@AndyAyersMS, I've run the crashing repro 8x with the new JIT and it did not crash once. Before we had a failure rate of about 60%, so I'm pretty confident that the problem is solved.
Thank you for your efforts!
@dellamonica thanks for reporting this, and for your help and patience in tracking this down.
Closed via #40038.
+1 thanks @dellamonica for helping track this down and fix it before final release. Much appreciated.
Most helpful comment
+1 thanks @dellamonica for helping track this down and fix it before final release. Much appreciated.