Some Orleans users have hit an issue when running on .NET Core 2.1 with TieredCompilation enabled. After investigation, I believe the issue is a JIT bug related to TieredCompilation.
Repro gist: https://gist.github.com/ReubenBond/b5bdf98423d5279a8e9ae63a5e6e8f0a
In the repro, the line return buff[offset];
will return buff[0]
when TieredCompilation is enabled and buff[1]
(correct) when TieredCompilation is disabled.
```C#
byte[] CheckLength(out int offset)
{
// In the original code, this logic is more complicated.
// It's simplified here to demonstrate the bug.
offset = 1;
return currentBuffer;
}
public byte ReadByte()
{
int offset;
var buff = CheckLength(out offset);
return buff[offset];
}
```
We have advised the affected users to disable TieredCompilation for now, as a workaround.
With issues like this it seems dangerous to me that TieredCompilation
is enabled by default. If it wasn't and had to be explicitly enabled, our users (and us) would have an easier time finding the root cause. What baffled us here was that a trivial "default" project created with VS wouldn't work while another one with the same application code worked just fine.
@kouvel @noahfalk @dotnet/jit-contrib
@sergeybykov Tiered compilation is not enabled by default for 2.1, only for 2.2 and the master branch, specifically to help flush out issues like this that we haven't encountered in internal testing.
Yeah I think this is the perfect way to catch these issues before this goes into production
This is actually a minopts bug...
We produce
; Assembly listing for method TierJit.BinaryTokenStreamReader:ReadByte():ubyte:this
; Emitting BLENDED_CODE for X64 CPU with AVX
; compiler->opts.MinOpts() is true
...
G_M55073_IG02:
8B75F0 mov esi, dword ptr [rbp-10H] // offset
488D55F0 lea rdx, bword ptr [rbp-10H] // &offset
488B4D10 mov rcx, gword ptr [rbp+10H]
E80BFAFFFF call TierJit.BinaryTokenStreamReader:CheckLength(byref):ref:this
483B7008 cmp rsi, qword ptr [rax+8] // uses old value of offset
730F jae SHORT G_M55073_IG04
488D443010 lea rax, bword ptr [rax+rsi+16] // uses old value of offset
Seems like a missing spill check in the importer... if we go to load a local and it is address exposed we need to spill any calls. We do this for stores but not loads.
Prospective fix is to call impSpillSideEffects
in impLoadVar
for any address-exposed local.
On second thought this is possibly an issue with GT_INDEX_ADDR
(which only exists in minopts) is not evaluating the array operand before the index operand. Would explain why this fairly simple construct regressed only recently.
Odd though that the source code insists index then addr is the right order.
Odd though that the source code insists index then addr is the right order.
Insists in what way? It wouldn't be the first time when ordering is found to be buggy. Unfortunately ordering is "broken" by design in RyuJIT.
I see, GenTreeIndexAddr
sets GTF_REVERSE_OPS
for apparently no reason. Hmm, seems like an obvious oversight.
And then fgSetTreeSeqHelper
double-checks that GTF_REVERSE_OPS
is set.
Anyways undoing both those fixes the bug and doesn't cause any pri1 failures and modest minopts diffs (1147 out of ~192K methods). Will PR this fix and the test case in the morning...
Thank you for root-causing this so quickly. The title does not accurately describe the issue, so please feel free to change it as you see fit.
And then fgSetTreeSeqHelper double-checks that GTF_REVERSE_OPS is set.
Yes, and that code should disappear. GenTreeIndexAddr
is a normal binary op and it shouldn't have any special handling. GenTreeVisitor
already lacks this special handling. (FWIW I'm looking into replacing fgSetTreeSeqHelper
with GenTreeVisitor
so that's why I'm interested in this case :)
@sergeybykov Tiered compilation is not enabled by default for 2.1, only for 2.2 and the master branch, specifically to help flush out issues like this that we haven't encountered in internal testing.
@BruceForstall Sorry, I probably confused it with 2.2. It was surprising that two people in our community ran into this independently with Hello World kind of apps.
What do you recommend with should do at this point. We are holding a release (Orleans 2.1.0) until we sort this issue out. Should we just document that there's a known issue with tiered compilation, and it should be turned off until the issue is fixed? Is there another workaround?
BTW, what's minopts
? Is it independent from tiered compilation?
@sergeybykov, which version of .NET Core is Orleans 2.1.0 targeting? I don't fully understand why this issue should block a release.
In .NET Core 2.1 tiered compilation is disabled by default, so if that's being targeted then some guidance may suffice, we can look into servicing 2.1 with a fix but it's not likely to happen.
.NET Core 2.2 is in preview mode and has not released yet, preview 2 released recently and has tiered compilation enabled by default. We'll definitely want to either fix the issue or disable tiered compilation for 2.2 RTM.
Minopts (minimal optimization) is the fast-jitting mode that is used in tier0. It is also supposed to be the high-reliability jitting mode.
As for a workaround: the bug requires that the call and is immediately followed by indexing into the array. So if you can introduce an extra statement between the call and the array index it should avoid this issue, eg:
```C#
// workaround for CoreCLR 20040 at Tier0
void Nop() {}
public byte ReadByte()
{
int offset;
var buff = CheckLength(out offset);
Nop();
return buff[offset];
}
```
This also should be completely and correctly optimized away with normal jitting and/or Tier1 so won't have any adverse impact on performance.
Would this also repro when running under the debugger in VS regardless of the TieredCompilation
setting? @sergeybykov mentioned that
a trivial "default" project created with VS wouldn't work
so I'm wondering if the debuggable
code that VS requests by default, which is similar to minopts
, is adding to the confusion about when this is likely to repro. If it is limited to debugging scenarios in 2.1 as long as tiering isn't explicitly enabled, then perhaps @AndyAyersMS's workaround would be acceptable(?).
No -- we only use GT_INDEX_ADDR
under minopts, not under debuggable codegen. See dotnet/runtime#8806.
So the bug should only manifest with a release build of the C# project, and then with tiering enabled.
@sergeybykov
What do you recommend with should do at this point. We are holding a release (Orleans 2.1.0) until we sort this issue out. Should we just document that there's a known issue with tiered compilation, and it should be turned off until the issue is fixed? Is there another workaround?
I dont think this should hold up the release -- the same exact issue occurs in the stable version of Orleans already
@sergeybykov, which version of .NET Core is Orleans 2.1.0 targeting? I don't fully understand why this issue should block a release.
We target .NET Standard 2.0. So there is no direct blocking issue here. It's just that we heard from our users that they were runing into this issue, blocking for them, even with trivial Hello World kind of apps. We wanted to understand the root cause before we release final bits.
Would this also repro when running under the debugger in VS regardless of the TieredCompilation setting
It seems to work fine with debugger attached and TieredCompilation
disabled.
I dont think this should hold up the release -- the same exact issue occurs in the stable version of Orleans already
Now that we understand the root cause, I agree.
@AndyAyersMS Thank you for the workaround. I think the problem is that we'd have to put in every piece of code every code path where this bug can manifest. I don't see how we can cover all such possibilities reliably. It seems to me we should still tell our users to disable TieredCompilation
for now. Do you agree?
Later, when this issue is fixed, we can try to reenable TieredCompilation
and run through all our tests to see if any other issues surface. Does this sound like a reasonable plan or you see a better one?
I understand the workaround may be impractical -- I looked over the sources but couldn't tell if the pattern was pervasive or just happened in a few places. The bug requires that exact sequence so I was hopeful it might be something that could be worked around easily.
Telling your users to not enable (2.1) or to disable (2.2, 3.0) tiered compilation for now seems prudent. We should have this fixed shortly in 3.0 and I will propose fixing it in both 2.2 and 2.1.
The fix is in master now so it would be great if you all can confirm the fix with an upcoming 3.0 preview build. I believe it takes a day or two for the previews to pick up these changes. I can point you at the right build once this happens.
Most helpful comment
I understand the workaround may be impractical -- I looked over the sources but couldn't tell if the pattern was pervasive or just happened in a few places. The bug requires that exact sequence so I was hopeful it might be something that could be worked around easily.
Telling your users to not enable (2.1) or to disable (2.2, 3.0) tiered compilation for now seems prudent. We should have this fixed shortly in 3.0 and I will propose fixing it in both 2.2 and 2.1.