Executing the following code:
private static void Main(string[] args)
{
var ar = new double[]
{
100
};
FillTo1(ref ar, 5);
Console.WriteLine(string.Join(",", ar.Select(a => a.ToString()).ToArray()));
}
public static void FillTo1(ref double[] dd, int N)
{
if (dd.Length >= N)
return;
double[] Old = dd;
double d = double.NaN;
if (Old.Length > 0)
d = Old[0];
dd = new double[N];
for (int i = 0; i < Old.Length; i++)
{
dd[N - Old.Length + i] = Old[i];
}
for (int i = 0; i < N - Old.Length; i++)
dd[i] = d;
}
The result in Debug mode is: 100,100,100,100,100 but in Release mode is: 100,100,100,100,0
Tested on .net framework 4.7.1 and .net core 2.0.0
Also posted to Stack Overflow (https://stackoverflow.com/questions/47591915/why-code-behavior-is-different-in-release-debug-mode).
(comment aimed at the repo maintainers, not OP) for categorisation: this looks to be a recent RyuJIT bug; see the answer on the SO question for context.
The code generated for the first for loop is busted:
33C0 xor eax, eax
85ED test ebp, ebp
7E25 jle SHORT G_M24688_IG07
G_M24688_IG06:
488B17 mov rdx, gword ptr [rdi] ; rdx = dd
8D0C03 lea ecx, [rbx+rax] ; 0 + i ?!
4C63C0 movsxd r8, eax
C4817B1044C610 vmovsd xmm0, qword ptr [r14+8*r8+16] ; xmm0 = old[i]
3B4A08 cmp ecx, dword ptr [rdx+8] ; dd[n - old.Length + i] range check
7344 jae SHORT G_M24688_IG10
4863C9 movsxd rcx, ecx
C4E17B1144CA10 vmovsd qword ptr [rdx+8*rcx+16], xmm0 ; dd[n - old.Length + i] = xmm0
FFC0 inc eax ; i++
3BE8 cmp ebp, eax ; old.Length > i
7FDB jg SHORT G_M24688_IG06
G_M24688_IG07:
33C0 xor eax, eax ; i = 0
8BDE mov ebx, esi ; ebx = n
2BDD sub ebx, ebp ; n - old.Length
85DB test ebx, ebx
7E18 jle SHORT G_M24688_IG09 ; n - old.Length <= 0
G_M24688_IG08:
488B17 mov rdx, gword ptr [rdi]
3B4208 cmp eax, dword ptr [rdx+8]
7322 jae SHORT G_M24688_IG10
4863C8 movsxd rcx, eax
C4E17B1174CA10 vmovsd qword ptr [rdx+8*rcx+16], xmm6
FFC0 inc eax
3BD8 cmp ebx, eax
7FE8 jg SHORT G_M24688_IG08 ; n - old.Length > i
Where N - Old.Length + i should be we instead have 0 + i. N - Old.Length is computed and stored in ebx only after that loop.
If FillTo1 is modified to this:
public static void FillTo1(ref double[] dd, ref double[] dx, int N)
{
if (dx.Length >= N)
return;
and called with FillTo1(ref ar, ref ar, 5); the problem disappears. It reappears when dx is changed to dd. The only IL difference in that case is ldarg.1 vs. ldarg.0.
If you want a workaround the best one is to stop using ref. Not only that it avoids the bug but it also results in slightly better code being generated. Something like:
```C#
public static double[] Test(double[] dd, int n)
{
if (dd.Length >= n)
return dd;
double[] nn = new double[n];
for (int i = 0; i < dd.Length; i++)
nn[nn.Length - dd.Length + i] = dd[i];
double d = dd.Length > 0 ? dd[0] : double.NaN;
for (int i = 0; i < nn.Length - dd.Length; i++)
nn[i] = d;
return nn;
}
```
@mikedn thanks, just reported to help fixing this bug.
There are multiple workarounds including Console.WriteLine(), Thread.Sleep(1), declaring var len=Old.Length and using len instead of Old.Length everywhere and not using ref as you mentioned.
A rather interesting bug.
Loop hoisting creates a copy of N - Old.Length in the pre-header of the first loop to allow CSE to pick it up. This copy is almost dead, its result is not used anywhere at this point. But it can't be removed during morphing because Old.Length has an exception side-effect.
The second loop is turned into a do/while loop and in the process a copy of loop's condition - again N - Old.Length - is also made. This works much like loop hoisting except that this copy is actually used.
When CSE starts there are 4 occurrences of N - Old.Length - one def and one use for every loop. There are also a few other CSE candidates, including Old.Length.
So far so good, nothing unusual.
Now CSE happens to process Old.Length first and replaces all the occurrences in N - Old.Length with a local variable. Every time it does that it remorphs statements and this seems to be where things go down the drain. The statement created by loop hoisting looses its side effects and gets removed as being useless.
When CSE finally processes N - Old.Length it replaces the use inside the first loop with a local variable. But the definition has been removed so the variable will remain uninitialized.
That said, I've no idea what changed recently that could have caused this. There were some changes around side-effect updates, I suppose one possibility is that before side-effects were not updated correctly and that prevented the hoisted expression from being removed. But anyway, it looks like morph should not remove CSE candidate trees while CSE is in progress.
cc @dotnet/jit-contrib
Looks like CSE#03 N - Old.Length has two defs and has uses of CSE#02 Old.Length nested inside. When CSE#02 is processed we decrement def counts for the first def of CSE#03ut we don't do this same decrement for the second def.
So after processing CSE#02, CSE#03 still looks viable despite its def having been removed, and things go off the rails there.
Some possible fixes:
Some possible fixes:
In this particular case it seems that we should not remove the def because doing so defeats loop hoisting.
Some possible fixes:
Or perhaps postpone morphing until CSE is complete? It's not clear why should the same statement be morphed multiple times (apart from wasting cycles...)
Looks like just not removing statements during CSE is the way to go; it is simple and doesn't have much CQ impact. Also should easily backport to 2.0... Will have a PR up in a sec.
We do the re-morphing after introducing a CSE def so that the assignment flag gets set correctly for the new tree.
so that the assignment flag gets set correctly for the new tree
GTF_ASG? Can't gtUpdateSideEffects be used instead of morphing?
Maybe but that function is a more recent addition, and we need to update all the way up to the top of the tree. When we have a tree with nested CSE we need the assignment flag to preserve the prior CSE defs.
It was updated to be GTF_ASG precise by PR dotnet/coreclr#13668
Reopening since we should port this to 2.0 servicing.
Went and looked at why this doesn't happen in 1.1.
Note CSE numbering differs, but the upshot is that the array length CSE cost went up and so the sort order changed. In 1.1 we CSE'd the outer expression first.
Haven't drilled in to what caused the cost change yet, as it seems like the outer expression should be still be more costly.
;; 1.1
CSE candidate dotnet/coreclr#1, vn=$c2 cseMask=0000000000000001 in BB04, [cost= 3, size= 3]:
N002 ( 3, 3) CSE dotnet/coreclr#1 (use)[000125] ---X-------- * arrLen int $c2
N001 ( 1, 1) [000114] ------------ \--* lclVar ref V02 loc0 u:3 $1c1
CSE candidate dotnet/coreclr#2, vn=$203 cseMask=0000000000000002 in BB06, [cost= 5, size= 5]:
N002 ( 3, 3) CSE dotnet/coreclr#1 (use)[000052] ---X-------- /--* arrLen int $c2
N001 ( 1, 1) [000051] ------------ | \--* lclVar ref V02 loc0 u:3 $1c1
N004 ( 5, 5) CSE dotnet/coreclr#2 (use)[000054] ---X----R--- * - int $203
N003 ( 1, 1) [000050] ------------ \--* lclVar int V01 arg1 u:2 $c0
CSE dotnet/coreclr#2,cseMask=0000000000000002,useCnt=2: [def=100, use=800] :: N004 ( 5, 5) CSE dotnet/coreclr#2 (def)[000213] ---X---HR--- * - int $203
CSE dotnet/coreclr#1,cseMask=0000000000000001,useCnt=7: [def= 50, use=1400] :: N002 ( 3, 3) CSE dotnet/coreclr#1 (def)[000020] ---X-------- * arrLen int $c2
;; latest (and likely 2.0)
CSE candidate dotnet/coreclr#1, vn=$1c0 cseMask=0000000000000001 in BB03, [cost= 3, size= 2]:
N002 ( 3, 2) CSE dotnet/coreclr#1 (use)[000011] *--XG------- * IND ref <l:$1c0, c:$201>
N001 ( 1, 1) [000010] ------------ \--* LCL_VAR byref V00 arg0 u:2 $80
CSE candidate dotnet/coreclr#2, vn=$c2 cseMask=0000000000000002 in BB03, [cost= 3, size= 3]:
N002 ( 3, 3) CSE dotnet/coreclr#2 (use)[000020] ---X-------- * ARR_LENGTH int <l:$c2, c:$c3>
N001 ( 1, 1) [000019] ------------ \--* LCL_VAR ref V02 loc0 u:3 <l:$1c0, c:$201>
CSE candidate dotnet/coreclr#3, vn=$247 cseMask=0000000000000004 in BB06, [cost= 5, size= 5]:
N002 ( 3, 3) CSE dotnet/coreclr#2 (use)[000052] ---X-------- /--* ARR_LENGTH int <l:$c2, c:$c3>
N001 ( 1, 1) [000051] ------------ | \--* LCL_VAR ref V02 loc0 u:3 <l:$1c0, c:$201>
N004 ( 5, 5) CSE dotnet/coreclr#3 (use)[000054] ---X----R--- * SUB int <l:$247, c:$246>
N003 ( 1, 1) [000050] ------------ \--* LCL_VAR int V01 arg1 u:2 $c0
CSE dotnet/coreclr#2,cseMask=0000000000000002,useCnt=8: [def=100, use=1450] :: N003 ( 5, 4) CSE dotnet/coreclr#2 (def)[000003] ---XG------- * ARR_LENGTH int <l:$c2, c:$c1>
CSE dotnet/coreclr#3,cseMask=0000000000000004,useCnt=2: [def=100, use=800] :: N004 ( 5, 5) CSE dotnet/coreclr#3 (def)[000213] ---X---HR--- * SUB int <l:$247, c:$246>
CSE dotnet/coreclr#1,cseMask=0000000000000001,useCnt=1: [def=100, use= 50] :: N002 ( 3, 2) CSE dotnet/coreclr#1 (def)[000002] *--XG------- * IND ref <l:$1c0, c:$200>
Looks like in 2.0 there is an extra CSE def of the array length which we pick up and prefer. This new CSE exists because of value numbering through byrefs. Because it has an indir this def has higher cost than the other equivalent CSEs.
Normally for nested CSEs we'd expect the parent CSE to always be more costly than the child. Because of this the algorithm naturally processes nested CSEs from outer->inner and so avoids the bug we see here.
But whens case the child CSE's expression shape can differ based on occurrence and the parent CSE only wraps cheap occurrences, then it is possible to see a child occurrence that is not less costly than the parent. This allows child CSE to be processed before parent and expose the bug.
There are likely ways to expose this in 1.1, if you can create a child CSE with two variant forms that value number equal, put the most expensive one in the dominant position, and then create a wrapping CSE that only wraps the cheaper occurrences, something similar could happen.
```
;; 1.1
N002 ( 3, 3) CSE dotnet/coreclr#1 (def)[000020] ---X-------- --* arrLen int $c2
N001 ( 1, 1) [000019] ------------ --* lclVar ref V02 loc0 u:3 $1c1
;; 2.0
N003 ( 5, 4) CSE dotnet/coreclr#2 (def)[000003] ---XG------- --* ARR_LENGTH int
N002 ( 3, 2) CSE dotnet/coreclr#1 (def)[000002] --XG------- -- IND ref
N001 ( 1, 1) [000001] ------------ --* LCL_VAR byref V00 arg0 u:2 $80
Closing -- the issue is fixed in master via dotnet/coreclr#15323 and fixed in the 2.0 branch (available in 2.0.6) by dotnet/coreclr#15360.
Most helpful comment
Looks like just not removing statements during CSE is the way to go; it is simple and doesn't have much CQ impact. Also should easily backport to 2.0... Will have a PR up in a sec.