Runtime: Regression in AsSpan for uint[] Datatype

Created on 17 Dec 2018 · 28Comments · Source: dotnet/runtime

Netcoreapp 3.0
Sdk Version 3.0.100-preview-009841


BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17134.471 (1803/April2018Update/Redstone4)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-alpha1-009630
  [Host]     : .NET Core 3.0.0-preview1-26928-03 (CoreCLR 4.6.26927.03, CoreFX 4.6.26927.03), 64bit RyuJIT
  Job-TQETEJ : .NET Core 3.0.0-preview1-26928-03 (CoreCLR 4.6.26927.03, CoreFX 4.6.26927.03), 64bit RyuJIT

BuildConfiguration=Release-Intrinsics  Toolchain=netcoreapp3.0  InvocationCount=1  
MaxIterationCount=20  UnrollFactor=1  WarmupCount=1

| Method | Mean | Error | StdDev | Median | Extra Metric |
|----------------------- |---------:|----------:|----------:|----------:|-------------:|
| AsSpanForUnitBenchmark | 1.445 us | 0.8379 us | 0.9649 us | 0.8400 us | - |

Sdk version 2.1.401
Netcoreapp2.1


BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17134.471 (1803/April2018Update/Redstone4)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-alpha1-009630
  [Host]     : .NET Core 2.1.6 (CoreCLR 4.6.27019.06, CoreFX 4.6.27019.05), 64bit RyuJIT
  Job-JARWGC : .NET Core 2.1.6 (CoreCLR 4.6.27019.06, CoreFX 4.6.27019.05), 64bit RyuJIT

BuildConfiguration=Release  Toolchain=netcoreapp2.1  InvocationCount=1  
MaxIterationCount=20  UnrollFactor=1  WarmupCount=1

| Method | Mean | Error | StdDev | Median | Extra Metric |
|----------------------- |---------:|---------:|---------:|---------:|-------------:|
| AsSpanForUnitBenchmark | 48.75 ns | 29.60 ns | 29.07 ns | 65.00 ns | - |

The code for this benchmark is
``` C#

public uint[] input;
public int end;

[IterationSetup(Target = nameof(AsSpanForUnitBenchmark))]
public void setup()
{
int min = 0;
int max = 100000;
Random randNum = new Random();
end = randNum.Next(min, max);

    input = new uint[max]
   for (int i = 0; i < input.Length; i++)
    {
           input[i] = (uint)randNum.Next(min, max);
    }

}

[Benchmark]
public void AsSpanForUnitBenchmark()
{
input.AsSpan(0, end);
}
```

cc @danmosemsft @adamsitnik @GrabYourPitchforks

area-CodeGen-coreclr bug tenet-performance

Source

Anipik

👍1

Most helpful comment

I'll take a look, recently I stumbled upon a JIT change that gets rid of some of these funny moves. Maybe we're lucky and it's the same case.

Also, the second mov in

mov     ecx,dword ptr [rax+8]
mov     ecx,ecx

might be eliminated by another change I have.

mikedn on 19 Dec 2018

👍2

All 28 comments

Think was fixed post preview? https://github.com/dotnet/coreclr/pull/21048

/cc @jkotas

benaadams on 17 Dec 2018

so do the coreclr version needs to be updated here ?

Anipik on 17 Dec 2018

That fix was in preview 1.

c:\git\coreclr>git rev-list v3.0.0-preview-27121-03 | find /i "ce7b0d7"
ce7b0d76090ea5306aad983d9e7b5178fcb7b9fb

@Anipik do you have official preview 1 bits? Or easier, just grab the latest 3.0 setup from the https://github.com/dotnet/core-sdk download page.

danmosemsft on 17 Dec 2018

I am using
3.0.100-preview-009841
installing 'https://dotnetcli.azureedge.net/dotnet/Sdk/3.0.100-preview-009841/dotnet-sdk-3.0.100-preview-009841-win-x64.zip' to 'C:\git\machinelearning\Tools\dotnetcli\dotnet-sdk-3.0.100-preview-009841-win-x64.zip'

Anipik on 18 Dec 2018

@jkotas any known active perf issue here?

danmosemsft on 18 Dec 2018

I do not think we have any active perf issue on this one.

I believe this regression was introduced by https://github.com/dotnet/coreclr/pull/20771. Here is the disassembly difference:

Before:

...
cmp     dword ptr [rax+8],0
jb      AsSpanForUnitBenchmark()+...
cmp     dword ptr [rax+8],edx
jb      AsSpanForUnitBenchmark()+...
lea     rcx,[rax+10h]
mov     qword ptr [rsp+28h],rcx
mov     dword ptr [rsp+30h],edx
...

After:

mov     ecx,dword ptr [rax+8]
mov     ecx,ecx
mov     r8d,edx
cmp     rcx,r8
jb      AsSpanForUnitBenchmark()+...
add     rax,10h
mov     rcx,rax
mov     rax,rcx
mov     rcx,rax
mov     qword ptr [rsp+28h],rcx
mov     dword ptr [rsp+30h],edx

The JIT code is bigger and its has more data dependencies, and thus runs slower. I have verified that reverting the commit changes the code to what it used to be.

cc @GrabYourPitchforks @ahsonkhan

jkotas on 18 Dec 2018

👍1

Good find @Anipik 🥂

danmosemsft on 18 Dec 2018

👍1

I don't see benchmark coverage of this scenario. The closest is https://github.com/dotnet/performance/blob/master/src/benchmarks/micro/corefx/System.Runtime/Perf.UInt32.cs which is really testing int.Parse and has a tiny array.

Whoever works on this, please add a benchmark there.

danmosemsft on 18 Dec 2018

@CarolEidt @AndyAyersMS How hard would it be to get rid of the redundant movs that seems to be causing the regression?

jkotas on 19 Dec 2018

I'll take a look, recently I stumbled upon a JIT change that gets rid of some of these funny moves. Maybe we're lucky and it's the same case.

Also, the second mov in

mov     ecx,dword ptr [rax+8]
mov     ecx,ecx

might be eliminated by another change I have.

mikedn on 19 Dec 2018

👍2

Also, the second mov in might be eliminated by another change I have.

Yep, that's definitely dotnet/coreclr#12676.

Still looking at the funny moves. Some unrelated change I have does get rid of some such Span related funny moves but not in this case. Still, I suspect they're related so I'm still looking.

mikedn on 19 Dec 2018

I don't see benchmark coverage of this scenario.

I just sent a PR for that https://github.com/dotnet/performance/pull/206

adamsitnik on 19 Dec 2018

And let's not miss the fact that when start is 0 the comparison should be narrowed to 32 bit

mov     ecx,dword ptr [rax+8]
mov     ecx,ecx
mov     r8d,edx
cmp     rcx,r8
jb      AsSpanForUnitBenchmark()+..

should be

cmp     dword ptr [rax+8],edx
jb      AsSpanForUnitBenchmark()+..

mikedn on 19 Dec 2018

recently I stumbled upon a JIT change that gets rid of some of these funny moves

Turns out that the change I had in mind is unrelated. But I have some other work in progress that gets rid of these funny moves. Ultimately it all has to do with structs, struct promotion, address taken variables and whatnot.

Though I'm not sure if the funny moves are actually caused by dotnet/coreclr#20711. It's possible that there was some other regression in the JIT. Or maybe the change dotnet/coreclr#20711 just had an impact on the surrounding code.

The issue is actually caused by code like
```C#
_pointer = new ByReference(ref Unsafe.Add(ref Unsafe.As(ref array.GetRawSzArrayData()), start));

Among other IR this generates

[000167] ------------ * STMT void (IL 0x002... ???)
[000166] -A---------- --* ASG struct (copy)
[000165] ------------ +--* OBJ struct(8)
[000164] ------------ | --* ADDR byref
[000163] ------------ | --* FIELD struct _pointer
[000136] L----------- | --* ADDR byref
[000137] ------------ | --* LCL_VAR struct(16)(P) V08 tmp3
| --* byref V08._pointer (offs=0x00) -> V15 tmp10
| --* int V08._length (offs=0x08) -> V16 tmp11
[000161] ------------ --* LCL_VAR struct(8)(P) V11 tmp6
--* byref V11._value (offs=0x00) -> V17 tmp12

The JIT fails to cleanup this mess (no idea if it ever did this and this is just a regression) and this interferes with subsequent optimizations (namely - local assertion propagation).

With proper cleanup we get

[000167] ------------ * STMT void (IL 0x002... ???)
[000166] -A---------- --* ASG byref
[000165] ------------ +--* LCL_VAR byref V15 tmp10
[000161] ------------ --* LCL_VAR byref V17 tmp12

and then optimizations kick in and we get
```asm
4C8D4210             lea      r8, bword ptr [rdx+16]

instead of

add     rax,10h
mov     rcx,rax
mov     rax,rcx
mov     rcx,rax

For completeness, the other problem I was talking about is caused by code like
C# ref Unsafe.As<T, byte>(ref MemoryMarshal.GetReference(span)), // from SequenceEqual
where a somewhat similar issue blocks local assertion propagation.

mikedn on 19 Dec 2018

I have a WIP PR dotnet/coreclr#19429. I just updated it, and verified that it gets rid of two of the three movs after the add. Like most register allocation changes, it had both positive and negative impact (though more positive by a good margin), but was reticent to merge it without perf results. I'll give the perf runs another go.

CarolEidt on 20 Dec 2018

Like most register allocation changes, it had both positive and negative impact (though more positive by a good margin), but was reticent to merge it without perf results

But is it LSRA's job to deal with poor IR generated by previous phases?

mikedn on 20 Dec 2018

But is it LSRA's job to deal with poor IR generated by previous phases?

Not entirely. But given phase ordering and all, I think it's generally accepted that a register allocator should incorporate copy elimination, either as part of its function, or done prior to.

CarolEidt on 20 Dec 2018

👍1

@mikedn I take it you meant dotnet/coreclr#20771?

The only change early on in the jit IR is from that source-level change to the validation logic in the span constructor.

It would be interesting to revert dotnet/coreclr#20771 and see if perf goes back to where it was in 2.1. If so perhaps we need to evaluate that optimization more broadly.

AndyAyersMS on 20 Dec 2018

@mikedn I take it you meant dotnet/coreclr#20771? The only change early on in the jit IR is from that source-level change to the validation logic in the span constructor.

@jkotas found that. I think there are actually 2 unrelated problems. One was that change, it came with the 64 bit compare that the JIT doesn't narrow to 32 bit. The other is the "funny move" problem, that's probably not related to dotnet/coreclr#20771, it's the usual problem the JIT has with some Span code due to problems with struct promotion.

mikedn on 20 Dec 2018

OK, I tried to revert the change done in dotnet/coreclr#20771. Indeed, the validation logic change did cause the apparition of those funny moves, even though they're actually generated by the ByReference assignment that follows the validation. I'll take a closer look, it's a bit strange how one caused the other.

mikedn on 20 Dec 2018

Those funny moves exist in IR both before and after dotnet/coreclr#20771. The difference is simply in register allocation. Before, all variables involved in the copy chain got r8 and after the change the same variables get both r8 and rdx.

mikedn on 20 Dec 2018

This should have been be improved by dotnet/coreclr#19429 and dotnet/coreclr#22454 ...

;;; 2.1

G_M12530_IG04:
       83780800             cmp      dword ptr [rax+8], 0
       721D                 jb       SHORT G_M12530_IG09
       395008               cmp      dword ptr [rax+8], edx
       7218                 jb       SHORT G_M12530_IG09

G_M12530_IG05:
       488D4810             lea      rcx, bword ptr [rax+16]

G_M12530_IG06:
       48894C2428           mov      bword ptr [rsp+28H], rcx
       89542430             mov      dword ptr [rsp+30H], edx

and master with the above PRs merged does show promise....

;;; master plus 22454

G_M12530_IG04:
       8B4808               mov      ecx, dword ptr [rax+8]
       448BC2               mov      r8d, edx
       493BC8               cmp      rcx, r8
       721B                 jb       SHORT G_M12530_IG09

G_M12530_IG05:
       4883C010             add      rax, 16
       488BC8               mov      rcx, rax

G_M12530_IG06:
       48894C2428           mov      bword ptr [rsp+28H], rcx
       89542430             mov      dword ptr [rsp+30H], edx

AndyAyersMS on 7 Feb 2019

👍1

Still room for improvement -- looks like maybe LSRA is not using casts for preferencing?

AndyAyersMS on 7 Feb 2019

I don't see much perf difference between 3.0 Preview 2 and 2.1 on the test case that was added in dotnet/performance#260 to try and capture this issue.

BenchmarkDotNet=v0.11.3.1003-nightly, OS=Windows 10.0.18323
Intel Core i7-4770HQ CPU 2.20GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-preview4-010309
  [Host]     : .NET Core 2.1.7 (CoreCLR 4.6.27129.04, CoreFX 4.6.27129.04), 64bit RyuJIT
  Job-WTCAUC : .NET Core 2.1.7 (CoreCLR 4.6.27129.04, CoreFX 4.6.27129.04), 64bit RyuJIT
  Job-YHFHGF : .NET Core 3.0.0-preview-27412-2 (CoreCLR 4.6.27411.71, CoreFX 4.7.19.11201), 64bit RyuJIT

Runtime=Core  IterationTime=250.0000 ms  MaxIterationCount=20
MinIterationCount=15  WarmupCount=1

| Method | Toolchain | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Gen 0/1k Op | Gen 1/1k Op | Gen 2/1k Op | Allocated Memory/Op |
|----------------------- |-------------- |----------:|----------:|----------:|----------:|----------:|----------:|------:|--------:|------------:|------------:|------------:|--------------------:|
| ArrayAsSpanStartLength | netcoreapp2.1 | 0.3919 ns | 0.0134 ns | 0.0125 ns | 0.3879 ns | 0.3705 ns | 0.4164 ns | 1.00 | 0.00 | - | - | - | - |
| ArrayAsSpanStartLength | netcoreapp3.0 | 0.4198 ns | 0.0288 ns | 0.0269 ns | 0.4166 ns | 0.3848 ns | 0.4819 ns | 1.07 | 0.08 | - | - | - | - |

@Anipik can you remeasure your test with 3.0 preview 2?

AndyAyersMS on 19 Feb 2019

👍1

Am going to close since I think this is fixed -- please reopen if that's not the case.

AndyAyersMS on 26 Feb 2019

👍1

@Anipik please post back when you've verified 3.0 preview 2 or later..

danmosemsft on 26 Feb 2019

yeah I am on it

Anipik on 26 Feb 2019

@AndyAyersMS I verified and regression is no longer there. thanks for all the help and effort in resolving this

Anipik on 26 Feb 2019