Netcoreapp 3.0
Sdk Version 3.0.100-preview-009841
BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17134.471 (1803/April2018Update/Redstone4)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-alpha1-009630
[Host] : .NET Core 3.0.0-preview1-26928-03 (CoreCLR 4.6.26927.03, CoreFX 4.6.26927.03), 64bit RyuJIT
Job-TQETEJ : .NET Core 3.0.0-preview1-26928-03 (CoreCLR 4.6.26927.03, CoreFX 4.6.26927.03), 64bit RyuJIT
BuildConfiguration=Release-Intrinsics Toolchain=netcoreapp3.0 InvocationCount=1
MaxIterationCount=20 UnrollFactor=1 WarmupCount=1
| Method | Mean | Error | StdDev | Median | Extra Metric |
|----------------------- |---------:|----------:|----------:|----------:|-------------:|
| AsSpanForUnitBenchmark | 1.445 us | 0.8379 us | 0.9649 us | 0.8400 us | - |
Sdk version 2.1.401
Netcoreapp2.1
BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17134.471 (1803/April2018Update/Redstone4)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-alpha1-009630
[Host] : .NET Core 2.1.6 (CoreCLR 4.6.27019.06, CoreFX 4.6.27019.05), 64bit RyuJIT
Job-JARWGC : .NET Core 2.1.6 (CoreCLR 4.6.27019.06, CoreFX 4.6.27019.05), 64bit RyuJIT
BuildConfiguration=Release Toolchain=netcoreapp2.1 InvocationCount=1
MaxIterationCount=20 UnrollFactor=1 WarmupCount=1
| Method | Mean | Error | StdDev | Median | Extra Metric |
|----------------------- |---------:|---------:|---------:|---------:|-------------:|
| AsSpanForUnitBenchmark | 48.75 ns | 29.60 ns | 29.07 ns | 65.00 ns | - |
The code for this benchmark is
``` C#
public uint[] input;
public int end;
[IterationSetup(Target = nameof(AsSpanForUnitBenchmark))]
public void setup()
{
int min = 0;
int max = 100000;
Random randNum = new Random();
end = randNum.Next(min, max);
input = new uint[max]
for (int i = 0; i < input.Length; i++)
{
input[i] = (uint)randNum.Next(min, max);
}
}
[Benchmark]
public void AsSpanForUnitBenchmark()
{
input.AsSpan(0, end);
}
```
cc @danmosemsft @adamsitnik @GrabYourPitchforks
Think was fixed post preview? https://github.com/dotnet/coreclr/pull/21048
/cc @jkotas
so do the coreclr version needs to be updated here ?
That fix was in preview 1.
c:\git\coreclr>git rev-list v3.0.0-preview-27121-03 | find /i "ce7b0d7"
ce7b0d76090ea5306aad983d9e7b5178fcb7b9fb
@Anipik do you have official preview 1 bits? Or easier, just grab the latest 3.0 setup from the https://github.com/dotnet/core-sdk download page.
I am using
3.0.100-preview-009841
installing 'https://dotnetcli.azureedge.net/dotnet/Sdk/3.0.100-preview-009841/dotnet-sdk-3.0.100-preview-009841-win-x64.zip' to 'C:\git\machinelearning\Tools\dotnetcli\dotnet-sdk-3.0.100-preview-009841-win-x64.zip'
@jkotas any known active perf issue here?
I do not think we have any active perf issue on this one.
I believe this regression was introduced by https://github.com/dotnet/coreclr/pull/20771. Here is the disassembly difference:
Before:
...
cmp dword ptr [rax+8],0
jb AsSpanForUnitBenchmark()+...
cmp dword ptr [rax+8],edx
jb AsSpanForUnitBenchmark()+...
lea rcx,[rax+10h]
mov qword ptr [rsp+28h],rcx
mov dword ptr [rsp+30h],edx
...
After:
mov ecx,dword ptr [rax+8]
mov ecx,ecx
mov r8d,edx
cmp rcx,r8
jb AsSpanForUnitBenchmark()+...
add rax,10h
mov rcx,rax
mov rax,rcx
mov rcx,rax
mov qword ptr [rsp+28h],rcx
mov dword ptr [rsp+30h],edx
The JIT code is bigger and its has more data dependencies, and thus runs slower. I have verified that reverting the commit changes the code to what it used to be.
cc @GrabYourPitchforks @ahsonkhan
Good find @Anipik 馃
I don't see benchmark coverage of this scenario. The closest is https://github.com/dotnet/performance/blob/master/src/benchmarks/micro/corefx/System.Runtime/Perf.UInt32.cs which is really testing int.Parse and has a tiny array.
Whoever works on this, please add a benchmark there.
@CarolEidt @AndyAyersMS How hard would it be to get rid of the redundant movs that seems to be causing the regression?
I'll take a look, recently I stumbled upon a JIT change that gets rid of some of these funny moves. Maybe we're lucky and it's the same case.
Also, the second mov in
mov ecx,dword ptr [rax+8]
mov ecx,ecx
might be eliminated by another change I have.
Also, the second mov in might be eliminated by another change I have.
Yep, that's definitely dotnet/coreclr#12676.
Still looking at the funny moves. Some unrelated change I have does get rid of some such Span related funny moves but not in this case. Still, I suspect they're related so I'm still looking.
I don't see benchmark coverage of this scenario.
I just sent a PR for that https://github.com/dotnet/performance/pull/206
And let's not miss the fact that when start is 0 the comparison should be narrowed to 32 bit
mov ecx,dword ptr [rax+8]
mov ecx,ecx
mov r8d,edx
cmp rcx,r8
jb AsSpanForUnitBenchmark()+..
should be
cmp dword ptr [rax+8],edx
jb AsSpanForUnitBenchmark()+..
recently I stumbled upon a JIT change that gets rid of some of these funny moves
Turns out that the change I had in mind is unrelated. But I have some other work in progress that gets rid of these funny moves. Ultimately it all has to do with structs, struct promotion, address taken variables and whatnot.
Though I'm not sure if the funny moves are actually caused by dotnet/coreclr#20711. It's possible that there was some other regression in the JIT. Or maybe the change dotnet/coreclr#20711 just had an impact on the surrounding code.
The issue is actually caused by code like
```C#
_pointer = new ByReference
Among other IR this generates
[000167] ------------ * STMT void (IL 0x002... ???)
[000166] -A---------- --* ASG struct (copy)
[000165] ------------ +--* OBJ struct(8)
[000164] ------------ | --* ADDR byref
[000163] ------------ | --* FIELD struct _pointer
[000136] L----------- | --* ADDR byref
[000137] ------------ | --* LCL_VAR struct(16)(P) V08 tmp3
| --* byref V08._pointer (offs=0x00) -> V15 tmp10
| --* int V08._length (offs=0x08) -> V16 tmp11
[000161] ------------ --* LCL_VAR struct(8)(P) V11 tmp6
--* byref V11._value (offs=0x00) -> V17 tmp12
The JIT fails to cleanup this mess (no idea if it ever did this and this is just a regression) and this interferes with subsequent optimizations (namely - local assertion propagation).
With proper cleanup we get
[000167] ------------ * STMT void (IL 0x002... ???)
[000166] -A---------- --* ASG byref
[000165] ------------ +--* LCL_VAR byref V15 tmp10
[000161] ------------ --* LCL_VAR byref V17 tmp12
and then optimizations kick in and we get
```asm
4C8D4210 lea r8, bword ptr [rdx+16]
instead of
add rax,10h
mov rcx,rax
mov rax,rcx
mov rcx,rax
For completeness, the other problem I was talking about is caused by code like
C#
ref Unsafe.As<T, byte>(ref MemoryMarshal.GetReference(span)), // from SequenceEqual
where a somewhat similar issue blocks local assertion propagation.
I have a WIP PR dotnet/coreclr#19429. I just updated it, and verified that it gets rid of two of the three movs after the add. Like most register allocation changes, it had both positive and negative impact (though more positive by a good margin), but was reticent to merge it without perf results. I'll give the perf runs another go.
Like most register allocation changes, it had both positive and negative impact (though more positive by a good margin), but was reticent to merge it without perf results
But is it LSRA's job to deal with poor IR generated by previous phases?
But is it LSRA's job to deal with poor IR generated by previous phases?
Not entirely. But given phase ordering and all, I think it's generally accepted that a register allocator should incorporate copy elimination, either as part of its function, or done prior to.
@mikedn I take it you meant dotnet/coreclr#20771?
The only change early on in the jit IR is from that source-level change to the validation logic in the span constructor.
It would be interesting to revert dotnet/coreclr#20771 and see if perf goes back to where it was in 2.1. If so perhaps we need to evaluate that optimization more broadly.
@mikedn I take it you meant dotnet/coreclr#20771? The only change early on in the jit IR is from that source-level change to the validation logic in the span constructor.
@jkotas found that. I think there are actually 2 unrelated problems. One was that change, it came with the 64 bit compare that the JIT doesn't narrow to 32 bit. The other is the "funny move" problem, that's probably not related to dotnet/coreclr#20771, it's the usual problem the JIT has with some Span code due to problems with struct promotion.
OK, I tried to revert the change done in dotnet/coreclr#20771. Indeed, the validation logic change did cause the apparition of those funny moves, even though they're actually generated by the ByReference assignment that follows the validation. I'll take a closer look, it's a bit strange how one caused the other.
Those funny moves exist in IR both before and after dotnet/coreclr#20771. The difference is simply in register allocation. Before, all variables involved in the copy chain got r8 and after the change the same variables get both r8 and rdx.
This should have been be improved by dotnet/coreclr#19429 and dotnet/coreclr#22454 ...
;;; 2.1
G_M12530_IG04:
83780800 cmp dword ptr [rax+8], 0
721D jb SHORT G_M12530_IG09
395008 cmp dword ptr [rax+8], edx
7218 jb SHORT G_M12530_IG09
G_M12530_IG05:
488D4810 lea rcx, bword ptr [rax+16]
G_M12530_IG06:
48894C2428 mov bword ptr [rsp+28H], rcx
89542430 mov dword ptr [rsp+30H], edx
and master with the above PRs merged does show promise....
;;; master plus 22454
G_M12530_IG04:
8B4808 mov ecx, dword ptr [rax+8]
448BC2 mov r8d, edx
493BC8 cmp rcx, r8
721B jb SHORT G_M12530_IG09
G_M12530_IG05:
4883C010 add rax, 16
488BC8 mov rcx, rax
G_M12530_IG06:
48894C2428 mov bword ptr [rsp+28H], rcx
89542430 mov dword ptr [rsp+30H], edx
Still room for improvement -- looks like maybe LSRA is not using casts for preferencing?
I don't see much perf difference between 3.0 Preview 2 and 2.1 on the test case that was added in dotnet/performance#260 to try and capture this issue.
BenchmarkDotNet=v0.11.3.1003-nightly, OS=Windows 10.0.18323
Intel Core i7-4770HQ CPU 2.20GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-preview4-010309
[Host] : .NET Core 2.1.7 (CoreCLR 4.6.27129.04, CoreFX 4.6.27129.04), 64bit RyuJIT
Job-WTCAUC : .NET Core 2.1.7 (CoreCLR 4.6.27129.04, CoreFX 4.6.27129.04), 64bit RyuJIT
Job-YHFHGF : .NET Core 3.0.0-preview-27412-2 (CoreCLR 4.6.27411.71, CoreFX 4.7.19.11201), 64bit RyuJIT
Runtime=Core IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1
| Method | Toolchain | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Gen 0/1k Op | Gen 1/1k Op | Gen 2/1k Op | Allocated Memory/Op |
|----------------------- |-------------- |----------:|----------:|----------:|----------:|----------:|----------:|------:|--------:|------------:|------------:|------------:|--------------------:|
| ArrayAsSpanStartLength | netcoreapp2.1 | 0.3919 ns | 0.0134 ns | 0.0125 ns | 0.3879 ns | 0.3705 ns | 0.4164 ns | 1.00 | 0.00 | - | - | - | - |
| ArrayAsSpanStartLength | netcoreapp3.0 | 0.4198 ns | 0.0288 ns | 0.0269 ns | 0.4166 ns | 0.3848 ns | 0.4819 ns | 1.07 | 0.08 | - | - | - | - |
@Anipik can you remeasure your test with 3.0 preview 2?
Am going to close since I think this is fixed -- please reopen if that's not the case.
@Anipik please post back when you've verified 3.0 preview 2 or later..
yeah I am on it
@AndyAyersMS I verified and regression is no longer there. thanks for all the help and effort in resolving this
Most helpful comment
I'll take a look, recently I stumbled upon a JIT change that gets rid of some of these funny moves. Maybe we're lucky and it's the same case.
Also, the second
movinmight be eliminated by another change I have.