Runtime: [Question]Perf comparison local clr vs corefx upstream

Created on 10 Feb 2019 · 21Comments · Source: dotnet/runtime

I'm working on https://github.com/dotnet/corefx/issues/33392

If I compare my local clr bit vs corefx upstream I see great difference for a simple test with no difference(local clr/fx compiled as release)

BenchmarkDotNet=v0.11.3.936-nightly, OS=Windows 10.0.17134.523 (1803/April2018Update/Redstone4)
Intel Core i7-3740QM CPU 2.70GHz (Ivy Bridge), 1 CPU, 8 logical and 4 physical cores
Frequency=2628184 Hz, Resolution=380.4909 ns, Timer=TSC
.NET Core SDK=3.0.100-preview4-010285
  [Host]     : .NET Core 3.0.0-preview-27410-2 (CoreCLR 4.6.27409.72, CoreFX 4.7.19.10906), 64bit RyuJIT
  Job-ITFHCS : .NET Core 02e65af1-425c-48f3-a957-732fc73de96f (CoreCLR 4.6.27507.0, CoreFX 4.7.19.11001), 64bit RyuJIT
  Job-OSXLCT : .NET Core bbe35a8d-21f4-4de9-bb5f-575be7c45245 (CoreCLR 4.6.27403.0, CoreFX 4.7.19.10901), 64bit RyuJIT

Runtime=Core  InvocationCount=1  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  UnrollFactor=1  
WarmupCount=1

| Type | Method | Toolchain | Count | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Gen 0/1k Op | Gen 1/1k Op | Gen 2/1k Op | Allocated Memory/Op |
|-------------------- |----------- |------------------------------------------------------------------------------------------------------------------------ |------ |---------:|----------:|----------:|---------:|----------:|---------:|------:|--------:|------------:|------------:|------------:|--------------------:|
| BackPorting<Int32> | Dictionary | \corefx\artifacts\bin\testhost\netcoreapp-Windows_NT-Release-x64\shared\Microsoft.NETCore.App\9.9.9\CoreRun.exe | 512 | 32.47 us | 1.5906 us | 1.7680 us | 33.01 us | 29.969 us | 34.92 us | 3.15 | 0.30 | - | - | - | 36864 B |
| BackPorting<Int32> | Dictionary | \corefxupstream\artifacts\bin\testhost\netcoreapp-Windows_NT-Release-x64\shared\Microsoft.NETCore.App\9.9.9\CoreRun.exe | 512 | 10.39 us | 0.8678 us | 0.8911 us | 10.35 us | 9.588 us | 13.01 us | 1.00 | 0.00 | - | - | - | - |
| | | | | | | | | | | | | | | | |
| BackPorting<String> | Dictionary | \corefx\artifacts\bin\testhost\netcoreapp-Windows_NT-Release-x64\shared\Microsoft.NETCore.App\9.9.9\CoreRun.exe | 512 | 51.10 us | 0.5797 us | 0.5139 us | 51.16 us | 50.396 us | 52.30 us | 2.21 | 0.04 | - | - | - | - |
| BackPorting<String> | Dictionary | \corefxupstream\artifacts\bin\testhost\netcoreapp-Windows_NT-Release-x64\shared\Microsoft.NETCore.App\9.9.9\CoreRun.exe | 512 | 23.08 us | 0.4327 us | 0.3836 us | 22.94 us | 22.563 us | 23.70 us | 1.00 | 0.00 | - | - | - | - |

Is this expected(maybe crossgen/optimizations related)?
I mean does it make sense as comparison or it's bettere compare local clr vs local clr updates?

/cc @danmosemsft @benaadams @adamsitnik @jkotas

area-Meta

Source

MarcoRossignoli

All 21 comments

Is this expected (maybe crossgen/optimizations related)?

It is hard to guess. Number of factors is different between your two setups, e.g. CoreCLR 4.6.27507.0 vs. CoreCLR 4.6.27403.0 that may be like one month apart can make a difference; depends how you build your local bits; ... .

If you want to get to the bottom of this, the best way is to look at code in the hot loop in each of the runs and/or profile each of the runs. 2x+ difference should be easy to see.

jkotas on 10 Feb 2019

👍1

@jkotas fast question if I want to dump asm for some method on System.Private.CoreLib Dic.Add for instance(mscorlib) do I follow Dumping native images?

MarcoRossignoli on 11 Feb 2019

@MarcoRossignoli I think that in this particular case your benchmark is most probably unstable.

private Dictionary<string, string> _collection;

[GlobalSetup]
public void Setup() => this._collection = new Dictionary<string, string>();

[IterationSetup]
public void Reset() => _collection.Clear();

[Benchmark]
public void Dictionary() => this._collection.Add("KEY", "VALUE");

You are using [IterationSetup] here which means: before every benchmark iteration, clean the dictionary. The problem is that the implicit behaviour here is also to run the benchmark once per iteration, if IterationSetup/Cleanup was used (InvocationCount=1)

This is translated to:

Setup();

for(int i = 0; i < iterationCount; i++)
{
    Reset();
    var clock = Stopwatch.Start();
    Dictionary();
    clock.Stop();
}

and the Dictionary method is very fast and we simply can't get accurate results with 1 fast operation per iteration. This is why BDN complains about the MinIterationTime when IterationTime < 100ms.

My suggestion here is to use existing benchmarks from the performance repo that test dictionary.Add:

https://github.com/dotnet/performance/blob/98c4fd1518a43fea966f8adfa3518a85de39038c/src/benchmarks/micro/corefx/System.Collections/Add/AddDefaultSize.cs#L60-L67

Please let me know if you still see such a huge difference for using the existing benchmarks.

adamsitnik on 11 Feb 2019

This is why BDN complains about the MinIterationTime when IterationTime < 100ms.

@adamsitnik I saw.
My attempt was test only Add load, because current tests(AddDefaultSize) include ctor and Add, my current update tries to remove null check moving some initialization on ctor. So I'm not sure what I'm measuring, because I get worste results without null check, that should be unexpected(btw I'm wip on it so maybe the issue isn't here).
I mean correct test should be we create 1 dic and measure only Add with a lot of iteration(current test is WIP for now I'm checking asm so I removed iteration noise)

MarcoRossignoli on 11 Feb 2019

Can use the unconditional setter rather than .Add and not use .Clear e.g.

[Benchmark]
public void Dictionary() => this._collection["KEY"] = "VALUE";

benaadams on 11 Feb 2019

👍1

My attempt was test only Add load, because current tests(AddDefaultSize) include ctor and Add

I would not worry about the ctor because it's a very fast operation and moreover if your base and diff contains ctor the benchmark should show you only the difference of Add ((ctor + AddOld) - (ctor + AddNew) = AddOld - AddNew)

The other thing worth trying isto increase the elements count:

https://github.com/dotnet/performance/blob/98c4fd1518a43fea966f8adfa3518a85de39038c/src/benchmarks/micro/corefx/System.Collections/Add/AddDefaultSize.cs#L21

adamsitnik on 11 Feb 2019

👍1

I would not worry about the ctor because it's a very fast operation

I would agree, but after strange results I'm not sure that new code added to ctor is so "light", so to be sure I want to isolate Add code.
BTW I'm investigating!

MarcoRossignoli on 11 Feb 2019

I want to dump asm for some method on System.Private.CoreLib Dic.Add for instance(mscorlib)

We have tiered JIT on by default, and we focus on throughput performance of code produced by the tiered JIT. You should dump the code produced by the tiered JIT, not the native image.

jkotas on 11 Feb 2019

thank's @jkotas first time with this tool for me a pair of question:
1) Is it correct enable tiered compilation with set COMPlus_TieredCompilation_Test_OptimizeTier0=1 on dump I see

; Tier-1 compilation
; optimized code

I think it's ok
2) I cannot filter correct method for instance (set COMPlus_JitDisasm=Dump)

G_M12026_IG02:
       448B01               mov      r8d, dword ptr [rcx]
       49B85033CB0D31020000 mov      r8, 0x2310DCB3350
       498B10               mov      rdx, gword ptr [r8]
       4C8BC2               mov      r8, rdx
       41B902000000         mov      r9d, 2
       E83E53FFFF           call     Dictionary`2:TryInsert(ref,ref,ubyte):bool:this
       90                   nop

I cannot dump method Dictionary`2:TryInsert(ref,ref,ubyte):bool:this but only Dump

MarcoRossignoli on 11 Feb 2019

set COMPlus_JitDisasm=TryInsert

benaadams on 11 Feb 2019

set COMPlus_JitDisasm=TryInsert

no luck

MarcoRossignoli on 11 Feb 2019

Might be being produced at crossgen time; so switch off ReadyToRun to force jitting of everything

set COMPlus_ReadyToRun=0

benaadams on 11 Feb 2019

❤1

set COMPlus_ReadyToRun=0

thank's @benaadams seems to work!

MarcoRossignoli on 11 Feb 2019

@benaadams can you help me again?I want constructor(Dictionary) asm but again I didn't found right filter

set COMPlus_JitDisasm=TryInsert Resize Ctor

I get SystemString Ctor

MarcoRossignoli on 14 Feb 2019

Maybe?

set COMPlus_JitDisasm=Dictionary*

benaadams on 14 Feb 2019

mmm no way, cc also @AndyAyersMS maybe could help with this I want constructor(Dictionary) asm but again I didn't found right filter

MarcoRossignoli on 14 Feb 2019

Its .ctor for the constructor
set COMPlus_JitDisasm=.ctor

Ctor is an explicit method on string

benaadams on 14 Feb 2019

Or if you just want Dictionary

set COMPlus_JitDisasm=Dictionary`2:.ctor

benaadams on 14 Feb 2019

❤1

Ahhh here we go!Did a lot of permutation but never right one!Thank's Ben!

MarcoRossignoli on 14 Feb 2019

If you're not sure of the right string to use, you can always do a run first with COMPlus_DumpJittedMethods=1 to see what syntax for the methods that get jitted (also helps you check the method is indeed being jitted). Though for some reason that uses :: between class and method when we normally use :. The name match parser is supposed to accept either one.