I'm working on https://github.com/dotnet/corefx/issues/33392
If I compare my local clr bit vs corefx upstream I see great difference for a simple test with no difference(local clr/fx compiled as release)
BenchmarkDotNet=v0.11.3.936-nightly, OS=Windows 10.0.17134.523 (1803/April2018Update/Redstone4)
Intel Core i7-3740QM CPU 2.70GHz (Ivy Bridge), 1 CPU, 8 logical and 4 physical cores
Frequency=2628184 Hz, Resolution=380.4909 ns, Timer=TSC
.NET Core SDK=3.0.100-preview4-010285
[Host] : .NET Core 3.0.0-preview-27410-2 (CoreCLR 4.6.27409.72, CoreFX 4.7.19.10906), 64bit RyuJIT
Job-ITFHCS : .NET Core 02e65af1-425c-48f3-a957-732fc73de96f (CoreCLR 4.6.27507.0, CoreFX 4.7.19.11001), 64bit RyuJIT
Job-OSXLCT : .NET Core bbe35a8d-21f4-4de9-bb5f-575be7c45245 (CoreCLR 4.6.27403.0, CoreFX 4.7.19.10901), 64bit RyuJIT
Runtime=Core InvocationCount=1 IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 UnrollFactor=1
WarmupCount=1
| Type | Method | Toolchain | Count | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Gen 0/1k Op | Gen 1/1k Op | Gen 2/1k Op | Allocated Memory/Op |
|-------------------- |----------- |------------------------------------------------------------------------------------------------------------------------ |------ |---------:|----------:|----------:|---------:|----------:|---------:|------:|--------:|------------:|------------:|------------:|--------------------:|
| BackPorting<Int32> | Dictionary | \corefx\artifacts\bin\testhost\netcoreapp-Windows_NT-Release-x64\shared\Microsoft.NETCore.App\9.9.9\CoreRun.exe | 512 | 32.47 us | 1.5906 us | 1.7680 us | 33.01 us | 29.969 us | 34.92 us | 3.15 | 0.30 | - | - | - | 36864 B |
| BackPorting<Int32> | Dictionary | \corefxupstream\artifacts\bin\testhost\netcoreapp-Windows_NT-Release-x64\shared\Microsoft.NETCore.App\9.9.9\CoreRun.exe | 512 | 10.39 us | 0.8678 us | 0.8911 us | 10.35 us | 9.588 us | 13.01 us | 1.00 | 0.00 | - | - | - | - |
| | | | | | | | | | | | | | | | |
| BackPorting<String> | Dictionary | \corefx\artifacts\bin\testhost\netcoreapp-Windows_NT-Release-x64\shared\Microsoft.NETCore.App\9.9.9\CoreRun.exe | 512 | 51.10 us | 0.5797 us | 0.5139 us | 51.16 us | 50.396 us | 52.30 us | 2.21 | 0.04 | - | - | - | - |
| BackPorting<String> | Dictionary | \corefxupstream\artifacts\bin\testhost\netcoreapp-Windows_NT-Release-x64\shared\Microsoft.NETCore.App\9.9.9\CoreRun.exe | 512 | 23.08 us | 0.4327 us | 0.3836 us | 22.94 us | 22.563 us | 23.70 us | 1.00 | 0.00 | - | - | - | - |
Is this expected(maybe crossgen/optimizations related)?
I mean does it make sense as comparison or it's bettere compare local clr vs local clr updates?
/cc @danmosemsft @benaadams @adamsitnik @jkotas
Is this expected (maybe crossgen/optimizations related)?
It is hard to guess. Number of factors is different between your two setups, e.g. CoreCLR 4.6.27507.0 vs. CoreCLR 4.6.27403.0 that may be like one month apart can make a difference; depends how you build your local bits; ... .
If you want to get to the bottom of this, the best way is to look at code in the hot loop in each of the runs and/or profile each of the runs. 2x+ difference should be easy to see.
@jkotas fast question if I want to dump asm for some method on System.Private.CoreLib Dic.Add for instance(mscorlib) do I follow Dumping native images?
@MarcoRossignoli I think that in this particular case your benchmark is most probably unstable.
private Dictionary<string, string> _collection;
[GlobalSetup]
public void Setup() => this._collection = new Dictionary<string, string>();
[IterationSetup]
public void Reset() => _collection.Clear();
[Benchmark]
public void Dictionary() => this._collection.Add("KEY", "VALUE");
You are using [IterationSetup] here which means: before every benchmark iteration, clean the dictionary. The problem is that the implicit behaviour here is also to run the benchmark once per iteration, if IterationSetup/Cleanup was used (InvocationCount=1)
This is translated to:
Setup();
for(int i = 0; i < iterationCount; i++)
{
Reset();
var clock = Stopwatch.Start();
Dictionary();
clock.Stop();
}
and the Dictionary method is very fast and we simply can't get accurate results with 1 fast operation per iteration. This is why BDN complains about the MinIterationTime when IterationTime < 100ms.
My suggestion here is to use existing benchmarks from the performance repo that test dictionary.Add:
Please let me know if you still see such a huge difference for using the existing benchmarks.
This is why BDN complains about the MinIterationTime when IterationTime < 100ms.
@adamsitnik I saw.
My attempt was test only Add load, because current tests(AddDefaultSize) include ctor and Add, my current update tries to remove null check moving some initialization on ctor. So I'm not sure what I'm measuring, because I get worste results without null check, that should be unexpected(btw I'm wip on it so maybe the issue isn't here).
I mean correct test should be we create 1 dic and measure only Add with a lot of iteration(current test is WIP for now I'm checking asm so I removed iteration noise)
Can use the unconditional setter rather than .Add and not use .Clear e.g.
[Benchmark]
public void Dictionary() => this._collection["KEY"] = "VALUE";
My attempt was test only Add load, because current tests(AddDefaultSize) include ctor and Add
I would not worry about the ctor because it's a very fast operation and moreover if your base and diff contains ctor the benchmark should show you only the difference of Add ((ctor + AddOld) - (ctor + AddNew) = AddOld - AddNew)
The other thing worth trying isto increase the elements count:
I would not worry about the
ctorbecause it's a very fast operation
I would agree, but after strange results I'm not sure that new code added to ctor is so "light", so to be sure I want to isolate Add code.
BTW I'm investigating!
I want to dump asm for some method on
System.Private.CoreLibDic.Add for instance(mscorlib)
We have tiered JIT on by default, and we focus on throughput performance of code produced by the tiered JIT. You should dump the code produced by the tiered JIT, not the native image.
thank's @jkotas first time with this tool for me a pair of question:
1) Is it correct enable tiered compilation with set COMPlus_TieredCompilation_Test_OptimizeTier0=1 on dump I see
; Tier-1 compilation
; optimized code
I think it's ok
2) I cannot filter correct method for instance (set COMPlus_JitDisasm=Dump)
G_M12026_IG02:
448B01 mov r8d, dword ptr [rcx]
49B85033CB0D31020000 mov r8, 0x2310DCB3350
498B10 mov rdx, gword ptr [r8]
4C8BC2 mov r8, rdx
41B902000000 mov r9d, 2
E83E53FFFF call Dictionary`2:TryInsert(ref,ref,ubyte):bool:this
90 nop
I cannot dump method Dictionary`2:TryInsert(ref,ref,ubyte):bool:this but only Dump
set COMPlus_JitDisasm=TryInsert
set COMPlus_JitDisasm=TryInsert
no luck
Might be being produced at crossgen time; so switch off ReadyToRun to force jitting of everything
set COMPlus_ReadyToRun=0
set COMPlus_ReadyToRun=0
thank's @benaadams seems to work!
@benaadams can you help me again?I want constructor(Dictionary) asm but again I didn't found right filter
set COMPlus_JitDisasm=TryInsert Resize Ctor
I get SystemString Ctor
Maybe?
set COMPlus_JitDisasm=Dictionary*
mmm no way, cc also @AndyAyersMS maybe could help with this I want constructor(Dictionary) asm but again I didn't found right filter
Its .ctor for the constructor
set COMPlus_JitDisasm=.ctor
Ctor is an explicit method on string
Or if you just want Dictionary
set COMPlus_JitDisasm=Dictionary`2:.ctor
Ahhh here we go!Did a lot of permutation but never right one!Thank's Ben!
If you're not sure of the right string to use, you can always do a run first with COMPlus_DumpJittedMethods=1 to see what syntax for the methods that get jitted (also helps you check the method is indeed being jitted). Though for some reason that uses :: between class and method when we normally use :. The name match parser is supposed to accept either one.
thank's guys for help!For now I've all info I need to go on!
I'll close.