Recently I made a bunch of various benchmarks to test Unity's Burst compiler against native compilers. I've also included Mono and CoreCLR out of curiosity, the code is available here in .NET folder. I've noticed strange results in two tests (Sieve of Eratosthenes and Particle Kinematics) where CoreCLR performs way much slower than Mono for some reason, I think this requires in-depth analysis by appropriate developers of .NET Core.
I'm happy to provide any additional info or assistance if required.
category:cq
theme:needs-triage
skill-level:expert
cost:medium
@nxrighthere have you tested Mono-LLVM-JIT (.NET 5 runtime) by the way? 馃檪 with recent changes it supports "fast-math" and all Math(F) methods are @llvm.intrinsics (let me know if you need any help to setup it)
Looking at the tests, you may need to do 30+ iterations of the methods for .NET Core 3.0 prior to doing the measurement to allow tiered compilation to kick in.
@EgorBo Thanks for the suggestion, going to try to play with it for sure. 馃憤
@benaadams Yea, I was thinking about it too, going to install .NET Core 3.0 then. Thanks.
Dry run:
(CoreCLR 3.0 vs .NET 5 (Mono-LLVM-JIT runtime), our llvm backend for LLVM currently uses only just a few optimization passes)
Ubuntu 18
Core i7 4930K (Ivy Bridge)
Also, it seems you use stackalloc a lot - it probably makes sense to clear InitLocals
for the whole project (so it will not have to clear the memory everytime you allocate them)
@benaadams I've tried .NET Core 3.0.100-preview9 and engage tiered compilation through heavy iterations, but results are almost the same, unfortunately. 馃槩
@EgorBo Interesting, I've never heard before about InitLocals
. I see many articles around reflection stuff, not sure how to use it properly tho.
cc @BruceForstall @sergiy-k
@dotnet/jit-contrib
@nxrighthere Have you tried disabling tiered compilation (set COMPlus_TieredCompilation=0
) to force tier 1 compilation from the start?
Have you seen https://github.com/dotnet/performance? Maybe you should consider contributing the benchmarks to that set, to be run regularly on .NET?
@BruceForstall Indeed, -set COMPlus_TieredCompilation=0
solved this, here's the diff with disabled Tiered Compilation.
Have you seen https://github.com/dotnet/performance? Maybe you should consider contributing the benchmarks to that set, to be run regularly on .NET?
I was not aware of this repository, will consider contributing directly into it, thank you.
Should I close this issue or keep it open?
@nxrighthere Your linked repo mentions ".NET Core 2.2.402". Have you tried with the latest .NET Core 3.0 build to see if there is any difference?
We're always looking for good benchmarks to use for performance comparison. It looks like you've found some where there are perf gaps between RyuJIT and other options that could be investigated.
Should I close this issue or keep it open?
Seems reasonable to keep it open for now.
@BruceForstall Here's the diff with results for 3.0.100-rc1. There's only one noticeable difference: recursive Fibonacci is slower by 22% with the new version, all other tests remain with near the same numbers.
@EgorBo I'm a bit lost with Mono's LLVM. The 6.0.0.334 version on the website is able to compile the code with --aot=llvm,llvmllc="-mcpu=* -fp-contract=fast"
? Also, what should be set to -mcpu
parameter for AMD FX (Vishera)? Thanks.
@nxrighthere --aot=llvm,mcpu=native --ffast-math
But it will be slower than what I tested (mono-netcore-runtime, LLVM jit) you are going to benchmark "legacy" mono with LLVM AOT (which has some limitations).
It's a bit difficult to setup mono-netcore for now (netcore/./build.sh --llvm -c Release
)
@EgorBo Hey Egor, it's possible to build the runtime with LLVM JIT from master on Windows right now?
Looks like we never drilled in to understand why Core is slower -- seems like we ought to do so, there may be one or two things there we can address without needing entire new classes of optimization.
Well, in general, it's all fine right now except places where floating-point arithmetic is involved, since as far as I know there's no equivalent to -ffast-math
/ /fp:fast
in .NET Core.
Is there some writeup you can point me at with more details?
Related issue https://github.com/dotnet/runtime/issues/12753
Thanks. I was actually looking for analysis showing that fast fp is the root cause of the perf differences in Core vs Mono-LLVM. I suspect there's more going on than just that...
Cc @tannergooding
@AndyAyersMS I think one of the low hanging fruits is a*b+c to fma recognition.
@nxrighthere we are still moving things here and there but it's already possible for macOS and Linux:
./buid.sh -c Release /p:MonoEnableLLVM=true
then go to cd src/mono/netcore
and do
make run-sample
After that you should see .dotnet-mono
folder in the repo root (make sure MONO_ENV_OPTIONS=--llvm
is set as a env variable when you will use it to run benchmarks)
After upgrading to .NET 5 Preview 8, I noticed a significant regression in this recursive Fibonacci test. Execution is slower by 40% vs .NET Core 3.1.101 while in other tests .NET 5 shows better results.
If you're talking about a regression in CoreCLR perf, it's likely because of #35020.
Most helpful comment
@nxrighthere we are still moving things here and there but it's already possible for macOS and Linux:
then go to
cd src/mono/netcore
and do
After that you should see
.dotnet-mono
folder in the repo root (make sureMONO_ENV_OPTIONS=--llvm
is set as a env variable when you will use it to run benchmarks)