Loops with a return statement in the body run slower than those without. It would be good if the JIT had a way to optimize this. If tracking this was too complex perhaps moving return [true|false] statements could be a starting point.
Test code
https://gist.github.com/bbowyersmyth/9514af463745528d8d290e7cd2492660
The very simple loop runs 85.7ns vs 67.4ns (20% difference). The gap can widen with additional instructions added to the body.
Current theory is that this is due to the CPUs complexity rules for the loop stream detector.
Initial suggestion by @jkotas https://github.com/dotnet/coreclr/pull/2667#r49820503
Recent discussion https://github.com/dotnet/coreclr/pull/9213
cc @mikedn
The inline return will add a bunch of additional pop statements that need to be jumped over making the body of the loop larger
Apart from a branch misprediction are there other factors at play that make skipping over multiple instructions more costly than one for a given iteration?
That branch should be correctly predicted so that's not an issue. One possible explanation for the behavior you are seeing is that predicted and taken branches and slightly more expensive than predicated and not taken branches:
G_M11558_IG05:
0FB708 movzx rcx, word ptr [rax]
440FB70A movzx r9, word ptr [rdx]
413BC9 cmp ecx, r9d
; predicted and taken every loop iteration
7407 je SHORT G_M11558_IG07
33C0 xor eax, eax
G_M11558_IG06:
4883C418 add rsp, 24
C3 ret
G_M11558_IG07:
4883C002 add rax, 2
4883C202 add rdx, 2
41FFC8 dec r8d
4585C0 test r8d, r8d
75DD jne SHORT G_M11558_IG05
; versus
G_M50079_IG05:
440FB701 movzx r8, word ptr [rcx]
440FB70A movzx r9, word ptr [rdx]
453BC1 cmp r8d, r9d
; predicted and NOT taken every loop iteration
7518 jne SHORT G_M50079_IG08
4883C102 add rcx, 2
4883C202 add rdx, 2
FFC8 dec eax
85C0 test eax, eax
75E5 jne SHORT G_M50079_IG05
Anyway, it's best to pull non-loop code out of loops for various reasons including branch cost and LSD.
Some perf results from the low-tech dotnet/coreclr#11192:
SelectRandomInsert somewhatThere is also what looks like a win in some of the Linq Where tests. Hard to be 100% sure based on diffs since dotnet/coreclr#11192 is moving blocks that I don't expect it to move -- likely a limitation of using BBF_BACKWARDS_JUMP as a crude in-loop detector.
The tree sort case is a good example of why something more general is warranted, as there are three loop exits, two series of non-loop blocks and two backedges. There's likely an even bigger win if we can generally move all the non-loop code out. And we'd want something that works for breaks from inner loops in nests and not just for returns.
I'm going to put this up for consideration in 2.1 since it looks like we really should fix this.
No idea how advanced LSDs are but is it still able to optimize if the body is not in the loop?
This is a common and measurable optimization on our tightest code. Definitely a win if we can avoid writing so many go-to statements.
this can be removed
C:\git\coreclr\src\mscorlib\src\System\String.Comparison.cs:
47: goto ReturnCharAMinusCharB; // TODO: Workaround for https://github.com/dotnet/coreclr/issues/9692
this can be removed
C:\git\coreclr\src\mscorlib\src\System\String.Comparison.cs:
47: goto ReturnCharAMinusCharB; // TODO: Workaround for dotnet/coreclr#9692
Most helpful comment
Some perf results from the low-tech dotnet/coreclr#11192:
SelectRandomInsertsomewhatThere is also what looks like a win in some of the Linq
Wheretests. Hard to be 100% sure based on diffs since dotnet/coreclr#11192 is moving blocks that I don't expect it to move -- likely a limitation of usingBBF_BACKWARDS_JUMPas a crude in-loop detector.The tree sort case is a good example of why something more general is warranted, as there are three loop exits, two series of non-loop blocks and two backedges. There's likely an even bigger win if we can generally move all the non-loop code out. And we'd want something that works for breaks from inner loops in nests and not just for returns.
I'm going to put this up for consideration in 2.1 since it looks like we really should fix this.