Runtime: Possible JIT improvement for loops with return statements

Created on 21 Feb 2017  路  8Comments  路  Source: dotnet/runtime

Loops with a return statement in the body run slower than those without. It would be good if the JIT had a way to optimize this. If tracking this was too complex perhaps moving return [true|false] statements could be a starting point.

Test code
https://gist.github.com/bbowyersmyth/9514af463745528d8d290e7cd2492660

The very simple loop runs 85.7ns vs 67.4ns (20% difference). The gap can widen with additional instructions added to the body.
Current theory is that this is due to the CPUs complexity rules for the loop stream detector.

Initial suggestion by @jkotas https://github.com/dotnet/coreclr/pull/2667#r49820503
Recent discussion https://github.com/dotnet/coreclr/pull/9213

cc @mikedn

area-CodeGen-coreclr enhancement optimization tenet-performance

Most helpful comment

Some perf results from the low-tech dotnet/coreclr#11192:

  • Fasta (~6%): moved return block out of loop in SelectRandom
  • TreeSort (~5%): compacted loop body in Insert somewhat

There is also what looks like a win in some of the Linq Where tests. Hard to be 100% sure based on diffs since dotnet/coreclr#11192 is moving blocks that I don't expect it to move -- likely a limitation of using BBF_BACKWARDS_JUMP as a crude in-loop detector.

The tree sort case is a good example of why something more general is warranted, as there are three loop exits, two series of non-loop blocks and two backedges. There's likely an even bigger win if we can generally move all the non-loop code out. And we'd want something that works for breaks from inner loops in nests and not just for returns.

I'm going to put this up for consideration in 2.1 since it looks like we really should fix this.

All 8 comments

The inline return will add a bunch of additional pop statements that need to be jumped over making the body of the loop larger

Apart from a branch misprediction are there other factors at play that make skipping over multiple instructions more costly than one for a given iteration?

That branch should be correctly predicted so that's not an issue. One possible explanation for the behavior you are seeing is that predicted and taken branches and slightly more expensive than predicated and not taken branches:

G_M11558_IG05:
       0FB708               movzx    rcx, word  ptr [rax]
       440FB70A             movzx    r9, word  ptr [rdx]
       413BC9               cmp      ecx, r9d
; predicted and taken every loop iteration
       7407                 je       SHORT G_M11558_IG07
       33C0                 xor      eax, eax
G_M11558_IG06:
       4883C418             add      rsp, 24
       C3                   ret
G_M11558_IG07:
       4883C002             add      rax, 2
       4883C202             add      rdx, 2
       41FFC8               dec      r8d
       4585C0               test     r8d, r8d
       75DD                 jne      SHORT G_M11558_IG05

; versus
G_M50079_IG05:
       440FB701             movzx    r8, word  ptr [rcx]
       440FB70A             movzx    r9, word  ptr [rdx]
       453BC1               cmp      r8d, r9d
; predicted and NOT taken every loop iteration
       7518                 jne      SHORT G_M50079_IG08
       4883C102             add      rcx, 2
       4883C202             add      rdx, 2
       FFC8                 dec      eax
       85C0                 test     eax, eax
       75E5                 jne      SHORT G_M50079_IG05

Anyway, it's best to pull non-loop code out of loops for various reasons including branch cost and LSD.

Some perf results from the low-tech dotnet/coreclr#11192:

  • Fasta (~6%): moved return block out of loop in SelectRandom
  • TreeSort (~5%): compacted loop body in Insert somewhat

There is also what looks like a win in some of the Linq Where tests. Hard to be 100% sure based on diffs since dotnet/coreclr#11192 is moving blocks that I don't expect it to move -- likely a limitation of using BBF_BACKWARDS_JUMP as a crude in-loop detector.

The tree sort case is a good example of why something more general is warranted, as there are three loop exits, two series of non-loop blocks and two backedges. There's likely an even bigger win if we can generally move all the non-loop code out. And we'd want something that works for breaks from inner loops in nests and not just for returns.

I'm going to put this up for consideration in 2.1 since it looks like we really should fix this.

No idea how advanced LSDs are but is it still able to optimize if the body is not in the loop?

This is a common and measurable optimization on our tightest code. Definitely a win if we can avoid writing so many go-to statements.

this can be removed
C:\git\coreclr\src\mscorlib\src\System\String.Comparison.cs:
47: goto ReturnCharAMinusCharB; // TODO: Workaround for https://github.com/dotnet/coreclr/issues/9692

this can be removed
C:\git\coreclr\src\mscorlib\src\System\String.Comparison.cs:
47: goto ReturnCharAMinusCharB; // TODO: Workaround for dotnet/coreclr#9692

See https://github.com/dotnet/coreclr/issues/13466

Was this page helpful?
0 / 5 - 0 ratings