Runtime: Unnecessary mov generated in MemoryMarshal.Cast

Created on 10 Jan 2019  路  18Comments  路  Source: dotnet/runtime

This:
```C#
[MethodImpl(MethodImplOptions.NoInlining)]
public static ReadOnlySpan GetBytes(ReadOnlySpan s) => MemoryMarshal.Cast(s);

results in this asm:

G_M23838_IG01:
4883EC28 sub rsp, 40
90 nop

G_M23838_IG02:
488B02 mov rax, bword ptr [rdx]
8B5208 mov edx, dword ptr [rdx+8]
8BD2 mov edx, edx
48C1E202 shl rdx, 2
4881FAFFFFFF7F cmp rdx, 0x7FFFFFFF
770E ja SHORT G_M23838_IG04
488901 mov bword ptr [rcx], rax
895108 mov dword ptr [rcx+8], edx
488BC1 mov rax, rcx

G_M23838_IG03:
4883C428 add rsp, 40
C3 ret

G_M23838_IG04:
E83384C95F call CORINFO_HELP_OVERFLOW
CC int3
`` but thatmov edx, edx` is superfluous.

@AndyAyersMS is investigating a fix.

arch-x64 area-CodeGen-coreclr enhancement optimization

Most helpful comment

No, haven't yet.

I split the redundant/shuffle stuff off to a WIP PR. See dotnet/coreclr#21959 for updates.

All 18 comments

Adding some notes I'd been passing around in email. We can peephole this in CodeGen::genIntToIntCast and it has promising diffs.

Total bytes of diff: -7389 (-0.02% of base)
    diff is an improvement.
Top file improvements by size (bytes):
       -1719 : System.Private.CoreLib.dasm (-0.04% of base)
        -590 : System.Memory.dasm (-0.36% of base)
        -510 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.01% of base)
        -465 : System.Private.Xml.dasm (-0.01% of base)
        -416 : Microsoft.CodeAnalysis.CSharp.dasm (-0.01% of base)
76 total files with size differences (76 improved, 0 regressed), 53 unchanged.
Top method improvements by size (bytes):
         -67 (-7.02% of base) : System.Private.CoreLib.dasm - DecCalc:VarDecMul(byref,byref)
         -48 (-1.07% of base) : System.Memory.dasm - SequenceReader`1:TryCopyMultisegment(struct):bool:this (4 methods)
         -48 (-0.99% of base) : System.Memory.dasm - SequenceReader`1:IsNextSlow(struct,bool):bool:this (4 methods)
         -40 (-0.53% of base) : System.Memory.dasm - ReadOnlySequenceDebugView`1:.ctor(struct):this (5 methods)
         -40 (-1.04% of base) : System.Memory.dasm - SequenceReader`1:ResetReader():this (4 methods)
Top method improvements by size (percentage):
          -2 (-20.00% of base) : System.Private.CoreLib.dasm - UInt32:System.IConvertible.ToInt64(ref):long:this
          -2 (-20.00% of base) : System.Private.CoreLib.dasm - UInt32:System.IConvertible.ToUInt64(ref):long:this
          -4 (-18.18% of base) : System.IO.FileSystem.dasm - FILE_TIME:ToTicks():long:this
          -2 (-18.18% of base) : System.Net.HttpListener.dasm - HttpListenerTimeoutManager:get_MinSendBytesPerSecond():long:this
          -2 (-18.18% of base) : System.Net.NetworkInformation.dasm - SystemIcmpV4Statistics:get_MessagesSent():long:this
2203 total methods with size differences (2203 improved, 0 regressed), 191232 unchanged.

But it currently fails two p0f tests. I have been looking into jit\directed\convert\ldind_conv. The IL here is

IL_0001  02                ldarg.0     
IL_0002  46                ldind.i1    
IL_0003  85                conv.ovf.i8.un

and the jit currently generates

; Assembly listing for method Program:Test_ldind_i1_conv_u8(byref,byref)
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   byref  ->  rcx        
;  V01 arg1         [V01,T01] (  3,  3   )   byref  ->  rdx        
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M33316_IG01:
       0F1F440000           nop      

G_M33316_IG02:
       480FBE01             movsx    rax, byte  ptr [rcx]
       8BC0                 mov      eax, eax                     // can't remove this!
       488902               mov      qword ptr [rdx], rax

G_M33316_IG03:
       C3                   ret 

The mov here is not quite a no-op as it zeros the upper 32 bits. But that zeroing should be redundant.

I don鈥檛 yet understand why the jit is emitting a full long extension in the instruction before:

       480FBE01             movsx    rax, byte  ptr [rcx]

as I think it should only be widening the byref-loaded byte to an int. Ecma-355 says for ldind

[Note:that is integer values smaller than 4 bytes, a boolean, or a character converted to 4 bytes by sign or zero-extension as appropriate. Floating-point values are converted to F type. end note]

So I think the jit should be emitting

       0FBE01               movsx    eax, byte  ptr [rcx]

which would do a 4 byte extension and set the upper 32 bits to zero, and then the following mov could safely be omitted.

Still debugging how we end up with a wider extension there.

Likely suspect:

https://github.com/dotnet/coreclr/blob/a658a8d51307cf619da538519893bd29d3be0d62/src/jit/emitxarch.cpp#L332-L343

Wonder if I could just add in an INS_movsx4 or similar to keep track of the cases we know should only extend to 4 bytes.

We can peephole this in CodeGen::genIntToIntCast and it has promising diffs.

@AndyAyersMS Hmm, what exactly did you do? Isn't this the same as dotnet/coreclr#12676 ?

My prototype is simplistic and only for x64: SkipZeroExtendingSameRegMov. Should be passing pri0 tests.

PMI Diffs for System.Private.CoreLib.dll, framework assemblies for x64 default jit
Summary:
(Lower is better)
Total bytes of diff: -9028 (-0.02% of base)
    diff is an improvement.
Top file improvements by size (bytes):
       -2061 : System.Private.CoreLib.dasm (-0.05% of base)
        -685 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (-0.02% of base)
        -602 : System.Memory.dasm (-0.37% of base)
        -592 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.01% of base)
        -502 : System.Data.Common.dasm (-0.04% of base)
80 total files with size differences (80 improved, 0 regressed), 49 unchanged.
Top method improvements by size (bytes):
         -67 (-7.02% of base) : System.Private.CoreLib.dasm - DecCalc:VarDecMul(byref,byref)
         -48 (-1.07% of base) : System.Memory.dasm - SequenceReader`1:TryCopyMultisegment(struct):bool:this (4 methods)
         -48 (-0.99% of base) : System.Memory.dasm - SequenceReader`1:IsNextSlow(struct,bool):bool:this (4 methods)
         -40 (-0.53% of base) : System.Memory.dasm - ReadOnlySequenceDebugView`1:.ctor(struct):this (5 methods)
         -40 (-1.04% of base) : System.Memory.dasm - SequenceReader`1:ResetReader():this (4 methods)
Top method improvements by size (percentage):
          -2 (-20.00% of base) : System.Private.CoreLib.dasm - UInt32:System.IConvertible.ToInt64(ref):long:this
          -2 (-20.00% of base) : System.Private.CoreLib.dasm - UInt32:System.IConvertible.ToUInt64(ref):long:this
          -4 (-18.18% of base) : System.IO.FileSystem.dasm - FILE_TIME:ToTicks():long:this
          -2 (-18.18% of base) : System.Net.HttpListener.dasm - HttpListenerTimeoutManager:get_MinSendBytesPerSecond():long:this
          -2 (-18.18% of base) : System.Net.NetworkInformation.dasm - SystemIcmpV4Statistics:get_MessagesSent():long:this
3184 total methods with size differences (3184 improved, 0 regressed), 190251 unchanged.

Wonder if something similarly low-tech would work on those register shuffle chains...when emitting a mov, look for the previously emitted instr and suppress if it is a dup or shuffle mov. I suspect you can usually get at the previous instrDesc....

Ah, this is quite different from what I have. My change doesn't actually apply to this case because the tree is "split", there's a lclvar between the producer of the value and the cast.

As far as I can tell your change assumes that codegen follows the X64 zero extension model to the letter. I'm not sure if this is a safe assumption because, to the best of my knowledge, nothing in the JIT relies on this today. See also discussion with @pentp in dotnet/coreclr#21553.

ARM64 already messed this up, despite the ISA following the same zero extension model as X64, ARM64 codegen emits ldrsw for TYP_INT indirs so it sign extends.

And regarding 32 bit movsx - I run into that myself in the cast PR. AFAIR I decided not to add a 32 bit movsx because the Intel optimization manual advice was to use the 64 bit version. Though I need to double check that, the advice might have been out of date.

I'm not confident about the zero extension model either. But if this idea of looking back at the previous instrDesc pans out we could use that to safely catch a lot of cases.

Right, I suppose you could make the emitter remember the previously generated instruction. Or perhaps the last couple of instructions. instrDesc is not fixed size so you can't traverse the group backwards.

Wonder if something similarly low-tech would work on those register shuffle chains...when emitting a mov, look for the previously emitted instr and suppress if it is a dup or shuffle mov. I suspect you can usually get at the previous instrDesc....

There are already all kinds of solutions for the move chain issue and they're likely better, in phase order:

  • Some can be handled by location assertion propagation but it's blocked by struct promotion issues and ADDR nodes. This can be avoided by moving struct promotion handling and morphing of IND(ADDR(LCL_VAR)) trees out of morph, into LocalAddressVisitor. The former is not that simple in the general case but it may be enough to handle only 1 field structs for now.
  • The earlyprop stuff I'm working on for String.IsNullOrEmpty can also burn through chain of copies (see comment https://github.com/dotnet/coreclr/issues/914#issuecomment-451837271)
  • CopyProp could do it but what I did also relies on the SSA use count so in a way it's just a worse variation of the earlyprop stuff
  • Carol's LSRA preferencing stuff

Delaying this until codegen/emit means that codegen will extra work to recognize such cases and the same time you save little in terms of throughput because there's not a lot going on after codegen.

The emitter already remembers.

I'm mainly curious to see how viable it is to do some late peephole style opts via this sort of approach. No matter how good the upstream phases are, things always slip through.

the mov here is not quite a no-op as it zeros the upper 32 bits.

I'm mainly curious to see how viable it is to do some late peephole style opts via this sort of approach.

I am not a fan of peephole opts... According to my experience, blindly removing the integer promotion code is dangerous, especially with managed <-> native interoperations. Usually, managed runtimes promote all the small integers to Int32, and native languages have real small integers, but I am not sure whether CoreCLR needs to worry about this.

Additionally, we probably need to figure out why RyuJIT generates inconsistent code for zero-extension, sometimes movzx and sometimes mov.

According to my experience, blindly removing the integer promotion code is dangerous, especially with managed <-> native interoperations.

Yes, but he's trying to look at the previous instruction to ensure that this can be done safely.

Additionally, we probably need to figure out why RyuJIT generates inconsistent code for zero-extension, sometimes movzx and sometimes mov

Hmm? What issue is that?

We already add explicit casts for unmanaged code that returns small ints (or if we're inter-oping with Jit64 generated code). Look for checkForSmallType in the importer.

I am not advocating blindly removing things (despite what the code on my branch is doing now) -- just trying to get a sense of opportunity, cost and risk by working on some of our CQ problems in this way.

My previous conception (and I think the consensus understanding) is that it wasn't viable to do machine-specific peepholes with what we have, and that we needed something like dotnet/runtime#8035. Now I am thinking simple things might be viable at the codegen/emit boundary, and relatively easy to develop, so: modest benefit, low risk, low cost.

Most every compiler I've worked on did some kind of late machine specific opts. Ryujit already does a bit of this here and there but the optimizations are folded into the late phases and not called out as an explicit thing.

Hmm? What issue is that?

Oops, my mistake, movzx does not operate over 32-bit src 馃槃 . But mov eax, eax is really confusing for most people.

Here's the redundant/shuffle move peephole: PeepholeRedundantMoves.

It shows how a one-instruction lookback might work. It is pretty clunky to compare a not yet emitted instruction with one that's already emitted, but for simple cases it seems tolerable.

PMI Diffs for System.Private.CoreLib.dll, framework assemblies for x64 default jit
Summary:
(Lower is better)
Total bytes of diff: -20032 (-0.05% of base)
    diff is an improvement.
Top file improvements by size (bytes):
       -4359 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.08% of base)
       -2809 : System.Private.CoreLib.dasm (-0.07% of base)
       -2497 : Microsoft.CodeAnalysis.CSharp.dasm (-0.06% of base)
       -1474 : System.Private.Xml.dasm (-0.04% of base)
        -834 : System.Memory.dasm (-0.51% of base)
90 total files with size differences (90 improved, 0 regressed), 39 unchanged.
Top method improvements by size (bytes):
        -135 (-2.61% of base) : System.IO.FileSystem.dasm - FileSystemEnumerator`1:MoveNext():bool:this (5 methods)
         -90 (-2.29% of base) : System.Private.CoreLib.dasm - TextInfo:ChangeCaseCommon(ref):ref:this (6 methods)
         -90 (-4.27% of base) : System.Private.CoreLib.dasm - EventPipePayloadDecoder:DecodePayload(byref,struct):ref
         -73 (-1.25% of base) : System.Security.Cryptography.Algorithms.dasm - KeyFormatHelper:ReadEncryptedPkcs8(ref,struct,struct,struct,ref,byref,byref) (5 methods)
         -66 (-2.12% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Scanner:ScanXmlString(ushort,ushort,bool):ref:this
Top method improvements by size (percentage):
          -6 (-15.79% of base) : Microsoft.CodeAnalysis.dasm - ModuleMetadata:CreateFromFile(ref):ref
          -6 (-12.77% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - CustomEventAccessorSymbol:GetAttributeDeclarations():struct:this
          -6 (-12.77% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SourceMethodSymbol:GetAttributeDeclarations():struct:this
          -6 (-12.77% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SourceMethodSymbol:GetReturnTypeAttributeDeclarations():struct:this
         -22 (-11.40% of base) : System.Private.Xml.dasm - NameTable:ComputeHash32(ref,int,int):int
3138 total methods with size differences (3138 improved, 0 regressed), 190301 unchanged.

Cherry-picked diff from the above:

;; System.Private.Xml      
;; NameTable:ComputeHash32(ref,int,int):int

 G_M17318_IG06:
        mov      rcx, rax
-       mov      rax, rcx
        imul     ecx, r9d, 2
        jo       SHORT G_M17318_IG08
        mov      rdx, rax               // this is also now dead....
-       mov      rax, rdx
-       mov      rdx, rax
-       mov      rax, rdx
-       mov      rdx, rax
-       mov      rax, rdx
        mov      rsi, rax
        mov      edi, ecx
        mov      rcx, 0xD1FFAB1E
        mov      edx, 100
        call     CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE

Looks great. Did you try on Linux? IIRC, Linux has more redundant movs sometimes (e.g., from multi-reg return).

No, haven't yet.

I split the redundant/shuffle stuff off to a WIP PR. See dotnet/coreclr#21959 for updates.

Updated the fork SkipZeroExtendingSameRegMov with a more conservative version that doesn't use the movsx4 trick and instead uses one instruction lookback to see if the previous instruction wrote the right register and zero extended.

Still gets the motivating case here, and decent handful of other cases, and should be less risky.

We lose a fair number of diffs this way vs my earlier prototype. Some of the diffs can be recovered by going back 2, 3, or 4 instructions (4 seems to be a sweet spot, perhaps) to look for the producer. But we'd need to write up some kind of general dependence check and invoking that repeatedly to walk back starts to get more costly and risky.

Total bytes of diff: -3635 (-0.01% of base)
    diff is an improvement.

Top file improvements by size (bytes):
       -1204 : System.Private.CoreLib.dasm (-0.03% of base)
        -370 : System.Memory.dasm (-0.23% of base)
        -232 : System.Net.Http.dasm (-0.04% of base)
        -208 : System.Net.NetworkInformation.dasm (-0.45% of base)
        -182 : System.Runtime.Numerics.dasm (-0.25% of base)

50 total files with size differences (50 improved, 0 regressed), 79 unchanged.

Top method improvements by size (bytes):
         -40 (-0.53% of base) : System.Memory.dasm - ReadOnlySequenceDebugView`1:.ctor(struct):this (5 methods)
         -32 (-0.83% of base) : System.Memory.dasm - SequenceReader`1:ResetReader():this (4 methods)
         -32 (-0.81% of base) : System.Memory.dasm - SequenceReader`1:GetNextSpan():this (4 methods)
         -32 (-0.66% of base) : System.Memory.dasm - SequenceReader`1:IsNextSlow(struct,bool):bool:this (4 methods)
         -28 (-0.76% of base) : System.Memory.dasm - BuffersExtensions:CopyToMultiSegment(byref,struct) (5 methods)

Top method improvements by size (percentage):
          -2 (-20.00% of base) : System.Private.CoreLib.dasm - UInt32:System.IConvertible.ToInt64(ref):long:this
          -2 (-20.00% of base) : System.Private.CoreLib.dasm - UInt32:System.IConvertible.ToUInt64(ref):long:this
          -4 (-18.18% of base) : System.IO.FileSystem.dasm - FILE_TIME:ToTicks():long:this
          -2 (-18.18% of base) : System.Net.HttpListener.dasm - HttpListenerTimeoutManager:get_MinSendBytesPerSecond():long:this
          -2 (-18.18% of base) : System.Net.NetworkInformation.dasm - SystemIcmpV4Statistics:get_MessagesSent():long:this

884 total methods with size differences (884 improved, 0 regressed), 192555 unchanged.

Updated my branch after Carol's preferencing changes and (not surprisingly) wins still look similar to the above.

I suppose I should put this up as a PR -- while the impact is selective these redundant zero extensions come up quite a bit in span related code.

Was this page helpful?
0 / 5 - 0 ratings