Runtime: Bmi2 MultiplyNoFlags2

Created on 19 Nov 2020  路  15Comments  路  Source: dotnet/runtime

Background and Motivation

In .NET 3.1 we introduced the following intrinsics to expose mulx x86 instruction:

  • uint MultiplyNoFlags(uint left, uint right, uint* low)
  • ulong MultiplyNoFlags(ulong left, ulong right, ulong* low)

where low is an out parameter that is used to return the lower 32-bit/64-bit part of 64-bit/128-bit result of left * right multiplication while the return value contains the upper part.

When the instrinsics are used the JIT produces sub-optimal code due to the fact that low has "address-taken" attribute.

For example, the following C# methods
```c#
static unsafe uint mulx(uint a, uint b)
{
uint r;
return Bmi2.MultiplyNoFlags(a, b, &r) + r;
}

static unsafe ulong mulx_64(ulong a, ulong b)
{
ulong r;
return Bmi2.X64.MultiplyNoFlags(a, b, &r) + r;
}

will be compiled down to the following code by the current implementation of the JIT
**mulx**
```asm
G_M48748_IG01:              ;; offset=0000H
       50                   push     rax
       33C0                 xor      rax, rax
       89442404             mov      dword ptr [rsp+04H], eax
       89542418             mov      dword ptr [rsp+18H], edx
                                                ;; bbWeight=1    PerfScore 3.25
G_M48748_IG02:              ;; offset=000BH
       488D442404           lea      rax, bword ptr [rsp+04H]
       448B442418           mov      r8d, dword ptr [rsp+18H]
       8BD1                 mov      edx, ecx
       C4C233F6D0           mulx     edx, r9d, r8d
       448908               mov      dword ptr [rax], r9d
       8BC2                 mov      eax, edx
       03442404             add      eax, dword ptr [rsp+04H]
                                                ;; bbWeight=1    PerfScore 7.00
G_M48748_IG03:              ;; offset=0025H
       4883C408             add      rsp, 8
       C3                   ret

mulx_64

G_M55976_IG01:              ;; offset=0000H
       50                   push     rax
       33C0                 xor      rax, rax
       48890424             mov      qword ptr [rsp], rax
       4889542418           mov      qword ptr [rsp+18H], rdx
                                                ;; bbWeight=1    PerfScore 3.25
G_M55976_IG02:              ;; offset=000CH
       488D0424             lea      rax, bword ptr [rsp]
       4C8B442418           mov      r8, qword ptr [rsp+18H]
       488BD1               mov      rdx, rcx
       C4C2B3F6D0           mulx     rdx, r9, r8
       4C8908               mov      qword ptr [rax], r9
       488BC2               mov      rax, rdx
       48030424             add      rax, qword ptr [rsp]
                                                ;; bbWeight=1    PerfScore 7.00
G_M55976_IG03:              ;; offset=0027H
       4883C408             add      rsp, 8
       C3                   ret
                                                ;; bbWeight=1    PerfScore 1.25

However, if the Bmi2.MultiplyNoFlags were implemented instead as
```c#
public static unsafe uint MultiplyNoFlags(uint left, uint right, uint* low)
{
var result = MultiplyNoFlags2(left, right); *low = result.Item1; return result.Item2;
}

public static unsafe ulong MultiplyNoFlags(ulong left, ulong right, ulong* low)
{
var result = MultiplyNoFlags2(left, right); *low = result.Item1; return result.Item2;
}

the JIT as in #37928 would inline `MultiplyNoFlags` and be able to *remove* the address-taken attribute from a local corresponding to `low`:
**mulx**
```asm
G_M48748_IG01:              ;; offset=0000H
                                                ;; bbWeight=1    PerfScore 0.00
G_M48748_IG02:              ;; offset=0000H
       C4E27BF6D1           mulx     edx, eax, ecx
       03C2                 add      eax, edx
                                                ;; bbWeight=1    PerfScore 3.25
G_M48748_IG03:              ;; offset=0007H
       C3                   ret
                                                ;; bbWeight=1    PerfScore 1.00

mulx_64

G_M55976_IG01:              ;; offset=0000H
                                                ;; bbWeight=1    PerfScore 0.00
G_M55976_IG02:              ;; offset=0000H
       C4E2FBF6D1           mulx     rdx, rax, rcx
       4803C2               add      rax, rdx
                                                ;; bbWeight=1    PerfScore 3.25
G_M55976_IG03:              ;; offset=0008H
       C3                   ret
                                                ;; bbWeight=1    PerfScore 1.00

Proposed API

```c#
namespace System.Runtime.Intrinsics.X86
{
public abstract class Bmi2 : X86Base
{
public static (uint Lower, uint Upper) MultiplyNoFlags2(uint left, uint right);

    public abstract class X64: X86Base.X64
    {
        public static (ulong Lower, ulong Upper) MultiplyNoFlags2(ulong left, ulong right);
    }
}

}
```

Based on work Carol did in #37928

cc @CarolEidt @tannergooding

api-ready-for-review arch-x64 arch-x86 area-System.Runtime.Intrinsics

All 15 comments

Tagging subscribers to this area: @tannergooding, @jeffhandley
See info in area-owners.md if you want to be subscribed.


Issue Details

```c#
namespace System.Runtime.Intrinsics.X86
{
public abstract class Bmi2 : X86Base
{
public static (uint, uint) MultiplyNoFlags2(uint left, uint right);

    public abstract class X64: X86Base.X64
    {
        public static (ulong, ulong) MultiplyNoFlags2(ulong left, ulong right);
    }
}

}
```

Based on work Carol did in #37928

cc @CarolEidt @tannergooding

Author: echesakovMSFT
Assignees: -
Labels: `arch-x64`, `arch-x86`, `area-System.Runtime.Intrinsics`
Milestone: -

It this a workaround for JIT not being able to optimize ulong MultiplyNoFlags(ulong left, ulong right, ulong* low)?

It would be useful to follow the API review template for the top post. Background and Motivation for this one is not obvious.

It this a workaround for JIT not being able to optimize ulong MultiplyNoFlags(ulong left, ulong right, ulong* low)?

Yes. Currently the only way the JIT can handle a node that produces multiple registers is if it produces a struct. An alternative to explicitly adding an API that returns a struct (ValueTuple) would be to synthesize such a struct in the JIT. However, that is not something the JIT currently can do.

Note that, by adding this API and having ulong MultiplyNoFlags(ulong left, ulong right, ulong* low) call it, we can generate identical code for it.

Should consider giving names to the returned tuple elements, e.g., _(int Hi, int Lo)_.

I agree, but prefer Upper and Lower, which we use in most of the other intrinsic APIs.

I updated the description to include Background and Motivation as Jan suggested and added the names to the tuple elements.

Is it that hard to remove the address taken attribute in the example? I would think that it should be pretty easy for the JIT to prove that the address does not leak anywhere.

Is it that hard to remove the address taken attribute in the example? I would think that it should be pretty easy for the JIT to prove that the address does not leak anywhere.

From my understanding, it would be quite tricky. The JIT doesn't have any mechanism at the moment to prove that an address is not leaking when it's passed to an intrinsic for the same reasons as when it's passed to a function that was not inlined.

Is it that hard to remove the address taken attribute in the example? I would think that it should be pretty easy for the JIT to prove that the address does not leak anywhere.

From my understanding, it would be quite tricky. The JIT doesn't have any mechanism at the moment to prove that an address is not leaking when it's passed to an intrinsic for the same reasons as when it's passed to a function that was not inlined.

Also, it is not sufficient to prove that the address doesn't leak. It's also problematic to transform the IR into something that allows the backend to put both results in a register, when the IR has it as a reference. Really, in order to do that, by the time it gets to the backend it needs to be a single instruction that defines two registers. And for now, the best way (really, the only way without major changes) is to have it produce a two-field struct.

Yeah, I figured that the real problem is the representation and not just the address taken that the description at the top highlights.

I figured that the real problem is the representation and not just the address taken

Well, they're both real problems. It would be somewhat easier to transform just the backend to accommodate this (though still requiring major changes), but by that point we would already forced the variable to live on the stack.

I propose naming the new methods Bmi2.Multiply instead of MultiplyNoFlags2 - the flags/no-flags effect isn't observable in any way (at least not until flags returning methods are added, but these would be new methods then).

Video

  • It seems there are two separate issues:

    1. Performance issue, which we can solve with a private intrinsic that uses tuples.

    2. Usability issue due to use of pointers

  • We should fix (1) which doesn't require new public APIs but we should also do a pass of all the APIs that we believe need out or tuple overloads

```C#
namespace System.Runtime.Intrinsics.X86
{
public abstract class Bmi2 : X86Base
{
public static (uint Lower, uint Upper) MultiplyNoFlags2(uint left, uint right);

    public abstract class X64: X86Base.X64
    {
        public static (ulong Lower, ulong Upper) MultiplyNoFlags2(ulong left, ulong right);
    }
}

}
```

@terrajobst, did you mean to close this? Is there a separate issue tracking "we should also do a pass of all the APIs that we believe need out or tuple overloads"? Or for that matter a separate issue tracking doing the perf fix for "Performance issue, which we can solve with a private intrinsic that uses tuples."?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

nvivo picture nvivo  路  174Comments

galvesribeiro picture galvesribeiro  路  185Comments

syeshchenko picture syeshchenko  路  199Comments

ghuntley picture ghuntley  路  158Comments

akoeplinger picture akoeplinger  路  207Comments