In .NET 3.1 we introduced the following intrinsics to expose mulx x86 instruction:
uint MultiplyNoFlags(uint left, uint right, uint* low)ulong MultiplyNoFlags(ulong left, ulong right, ulong* low)where low is an out parameter that is used to return the lower 32-bit/64-bit part of 64-bit/128-bit result of left * right multiplication while the return value contains the upper part.
When the instrinsics are used the JIT produces sub-optimal code due to the fact that low has "address-taken" attribute.
For example, the following C# methods
```c#
static unsafe uint mulx(uint a, uint b)
{
uint r;
return Bmi2.MultiplyNoFlags(a, b, &r) + r;
}
static unsafe ulong mulx_64(ulong a, ulong b)
{
ulong r;
return Bmi2.X64.MultiplyNoFlags(a, b, &r) + r;
}
will be compiled down to the following code by the current implementation of the JIT
**mulx**
```asm
G_M48748_IG01: ;; offset=0000H
50 push rax
33C0 xor rax, rax
89442404 mov dword ptr [rsp+04H], eax
89542418 mov dword ptr [rsp+18H], edx
;; bbWeight=1 PerfScore 3.25
G_M48748_IG02: ;; offset=000BH
488D442404 lea rax, bword ptr [rsp+04H]
448B442418 mov r8d, dword ptr [rsp+18H]
8BD1 mov edx, ecx
C4C233F6D0 mulx edx, r9d, r8d
448908 mov dword ptr [rax], r9d
8BC2 mov eax, edx
03442404 add eax, dword ptr [rsp+04H]
;; bbWeight=1 PerfScore 7.00
G_M48748_IG03: ;; offset=0025H
4883C408 add rsp, 8
C3 ret
mulx_64
G_M55976_IG01: ;; offset=0000H
50 push rax
33C0 xor rax, rax
48890424 mov qword ptr [rsp], rax
4889542418 mov qword ptr [rsp+18H], rdx
;; bbWeight=1 PerfScore 3.25
G_M55976_IG02: ;; offset=000CH
488D0424 lea rax, bword ptr [rsp]
4C8B442418 mov r8, qword ptr [rsp+18H]
488BD1 mov rdx, rcx
C4C2B3F6D0 mulx rdx, r9, r8
4C8908 mov qword ptr [rax], r9
488BC2 mov rax, rdx
48030424 add rax, qword ptr [rsp]
;; bbWeight=1 PerfScore 7.00
G_M55976_IG03: ;; offset=0027H
4883C408 add rsp, 8
C3 ret
;; bbWeight=1 PerfScore 1.25
However, if the Bmi2.MultiplyNoFlags were implemented instead as
```c#
public static unsafe uint MultiplyNoFlags(uint left, uint right, uint* low)
{
var result = MultiplyNoFlags2(left, right); *low = result.Item1; return result.Item2;
}
public static unsafe ulong MultiplyNoFlags(ulong left, ulong right, ulong* low)
{
var result = MultiplyNoFlags2(left, right); *low = result.Item1; return result.Item2;
}
the JIT as in #37928 would inline `MultiplyNoFlags` and be able to *remove* the address-taken attribute from a local corresponding to `low`:
**mulx**
```asm
G_M48748_IG01: ;; offset=0000H
;; bbWeight=1 PerfScore 0.00
G_M48748_IG02: ;; offset=0000H
C4E27BF6D1 mulx edx, eax, ecx
03C2 add eax, edx
;; bbWeight=1 PerfScore 3.25
G_M48748_IG03: ;; offset=0007H
C3 ret
;; bbWeight=1 PerfScore 1.00
mulx_64
G_M55976_IG01: ;; offset=0000H
;; bbWeight=1 PerfScore 0.00
G_M55976_IG02: ;; offset=0000H
C4E2FBF6D1 mulx rdx, rax, rcx
4803C2 add rax, rdx
;; bbWeight=1 PerfScore 3.25
G_M55976_IG03: ;; offset=0008H
C3 ret
;; bbWeight=1 PerfScore 1.00
```c#
namespace System.Runtime.Intrinsics.X86
{
public abstract class Bmi2 : X86Base
{
public static (uint Lower, uint Upper) MultiplyNoFlags2(uint left, uint right);
public abstract class X64: X86Base.X64
{
public static (ulong Lower, ulong Upper) MultiplyNoFlags2(ulong left, ulong right);
}
}
}
```
Based on work Carol did in #37928
cc @CarolEidt @tannergooding
Tagging subscribers to this area: @tannergooding, @jeffhandley
See info in area-owners.md if you want to be subscribed.
Issue Details
```c#
namespace System.Runtime.Intrinsics.X86
{
public abstract class Bmi2 : X86Base
{
public static (uint, uint) MultiplyNoFlags2(uint left, uint right);
public abstract class X64: X86Base.X64
{
public static (ulong, ulong) MultiplyNoFlags2(ulong left, ulong right);
}
}
}
```
Based on work Carol did in #37928
cc @CarolEidt @tannergooding
| Author: | echesakovMSFT |
|---|---|
| Assignees: | - |
| Labels: | `arch-x64`, `arch-x86`, `area-System.Runtime.Intrinsics` |
| Milestone: | - |
It this a workaround for JIT not being able to optimize ulong MultiplyNoFlags(ulong left, ulong right, ulong* low)?
It would be useful to follow the API review template for the top post. Background and Motivation for this one is not obvious.
It this a workaround for JIT not being able to optimize ulong MultiplyNoFlags(ulong left, ulong right, ulong* low)?
Yes. Currently the only way the JIT can handle a node that produces multiple registers is if it produces a struct. An alternative to explicitly adding an API that returns a struct (ValueTuple) would be to synthesize such a struct in the JIT. However, that is not something the JIT currently can do.
Note that, by adding this API and having ulong MultiplyNoFlags(ulong left, ulong right, ulong* low) call it, we can generate identical code for it.
Should consider giving names to the returned tuple elements, e.g., _(int Hi, int Lo)_.
I agree, but prefer Upper and Lower, which we use in most of the other intrinsic APIs.
I updated the description to include Background and Motivation as Jan suggested and added the names to the tuple elements.
Is it that hard to remove the address taken attribute in the example? I would think that it should be pretty easy for the JIT to prove that the address does not leak anywhere.
Is it that hard to remove the address taken attribute in the example? I would think that it should be pretty easy for the JIT to prove that the address does not leak anywhere.
From my understanding, it would be quite tricky. The JIT doesn't have any mechanism at the moment to prove that an address is not leaking when it's passed to an intrinsic for the same reasons as when it's passed to a function that was not inlined.
Is it that hard to remove the address taken attribute in the example? I would think that it should be pretty easy for the JIT to prove that the address does not leak anywhere.
From my understanding, it would be quite tricky. The JIT doesn't have any mechanism at the moment to prove that an address is not leaking when it's passed to an intrinsic for the same reasons as when it's passed to a function that was not inlined.
Also, it is not sufficient to prove that the address doesn't leak. It's also problematic to transform the IR into something that allows the backend to put both results in a register, when the IR has it as a reference. Really, in order to do that, by the time it gets to the backend it needs to be a single instruction that defines two registers. And for now, the best way (really, the only way without major changes) is to have it produce a two-field struct.
Yeah, I figured that the real problem is the representation and not just the address taken that the description at the top highlights.
I figured that the real problem is the representation and not just the address taken
Well, they're both real problems. It would be somewhat easier to transform just the backend to accommodate this (though still requiring major changes), but by that point we would already forced the variable to live on the stack.
I propose naming the new methods Bmi2.Multiply instead of MultiplyNoFlags2 - the flags/no-flags effect isn't observable in any way (at least not until flags returning methods are added, but these would be new methods then).
out or tuple overloads```C#
namespace System.Runtime.Intrinsics.X86
{
public abstract class Bmi2 : X86Base
{
public static (uint Lower, uint Upper) MultiplyNoFlags2(uint left, uint right);
public abstract class X64: X86Base.X64
{
public static (ulong Lower, ulong Upper) MultiplyNoFlags2(ulong left, ulong right);
}
}
}
```
@terrajobst, did you mean to close this? Is there a separate issue tracking "we should also do a pass of all the APIs that we believe need out or tuple overloads"? Or for that matter a separate issue tracking doing the perf fix for "Performance issue, which we can solve with a private intrinsic that uses tuples."?