Runtime: Add DivMod instrinct for intel x86/x64

Created on 2 Sep 2018  路  15Comments  路  Source: dotnet/runtime

Please add hw instrinct support for divide on intel platforms.

Rationale and Usage

The intel instruction div is a bit special since 32bit divide takes a 64 dividend and 32bit divisor and returns both quotient and remainder as result.

It would be very useful, not only for speeding up Math.DivRem (https://github.com/dotnet/coreclr/issues/757 ) but especially for several decimal operations which would make it more feasible to move decimal code to C# (and maybe several number functions).
Significant speedup using div in C++ provided significant speedups in https://github.com/dotnet/coreclr/issues/10642)

Example:

The following code perfomes division of a 96bit unsigned number by a 32bit number.
Updating the 96bit number in-pace and returning the remainder
The example is simplified from actual code and contains only the "worst case" execution path.

```C#
public uint Div96By32(uint [] num, uint denominator)
{
if (System.Runtime.Intrinsics.X86.X86Base.IsSupported)
{
uint remainder;
num[2] = System.Runtime.Intrinsics.X86.X86Base.DivMod(0, num[2], denominator, out remainder);
num[1] = System.Runtime.Intrinsics.X86.X86Base.DivMod(remainder, num[1], denominator, out remainder);
num[0] = System.Runtime.Intrinsics.X86.X86Base.DivMod(remainder, num[0], denominator, out remainder);
return remainder;
}
else { ... }
}
````

It performs the same logic as in https://github.com/dotnet/corert/blob/d82d460a8530a57e4915060be37fb42c7a661f48/src/System.Private.CoreLib/shared/System/Decimal.DecCalc.cs#L228

That code has 2 64bit divides (_much_ slower, especially in 32bit mode), and several multiplications (workarounds instead of using '%' which would double the executed div instructions currently).

Proposed API

The API might need some design, but I think it makes sense to keep it similar to Math.DivRem as below.

```C#

namespace System.Runtime.Intrinsics.X86
{
public static class X86Base
{
/// Perform unsigned division of 64 bit (hi, low) by 32bit divisor
/// returning quotient, and returning remainder as an out parameter
public uint DivRem(uint hi, uint low, uint divisor, out uint remainder);

/// Perform unsigned division of 64 bit (hi, low) by 32 bit divisor
/// returning quotient, and returning remainder as an out parameter
public int DivRem(int hi, uint low, int divisor, out int remainder);

[Intrinsic]
public static class X64
{
    /// Perform unsigned division of 128 bit (hi, low) by 64bit divisor
    /// returning quotient, and returning remainder as an out parameter
    public ulong DivRem(ulong hi, ulong low, ulong divisor, out ulong remainder);

    /// Perform unsigned division of 128 bit (hi, low) by 64 bit divisor
    /// returning quotient, and returning remainder as an out parameter
    public long DivRem(long hi, ulong low, long divisor, out long remainder);
  }

}

Related functionality which would make sense to add.
It could use the instrinct for god performance on x86 platforms.

```csharp
class System.Math
{
   // unsigned overload of  public static int DivRem (int a, int b, out int result);
   public static uint DivRem (uint a, uint b, out uint result);

   // unsigned overload of  public static int DivRem (int a, int b, out int result);
   public static ulong DivRem (ulong a, ulong b, out ulong result);
}

Details

Full description of instruction can be found in https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf

Updates

api-suggestion arch-x64 area-System.Runtime.Intrinsics

Most helpful comment

There are several issues with the existing Math.DivRem/BigMul methods that are not resolved by making them intrinsic:

  • No unsigned overloads.
  • No 128-bit division/multiplication.
  • No IsSupported mechanism to detect if they are intrinsics or not.

And unlike IDIV, a general purpose division method has dividend size equal to quotient size, which results in long division being performed with a 128-bit IDIV on x64 with a 2-4x latency and 4-15x throughput penalty compared to 64-bit IDIV (x86 goes through a slow helper call).
A general purpose 128-bit division method would have to be a slow helper call on all platforms and pretty much useless (e.g., Decimal has special division methods already for 96/64, 96/32, etc.).

Maybe these intrinsics don't need to be public at first, but they would be useful even just to implement the existing Math functions + Decimal math. But making them internal would limit their usefulness.

All 15 comments

@tannergooding

CC. @CarolEidt, would appreciate your thoughts here.

This falls into the category of "core" instructions (not behind any flag), but which also provide additional functionality that is one or more of:

  • Platform/Architecture Specific
  • Not easily mappable to IL or C# language semantics
  • Potentially useful for certain types of algorithms

There are other instructions that also fall into this category, such as Add with Carry, for which certain native compilers do expose such instrinsic helpers:

@Daniel-Svensson, could you update the original post to roughly follow the layout given as an example here: https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/api-review-process.md#steps

Could you also update the original post to include the signed-division overloads (which would call IDIV)?

@tannergooding I've tried to update the post based on the example which the documentation linked to.

I've added two versions for IDIV since I was a bit unsure about which version you would be interested in.

I've also tried to add namespaces/classes based on some arm instrinct I found.

Regarding Add with Carry and Subtract with Carry are there any issue tracking those?

It would be very useful to have them as cross platform instrincts, 64bit operations could just fall back to 2x 32bit operations. I've have some C++ improvements to decimal arithmetics with 150%+ speedups around and having those operations would be required to have any managed implementation with similar performance)

I've added two versions for IDIV since I was a bit unsure about which version you would be interested in.

Both the 64-bit / 32-bit, and the 32-bit / 32-bit cases are interesting.

I've also tried to add namespaces/classes based on some arm instrinct I found.

We don't want to have a x64 specific class. Currently we are just exposing them together in the same class and calling the 64-bit only overloads (on a 32-bit box) will throw a PNSE.

Regarding Add with Carry and Subtract with Carry are there any issue tracking those

No, there isn't a tracking issue today.

It would be very useful to have them as cross platform instrincts

We actually have Math.DivRem today (for signed and unsigned versions of Int32 / Int32 and Int64 / Int64). They just don't actually do a single operation today (if the intrinsics are approved, and implemented, we would likely update the existing Math APIs to use those, where possible).

I would like to extend this proposal to also include BigMul as mentioned in https://github.com/dotnet/coreclr/issues/17818#issuecomment-395365269.
128-bit DIV and MUL are often needed together in the same place (e.g., System.Decimal).

```cs
public static class Base
{
// 32 * 32 => 64 bit signed multiply
public uint BigMul(int a, int b, out int hi);
// 32 * 32 => 64 bit unsigned multiply
public uint BigMul(uint a, uint b, out uint hi);
// 64 * 64 => 128 bit signed multiply
public ulong BigMul(long a, long b, out long hi);
// 64 * 64 => 128 bit unsigned multiply
public ulong BigMul(ulong a, ulong b, out ulong hi);
}

We should just make the existing Math.DivRem methods work as intrinsic. These do not need be new HW intrinsics methods in the public surface.

These do not need be new HW intrinsics methods in the public surface.

I think that is (possibly) worth further discussion.

Even if we update Math.DivRem to be intrinsic in 3.0, it will essentially be "this is maybe intrinsic, if your platform/runtime support it". While x86Base.DivRem would be "this is always intrinsic, for x86 using int/uint and for x64 using int/uint or long/ulong" (the latter of which can be important for certain performance oriented scenarios).

DivRem (like the other instructions that would be put up for proposal to include in the x86Base class) is a case where x86/x64 exposes a single instruction that provides functionality not all platforms may support. In this case, it is a combining div and rem (which on ARM, to my knowledge, there is no single instruction equivalent).

There are several issues with the existing Math.DivRem/BigMul methods that are not resolved by making them intrinsic:

  • No unsigned overloads.
  • No 128-bit division/multiplication.
  • No IsSupported mechanism to detect if they are intrinsics or not.

And unlike IDIV, a general purpose division method has dividend size equal to quotient size, which results in long division being performed with a 128-bit IDIV on x64 with a 2-4x latency and 4-15x throughput penalty compared to 64-bit IDIV (x86 goes through a slow helper call).
A general purpose 128-bit division method would have to be a slow helper call on all platforms and pretty much useless (e.g., Decimal has special division methods already for 96/64, 96/32, etc.).

Maybe these intrinsics don't need to be public at first, but they would be useful even just to implement the existing Math functions + Decimal math. But making them internal would limit their usefulness.

A couple more useful additions: BSF and BSR

```C#
public abstract class Base
{
public abstract class X64
{
///


/// Search the mask data from least significant bit (LSB) to the most significant bit (MSB) for a set bit (1).
///

public static int BitScanForward(long mask);
public static int BitScanForward(ulong mask);

    /// <summary>
    /// Search the mask data from most significant bit (MSB) to least significant bit (LSB) for a set bit (1).
    /// </summary>
    public static int BitScanReverse(long x);
    public static int BitScanReverse(ulong x);
}

/// <summary>
/// Search the mask data from least significant bit (LSB) to the most significant bit (MSB) for a set bit (1).
/// </summary>
public static int BitScanForward(int mask);
public static int BitScanForward(uint mask);

/// <summary>
/// Search the mask data from most significant bit (MSB) to least significant bit (LSB) for a set bit (1).
/// </summary>
public static int BitScanReverse(int x);
public static int BitScanReverse(uint x);

}
```

There are several places in CoreFX that have been updated recently to use the TZCNT and LZCNT HWIntrinsics, but they must fall back to a slow path for processors not supporting the BMI1 or LZCNT ISAs, respectively, because the JIT requires the corresponding CPUID flag be set.

However, as described here, TZCNT is encoded as REP BSF, which means it can execute on downlevel processors (with different behavior for a 0 input).

Software built with an AOT compiler can take advantage of the instruction compatibility to have a single code path when the TZCNT/BSF behavior is known to be the same, but we can't do the same with the JIT. On the other hand, we could code paths for both instructions if BSF were exposed explicitly.

Likewise, in place of LZCNT, we could use 31 ^ BSR(mask) as a fallback path.

Given the number of processors not supporting the BMI1 and LZCNT ISAs, including the modern Intel Goldmont architecture, it would be worthwhile to be able to code both paths.

Also related: https://github.com/dotnet/corefx/issues/32269

I would like to extend this proposal to also include BigMul as mentioned in dotnet/coreclr#17818 (comment).

@pentp I think that would be a better fit to add to Math to complement the currently available signed versions (which I guess might be in a different issue) or wherever new math operations go instead of making it platform specific. It could then be treated as an instrinct by the runtime for optimal performance. I tried to search for such an issue but did not find it.

128-bit MUL does not really fit into the existing Math class - it can only be an intrinsic on some platforms, while on others (e.g. 32-bit x86) the performance falls off a cliff. It follows that an IsSupported check is required, which also does not fit into Math, as commented above.

A Mul128(long, long) is really no different from FusedMultuplyAdd.

For C/C++ they expose a FP_FAST_FMA macro for this.

I dont think it would be unreasonable, or out of scope, to expose similar boolean properties for a small selection of Math APIs, like FusedMultiplyAdd or Mul128

I've updated the description to match that there is now a X86Base instrinct class.
@tannergooding is there anything else that should/can be changed to increase the likelihood of getting this included for 5.0?

I can mark it ready-for-review now but I don't know if this will get on the backlog before we lock everything down.

The API review will also need to go over whether this is actually worthwhile to expose. As @jkotas indicated, this may be sufficiently covered (for the most common cases) via Math.DivRem and we should just optimize it (which I believe is dependent on some work @CarolEidt has been doing around multi-reg returns).

The remaining reason for the intrinsic is to support 128-bit / 64-bit = 64-bit and 64-bit / 32-bit = 32-bit. However, it might also be possible to just handle those via new "general-purpose" Math APIs.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

EgorBo picture EgorBo  路  3Comments

iCodeWebApps picture iCodeWebApps  路  3Comments

nalywa picture nalywa  路  3Comments

omariom picture omariom  路  3Comments

bencz picture bencz  路  3Comments