Runtime: CodeGen should be updated to support the FMA instruction set

Created on 5 Jun 2017 · 20Comments · Source: dotnet/runtime

Rationale
Performing a multiplication followed by an add is a frequent operation in code (d = (a * b) + c) and modern processors provide instructions to help perform this operation in a fast/efficient manner.

Proposal

The runtime should be updated to recognize the d = (a * b) + c pattern and emit FMA instructions in those scenarios.

This pattern should be handled for both scalar and vector operations (float, double, Vector, Vector, Vector2, Vector3, and Vector4).

Other Thoughts
The runtime already has support for the AVX instruction set and the VEX encoding which should make it significantly easier to support for the FMA instructions.

An explicit API (Math.MultiplyAdd, or something similar) could also be provided so users can explicitly opt into the behavior.

category:cq
theme:intrinsics
skill-level:intermediate
cost:medium

area-CodeGen-coreclr enhancement

Source

tannergooding

👍2

Most helpful comment

@dotnet/jit-contrib

As @tannergooding implies, the difference in behavior between FMA and MUL+ADD is that the former rounds once (at the end) while the latter rounds twice. CLR doesn't have particularly strict floating-point semantics, and contraction is explicitly allowed by ECMA 335 (III.1.1.1), so it might be reasonable to do this folding without user consent, but I'm not sure. Certainly doesn't seem like a huge breaking change--at least for .NET Core.

RussKeldorph on 6 Jun 2017

👍6

All 20 comments

For both scalar and vector types, this can also cover:

Multiply and Subtract (d = (a * b) - c)
Negative Multiply and Add (d = -(a * b) + c)
Negative Multiply and Subtract (d = -(a * b) - c)

For vector types, this can also cover:

Multiply with Alternating Add/Subtract (d0 = (a0 * b0) + c0, d1 = (a1 * b1) - c1, d2 = (a2 * b2) + c2, d3 = (a3 * b3) - c3)
Multiply with Alternating Subtract/Add (d0 = (a0 * b0) - c0, d1 = (a1 * b1) + c1, d2 = (a2 * b2) - c2, d3 = (a3 * b3) + c3)

tannergooding on 5 Jun 2017

FYI. @mellinoe, who may be interested

tannergooding on 5 Jun 2017

It's my understanding that this does not necessarily have the same behavior as a regular multiply-followed-by-add. Given that this could affect large chunks of "regular" code, I can imagine something out there breaking for some reason.

I guess it would help to understand what other optimizing compilers do. Is this a common automatic optimization to do (not opt-in), with respect to behavior/precision differences?

mellinoe on 5 Jun 2017

It's my understanding that this does not necessarily have the same behavior as a regular multiply-followed-by-add.

That could be the case, but I didn't see anything obvious to this effect in the architecture manual (they generally list the error rate when it differs, as is the case with rsqrt). The FMA Extensions section is Volume 1, 14.5. The actual instruction pages are in Volume 2C.

Given that this could affect large chunks of "regular" code, I can imagine something out there breaking for some reason.

It should be possible to cordon this off under a switch, like we do with AVX and SIMD support.

I guess it would help to understand what other optimizing compilers do.

The compiler and assembler support is here: https://en.wikipedia.org/wiki/FMA_instruction_set#Compiler_and_assembler_support

All major compilers support it (and a specific switch to enable it). On VC++, this requires enabled AVX2 support and switching the floating-point mode from precise to fast. On GCC and others, it just requires specifying something similar to -mfma.

It might be worth noting that some of our math functions (at least on Windows/MacOS) already take advantage of FMA instructions (if the hardware supports it) in the backing CRT call.

Additionally, if enabling this by default is still a concern, then providing an explicit function to enable use of the intrinsic might be desirable (System.Math.FusedMultiplyAdd).