Rationale
Performing a multiplication followed by an add is a frequent operation in code (d = (a * b) + c) and modern processors provide instructions to help perform this operation in a fast/efficient manner.
Proposal
The runtime should be updated to recognize the d = (a * b) + c pattern and emit FMA instructions in those scenarios.
This pattern should be handled for both scalar and vector operations (float, double, Vector
Other Thoughts
The runtime already has support for the AVX instruction set and the VEX encoding which should make it significantly easier to support for the FMA instructions.
An explicit API (Math.MultiplyAdd, or something similar) could also be provided so users can explicitly opt into the behavior.
category:cq
theme:intrinsics
skill-level:intermediate
cost:medium
For both scalar and vector types, this can also cover:
d = (a * b) - c)d = -(a * b) + c)d = -(a * b) - c)For vector types, this can also cover:
d0 = (a0 * b0) + c0, d1 = (a1 * b1) - c1, d2 = (a2 * b2) + c2, d3 = (a3 * b3) - c3)d0 = (a0 * b0) - c0, d1 = (a1 * b1) + c1, d2 = (a2 * b2) - c2, d3 = (a3 * b3) + c3)FYI. @mellinoe, who may be interested
It's my understanding that this does not necessarily have the same behavior as a regular multiply-followed-by-add. Given that this could affect large chunks of "regular" code, I can imagine something out there breaking for some reason.
I guess it would help to understand what other optimizing compilers do. Is this a common automatic optimization to do (not opt-in), with respect to behavior/precision differences?
It's my understanding that this does not necessarily have the same behavior as a regular multiply-followed-by-add.
That could be the case, but I didn't see anything obvious to this effect in the architecture manual (they generally list the error rate when it differs, as is the case with rsqrt). The FMA Extensions section is Volume 1, 14.5. The actual instruction pages are in Volume 2C.
Given that this could affect large chunks of "regular" code, I can imagine something out there breaking for some reason.
It should be possible to cordon this off under a switch, like we do with AVX and SIMD support.
I guess it would help to understand what other optimizing compilers do.
The compiler and assembler support is here: https://en.wikipedia.org/wiki/FMA_instruction_set#Compiler_and_assembler_support
All major compilers support it (and a specific switch to enable it). On VC++, this requires enabled AVX2 support and switching the floating-point mode from precise to fast. On GCC and others, it just requires specifying something similar to -mfma.
It might be worth noting that some of our math functions (at least on Windows/MacOS) already take advantage of FMA instructions (if the hardware supports it) in the backing CRT call.
Additionally, if enabling this by default is still a concern, then providing an explicit function to enable use of the intrinsic might be desirable (System.Math.FusedMultiplyAdd).
Yeah, an intrinsic like FusedMultiplyAdd (on the Vector types as well) would be doable and something to investigate regardless.
Are there other optimizations "hidden" behind compiler flags that we do by default today? That could provide some motivation to counter-balance the compat concerns.
I think that AVX/AVX2 instructions are the biggest. They are disabled in most compilers by default (due to wanting to build for the LCD). However, we have support for it enabled by default on CoreCLR (only actually emitting those instructions when the underlying hardware supports it), but it can be disabled via one of the configuration knobs:
| Name | Description | Type | Class | Default Value | Flags |
| -----|-------------|------|-------|---------------|------- |
| EnableAVX | Enable AVX instruction set for wide operations as default | DWORD | EXTERNAL | EXTERNAL_JitEnableAVX_Default | REGUTIL_default |
@mellinoe, looks like the difference in behavior falls out from the intermediate result being held in "infinite precision". So it is possible for fma to successfully compute d = (a * b) + c where the equivalent mulsd/addsd calls would fail.
For example, with (1e300 * 2) - 1e300, you will get Inf, but with fma(1e300, 2, -1e300) you get 1e300.
It also looks like MSVC++ uses fma for constant folding, regardless of what mode you are in (i.e. it will still do it when no enhanced instruction set is enabled and precise mode is set). It just only emits the fma instructions when both AVX2 and fast mode are set.
Additionally, it appears that enabling FMA intrinsics will require a bit of rewriting of the GenTree, as it looks like it currently only supports unary and binary operand intrinsics, where-as FMA have three operands.
@dotnet/jit-contrib
As @tannergooding implies, the difference in behavior between FMA and MUL+ADD is that the former rounds once (at the end) while the latter rounds twice. CLR doesn't have particularly strict floating-point semantics, and contraction is explicitly allowed by ECMA 335 (III.1.1.1), so it might be reasonable to do this folding without user consent, but I'm not sure. Certainly doesn't seem like a huge breaking change--at least for .NET Core.
Intel hardware intrinsic API proposal has been opened at dotnet/corefx#22940
Is there any way to see what codegen options the JIT is using on a particular machine? I am working on a complex multi-threaded calculation project, and we are getting different answers running on different hardware. In particular Xeons give different answers to i7s. I thought it might be AVX levels, and tried setting the flag described in this thread, but it made no difference.
If I can understand _why_ the differences are occurring, then I'm happy that the JITter is optimising for available hardware, but at the moment I can't tell whether it is the JIT or some subtle multi-threaded timing artefact!
@OracPrime, you might try the JitDisasm configuration knob: https://github.com/dotnet/coreclr/blob/master/Documentation/project-docs/clr-configuration-knobs.md (although I think that one only works with debug builds of the runtime)
@tannergooding thanks. What a lot of knobs!
Eventually we discovered the difference was down to FusedMultiplyAdd instructions, which can be controlled by the C RTL call _set_enable_FMA3
The PR adding support for the x86 FMA intrinsics is up: https://github.com/dotnet/coreclr/pull/18105
@tannergooding - would it make sense to change the title of this to make it clear that, beyond dotnet/coreclr#18105, this proposes that the JIT do folding of these operations?
Not quite runtime recognition, but the explicit Math/MathF operations are now approved for implementation: dotnet/corefx#31903
@tannergooding - the runtime would require some kind of knob to enable the JIT to optimize these kinds of sequences. Given that, and the fact that fused multiply add intrinsics are supported on both x86/x64 and Arm64, I'd propose that we close this issue since it is a subset of #12753. Any objection?
Sounds reasonable to me!
Most helpful comment
@dotnet/jit-contrib
As @tannergooding implies, the difference in behavior between FMA and MUL+ADD is that the former rounds once (at the end) while the latter rounds twice. CLR doesn't have particularly strict floating-point semantics, and contraction is explicitly allowed by ECMA 335 (III.1.1.1), so it might be reasonable to do this folding without user consent, but I'm not sure. Certainly doesn't seem like a huge breaking change--at least for .NET Core.