Runtime: Improve the codegen for Vector4.Clamp

Created on 24 Mar 2020 · 11Comments · Source: dotnet/runtime

https://github.com/SixLabors/ImageSharp makes use of the System.Numerics.Vector* and related types.

However, in https://github.com/SixLabors/ImageSharp/pull/1152 they are needing to replace the usage of Vector4.Clamp with a custom implementation due to the existing implementation being ~5x slower.

We should improve the codegen of this method in particular and also look at many of the other available methods and see where they could be trivially improved and/or accelerated.

area-System.Numerics tenet-performance up-for-grabs

Source

tannergooding

👍3

Most helpful comment

You can see HLSL here: http://shader-playground.timjones.io/0b7eb2cdf1dbf05000e49c6b66c48a6f
It's actually slightly different from what I had above which is the GLSL behavior.

I tested locally for both (and with non constant inputs) but I don't know of a shader-playground equivalent for GLSL

For summary the HLSL results are:

min(1.0f, 3.0f) = 1.0f
min(3.0f, 1.0f) = 1.0f
min(1.0f, NaN) = NaN
min(NaN, 1.0f) = 1.0f
min(NaN, NaN) = NaN
min(+0.0f, -0.0f) = -0.0f
min(-0.0f, +0.0f) = +0.0f
- max(1.0f, 3.0f) = 3.0f
max(3.0f, 1.0f) = 3.0f
max(1.0f, NaN) = NaN
max(NaN, 1.0f) = 1.0f
max(NaN, NaN) = NaN
max(+0.0f, -0.0f) = -0.0f
max(-0.0f, +0.0f) = +0.0f

So while GLSL is the below:

min(x, y) { return (y < x) ? y : x; }
max(x, y) { return (y > x) ? y : x; }
clamp(x, y, z) { return min(max(x, y), z) }

HLSL is basically:

min(x, y) { return (x < y) ? x : y; }
max(x, y) { return (x > y) ? x : y; }
clamp(x, y, z) { return min(max(x, y), z) }

This means that no transform is needed to remain compatible with HLSL (which is what we are documented to do) and implementing Clamp as return Vector4.Min(Vector4.Max(value1, min), max) should be fine.

tannergooding on 24 Mar 2020

👍3

All 11 comments

CC. @pgovind

tannergooding on 24 Mar 2020

Also CC. @JimBobSquarePants as an FYI that we are tracking this and would be happy to hear of any other places where we could improve.

tannergooding on 24 Mar 2020

👍1

I think the main problem with this method is that the logic is replicating the behavior of Clamp in HLSL, rather than the intuitive Min-Max behavior. The codegen could certainly be improved from what it is now, but it can't be made to perform as well as most devs would expect without breaking the current behavior.

I did the same workaround myself after noticing the suboptimal codegen: https://github.com/saucecontrol/PhotoSauce/blob/master/src/MagicScaler/Utilities/MathUtil.cs#L57

Seems like more of a documentation issue since the docs don't fully explain the behavior and don't mention that it's not presently a JIT intrinsic at all.

saucecontrol on 24 Mar 2020

I think the main problem with this method is that the logic is replicating the behavior of Clamp in HLSL

HLSL is also trivially min(max(x, minVal), maxVal) where:

min is (y < x) ? y : x
max is (y > x) ? y : x

The issue is that the x86 maxps and minps intrinsics aren't exactly straightforward for 0 or NaN inputs.

That is, for maxps lhs, rhs it returns rhs (on a per element basis) if either input is NaN or if both inputs are 0; regardless of which is + or -.

tannergooding on 24 Mar 2020

Right, but would changing the behavior be acceptable as an optimization? If not, the workaround would still be necessary for cases where minps/maxps would be adequate.

saucecontrol on 24 Mar 2020

I think we should be able to optimize things correctly, it just requires some magic with regards to the ordering of left and right.

That is, since HLSL min(x, y) is (y < x) ? y : x but minps x, y is (x < y) ? x : y (which covers returning y for NaN and both inputs being zero)

Then I believe we just need to swap x and y and things will still be optimized correctly (pending tests, etc).

tannergooding on 24 Mar 2020

👍1

I'll defer to your knowledge on IEEE 754 edge cases. :)

Would certainly be good to have Clamp emit the expected minps/maxps pair on x86 if that's possible.

saucecontrol on 24 Mar 2020

I'll defer to your knowledge on IEEE 754 edge cases. :)

Well the HLSL min/max functions aren't IEEE compliant. If they were they would at least always propagate NaN or not (depending on which min/max function implemented) and would also treat -0.0 as less than +0.0.

But, Vector4 was also explicitly designed for speed and use in graphics applications (hence it mirroring HLSL/GLSL functionality for min/max/clamp), so we just need to continue preserving that.

I'll see about getting some tests covering the various cases (num, num, num, NaN, NaN, num, NaN, NaN, +0.0, -0.0 and -0.0, +0.0) for HLSL, GLSL and existing Vector4 behavior and then will update when I get the chance.

tannergooding on 24 Mar 2020

👍3

You can see HLSL here: http://shader-playground.timjones.io/0b7eb2cdf1dbf05000e49c6b66c48a6f
It's actually slightly different from what I had above which is the GLSL behavior.