https://github.com/SixLabors/ImageSharp makes use of the System.Numerics.Vector* and related types.
However, in https://github.com/SixLabors/ImageSharp/pull/1152 they are needing to replace the usage of Vector4.Clamp with a custom implementation due to the existing implementation being ~5x slower.
We should improve the codegen of this method in particular and also look at many of the other available methods and see where they could be trivially improved and/or accelerated.
CC. @pgovind
Also CC. @JimBobSquarePants as an FYI that we are tracking this and would be happy to hear of any other places where we could improve.
I think the main problem with this method is that the logic is replicating the behavior of Clamp in HLSL, rather than the intuitive Min-Max behavior. The codegen could certainly be improved from what it is now, but it can't be made to perform as well as most devs would expect without breaking the current behavior.
I did the same workaround myself after noticing the suboptimal codegen: https://github.com/saucecontrol/PhotoSauce/blob/master/src/MagicScaler/Utilities/MathUtil.cs#L57
Seems like more of a documentation issue since the docs don't fully explain the behavior and don't mention that it's not presently a JIT intrinsic at all.
I think the main problem with this method is that the logic is replicating the behavior of Clamp in HLSL
HLSL is also trivially min(max(x, minVal), maxVal) where:
min is (y < x) ? y : xmax is (y > x) ? y : xThe issue is that the x86 maxps and minps intrinsics aren't exactly straightforward for 0 or NaN inputs.
That is, for maxps lhs, rhs it returns rhs (on a per element basis) if either input is NaN or if both inputs are 0; regardless of which is + or -.
Right, but would changing the behavior be acceptable as an optimization? If not, the workaround would still be necessary for cases where minps/maxps would be adequate.
I think we should be able to optimize things correctly, it just requires some magic with regards to the ordering of left and right.
That is, since HLSL min(x, y) is (y < x) ? y : x but minps x, y is (x < y) ? x : y (which covers returning y for NaN and both inputs being zero)
Then I believe we just need to swap x and y and things will still be optimized correctly (pending tests, etc).
I'll defer to your knowledge on IEEE 754 edge cases. :)
Would certainly be good to have Clamp emit the expected minps/maxps pair on x86 if that's possible.
I'll defer to your knowledge on IEEE 754 edge cases. :)
Well the HLSL min/max functions aren't IEEE compliant. If they were they would at least always propagate NaN or not (depending on which min/max function implemented) and would also treat -0.0 as less than +0.0.
But, Vector4 was also explicitly designed for speed and use in graphics applications (hence it mirroring HLSL/GLSL functionality for min/max/clamp), so we just need to continue preserving that.
I'll see about getting some tests covering the various cases (num, num, num, NaN, NaN, num, NaN, NaN, +0.0, -0.0 and -0.0, +0.0) for HLSL, GLSL and existing Vector4 behavior and then will update when I get the chance.
You can see HLSL here: http://shader-playground.timjones.io/0b7eb2cdf1dbf05000e49c6b66c48a6f
It's actually slightly different from what I had above which is the GLSL behavior.
For summary the HLSL results are:
min(1.0f, 3.0f) = 1.0fmin(3.0f, 1.0f) = 1.0fmin(1.0f, NaN) = NaNmin(NaN, 1.0f) = 1.0fmin(NaN, NaN) = NaNmin(+0.0f, -0.0f) = -0.0fmin(-0.0f, +0.0f) = +0.0fmax(1.0f, 3.0f) = 3.0fmax(3.0f, 1.0f) = 3.0fmax(1.0f, NaN) = NaNmax(NaN, 1.0f) = 1.0fmax(NaN, NaN) = NaNmax(+0.0f, -0.0f) = -0.0fmax(-0.0f, +0.0f) = +0.0fSo while GLSL is the below:
min(x, y) { return (y < x) ? y : x; }max(x, y) { return (y > x) ? y : x; }clamp(x, y, z) { return min(max(x, y), z) }HLSL is basically:
min(x, y) { return (x < y) ? x : y; }max(x, y) { return (x > y) ? x : y; }clamp(x, y, z) { return min(max(x, y), z) }This means that no transform is needed to remain compatible with HLSL (which is what we are documented to do) and implementing Clamp as return Vector4.Min(Vector4.Max(value1, min), max) should be fine.
That's great! Thanks for digging into it!
Brilliant. This looks like an excellent plan going forward. Thanks @tannergooding
Most helpful comment
You can see HLSL here: http://shader-playground.timjones.io/0b7eb2cdf1dbf05000e49c6b66c48a6f
It's actually slightly different from what I had above which is the GLSL behavior.
For summary the HLSL results are:
min(1.0f, 3.0f) = 1.0fmin(3.0f, 1.0f) = 1.0fmin(1.0f, NaN) = NaNmin(NaN, 1.0f) = 1.0fmin(NaN, NaN) = NaNmin(+0.0f, -0.0f) = -0.0fmin(-0.0f, +0.0f) = +0.0fmax(1.0f, 3.0f) = 3.0fmax(3.0f, 1.0f) = 3.0fmax(1.0f, NaN) = NaNmax(NaN, 1.0f) = 1.0fmax(NaN, NaN) = NaNmax(+0.0f, -0.0f) = -0.0fmax(-0.0f, +0.0f) = +0.0fSo while GLSL is the below:
min(x, y) { return (y < x) ? y : x; }max(x, y) { return (y > x) ? y : x; }clamp(x, y, z) { return min(max(x, y), z) }HLSLis basically:min(x, y) { return (x < y) ? x : y; }max(x, y) { return (x > y) ? x : y; }clamp(x, y, z) { return min(max(x, y), z) }This means that no transform is needed to remain compatible with HLSL (which is what we are documented to do) and implementing
Clampasreturn Vector4.Min(Vector4.Max(value1, min), max)should be fine.