Let's finally beat System.Drawing on the JPEG Load->Resize->Save scenario!
As discussed in #1064, it's finally possible thanks to the Intel SIMD intrinsics in .NET Core 3.1. Opening an issue so we can track this work, and hopefully get some help & feedback from the community.
/cc @Sergio0694 @saucecontrol
Summary of steps currently done by ConvertColorsInto:
[D]: Data representation
(T): Bulk transformation between data representations
Rgba32 buffer| | |
| ------- | :--- |
| [D] | 3 Planes of W*H sized float jpeg color channels normalized to 0-255 (3 x Buffer2D<float>, Y+Cb+Cr) |
| (T) | Color convert and pack into a single Vector4 buffer |
| [D] | Floating point RGBA data as Memory<Vector4> |
| (T) | Convert the Vector4 buffer to an Rgba32 buffer. In the Rgba32 case case, the input buffer could be handled as homogenous float buffer, where all individual float values should be converted to byte-s. The conversion is implemented in BulkConvertNormalizedFloatToByteClampOverflows, utilizing AVX2 conversion and narrowing operations through Vector<T> |
| [D] | The result image as an Rgba32 buffer |
Rgb24 buffer| | |
| ------- | :--- |
| [D] | 3 Planes of W*H sized float jpeg color channels normalized to 0-255 (3 x Buffer2D<float>, Y+Cb+Cr) |
| (T) | Color convert and pack into a single Vector4 buffer |
| [D] | Floating point RGBA data as Memory<Vector4> |
| (T) | Convert the Vector4 buffer to an Rgba32 buffer, utilizing BulkConvertNormalizedFloatToByteClampOverflows, utilizing AVX2 conversion and narrow operations through Vector<T> |
| [D] | Temporary Rgba32 buffer |
| (T) | PixelOperations<Rgb24>.FromRgba32() (sub-optimal, extra transformation!) |
| [D] | The result image as an Rgb24 buffer |
Rgb24 case) Y+Cb+Cr planes --> Single Rgb24 buffer| | |
| ------- | :--- |
| D1 | 3 Planes of W*H sized float jpeg color channels normalized to 0-255 (3 x Buffer2D<float>, Y+Cb+Cr) |
| (T) | Color convert, the 3 planes, and write them back to the originating buffers |
| D2 | 3 Planes of Buffer2D<float>, R+G+B) |
| (T) | Narrow the float buffers to byte buffers using SimdUtils.BulkConvertNormalizedFloatToByteClampOverflows |
| D3 | 3 Planes of Buffer2D<byte>, R+G+B |
| (T) | PACK the separate image planes (color channels) into a single Rgb24 buffer |
| D4 | The result image as an Rgb24 buffer |
TPixel case) Y+Cb+Cr planes --> Single TPixel buffer| | |
| ------- | :--- |
| D1 | 3 Planes of W*H sized float jpeg color channels normalized to 0-255 (3 x Buffer2D<float>, Y+Cb+Cr) |
| (T) | All the steps from the default Rgb24 case |
| D4 | Memory<Rgb24> |
| (T) | Convert the Rgb24 buffer to TPixel buffer using PixelOperations<T> |
| D5 | The result image as an TPixel buffer |
The magic is mostly in the D3->D4 transition, because of the fact that we can now do the pixel packing with shuffle and permute intrinsics when those are available. The other fun thing is that if we decode to Image<Rgb24> (case b) we can omit an unnecessary step.
The best thing is that we can handle this big task incrementally:
PixelOperations<T> by new packing operationsJpegImagePostProcessor as described in the Optimized pipeline paragraph The packing API is pretty straightforward:
public class PixelOperations<TPixel>
{
// ...
public void PackFromRgbPlanes(
Configuration configuration,
ReadOnlySpan<byte> redChannel,
ReadOnlySpan<byte> greenChannel,
ReadOnlySpan<byte> blueChannel,
Span<TPixel> destination);
}
We can define a default implementations in the base PixelOperations<TPixel> class, and specialize it for Rgba32 and Rgb24. Optional hardcore task is to T4 a SIMD implementation it for all the RGB(A)-like formats.
It is possible to optimize the conversion even further by doing D1->D3 in a single step, but I consider it a very hard task both implementation and architecture-wise, and prefer incremental evolution instead.
@brianpopow is packing useful for other codecs?
You can get a nice easy win just by using the intrinsics for your float->byte conversions, because the float->int instructions include rounding, and the packing instructions are saturating so you can skip the clamping step. That's as simple as this:
You can speed up Y'CbCr->RGB conversion with intrinsics as well, but I was only able to make the SIMD version faster than scalar with AVX. Any matrix multiplication can be accelerated with DPPS (dot product), but since you really only need 4 of the coefficients from the Y'CbCr->RGB matrix, it doesn't work out any faster unless you can operate on the wider vector.
Unrelated: @saucecontrol your repo is seriously impressive, incredible work there! 馃憦
I have a couple questions:
ImageSharp did you use to make those benchmarks? I'm particularly impressed by the very high memory usage it has compared to other libs, especially in the parallel benchmark (207x compared to System.Drawing? Am I missing something here? 馃樁).ImageSharp fails with the OOM error - @antonfirsov's recent PR should've fixed that.Thanks! I did those benchmarks with the current dev at the time (1.0.0-dev003181). I did a quick run yesterday with 1.0.0-unstable0806, and GC allocations were way down but speed was a bit worse. I see @JimBobSquarePants has a new PR that improves perf. Hopefully that gets merged this weekend? I plan on re-running the whole suite since I've got a new MagicScaler version releasing today. Would be good to have all the ImageSharp improvements in for that.
207x compared to System.Drawing? Am I missing something here?
Those numbers will always be misleading because System.Drawing is just a thin wrapper around native code (GDI+). That's why I threw in a more comprehensive test that also measured unmanaged memory allocations.
@saucecontrol thanks for the suggestions! Introducing a System.Runtime.InteropServices AVX2 path for our float -> byte conversion is definitely a long hanging fruit. If we could base our implementation on your code it would be super easy.
For Y'CbCr -> RGB, we already have a SIMD path that compiles to AVX2 under the hood, and it outperforms our previous scalar approaches a lot. It's the slow scalar packing that hurts the most at the moment. It will be way much better to do the float -> byte conversion first, and pack 32 bytes at a single step with AVX2.
You're certainly welcome to borrow any of my code you find useful. My Y'CbCr conversion is here:
You'll notice the scalar float version reads/writes directly to/from the buffers rather than going through a Vector4. Setting the individual fields of Vector4 is always slow because it means a minimum of three shuffle ops, and Intel processors only have a single port that can service those shuffles. It ends up being faster to do the memory read and write since there are more ports to service those and they'll be in cache anyway.
With AVX, it's possible to overcome the extra shuffle overhead by using a DotProduct that operates on two pixels at once, but even that is only profitable if you have access to the full AVX instruction set. I couldn't outdo the scalar code with System.Numerics.
Just realized in looking at the code again, I was doing an extra round of shuffles just to be able to use DotProduct when it's only a gain on the Green output. And with FMA, it's not even a gain there. I simplified the code, and added an SSE version which ends up just over twice as fast as scalar. I haven't had time to draw out ascii diagrams of how that code works, but I'm happy to answer questions if you have any.
I see another quick win possibility in your SIMD Y'CbCr->RGB path if you just hoist those constant Vector<float> values that hold the coefficients out of the loop. RyuJIT has never been smart enough to do that for you, and each of those requires a load/broadcast instruction pair.
@brianpopow is packing useful for other codecs?
WebP lossy format is similar to JPEG as in the image is also represented with YUV and split up into macroblocks. The difference is, that the DCT coefficients are represented as byte rather than float. All the decoding is done with integer operations.
For the other formats we currently have, i also can not see this applied. Unfortunately this optimization looks rather specific to JPEG at the moment.
I'll give this a shot. A couple of questions:
Awesome! Not sure I understand what you mean by the first question but Rgba32 would use an alpha of 255.
We can define a default implementations in the base PixelOperations
class, and specialize it for Rgba32 and Rgb24
@JimBobSquarePants i get the 32/24 behaviour but not sure what it means by default impl
Rgba32 would use an alpha of 255.
Silly me, was thinking of floating point max alpha 馃榿
Ah right.... So we have a "default" PixelOperations
The default version will use vectors and the various members of IPixel interface
https://github.com/SixLabors/ImageSharp/tree/b720219bb5d01f8443408819574444cc19baf595/src/ImageSharp/PixelFormats
Whereas the explicit versions contain optimizations that take advantage of the size and layout of the individual pixel formats.
https://github.com/SixLabors/ImageSharp/tree/master/src/ImageSharp/PixelFormats/PixelImplementations/Generated
In other words: the default implementation contains generic code, that utilizes the IPixel<T> interface thanks to the constraints:
https://github.com/SixLabors/ImageSharp/blob/b720219bb5d01f8443408819574444cc19baf595/src/ImageSharp/PixelFormats/PixelOperations%7BTPixel%7D.Generated.cs
However instead of extending that interface (more codebloat for low benefits), I suggest to convert TPixel from Rgb24.
Most helpful comment
In other words: the default implementation contains generic code, that utilizes the
IPixel<T>interface thanks to the constraints:https://github.com/SixLabors/ImageSharp/blob/b720219bb5d01f8443408819574444cc19baf595/src/ImageSharp/PixelFormats/PixelOperations%7BTPixel%7D.Generated.cs
However instead of extending that interface (more codebloat for low benefits), I suggest to convert
TPixelfromRgb24.