Imagesharp: Speed up JPEG Decoder color conversion

Created on 22 Feb 2020 · 13Comments · Source: SixLabors/ImageSharp

Let's finally beat System.Drawing on the JPEG Load->Resize->Save scenario!

As discussed in #1064, it's finally possible thanks to the Intel SIMD intrinsics in .NET Core 3.1. Opening an issue so we can track this work, and hopefully get some help & feedback from the community.

/cc @Sergio0694 @saucecontrol

Current pipeline

Summary of steps currently done by ConvertColorsInto:

[D]: Data representation
(T): Bulk transformation between data representations

(case a) Y+Cb+Cr planes --> Single `Rgba32` buffer

| | |
| ------- | :--- |
| [D] | 3 Planes of W*H sized float jpeg color channels normalized to 0-255 (3 x Buffer2D<float>, Y+Cb+Cr) |
| (T) | Color convert and pack into a single Vector4 buffer |
| [D] | Floating point RGBA data as Memory<Vector4> |
| (T) | Convert the Vector4 buffer to an Rgba32 buffer. In the Rgba32 case case, the input buffer could be handled as homogenous float buffer, where all individual float values should be converted to byte-s. The conversion is implemented in BulkConvertNormalizedFloatToByteClampOverflows, utilizing AVX2 conversion and narrowing operations through Vector<T> |
| [D] | The result image as an Rgba32 buffer |

(case b) Y+Cb+Cr planes --> Single `Rgb24` buffer

| | |
| ------- | :--- |
| [D] | 3 Planes of W*H sized float jpeg color channels normalized to 0-255 (3 x Buffer2D<float>, Y+Cb+Cr) |
| (T) | Color convert and pack into a single Vector4 buffer |
| [D] | Floating point RGBA data as Memory<Vector4> |
| (T) | Convert the Vector4 buffer to an Rgba32 buffer, utilizing BulkConvertNormalizedFloatToByteClampOverflows, utilizing AVX2 conversion and narrow operations through Vector<T> |
| [D] | Temporary Rgba32 buffer |
| (T) | PixelOperations<Rgb24>.FromRgba32() (sub-optimal, extra transformation!) |
| [D] | The result image as an Rgb24 buffer |

Optimized pipeline

(default `Rgb24` case) Y+Cb+Cr planes --> Single `Rgb24` buffer

| | |
| ------- | :--- |
| D1 | 3 Planes of W*H sized float jpeg color channels normalized to 0-255 (3 x Buffer2D<float>, Y+Cb+Cr) |
| (T) | Color convert, the 3 planes, and write them back to the originating buffers |
| D2 | 3 Planes of Buffer2D<float>, R+G+B) |
| (T) | Narrow the float buffers to byte buffers using SimdUtils.BulkConvertNormalizedFloatToByteClampOverflows |
| D3 | 3 Planes of Buffer2D<byte>, R+G+B |
| (T) | PACK the separate image planes (color channels) into a single Rgb24 buffer |
| D4 | The result image as an Rgb24 buffer |

(`TPixel` case) Y+Cb+Cr planes --> Single `TPixel` buffer

| | |
| ------- | :--- |
| D1 | 3 Planes of W*H sized float jpeg color channels normalized to 0-255 (3 x Buffer2D<float>, Y+Cb+Cr) |
| (T) | All the steps from the default Rgb24 case |
| D4 | Memory<Rgb24> |
| (T) | Convert the Rgb24 buffer to TPixel buffer using PixelOperations<T> |
| D5 | The result image as an TPixel buffer |

The magic is mostly in the D3->D4 transition, because of the fact that we can now do the pixel packing with shuffle and permute intrinsics when those are available. The other fun thing is that if we decode to Image<Rgb24> (case b) we can omit an unnecessary step.

API proposal for packing

The best thing is that we can handle this big task incrementally:

[ ] First, extend PixelOperations<T> by new packing operations
[ ] Then, adapt the changes in JpegImagePostProcessor as described in the Optimized pipeline paragraph

The packing API is pretty straightforward:

public class PixelOperations<TPixel>
{
    // ...

    public void PackFromRgbPlanes(
           Configuration configuration,
       ReadOnlySpan<byte> redChannel, 
       ReadOnlySpan<byte> greenChannel, 
       ReadOnlySpan<byte> blueChannel,
       Span<TPixel> destination);
}

We can define a default implementations in the base PixelOperations<TPixel> class, and specialize it for Rgba32 and Rgb24. Optional hardcore task is to T4 a SIMD implementation it for all the RGB(A)-like formats.

Note

It is possible to optimize the conversion even further by doing D1->D3 in a single step, but I consider it a very hard task both implementation and architecture-wise, and prefer incremental evolution instead.

performance jpeg help needed

Source

antonfirsov

Most helpful comment

In other words: the default implementation contains generic code, that utilizes the IPixel<T> interface thanks to the constraints:
https://github.com/SixLabors/ImageSharp/blob/b720219bb5d01f8443408819574444cc19baf595/src/ImageSharp/PixelFormats/PixelOperations%7BTPixel%7D.Generated.cs

However instead of extending that interface (more codebloat for low benefits), I suggest to convert TPixel from Rgb24.

antonfirsov on 15 May 2020

👍2

All 13 comments

@brianpopow is packing useful for other codecs?

antonfirsov on 22 Feb 2020

You can get a nice easy win just by using the intrinsics for your float->byte conversions, because the float->int instructions include rounding, and the packing instructions are saturating so you can skip the clamping step. That's as simple as this:

https://github.com/saucecontrol/PhotoSauce/blob/a9bd6e5162d2160419f0cf743fd4f536c079170b/src/MagicScaler/Magic/Processors/ConvertersFloat.cs#L451-L508

You can speed up Y'CbCr->RGB conversion with intrinsics as well, but I was only able to make the SIMD version faster than scalar with AVX. Any matrix multiplication can be accelerated with DPPS (dot product), but since you really only need 4 of the coefficients from the Y'CbCr->RGB matrix, it doesn't work out any faster unless you can operate on the wider vector.

saucecontrol on 22 Feb 2020

Unrelated: @saucecontrol your repo is seriously impressive, incredible work there! 👏

I have a couple questions:

What version of ImageSharp did you use to make those benchmarks? I'm particularly impressed by the very high memory usage it has compared to other libs, especially in the parallel benchmark (207x compared to System.Drawing? Am I missing something here? 😶).
If you have time and are interested, I'd be curious to see your updated results for the benchmark where ImageSharp fails with the OOM error - @antonfirsov's recent PR should've fixed that.

Sergio0694 on 22 Feb 2020

Thanks! I did those benchmarks with the current dev at the time (1.0.0-dev003181). I did a quick run yesterday with 1.0.0-unstable0806, and GC allocations were way down but speed was a bit worse. I see @JimBobSquarePants has a new PR that improves perf. Hopefully that gets merged this weekend? I plan on re-running the whole suite since I've got a new MagicScaler version releasing today. Would be good to have all the ImageSharp improvements in for that.

saucecontrol on 22 Feb 2020

207x compared to System.Drawing? Am I missing something here?

Those numbers will always be misleading because System.Drawing is just a thin wrapper around native code (GDI+). That's why I threw in a more comprehensive test that also measured unmanaged memory allocations.

saucecontrol on 22 Feb 2020

@saucecontrol thanks for the suggestions! Introducing a System.Runtime.InteropServices AVX2 path for our float -> byte conversion is definitely a long hanging fruit. If we could base our implementation on your code it would be super easy.

For Y'CbCr -> RGB, we already have a SIMD path that compiles to AVX2 under the hood, and it outperforms our previous scalar approaches a lot. It's the slow scalar packing that hurts the most at the moment. It will be way much better to do the float -> byte conversion first, and pack 32 bytes at a single step with AVX2.

antonfirsov on 23 Feb 2020

You're certainly welcome to borrow any of my code you find useful. My Y'CbCr conversion is here:

https://github.com/saucecontrol/PhotoSauce/blob/master/src/MagicScaler/Magic/PlanarConversionTransform.cs#L148-L236

You'll notice the scalar float version reads/writes directly to/from the buffers rather than going through a Vector4. Setting the individual fields of Vector4 is always slow because it means a minimum of three shuffle ops, and Intel processors only have a single port that can service those shuffles. It ends up being faster to do the memory read and write since there are more ports to service those and they'll be in cache anyway.

With AVX, it's possible to overcome the extra shuffle overhead by using a DotProduct that operates on two pixels at once, but even that is only profitable if you have access to the full AVX instruction set. I couldn't outdo the scalar code with System.Numerics.

Just realized in looking at the code again, I was doing an extra round of shuffles just to be able to use DotProduct when it's only a gain on the Green output. And with FMA, it's not even a gain there. I simplified the code, and added an SSE version which ends up just over twice as fast as scalar. I haven't had time to draw out ascii diagrams of how that code works, but I'm happy to answer questions if you have any.

I see another quick win possibility in your SIMD Y'CbCr->RGB path if you just hoist those constant Vector<float> values that hold the coefficients out of the loop. RyuJIT has never been smart enough to do that for you, and each of those requires a load/broadcast instruction pair.

saucecontrol on 23 Feb 2020

@brianpopow is packing useful for other codecs?

WebP lossy format is similar to JPEG as in the image is also represented with YUV and split up into macroblocks. The difference is, that the DCT coefficients are represented as byte rather than float. All the decoding is done with integer operations.

For the other formats we currently have, i also can not see this applied. Unfortunately this optimization looks rather specific to JPEG at the moment.

brianpopow on 25 Feb 2020

👍1

I'll give this a shot. A couple of questions:

what is the default behaviour?
For RGB32, does it just assume alpha value of 1?

john-h-k on 15 May 2020

❤1

Awesome! Not sure I understand what you mean by the first question but Rgba32 would use an alpha of 255.

JimBobSquarePants on 15 May 2020

We can define a default implementations in the base PixelOperations class, and specialize it for Rgba32 and Rgb24

@JimBobSquarePants i get the 32/24 behaviour but not sure what it means by default impl

Rgba32 would use an alpha of 255.
Silly me, was thinking of floating point max alpha 😁

john-h-k on 15 May 2020

Ah right.... So we have a "default" PixelOperations class which we then override with implementations specific to individual pixel formats.

The default version will use vectors and the various members of IPixel interface
https://github.com/SixLabors/ImageSharp/tree/b720219bb5d01f8443408819574444cc19baf595/src/ImageSharp/PixelFormats

Whereas the explicit versions contain optimizations that take advantage of the size and layout of the individual pixel formats.
https://github.com/SixLabors/ImageSharp/tree/master/src/ImageSharp/PixelFormats/PixelImplementations/Generated

JimBobSquarePants on 15 May 2020

However instead of extending that interface (more codebloat for low benefits), I suggest to convert TPixel from Rgb24.

antonfirsov on 15 May 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Split ImageSharp into multiple projects

tocsoft · 28Comments

Image.Resize fails with InvalidCastException when Vector.IsHardwareAccelerated is false

MaddoScientisto · 28Comments

API docs

JimBobSquarePants · 29Comments

ImageSharp fails to build in Release UWP: invalid method body for CreateFrameQuantizer<TPixel>

Sergio0694 · 42Comments

ImageSharp.Formats.Jpg.DecodedBlock ArrayPool is using up WAY too much memory

DeCarabas · 53Comments

Imagesharp: Speed up JPEG Decoder color conversion

Current pipeline

(case a) Y+Cb+Cr planes --> Single Rgba32 buffer

(case b) Y+Cb+Cr planes --> Single Rgb24 buffer

Optimized pipeline

(default Rgb24 case) Y+Cb+Cr planes --> Single Rgb24 buffer

(TPixel case) Y+Cb+Cr planes --> Single TPixel buffer

API proposal for packing

Note

Most helpful comment

All 13 comments

Related issues

(case a) Y+Cb+Cr planes --> Single `Rgba32` buffer

(case b) Y+Cb+Cr planes --> Single `Rgb24` buffer

(default `Rgb24` case) Y+Cb+Cr planes --> Single `Rgb24` buffer

(`TPixel` case) Y+Cb+Cr planes --> Single `TPixel` buffer