Would make the most sense to address this together with #808, AVX2 optimizing the YCbCr and grayscale path.
Recommend that we pause this until we get the new SIMD APIs in System.Runtime.Intrinsics.
I took a look at this briefly and I do think you will need HWIntrinsics to handle this properly. You might be able to get part of the way with Vector<T>, but it is lacking for things like register blending that would make it quite a bit more efficient.
I think that the CalculateWeights function can also be vectorized, but it would require modifying the IResampler to expose a new method that took a Vector4.
There might also be some small perf wins if the implementors of IResampler were made sealed (which helps with devirtualation) or if you made them structs (which might allow some generic specialization tricks).
@tannergooding my understanding is that IResampler invocations are usually not on hot path (at least not in resize).
I like to push towards extending such interfaces with bulk methods (eg resampler.SampleValues(sourceSpan, destSpan). (This way we can YAGNI it avoiding early large refactors)
What I think we really need is: 100% simdified convolution (it's on our hottest resize path). For rhat we need shuffling intrinsics so we can at least temporarily switch from AOS to SOA layout.
IResampler is on a hot path in our affine and projective transforms. We can't use bulk operations there unfortunately either.
We can't use bulk operations there unfortunately either.
Why? Basically this is what we do in CalculateWeights()
Or am I missing something?
The weights are currently calculated on the fly and per pixel. You have to calculate them based upon the transformed location not the input one which isn鈥檛 an integral vector. Always looking for a better way though
Yeah, but you are still collecting them into a linear destination buffer. What we can do is to have method with similar semantics to CalculateWeights() right on IResampler. This way we can SIMD calculate sampler values in the IResampler implementation, for a given range of inputs. (Same way as we do in other bulk methods across the lib.)
A method like SampleValues(Span<float> values, Span<float> result) or probably SampleValues(float startVal, float delta, int count, Span<float> result) can do the job. These API's could fit the usage in both ResizeKernelMap and TransformKernelMap.
I think is see where you are going here. I need coffee though, it's been a busy week!
Most helpful comment
I think is see where you are going here. I need coffee though, it's been a busy week!