Rawtherapee: Created new branch to speedup xtrans processing

Created on 6 Mar 2016 · 64Comments · Source: Beep6581/RawTherapee

I created xtransnew branch with the intention to make some speedups for xtrans users.
I've a small speedup for xtrans demosaic on hold, but at playing around with xtrans files I detected, that they almost always need a bit of raw false colour suppression. For that reason I made an about 50% speedup for raw false colour suppression in first step.
Xtrans demosaic speedups will follow later (that's really hard stuff to optimize xtrans demosaic...)

Ingo

enhancement

Source

heckflosse

Most helpful comment

@heckflosse sorry Ingo, I couldn't send my TS tests earlier. I uploaded my results here: http://filebin.net/onuuozhgqz
Interestingly, the lower the tile size, the faster it runs on my system (processing a 16MP file), by a good 10%. As you will see, I've been down to TS=90, should I try to go lower and see if it gets any better, or should I stop here?

sguyader on 14 Mar 2016

🎉2

All 64 comments

I added 1d64c65

heckflosse on 6 Mar 2016

Compiles and runs fine, I don't see any differences in saved output between this and master.

Beep6581 on 6 Mar 2016

@Beep6581 Thanks for testing. I also made a lot of tests during optimization process and detected no differences in output. I'll merge now and let the branch open for further work on xtrans speedup.

heckflosse on 6 Mar 2016

Thanks Ingo for this first speedup. I hope you will be able to make a
sensible speedup for demosaicing too.

sguyader on 7 Mar 2016

Sébastien, currently I'm at 9% speedup. xtrans demosaic is really hard to optimize. The sensor layout doesn't make SSE optimizations easy...

heckflosse on 7 Mar 2016

Ingo I tried your new branch but running the benchmarkRT script I couldn't see any significant difference with master.

sguyader on 8 Mar 2016

Sébastien, I would expect no differences because I didn't push the changes ;-)

heckflosse on 8 Mar 2016

Lol ok, then I'm glad I didn't find any difference ;)

However, I thought you had pushed changes for false color suppression speedup? That's what I tried, even with false color suppression turned on I see no speedup.

sguyader on 8 Mar 2016

benchmarkRT was useful back when a single tool took ages to compute, e.g. Shadows and Highlights with Sharp Mask took 3 and a half minutes to process a 10 megapixel image. But nowadays RawTherapee does the same work in one or two seconds, so the script is no longer useful for measuring such short times. Most tools take just milliseconds. To measure such short times you need to hard-code a timer function into the tool and have RawTherapee print the result into the console.

Maybe the timers should be left in the code, and just surrounded by #ifdef USE_TIMER?

Beep6581 on 8 Mar 2016

@sguyader Sébastien, the false colour suppression changes are already in master branch
@Beep6581 DrSlony, that's what the BENCHMARK #define and the BENCHFUN macros are for

heckflosse on 8 Mar 2016

@Beep6581 For me to process a 16MP Xtrans file on my laptop, with the neutral profile, it takes several seconds, actually 7.8 to 8 seconds on average, and add 1.2 second on top of that for false color suppression. Looking at the variation between 5 consecutive runs, I should be able to see a significant difference if it is above 0.5 second.

sguyader on 8 Mar 2016

@sguyader I tested the false colour suppression with a 24 Mp xtrans file and set false colour suppression steps to 5. Before changes processing time for false colour suppression was 1244 ms, after last commit 421 ms

heckflosse on 8 Mar 2016

@sguyader Sébastien, currently the speedup is at 10% (90% processing time compared to master branch). I can push it to the xtransnew branch if you would like to try it. It also includes a minor bugfix and a slightly different averaging of the multi pass results which results in a tiny bit better colours in some areas of the picture.

Ingo

heckflosse on 8 Mar 2016

@heckflosse yes Ingo I will test new changes when you push them.

sguyader on 9 Mar 2016

@sguyader Sébastien, great, I'll push them tomorrow after cleaning the code a bit. On my system, xtrans-3-pass for a 24Mp xtrans file went down from 1450 to 1290 ms. Not much, but better than nothing. Maybe I'll find some more small bottlenecks...

heckflosse on 9 Mar 2016

@sguyader I found another part I could optimize. Processing time now is 1236 ms. But, because of this new optimization I'll probably not be able to push the changes today.

heckflosse on 9 Mar 2016

Ingo, regarding the better colours you mentioned .. before the averaging .. do we have correct aligning of colors at the various passes ?.
An example of possible error is for the "interpolate red for blue pixels and vice versa". If we interpolate the three neighbouring blues on a red pixel by inverse distance (B1+B2/1.41+B3/1.41) then the weight center in not at the center of the red pixel ,, it would be at the center by using inverse square :). And it (inverse square) happens to be easy and fast RB = (B2+B3)/2+B1
G0 B1 G0
G0 RB G0
B2 G0 B3
xtrans-red-blue
BTW .. I think I am blocked out of your email :(

iliasg on 9 Mar 2016

Ilias, I didn't block anyone out of my email. Have to check that...
I'll try your suggestion now

heckflosse on 9 Mar 2016

Ilias, I think it already is (B2+B3)/2+B1, as you suggest

heckflosse on 9 Mar 2016

Fine .. but does this perfect centering holds true for the all interpolations ?.

About email .. I sent you a double mail (double click due to defect mouse button) some months ago .. maybe this triggered a block out.

iliasg on 9 Mar 2016

Ilias, I did check the part of code you mentioned: 'interpolate red for blue pixels and vice versa'. For the remaining interpolations I'll have to look again. I'll post now (next hour) the current changes concerning speedup of xtrans demosaic. There are some scattered pixels different to master branch because I optimized some calculations and also calculated one lut in double precision which was calculated in float precision before. When this changes are tested I'll merge them into master and continue with your interpolation suggestions (which I really appreciate!!!). If I would do them now, it would confuse me and others (you, Sébastien) too much I guess, because we wouldn't be able to detect possible errors in my performance optimizations due to changes of interpolation.
I hope you agree with that :)

Ingo

heckflosse on 9 Mar 2016

@sguyader I pushed the speedup to xtransnew branch (without the changes to averaging final pixels). Processing time for my test file now is about 1260 ms (master branch is 1450).
@iliasg This optimization process helps me to understand how the current implementation works. It's really complicated but I already started to understand it. I guess that might help with further quality improvements.

heckflosse on 10 Mar 2016

Addition: with last commit I also reduced the memory usage of one array in xtrans demsaic from
c * 8 * 122 * 122 bytes
to
c * 8 * 122 * 61 bytes, where c is the number of cores.
By tricky coding it should be possible to reduce memory usage of this array to
c * 8 * 82 * 41 bytes
Reducing memory usage usually leads to additional speedup. I'll try this next days...

heckflosse on 10 Mar 2016

@heckflosse I'm ready to try your new changes. Can you recommend to me a method for comparing timings and memory usage between master and xtransnew branches?

sguyader on 10 Mar 2016

@sguyader copy line 39 and 40 and line 4015 into master to have the timing code in master. For memory usage you won't be able to measure. It's only a small reduction which is mainly for reducing processing time

heckflosse on 10 Mar 2016

For timings I always measure in queue to avoid side effects of progress bar. I always put the same image 7 times in queue and take the median of this 7 values

heckflosse on 10 Mar 2016

@sguyader line numbers in master are different of course. The BENCHFUN must be behind this lines
void RawImageSource::xtrans_interpolate (const int passes, const bool useCieLab)
{

heckflosse on 10 Mar 2016

@heckflosse with your changes, xtrans demosaicing on my computer went from ~3475ms down to ~3100ms, about 11% faster, with 5-step false color suppression enabled.

sguyader on 10 Mar 2016

@sguyader Did you place an own BENCHFUN in the false colour suppression?

heckflosse on 10 Mar 2016

@heckflosse I didn't. In fact I at first I thought false colour suppression was part of the xtrans demosaicing algorithm, but I realized it is not.

sguyader on 10 Mar 2016

@sguyader Sébastien, false colour suppression is after demosaic (for bayer and for xtrans). That's correct so far, but what's not correct in my opinion is, that it's also after retinex. That doesn't make sense. What do you and other devs think about that? @Hombre57 @Desmis @Beep6581 @michaelezra @adamreichold

heckflosse on 11 Mar 2016

I agree, false color suppression is used to recover artifact from the demosaicing process, so Retinex should be after that.

Hombre57 on 11 Mar 2016

I also agree that false colour suppression should be right next to
demosaicing, as it is used to correct demosaicing-introduced artifacts.

sguyader on 11 Mar 2016

Ok, so we have an agreement about the place of false colour suppression in tool chain, great!
Let me explain the consequences to performance in preview mode:
Assume that 'false colour suppression is > 0
In current implementation 'fcs' (I use that now as an abbreviation for 'false colour suppression') is done not for the whole raw file, but only for the part in preview.
Doing fcs for the whole file directly after demosaic (that's the way I would prefer to go) would mean, that the initial time to open an image with fcs enabled would increase, but the time to pan and zoom in preview would decrease. In queue processing there should be no change in processing time by the change.

heckflosse on 11 Mar 2016

That's fine for me, since as long as we're in a fast demozaicing mode (i.e. < 100% zoom rate), fcs is not computed, and there would be no slow down for those reviewing their images.

Hombre57 on 11 Mar 2016

I'll open an Issue for fcs where we can discuss it, because it affects bayer too and we're in xtrans here

heckflosse on 11 Mar 2016

Opened #3208

heckflosse on 11 Mar 2016

@sguyader Sébastien, choice of tilesize has an influence on processing speed of xtrans demosaic. On my machine, ts 114 is best. If you have some time left, could you please test different values for ts? Maybe we find a good value. Here's my result: http://filebin.net/gk863gpkmq
You need to change ts in code, recompile using make install (should be fast) and then put an image 7 times in queue, process the queue and take the median of the measures. I know that's a bit of work for all the ts values I used (as you can see in the ods file I linked) but I would really appreciate it

heckflosse on 12 Mar 2016

OK Ingo I'll try to do some testing during the weekend. Maybe not tomorrow, as I'm going to meet Jacques who invited me for lunch!

sguyader on 12 Mar 2016

👍1

sguyader on 14 Mar 2016

🎉2

I went a little further, and event tile sizes of between 70 to 80 lead to good speedup. The (marginally) best TS I found is 80 for me (median time 2764ms, 89.10% compared to TS=122).
I have a question: is there any drawback to lowering the tile size?

sguyader on 14 Mar 2016

@sguyader There's no drawback in lowering the tile size. We should try to avoid values which are really bad at one system. e.g. 96 is 10% faster than 122 on your system but on my system it's 10% slower.

heckflosse on 14 Mar 2016

@sguyader Sébastien, I try to explain the consequences of choice of tile size:

In tiled demosaicers you always have a gross and a net tile size. The gross tile size determines the size of the square you need to read from raw data to store a demosaiced tile which size is determined by the net tile size. The ts variable in xtrans demosaic is the gross tile size, while the net tile size is gross tile size - 16. What does that mean?
Let me make an example. Assume you have a raw file with output dimensions 6000x4000
Now we process it with ts = 116. This will result in a processing of 2400 tiles each of them with input size 116x116 and output size 100x100. The amount of input data is 2400x116x116 = 32294400
When we process it with ts = 66 we'll get 9600 tiles with input size 66x66 and output size 50x50. The amount of input data is 9600x66x66 = 41817600 which is about 30% more than with a ts of 116

Looks like choosing larger ts would be better in any case. What's wrong?

If you lower the ts, the amount of data to be processed for each tile reduces. In consequence, you get more L2-Cache (and L3-Cache if your cpu has) hits which reduces slow memory transfers.

Additionally there are some tile sizes which don't work well depending on properties of L2-Cache. You can see this effect by looking at the value of ts = 128 in my results

It's impossible to find a ts which is optimal for all systems. Best we can do, is to find a value which is good for most systems.

Ingo

heckflosse on 14 Mar 2016

❤1

Thanks for the explanation Ingo. Is there a way to categorise the types of existing systems, based on L2/L3 cache, and try to find the best overall tile size? Or is it possible to make tile size a choice in preferences, then it would be up to the user to find the best best value for his system by trial and error?

sguyader on 14 Mar 2016

Sébastien, I also thought about making tile size a choice in preferences. But hard coded tile size also has an influence on speed, because compiler can do better optimizations with hard coded tile size.
About categorizing the systems: I would categorize them in two categories: Old systems and modern systems. Modern systems often have a lot of L2/L3 cache where a higher ts is better, whereas old systems often have less L2 and no L3 cache where a lower ts is better.
I would prefer to choose a tile size which is good for modern systems

heckflosse on 15 Mar 2016

I agree it is better to target modern systems.

sguyader on 15 Mar 2016

Good list of CPUs including cache types and sizes:
https://en.wikipedia.org/wiki/List_of_Intel_microprocessors
https://en.wikipedia.org/wiki/List_of_AMD_microprocessors
https://en.wikipedia.org/wiki/Category:Lists_of_microprocessors

Beep6581 on 15 Mar 2016

With 636d0be I vectorized one part of xtrans demosaic. Speedup is about 4%. Another small step...

heckflosse on 15 Mar 2016

I just detected some further possible speedups. Don't know how much we gain. I'll try tomorrow

heckflosse on 16 Mar 2016

@sguyader Sébastien, any objections to merge the speedups into master branch?

heckflosse on 20 Mar 2016

No objection Ingo, I tested your last changes and everything looks fine.

sguyader on 20 Mar 2016

Speedups merged into master branch

heckflosse on 20 Mar 2016

I pushed another 4% speedup to xtransnew branch. Additionally I disabled the CLIP in xtrans demosaic.

heckflosse on 22 Mar 2016

@heckflosse after all your changes we'll end up with a 50% speedup compared to the initial demosaicing ;-)
By the way, what is the CLIP?

sguyader on 22 Mar 2016

In original (integer) version of dcraw xtrans demosaic some values had to be clipped to fit in the [0;65535] range. For float that's not necessary and additionally could give a bit more room for highlight recovery. Other demosaicers (Amaze...) also don't clip.

heckflosse on 22 Mar 2016

👍1

@sguyader about the 50% speedup. That would need 17 4% speedups. Don't know whether I'll succeed...

heckflosse on 22 Mar 2016

I know the you do what's humanly possible Ingo.

sguyader on 23 Mar 2016

heckflosse on 23 Mar 2016

I pushed a new speedup (about 11%) to xtransnew branch

heckflosse on 24 Mar 2016

@heckflosse very nice speedup Ingo! With the default tile size of 114, I got between 2360 and 2410 ms, and it is now systematically faster than with TS=80 (2460 ms) which I found was the faster tile size previously. It's funny to me that now TS=114 is better wthen TS=80 with your new changes.
So, it is now 15% faster with default tile size!

sguyader on 24 Mar 2016

Back from holiday I would like to merge this speedup (after removing benchmark code) into master. Any objections?

Ingo

heckflosse on 2 Apr 2016

No problem with me. I tried and it compiles and run fine (and faster).

sguyader on 2 Apr 2016

Compiles and runs fine here.
By the way, changing "Sensor with Bayer Matrix" parameters when using X-Trans raws still triggers a re-demosaicing.

Beep6581 on 3 Apr 2016

I merged the speedups into master

heckflosse on 3 Apr 2016

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Gridlines in histogram

JIPG2 · 3Comments

Add color wheels

elecprog · 5Comments

Using Wavelet/Residual with contrast set to != 0 crashes on completely black image

heckflosse · 3Comments

Linking against system version of KLT fails on Fedora 31

Thanatomanic · 4Comments

Raw ca correction 'Avoid colour shift' sometimes produces a colour shift

heckflosse · 5Comments