I created xtransnew branch with the intention to make some speedups for xtrans users.
I've a small speedup for xtrans demosaic on hold, but at playing around with xtrans files I detected, that they almost always need a bit of raw false colour suppression. For that reason I made an about 50% speedup for raw false colour suppression in first step.
Xtrans demosaic speedups will follow later (that's really hard stuff to optimize xtrans demosaic...)
Ingo
I added 1d64c65
Compiles and runs fine, I don't see any differences in saved output between this and master.
@Beep6581 Thanks for testing. I also made a lot of tests during optimization process and detected no differences in output. I'll merge now and let the branch open for further work on xtrans speedup.
Thanks Ingo for this first speedup. I hope you will be able to make a
sensible speedup for demosaicing too.
Sébastien, currently I'm at 9% speedup. xtrans demosaic is really hard to optimize. The sensor layout doesn't make SSE optimizations easy...
Ingo I tried your new branch but running the benchmarkRT script I couldn't see any significant difference with master.
Sébastien, I would expect no differences because I didn't push the changes ;-)
Lol ok, then I'm glad I didn't find any difference ;)
However, I thought you had pushed changes for false color suppression speedup? That's what I tried, even with false color suppression turned on I see no speedup.
benchmarkRT was useful back when a single tool took ages to compute, e.g. Shadows and Highlights with Sharp Mask took 3 and a half minutes to process a 10 megapixel image. But nowadays RawTherapee does the same work in one or two seconds, so the script is no longer useful for measuring such short times. Most tools take just milliseconds. To measure such short times you need to hard-code a timer function into the tool and have RawTherapee print the result into the console.
Maybe the timers should be left in the code, and just surrounded by #ifdef USE_TIMER?
@sguyader Sébastien, the false colour suppression changes are already in master branch
@Beep6581 DrSlony, that's what the BENCHMARK #define and the BENCHFUN macros are for
@Beep6581 For me to process a 16MP Xtrans file on my laptop, with the neutral profile, it takes several seconds, actually 7.8 to 8 seconds on average, and add 1.2 second on top of that for false color suppression. Looking at the variation between 5 consecutive runs, I should be able to see a significant difference if it is above 0.5 second.
@sguyader I tested the false colour suppression with a 24 Mp xtrans file and set false colour suppression steps to 5. Before changes processing time for false colour suppression was 1244 ms, after last commit 421 ms
@sguyader Sébastien, currently the speedup is at 10% (90% processing time compared to master branch). I can push it to the xtransnew branch if you would like to try it. It also includes a minor bugfix and a slightly different averaging of the multi pass results which results in a tiny bit better colours in some areas of the picture.
Ingo
@heckflosse yes Ingo I will test new changes when you push them.
@sguyader Sébastien, great, I'll push them tomorrow after cleaning the code a bit. On my system, xtrans-3-pass for a 24Mp xtrans file went down from 1450 to 1290 ms. Not much, but better than nothing. Maybe I'll find some more small bottlenecks...
@sguyader I found another part I could optimize. Processing time now is 1236 ms. But, because of this new optimization I'll probably not be able to push the changes today.
Ingo, regarding the better colours you mentioned .. before the averaging .. do we have correct aligning of colors at the various passes ?.
An example of possible error is for the "interpolate red for blue pixels and vice versa". If we interpolate the three neighbouring blues on a red pixel by inverse distance (B1+B2/1.41+B3/1.41) then the weight center in not at the center of the red pixel ,, it would be at the center by using inverse square :). And it (inverse square) happens to be easy and fast RB = (B2+B3)/2+B1
G0 B1 G0
G0 RB G0
B2 G0 B3

BTW .. I think I am blocked out of your email :(
Ilias, I didn't block anyone out of my email. Have to check that...
I'll try your suggestion now
Ilias, I think it already is (B2+B3)/2+B1, as you suggest
Fine .. but does this perfect centering holds true for the all interpolations ?.
About email .. I sent you a double mail (double click due to defect mouse button) some months ago .. maybe this triggered a block out.
Ilias, I did check the part of code you mentioned: 'interpolate red for blue pixels and vice versa'. For the remaining interpolations I'll have to look again. I'll post now (next hour) the current changes concerning speedup of xtrans demosaic. There are some scattered pixels different to master branch because I optimized some calculations and also calculated one lut in double precision which was calculated in float precision before. When this changes are tested I'll merge them into master and continue with your interpolation suggestions (which I really appreciate!!!). If I would do them now, it would confuse me and others (you, Sébastien) too much I guess, because we wouldn't be able to detect possible errors in my performance optimizations due to changes of interpolation.
I hope you agree with that :)
Ingo
@sguyader I pushed the speedup to xtransnew branch (without the changes to averaging final pixels). Processing time for my test file now is about 1260 ms (master branch is 1450).
@iliasg This optimization process helps me to understand how the current implementation works. It's really complicated but I already started to understand it. I guess that might help with further quality improvements.
Addition: with last commit I also reduced the memory usage of one array in xtrans demsaic from
c * 8 * 122 * 122 bytes
to
c * 8 * 122 * 61 bytes, where c is the number of cores.
By tricky coding it should be possible to reduce memory usage of this array to
c * 8 * 82 * 41 bytes
Reducing memory usage usually leads to additional speedup. I'll try this next days...
@heckflosse I'm ready to try your new changes. Can you recommend to me a method for comparing timings and memory usage between master and xtransnew branches?
@sguyader copy line 39 and 40 and line 4015 into master to have the timing code in master. For memory usage you won't be able to measure. It's only a small reduction which is mainly for reducing processing time
For timings I always measure in queue to avoid side effects of progress bar. I always put the same image 7 times in queue and take the median of this 7 values
@sguyader line numbers in master are different of course. The BENCHFUN must be behind this lines
void RawImageSource::xtrans_interpolate (const int passes, const bool useCieLab)
{
@heckflosse with your changes, xtrans demosaicing on my computer went from ~3475ms down to ~3100ms, about 11% faster, with 5-step false color suppression enabled.
@sguyader Did you place an own BENCHFUN in the false colour suppression?
@heckflosse I didn't. In fact I at first I thought false colour suppression was part of the xtrans demosaicing algorithm, but I realized it is not.
@sguyader Sébastien, false colour suppression is after demosaic (for bayer and for xtrans). That's correct so far, but what's not correct in my opinion is, that it's also after retinex. That doesn't make sense. What do you and other devs think about that? @Hombre57 @Desmis @Beep6581 @michaelezra @adamreichold
I agree, false color suppression is used to recover artifact from the demosaicing process, so Retinex should be after that.
I also agree that false colour suppression should be right next to
demosaicing, as it is used to correct demosaicing-introduced artifacts.
Ok, so we have an agreement about the place of false colour suppression in tool chain, great!
Let me explain the consequences to performance in preview mode:
Assume that 'false colour suppression is > 0
In current implementation 'fcs' (I use that now as an abbreviation for 'false colour suppression') is done not for the whole raw file, but only for the part in preview.
Doing fcs for the whole file directly after demosaic (that's the way I would prefer to go) would mean, that the initial time to open an image with fcs enabled would increase, but the time to pan and zoom in preview would decrease. In queue processing there should be no change in processing time by the change.
That's fine for me, since as long as we're in a fast demozaicing mode (i.e. < 100% zoom rate), fcs is not computed, and there would be no slow down for those reviewing their images.
I'll open an Issue for fcs where we can discuss it, because it affects bayer too and we're in xtrans here
Opened #3208
@sguyader Sébastien, choice of tilesize has an influence on processing speed of xtrans demosaic. On my machine, ts 114 is best. If you have some time left, could you please test different values for ts? Maybe we find a good value. Here's my result: http://filebin.net/gk863gpkmq
You need to change ts in code, recompile using make install (should be fast) and then put an image 7 times in queue, process the queue and take the median of the measures. I know that's a bit of work for all the ts values I used (as you can see in the ods file I linked) but I would really appreciate it
OK Ingo I'll try to do some testing during the weekend. Maybe not tomorrow, as I'm going to meet Jacques who invited me for lunch!
@heckflosse sorry Ingo, I couldn't send my TS tests earlier. I uploaded my results here: http://filebin.net/onuuozhgqz
Interestingly, the lower the tile size, the faster it runs on my system (processing a 16MP file), by a good 10%. As you will see, I've been down to TS=90, should I try to go lower and see if it gets any better, or should I stop here?
I went a little further, and event tile sizes of between 70 to 80 lead to good speedup. The (marginally) best TS I found is 80 for me (median time 2764ms, 89.10% compared to TS=122).
I have a question: is there any drawback to lowering the tile size?
@sguyader There's no drawback in lowering the tile size. We should try to avoid values which are really bad at one system. e.g. 96 is 10% faster than 122 on your system but on my system it's 10% slower.
@sguyader Sébastien, I try to explain the consequences of choice of tile size:
In tiled demosaicers you always have a gross and a net tile size. The gross tile size determines the size of the square you need to read from raw data to store a demosaiced tile which size is determined by the net tile size. The ts variable in xtrans demosaic is the gross tile size, while the net tile size is gross tile size - 16. What does that mean?
Let me make an example. Assume you have a raw file with output dimensions 6000x4000
Now we process it with ts = 116. This will result in a processing of 2400 tiles each of them with input size 116x116 and output size 100x100. The amount of input data is 2400x116x116 = 32294400
When we process it with ts = 66 we'll get 9600 tiles with input size 66x66 and output size 50x50. The amount of input data is 9600x66x66 = 41817600 which is about 30% more than with a ts of 116
Looks like choosing larger ts would be better in any case. What's wrong?
If you lower the ts, the amount of data to be processed for each tile reduces. In consequence, you get more L2-Cache (and L3-Cache if your cpu has) hits which reduces slow memory transfers.
Additionally there are some tile sizes which don't work well depending on properties of L2-Cache. You can see this effect by looking at the value of ts = 128 in my results
It's impossible to find a ts which is optimal for all systems. Best we can do, is to find a value which is good for most systems.
Ingo
Thanks for the explanation Ingo. Is there a way to categorise the types of existing systems, based on L2/L3 cache, and try to find the best overall tile size? Or is it possible to make tile size a choice in preferences, then it would be up to the user to find the best best value for his system by trial and error?
Sébastien, I also thought about making tile size a choice in preferences. But hard coded tile size also has an influence on speed, because compiler can do better optimizations with hard coded tile size.
About categorizing the systems: I would categorize them in two categories: Old systems and modern systems. Modern systems often have a lot of L2/L3 cache where a higher ts is better, whereas old systems often have less L2 and no L3 cache where a lower ts is better.
I would prefer to choose a tile size which is good for modern systems
I agree it is better to target modern systems.
Good list of CPUs including cache types and sizes:
https://en.wikipedia.org/wiki/List_of_Intel_microprocessors
https://en.wikipedia.org/wiki/List_of_AMD_microprocessors
https://en.wikipedia.org/wiki/Category:Lists_of_microprocessors
With 636d0be I vectorized one part of xtrans demosaic. Speedup is about 4%. Another small step...
I just detected some further possible speedups. Don't know how much we gain. I'll try tomorrow
@sguyader Sébastien, any objections to merge the speedups into master branch?
No objection Ingo, I tested your last changes and everything looks fine.
Speedups merged into master branch
I pushed another 4% speedup to xtransnew branch. Additionally I disabled the CLIP in xtrans demosaic.
@heckflosse after all your changes we'll end up with a 50% speedup compared to the initial demosaicing ;-)
By the way, what is the CLIP?
In original (integer) version of dcraw xtrans demosaic some values had to be clipped to fit in the [0;65535] range. For float that's not necessary and additionally could give a bit more room for highlight recovery. Other demosaicers (Amaze...) also don't clip.
@sguyader about the 50% speedup. That would need 17 4% speedups. Don't know whether I'll succeed...
I know the you do what's humanly possible Ingo.
:)
I pushed a new speedup (about 11%) to xtransnew branch
@heckflosse very nice speedup Ingo! With the default tile size of 114, I got between 2360 and 2410 ms, and it is now systematically faster than with TS=80 (2460 ms) which I found was the faster tile size previously. It's funny to me that now TS=114 is better wthen TS=80 with your new changes.
So, it is now 15% faster with default tile size!
Back from holiday I would like to merge this speedup (after removing benchmark code) into master. Any objections?
Ingo
No problem with me. I tried and it compiles and run fine (and faster).
Compiles and runs fine here.
By the way, changing "Sensor with Bayer Matrix" parameters when using X-Trans raws still triggers a re-demosaicing.
I merged the speedups into master
Most helpful comment
@heckflosse sorry Ingo, I couldn't send my TS tests earlier. I uploaded my results here: http://filebin.net/onuuozhgqz
Interestingly, the lower the tile size, the faster it runs on my system (processing a 16MP file), by a good 10%. As you will see, I've been down to TS=90, should I try to go lower and see if it gets any better, or should I stop here?