Server: ffmpeg consumer slow colorspace transform on i59

Created on 3 Mar 2018 · 75Comments · Source: CasparCG/server

This is a sort of a continuation of #883

Compiled 1fb0d9348d424a008d1e2ee97539aac15a1e0f1f myself.

Used command....
add 1-700 file record700.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace,scale=out_range=tv:out_color_matrix=bt709,format=yuv422p

Input buffer fills linearly from the start of the consumer and saturates and never drains. Plenty of resources left in this brand new macbook pro.

capture

feedbacpull request wanted ffmpeg-consumer typenhancement

Source

TomKaltz

All 75 comments

Also please note that the stock 2.0.7 release records this file without dropped frames with following command...

ADD 1-7 FILE record7.mxf -vcodec dnxhd

TomKaltz on 3 Mar 2018

Yea, but 2.0.7 doesn't output proper colors.

The bottleneck here seems to be the extra scale filter we add to correct the color range (and the fact that we don't run the encoder and filter in parallel).

ronag on 4 Mar 2018

There is no way around this issue. The scale=out_range=tv:out_color_matrix=bt709 filter doesn't run in real time with color transforms unless you use a powerful enough cpu in terms of SIMD, frequency and IPC. More cores does not help.

ronag on 4 Mar 2018

Is there anyway we could do this at a card level. (Decklink/bluefish)

5opr4ni on 4 Mar 2018

I'm not sure why it's so slow.

Resolving this requires one of the following:

Faster colorspace transform on CPU
Parallel colorspace transform
Do the transform on GPU

ronag on 4 Mar 2018

@5opr4ni no card is involved here, other than the gpu

ronag on 4 Mar 2018

I am wondering why this scale filter should be/is so slow on this position.
I have to check this with standalone ffmpeg next week. Maybe some additional flags provide more performance...
I had no issues with encoding high bitrate long gop h264 4:2:2 on HP Z440 workstation.

premultiply on 4 Mar 2018

👍1

Sorry! missed the consumer, thought about the producer.

5opr4ni on 4 Mar 2018

@5opr4ni on input it's not a problem as far as I can tell

ronag on 4 Mar 2018

👍1

I wish the scale filter had slice threading...

ronag on 4 Mar 2018

It's one of ffmpeg's core components... Can't imagine a bad performance here...

premultiply on 4 Mar 2018

@premultiply: it has fast and slow paths, they don't optimize for every possible use case

ronag on 4 Mar 2018

Take a 59p video and see if you can build a corresponding command string that runs in realtime in standalone ffmpeg.

ronag on 4 Mar 2018

Yes. And have to check if it does scale twice for any reason.

premultiply on 4 Mar 2018

One other alternative is to make the GPU mixer always output TV range RGB. That way we might not need the extra color transform (except for the screen consumer which could downgrade to experimental until we fix it). @premultiply? This would probably result in less accurate RGB=>YUV transforms though.

I'm not sure exactly how the FFMPEG default RGB->YUV conversion works. But I'm guessing it doesn't apply any color range calculations.

ronag on 4 Mar 2018

Another alternative is to output YUV444 (TV) from the mixer. Hmmm.... I think I like that best of all alternatives.

Will however require an extra GPU pass and also some work in the decklink and especially screen consumer. Would make ffmpeg and decklink consumers faster.

ronag on 4 Mar 2018

No it is not that easy as you convert to YUV (dont care of color sampling here) it is always TV range BUT you have to decide about the COLOR MATRIX before you do the conversion from RGB. This would only be possible if the colormatrix gets locked to the channel mode and other input and output modes (resolutions) are not allowed. I do not think we want that.
Converting the colormatrix afterwards may be lossy in some color grades. Avoid that were possible (has to be done for SD-HD-UHD conversion in one step).
It's the same for converting broken yuv fullrange stuff to correct yuv tv range. This is even worse.
Keep in mind that we are doing only 8 bit per channel here...

The only workaround would be to do all these conversions on GPU for each consumer. Only screen consumer does not need any conversion (native progressive RGB from mixer).
Decklink can do valid conversion through SDK or hardware (needs interlaced RGB).
But FFMPEG consumer may need prefiltered combination on users requested output format.
At the moment our good common interface is progressive RGBA to each consumer and every consumer does this on its own as requiered.

premultiply on 4 Mar 2018

@premultiply what about 16 bit output from mixer? We want to do that in the future anyway... i.e. YUV444 (bt609, bt709, bt2020 depending on channel format) 16 bit. YUV444_16 => YUV422_8 should be relatively fast on cpu.

ronag on 4 Mar 2018

Hmm... of course 16 bit to 8 bit will require dithering...

ronag on 4 Mar 2018

The only clean solution would be to have multiple different outputs from mixer depending on consumer... if we are to do this without CPU involvement...

Easiest is probably if FFMPEG could take advantage of multi-core for these transformations.

ronag on 4 Mar 2018

Since we are not doing any scaling in this conversion I could probably implement a slice threaded color transform util based on sws scale.

ronag on 4 Mar 2018

Ok, I've implemented threaded color transform (https://github.com/CasparCG/server/commit/7b94bc6544b620583263bee411f88be99ab6eda2). HOWEVER, it will always convert to YUVA444P, BT709 and only work with channel heights dividable by 8. Which is far from optimal but should work well for most cases.

The most problematic case will be RGB(A) and/or full range recording... but that is very unusual.

ronag on 4 Mar 2018

@TomKaltz please verify. -filter:v interlace,format=yuv422p

ronag on 4 Mar 2018

Possible further optimizations:

Convert directly to YUV422P as 4:4:4 formats are unusual for recorded files.
Maybe drop the alphachannel completly for ffmpeg consumer if this will further improve the performance (maybe only together with previous point, yuv422p) as recording with alpha channel may be rare.
Make color matrix dependant on channel mode / resolution.

premultiply on 4 Mar 2018

@premultiply: please create separate issue for those

ronag on 4 Mar 2018

@ronag and I iterated on this today and it's getting better but still very inefficient. In my testing it seems if we omit alpha and swscale to AV_PIX_FMT_YUV422P it helps slightly. The best performance I got was manually changing all occurrences of AV_PIX_FMT_YUVA422P to AV_PIX_FMT_YUV422P in ffmpeg_consumer.cpp after commit 0d721847b49d022f7db09f48e92d8732b0db19c8 and using the following command...

add 1-700 file record700.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace -threads:v 4

My brand new quad-core 3.1ghz macbook pro could barely keep up but it did. I'm hoping the color transform can be moved to GPU because currently using ffmpeg consumer to record broadcast formats is not very performant.

TomKaltz on 5 Mar 2018

👍1

Just to make shure: You have verified that it is not dnxhd codec from ffmpeg which kills the performance? Same with x264?

premultiply on 5 Mar 2018

I definitely tested with ProRes with same bad results. I did a quick test with x264 defaults and it was slightly better but still not good.

TomKaltz on 5 Mar 2018

Prores and dnxhd are the slowest codecs in ffmpeg I know. Maybe they are single threaded or anything like this. In don’t know.
Anyway... their performance is bad.

premultiply on 5 Mar 2018

@TomKaltz your computer is 2.3 GHz with 3.1 turbo :)... no? Personally, I don't think recording is something for a laptop.

x264 defaults is not for realtime recording. You should be using -preset:v veryfast.

One more optimization would be to do the color conversion and interlacing in the same step... but now we're moving things out of the ffmpeg filter => more complexity.

ronag on 5 Mar 2018

Interlacing outside of user defined filter would mean that high quality progressive recording is lost when channel is set to interlaced format. Might be ok for most recording applications but replays will suffer. Mmmmh...

Anyway I would prefer to switch to AV_PIX_FMT_YUV422P as @TomKaltz said above as it would reduce the buffersizes/amount of data to move and increase the performance for common use.

premultiply on 5 Mar 2018

We need to use an alpha based pixel format since some users record alpha. We should create a dummy filter graph and check the resolved pixel format and use that.

ronag on 5 Mar 2018

@premultiply: I'm unsure whether performing the transform in slices is actually valid given dithering etc... are you able to find out?

ronag on 5 Mar 2018

I can try to measure it if there is a build availible.

premultiply on 5 Mar 2018

The auto build should be running.

ronag on 5 Mar 2018

I'm removing this from 2.2. There is not much more we can do without more effort.

ronag on 6 Mar 2018

From what I can see from fist test the new transform is inaccurate with levelshifts. But I have to do multipass tests.

premultiply on 6 Mar 2018

@premultiply I think the issue is more the possible seams between the (8) slices since it might be using dithering for the full => tv range transformation to distribute quantization errors. I'm unsure of the exact implementation and its impact. @5opr4ni maybe you know someone that can shed light on it?

@premultiply maybe you could investigate the implementation (sws flags) and if it is possible to enable/disable?

ronag on 6 Mar 2018

@ronag Can we try to replace swscale by zscale filter? Maybe it has better performance and accuracy.

premultiply on 7 Mar 2018

-filter:v interlace,zscale=rangein=full:primaries=709:transfer=709:matrix=709:range=limited,format=yuv422p
should do it if RGBA is passed to ffmpeg (again). _dither_ parameter may be added additionally.
https://ffmpeg.org/ffmpeg-filters.html#zscale-1

premultiply on 7 Mar 2018

@premultiply we're not using a scale filter, we're doing the scale manually. We could go back to how it was before. But then we don't get a parallel scale filter.

ronag on 7 Mar 2018

Yes i know. But it’s also swscale. And i‘d like to try doing it by zscale with ffmpeg as before with scale and no manual preconversion to compare the performance.

premultiply on 7 Mar 2018

Run some benchmarks with vanilla ffmpeg. If there is any tangible advantage I’ll revert the parallel conversion.

ronag on 7 Mar 2018

@TomKaltz Can you try it again with a build before https://github.com/CasparCG/server/commit/7b94bc6544b620583263bee411f88be99ab6eda2 and -filter:v interlace,zscale=rangein=full:primaries=709:transfer=709:matrix=709:range=limited,format=yuv422p?

premultiply on 7 Mar 2018

I will report back...

TomKaltz on 7 Mar 2018

Compiled commit 54997aedb7ee2372f42b9a10bcfc0304fdb735c3 and ran with 1080i5994 channel.....

```add 1-700 file record700.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace,zscale=rangein=full:primaries=709:transfer=709:matrix=709:range=limited,format=yuv422p
[2018-03-07 13:50:19.180] [info] Received message from Console: add 1-700 file record700.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace,zscale=rangein=full:primaries=709:transfer=709:matrix=709:range=limited,format=yuv422p\r\n

202 ADD OK

[2018-03-07 13:50:19.181] [info] ffmpeg[record700.mxf] Initialized.
[2018-03-07 13:50:19.243] [error] [ffmpeg] code 3074: no path between colorspaces
[2018-03-07 13:50:19.243] [error]
[2018-03-07 13:50:19.256] [error] C:\UsersThomas\casparcg\src\modules\ffmpeg\consumer\ffmpeg_consumer.cpp(358): Throw in function void __cdecl caspar::ffmpeg::Stream::send(class caspar::core::const_frame,const struct caspar::core::video_format_desc &,class std::function)>)
[2018-03-07 13:50:19.256] [error] Dynamic exception type: class boost::exception_detail::clone_impl
[2018-03-07 13:50:19.256] [error] [struct boost::errinfo_api_function_ * __ptr64] = av_buffersink_get_frame
[2018-03-07 13:50:19.256] [error] [struct boost::errinfo_errno_ * __ptr64] = 542398533, "Unknown error"
[2018-03-07 13:50:19.256] [error] [struct caspar::tag_stacktrace_info * __ptr64] = 0# 0x00007FF6F994755E in casparcg
[2018-03-07 13:50:19.256] [error] 1# 0x00007FF6F9969AE0 in casparcg
[2018-03-07 13:50:19.256] [error] 2# 0x00007FF6F9A91A0A in casparcg
[2018-03-07 13:50:19.256] [error] 3# 0x00007FF6F9A8F0BB in casparcg
[2018-03-07 13:50:19.256] [error] 4# tbb::interface7::internal::task_arena_base::internal_current_slot in tbb
[2018-03-07 13:50:19.256] [error] 5# 0x00007FF6F9A80AE4 in casparcg
[2018-03-07 13:50:19.256] [error] 6# 0x00007FF6F9A8B4F7 in casparcg
[2018-03-07 13:50:19.256] [error] 7# 0x00007FF6F9A8CD50 in casparcg
[2018-03-07 13:50:19.256] [error] 8# 0x00007FF6F9943849 in casparcg
[2018-03-07 13:50:19.256] [error] 9# iswascii in ucrtbase
[2018-03-07 13:50:19.256] [error] 10# BaseThreadInitThunk in KERNEL32
[2018-03-07 13:50:19.256] [error] 11# RtlUserThreadStart in ntdll
[2018-03-07 13:50:19.256] [error]
[2018-03-07 13:50:19.256] [error]
[2018-03-07 13:50:19.315] [info] ffmpeg[record700.mxf] Uninitialized.
```

TomKaltz on 7 Mar 2018

@ronag the parallel conversion is significantly more performant. Is there any downside to having the transform output locked to yuva422p in this way?

TomKaltz on 8 Mar 2018

Found out that the current implementation is wrong as it assumes BGRA Input from mixer to be tv range and not full range. This gives wrong Levels in Output.

Complete filter chain for pre https://github.com/CasparCG/server/commit/0d721847b49d022f7db09f48e92d8732b0db19c8 is
ADD 1-700 FILE test.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace,scale=in_range=full:out_range=tv:out_color_matrix=bt709:flags=print_info,format=yuv422p

premultiply on 8 Mar 2018

@TomKaltz Thanks for testing! The problem in your case is that dnxhd requires 10 bit which zscale does not support in this conversion path.
But this also gives the hint why why dnxhd performance is low. swscale needs to upscale from 8 to 10 bit first before writing to the dnxhd encoder.

premultiply on 8 Mar 2018

Some sort of XDCAM HD422 flavor (wrong audio track configuration) should give much better performance:
ADD 1-777 FILE test.mxf -codec:v mpeg2video -codec:a pcm_s24le -filter:v interlace,scale=in_range=full:out_range=tv:out_color_matrix=bt709:flags=print_info,format=yuv422p -b:v 50M -maxrate:v 50M -bufsize:v 3835k -minrate:v 50M -profile:v 0 -level:v 2 -flags:v ilme+ildct

premultiply on 8 Mar 2018

Or using zscale:
ADD 1-777 FILE test.mxf -codec:v mpeg2video -codec:a pcm_s24le -filter:v interlace,zscale=rangein=full:range=limited:primaries=709:transfer=709:matrix=709,format=yuv422p -b:v 50M -maxrate:v 50M -bufsize:v 3835k -minrate:v 50M -profile:v 0 -level:v 2 -flags:v ilme+ildct

premultiply on 8 Mar 2018

The problem in your case is that dnxhd requires 10 bit

120M is 8 bit, 185M is 10 bits (https://en.wikipedia.org/wiki/List_of_Avid_DNxHD_resolutions)

ronag on 8 Mar 2018

The scaler claims that 10bit is required.

premultiply on 8 Mar 2018

@premultiply Where? How do I reproduce? DNxHD has 8 bit support so that's weird. Is it some other filter in the graph that requires it?

ronag on 8 Mar 2018

Could be a bug in ffmpeg pixel format resolution... hmm...

ronag on 8 Mar 2018

but yes, that could explain the performance issues Tom has been having

ronag on 8 Mar 2018

https://gist.github.com/ronag/f8dac4dae1faa01d7667aff68a747231

premultiply on 8 Mar 2018

120M is for 1080i50 not 1080p50, you need to add interlace in the filter graph for 120M to work.

ronag on 8 Mar 2018

Yes, sorry, you are right about the bitrate.

premultiply on 8 Mar 2018

@TomKaltz ADD 1-700 FILE test.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace,zscale=rangein=full:range=limited:primaries=709:transfer=709:matrix=709,format=yuv422p works for me

premultiply on 8 Mar 2018

@premultiply old commit 1080i59 realtime?

ronag on 8 Mar 2018

@ronag Old commit but 1080i50 "only". Sorry I haven't got such crazy odd video source availible. ;-)
But the scaler should not care about the framerate at all.

premultiply on 8 Mar 2018

i59 is heavier than 50i... my server handles 50i... but at 59i it chokes

ronag on 8 Mar 2018

HP Z440 has only ~30% CPU load on 1080i50 with this ffmpeg_consumer running (~14% just passing decklink input to decklink output) but I'll try to get a "faster" video source tomorrow.

premultiply on 8 Mar 2018

I will test the zscale stuff but with the following command on 0d721847b49d022f7db09f48e92d8732b0db19c8 I get very good performance and realtime encoding. The files seemed to have proper coloremetry.

add 1-700 file record700.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace,format=yuv422p -threads:v 4

TomKaltz on 8 Mar 2018

No it's slightly off. You can reproduce by feeding a colorbar to decklink in and watch server decklink output on video monitor (better add a waveform/vectorscope). If you record the channel and playback the recorded file on ffmpeg producer you can toggle between the signals. You should see a slight level shift because of the wrong scaler input flag.

premultiply on 8 Mar 2018

BlackMagic decklink have a main color standart as bmdFormat10BitYUV for copy by DMA working. May be use it as main in mixer?

drakmor on 9 Mar 2018

@drakmor This would require a complete redesign and rewriting of all producers, mixer and consumers. And the Decklink path is currently fine. There are only some glitches on the ffmpeg_consumer left.

premultiply on 9 Mar 2018

Running at commit 80dec5a1e18d797bf94725461a36e91b49ed9530 1080i5994 channel

Between the commands...
ADD 1-700 FILE test.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace,zscale=rangein=full:range=limited:primaries=709:transfer=709:matrix=709,format=yuv422p -threads:v 4

AND

add 1-700 file record700.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace,format=yuv422p -threads:v 4

I get similar performance between the two commands and can encode in realtime with both. The coloremetry seems to be correct in the test.mxf file encoded with ZSCALE but not with the hard-coded scaler.

I'm a little confused because in this commit aren't we forcing the usage of swscale for the color transform?

TomKaltz on 10 Mar 2018

Yes, you have to use a build without the internal prescaler to compare the performance between scale (swscale) and zscale. Please try with 41cf1abd2849e4d0031ab02d45620652f28efdbc.
Download: http://casparcg.com/builds/CasparCG%20Server/master/casparcg-server-41cf1abd2849e4d0031ab02d45620652f28efdbc-windows.zip

zscale:
ADD 1-700 FILE test.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace,zscale=rangein=full:range=limited:primaries=709:transfer=709:matrix=709,format=yuv422p -threads:v 4

swscale:
ADD 1-700 FILE test.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace,scale=in_range=full:out_range=tv:out_color_matrix=bt709:flags=print_info,format=yuv422p -threads:v 4

You can even try to toggle the position of interlace filter between the other parts of the filter graph in the commands above. Maybe it has also little impact on the graphs performance. However I think it should have the best performance in its current position.

premultiply on 10 Mar 2018

Ok with commit https://github.com/CasparCG/server/commit/41cf1abd2849e4d0031ab02d45620652f28efdbc and the commands in the previous comment it seems that swscale is very slightly more performant but I cannot get these commands to run in realtime for very long. Robert's parallel transform makes it realtime but then the colors are still slightly off. Any ideas?

TomKaltz on 14 Mar 2018

@premultiply do you have any idea why the hard-coded color transform is still wrong?

TomKaltz on 17 Mar 2018

Yes, because the filter does not care about the full range property of the input BGRA frames.

premultiply on 17 Mar 2018

Is this an swscale bug or is there anything else we can do to twist the filter into doing what we want?

TomKaltz on 17 Mar 2018

I don‘t know. I can only hint what seems to be the problem here.

https://trac.ffmpeg.org/wiki/colorspace

premultiply on 19 Mar 2018

Was this page helpful?

0 / 5 - 0 ratings