This is a sort of a continuation of #883
Compiled 1fb0d9348d424a008d1e2ee97539aac15a1e0f1f myself.
Used command....
add 1-700 file record700.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace,scale=out_range=tv:out_color_matrix=bt709,format=yuv422p
Input buffer fills linearly from the start of the consumer and saturates and never drains. Plenty of resources left in this brand new macbook pro.
Also please note that the stock 2.0.7 release records this file without dropped frames with following command...
ADD 1-7 FILE record7.mxf -vcodec dnxhd
Yea, but 2.0.7 doesn't output proper colors.
The bottleneck here seems to be the extra scale filter we add to correct the color range (and the fact that we don't run the encoder and filter in parallel).
There is no way around this issue. The scale=out_range=tv:out_color_matrix=bt709 filter doesn't run in real time with color transforms unless you use a powerful enough cpu in terms of SIMD, frequency and IPC. More cores does not help.
Is there anyway we could do this at a card level. (Decklink/bluefish)
I'm not sure why it's so slow.
Resolving this requires one of the following:
@5opr4ni no card is involved here, other than the gpu
I am wondering why this scale filter should be/is so slow on this position.
I have to check this with standalone ffmpeg next week. Maybe some additional flags provide more performance...
I had no issues with encoding high bitrate long gop h264 4:2:2 on HP Z440 workstation.
Sorry! missed the consumer, thought about the producer.
@5opr4ni on input it's not a problem as far as I can tell
I wish the scale filter had slice threading...
It's one of ffmpeg's core components... Can't imagine a bad performance here...
@premultiply: it has fast and slow paths, they don't optimize for every possible use case
Take a 59p video and see if you can build a corresponding command string that runs in realtime in standalone ffmpeg.
Yes. And have to check if it does scale twice for any reason.
One other alternative is to make the GPU mixer always output TV range RGB. That way we might not need the extra color transform (except for the screen consumer which could downgrade to experimental until we fix it). @premultiply? This would probably result in less accurate RGB=>YUV transforms though.
I'm not sure exactly how the FFMPEG default RGB->YUV conversion works. But I'm guessing it doesn't apply any color range calculations.
Another alternative is to output YUV444 (TV) from the mixer. Hmmm.... I think I like that best of all alternatives.
Will however require an extra GPU pass and also some work in the decklink and especially screen consumer. Would make ffmpeg and decklink consumers faster.
No it is not that easy as you convert to YUV (dont care of color sampling here) it is always TV range BUT you have to decide about the COLOR MATRIX before you do the conversion from RGB. This would only be possible if the colormatrix gets locked to the channel mode and other input and output modes (resolutions) are not allowed. I do not think we want that.
Converting the colormatrix afterwards may be lossy in some color grades. Avoid that were possible (has to be done for SD-HD-UHD conversion in one step).
It's the same for converting broken yuv fullrange stuff to correct yuv tv range. This is even worse.
Keep in mind that we are doing only 8 bit per channel here...
The only workaround would be to do all these conversions on GPU for each consumer. Only screen consumer does not need any conversion (native progressive RGB from mixer).
Decklink can do valid conversion through SDK or hardware (needs interlaced RGB).
But FFMPEG consumer may need prefiltered combination on users requested output format.
At the moment our good common interface is progressive RGBA to each consumer and every consumer does this on its own as requiered.
@premultiply what about 16 bit output from mixer? We want to do that in the future anyway... i.e. YUV444 (bt609, bt709, bt2020 depending on channel format) 16 bit. YUV444_16 => YUV422_8 should be relatively fast on cpu.
Hmm... of course 16 bit to 8 bit will require dithering...
The only clean solution would be to have multiple different outputs from mixer depending on consumer... if we are to do this without CPU involvement...
Easiest is probably if FFMPEG could take advantage of multi-core for these transformations.
Since we are not doing any scaling in this conversion I could probably implement a slice threaded color transform util based on sws scale.
Ok, I've implemented threaded color transform (https://github.com/CasparCG/server/commit/7b94bc6544b620583263bee411f88be99ab6eda2). HOWEVER, it will always convert to YUVA444P, BT709 and only work with channel heights dividable by 8. Which is far from optimal but should work well for most cases.
The most problematic case will be RGB(A) and/or full range recording... but that is very unusual.
@TomKaltz please verify. -filter:v interlace,format=yuv422p
Possible further optimizations:
@premultiply: please create separate issue for those
@ronag and I iterated on this today and it's getting better but still very inefficient. In my testing it seems if we omit alpha and swscale to AV_PIX_FMT_YUV422P it helps slightly. The best performance I got was manually changing all occurrences of AV_PIX_FMT_YUVA422P to AV_PIX_FMT_YUV422P in ffmpeg_consumer.cpp after commit 0d721847b49d022f7db09f48e92d8732b0db19c8 and using the following command...
add 1-700 file record700.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace -threads:v 4
My brand new quad-core 3.1ghz macbook pro could barely keep up but it did. I'm hoping the color transform can be moved to GPU because currently using ffmpeg consumer to record broadcast formats is not very performant.
Just to make shure: You have verified that it is not dnxhd codec from ffmpeg which kills the performance? Same with x264?
I definitely tested with ProRes with same bad results. I did a quick test with x264 defaults and it was slightly better but still not good.
Prores and dnxhd are the slowest codecs in ffmpeg I know. Maybe they are single threaded or anything like this. In don鈥檛 know.
Anyway... their performance is bad.
@TomKaltz your computer is 2.3 GHz with 3.1 turbo :)... no? Personally, I don't think recording is something for a laptop.
x264 defaults is not for realtime recording. You should be using -preset:v veryfast.
One more optimization would be to do the color conversion and interlacing in the same step... but now we're moving things out of the ffmpeg filter => more complexity.
Interlacing outside of user defined filter would mean that high quality progressive recording is lost when channel is set to interlaced format. Might be ok for most recording applications but replays will suffer. Mmmmh...
Anyway I would prefer to switch to AV_PIX_FMT_YUV422P as @TomKaltz said above as it would reduce the buffersizes/amount of data to move and increase the performance for common use.
We need to use an alpha based pixel format since some users record alpha. We should create a dummy filter graph and check the resolved pixel format and use that.
@premultiply: I'm unsure whether performing the transform in slices is actually valid given dithering etc... are you able to find out?
I can try to measure it if there is a build availible.
The auto build should be running.
I'm removing this from 2.2. There is not much more we can do without more effort.
From what I can see from fist test the new transform is inaccurate with levelshifts. But I have to do multipass tests.
@premultiply I think the issue is more the possible seams between the (8) slices since it might be using dithering for the full => tv range transformation to distribute quantization errors. I'm unsure of the exact implementation and its impact. @5opr4ni maybe you know someone that can shed light on it?
@premultiply maybe you could investigate the implementation (sws flags) and if it is possible to enable/disable?
@ronag Can we try to replace swscale by zscale filter? Maybe it has better performance and accuracy.
-filter:v interlace,zscale=rangein=full:primaries=709:transfer=709:matrix=709:range=limited,format=yuv422p
should do it if RGBA is passed to ffmpeg (again). _dither_ parameter may be added additionally.
https://ffmpeg.org/ffmpeg-filters.html#zscale-1
@premultiply we're not using a scale filter, we're doing the scale manually. We could go back to how it was before. But then we don't get a parallel scale filter.
Yes i know. But it鈥檚 also swscale. And i鈥榙 like to try doing it by zscale with ffmpeg as before with scale and no manual preconversion to compare the performance.
Run some benchmarks with vanilla ffmpeg. If there is any tangible advantage I鈥檒l revert the parallel conversion.
@TomKaltz Can you try it again with a build before https://github.com/CasparCG/server/commit/7b94bc6544b620583263bee411f88be99ab6eda2 and -filter:v interlace,zscale=rangein=full:primaries=709:transfer=709:matrix=709:range=limited,format=yuv422p?
I will report back...
Compiled commit 54997aedb7ee2372f42b9a10bcfc0304fdb735c3 and ran with 1080i5994 channel.....
```add 1-700 file record700.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace,zscale=rangein=full:primaries=709:transfer=709:matrix=709:range=limited,format=yuv422p
[2018-03-07 13:50:19.180] [info] Received message from Console: add 1-700 file record700.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace,zscale=rangein=full:primaries=709:transfer=709:matrix=709:range=limited,format=yuv422p\r\n
[2018-03-07 13:50:19.181] [info] ffmpeg[record700.mxf] Initialized.
[2018-03-07 13:50:19.243] [error] [ffmpeg] code 3074: no path between colorspaces
[2018-03-07 13:50:19.243] [error]
[2018-03-07 13:50:19.256] [error] C:\UsersThomas\casparcg\src\modules\ffmpeg\consumer\ffmpeg_consumer.cpp(358): Throw in function void __cdecl caspar::ffmpeg::Stream::send(class caspar::core::const_frame,const struct caspar::core::video_format_desc &,class std::function
[2018-03-07 13:50:19.256] [error] Dynamic exception type: class boost::exception_detail::clone_impl
[2018-03-07 13:50:19.256] [error] [struct boost::errinfo_api_function_ * __ptr64] = av_buffersink_get_frame
[2018-03-07 13:50:19.256] [error] [struct boost::errinfo_errno_ * __ptr64] = 542398533, "Unknown error"
[2018-03-07 13:50:19.256] [error] [struct caspar::tag_stacktrace_info * __ptr64] = 0# 0x00007FF6F994755E in casparcg
[2018-03-07 13:50:19.256] [error] 1# 0x00007FF6F9969AE0 in casparcg
[2018-03-07 13:50:19.256] [error] 2# 0x00007FF6F9A91A0A in casparcg
[2018-03-07 13:50:19.256] [error] 3# 0x00007FF6F9A8F0BB in casparcg
[2018-03-07 13:50:19.256] [error] 4# tbb::interface7::internal::task_arena_base::internal_current_slot in tbb
[2018-03-07 13:50:19.256] [error] 5# 0x00007FF6F9A80AE4 in casparcg
[2018-03-07 13:50:19.256] [error] 6# 0x00007FF6F9A8B4F7 in casparcg
[2018-03-07 13:50:19.256] [error] 7# 0x00007FF6F9A8CD50 in casparcg
[2018-03-07 13:50:19.256] [error] 8# 0x00007FF6F9943849 in casparcg
[2018-03-07 13:50:19.256] [error] 9# iswascii in ucrtbase
[2018-03-07 13:50:19.256] [error] 10# BaseThreadInitThunk in KERNEL32
[2018-03-07 13:50:19.256] [error] 11# RtlUserThreadStart in ntdll
[2018-03-07 13:50:19.256] [error]
[2018-03-07 13:50:19.256] [error]
[2018-03-07 13:50:19.315] [info] ffmpeg[record700.mxf] Uninitialized.
```
@ronag the parallel conversion is significantly more performant. Is there any downside to having the transform output locked to yuva422p in this way?
Found out that the current implementation is wrong as it assumes BGRA Input from mixer to be tv range and not full range. This gives wrong Levels in Output.
Complete filter chain for pre https://github.com/CasparCG/server/commit/0d721847b49d022f7db09f48e92d8732b0db19c8 is
ADD 1-700 FILE test.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace,scale=in_range=full:out_range=tv:out_color_matrix=bt709:flags=print_info,format=yuv422p
@TomKaltz Thanks for testing! The problem in your case is that dnxhd requires 10 bit which zscale does not support in this conversion path.
But this also gives the hint why why dnxhd performance is low. swscale needs to upscale from 8 to 10 bit first before writing to the dnxhd encoder.
Some sort of XDCAM HD422 flavor (wrong audio track configuration) should give much better performance:
ADD 1-777 FILE test.mxf -codec:v mpeg2video -codec:a pcm_s24le -filter:v interlace,scale=in_range=full:out_range=tv:out_color_matrix=bt709:flags=print_info,format=yuv422p -b:v 50M -maxrate:v 50M -bufsize:v 3835k -minrate:v 50M -profile:v 0 -level:v 2 -flags:v ilme+ildct
Or using zscale:
ADD 1-777 FILE test.mxf -codec:v mpeg2video -codec:a pcm_s24le -filter:v interlace,zscale=rangein=full:range=limited:primaries=709:transfer=709:matrix=709,format=yuv422p -b:v 50M -maxrate:v 50M -bufsize:v 3835k -minrate:v 50M -profile:v 0 -level:v 2 -flags:v ilme+ildct
The problem in your case is that dnxhd requires 10 bit
120M is 8 bit, 185M is 10 bits (https://en.wikipedia.org/wiki/List_of_Avid_DNxHD_resolutions)
The scaler claims that 10bit is required.
@premultiply Where? How do I reproduce? DNxHD has 8 bit support so that's weird. Is it some other filter in the graph that requires it?
Could be a bug in ffmpeg pixel format resolution... hmm...
but yes, that could explain the performance issues Tom has been having
120M is for 1080i50 not 1080p50, you need to add interlace in the filter graph for 120M to work.
Yes, sorry, you are right about the bitrate.
@TomKaltz ADD 1-700 FILE test.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace,zscale=rangein=full:range=limited:primaries=709:transfer=709:matrix=709,format=yuv422p works for me
@premultiply old commit 1080i59 realtime?
@ronag Old commit but 1080i50 "only". Sorry I haven't got such crazy odd video source availible. ;-)
But the scaler should not care about the framerate at all.
i59 is heavier than 50i... my server handles 50i... but at 59i it chokes
HP Z440 has only ~30% CPU load on 1080i50 with this ffmpeg_consumer running (~14% just passing decklink input to decklink output) but I'll try to get a "faster" video source tomorrow.
I will test the zscale stuff but with the following command on 0d721847b49d022f7db09f48e92d8732b0db19c8 I get very good performance and realtime encoding. The files seemed to have proper coloremetry.
add 1-700 file record700.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace,format=yuv422p -threads:v 4
No it's slightly off. You can reproduce by feeding a colorbar to decklink in and watch server decklink output on video monitor (better add a waveform/vectorscope). If you record the channel and playback the recorded file on ffmpeg producer you can toggle between the signals. You should see a slight level shift because of the wrong scaler input flag.
BlackMagic decklink have a main color standart as bmdFormat10BitYUV for copy by DMA working. May be use it as main in mixer?
@drakmor This would require a complete redesign and rewriting of all producers, mixer and consumers. And the Decklink path is currently fine. There are only some glitches on the ffmpeg_consumer left.
Running at commit 80dec5a1e18d797bf94725461a36e91b49ed9530 1080i5994 channel
Between the commands...
ADD 1-700 FILE test.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace,zscale=rangein=full:range=limited:primaries=709:transfer=709:matrix=709,format=yuv422p -threads:v 4
AND
add 1-700 file record700.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace,format=yuv422p -threads:v 4
I get similar performance between the two commands and can encode in realtime with both. The coloremetry seems to be correct in the test.mxf file encoded with ZSCALE but not with the hard-coded scaler.
I'm a little confused because in this commit aren't we forcing the usage of swscale for the color transform?
Yes, you have to use a build without the internal prescaler to compare the performance between scale (swscale) and zscale. Please try with 41cf1abd2849e4d0031ab02d45620652f28efdbc.
Download: http://casparcg.com/builds/CasparCG%20Server/master/casparcg-server-41cf1abd2849e4d0031ab02d45620652f28efdbc-windows.zip
zscale:
ADD 1-700 FILE test.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace,zscale=rangein=full:range=limited:primaries=709:transfer=709:matrix=709,format=yuv422p -threads:v 4
swscale:
ADD 1-700 FILE test.mxf -codec:v dnxhd -b:v 120M -codec:a pcm_s16le -flags:v +ilme+ildct -filter:v interlace,scale=in_range=full:out_range=tv:out_color_matrix=bt709:flags=print_info,format=yuv422p -threads:v 4
You can even try to toggle the position of interlace filter between the other parts of the filter graph in the commands above. Maybe it has also little impact on the graphs performance. However I think it should have the best performance in its current position.
Ok with commit https://github.com/CasparCG/server/commit/41cf1abd2849e4d0031ab02d45620652f28efdbc and the commands in the previous comment it seems that swscale is very slightly more performant but I cannot get these commands to run in realtime for very long. Robert's parallel transform makes it realtime but then the colors are still slightly off. Any ideas?
@premultiply do you have any idea why the hard-coded color transform is still wrong?
Yes, because the filter does not care about the full range property of the input BGRA frames.
Is this an swscale bug or is there anything else we can do to twist the filter into doing what we want?
I don鈥榯 know. I can only hint what seems to be the problem here.