play 1-0 decklink device 1 format 1080i5000
play 2-0 decklink device 2 format 1080i5000
play 3-0 decklink device 3 format 1080i5000
play 4-0 decklink device 4 format 1080i5000
play 1-1 decklink device 5 format 1080i5000
play 2-1 decklink device 6 format 1080i5000
play 3-1 decklink device 7 format 1080i5000
play 4-1 decklink device 8 format 1080i5000
2.2.0 beta 1
Idle at 0% cpu and 0% gpu
1 input 7% cpu and 34% gpu 3,25gb ram
2 inputs 17% cpu and 36% gpu 3,53gb ram
3 inputs 33% cpu and 56% gpu 3,72gb ram
4 inputs 81% cpu and 93% gpu 4,06gb ram. Crashes after a few seconds.
2.1.0 beta 2
Idles at 3% cpu and 0% gpu 3,60gb ram
1 input 3% cpu and 30% gpu 3,72gb ram
2 inputs 4% cpu and 16% gpu 3,84gb ram
3 inputs 5% cpu and 24% gpu 3,98gb ram
4 inputs 5% cpu and 32% gpu 4,15gb ram
5 inputs 6% cpu and 34% gpu 4,42gb ram
6 inputs 7% cpu and 35% gpu 4,70gb ram
7 inputs 9% cpu and 40% gpu 4,98gb ram
8 inputs 9% cpu and 40% gpu 5,24gb ram
2.2 runs in progressive so it basically has 2x GPU usage as well as extra cpu overhead for deinterlacing. This is expected.
Fixing this will require re-introducing native interlaced support. Which I don't see happening.
How much memory does that machine have? It seems to crash due to out of memory. The memory usage is a bit interesting. Do you have a log + DIAG screenshot around the time it crashes?
16gb. It crashes with this:
[2018-03-11 18:14:01.209] [fatal] #######################
[2018-03-11 18:14:01.209] [fatal] UNHANDLED EXCEPTION:
[2018-03-11 18:14:01.209] [fatal] Adress:000007FEFD80A06D
[2018-03-11 18:14:01.209] [fatal] Code:3221225477
[2018-03-11 18:14:01.209] [fatal] Flag:0
[2018-03-11 18:14:01.209] [fatal] Info:000000005020C470
[2018-03-11 18:14:01.209] [fatal] Continuing execution.
[2018-03-11 18:14:01.209] [fatal] #######################
Does that happen when a specific number of inputs or memory usage is reached?
Yes. Look at the jump between 3rd and 4th input added. It's massive. Then crashes. Reproducible.
Are you getting drop-frame warnings in DIAG before it happens?
I have run it through the vs cpu profiler and with 4 1080i50 inputs, it is spending 50% of the cpu-time in decklink_producer.VideoInputFrameArrived
30% in calling av_buffersink_get_frame_flags
10% in calling av_buffersrc_write_frame
12% in ffmpeg::make_frame
It doesnt quite max out my cpu with that, and hits 1.6gb of ram in use. One channel with decklink consumer.
This is probably because the decklink sdk is calling the VideoInputFrameArrived faster than we are able to process the frames and we start buffering up input frames. Should probably move the graph to a separate thread.
@ronag I ran without DIAG since it adds 20-25% cpu usage from idle, and I wanted to hold that out of the measurements.
Yeah it looks to me like we are pulling frames from the graph immediately, and doing a while(true) with no sleeps until a frame is ready. I am assuming av_buffersink_get_frame_flags doesnt block there.
A seperate thread sounds like a good idea there
There isn't much to do here... we could possible avoid the extra overhead in make_frame... and move the processing to a separate thread (to avoid the crash) but that won't actually help with proper playback.
You need a beefier machine or run progressive :).
Maybe we could make the deinterlace filter configurable to allow a faster version.
A seperate thread sounds like a good idea there
Yea, it would be appropriate.
Ah, we should probably use AVBufferRef for the frames sent to av_buffersrc_write_frame to avoid the extra memcpy.
The av_buffersrc_write_frame and make_frame overhead should be possible to remove.
Performance have not been a focus for 2.2
The primary goal for this version has been code cleanup, improved performance and improved stability/reliability.
http://casparcg.com/forum/viewtopic.php?f=13&p=30575&sid=a20a315112b8756c9d58d3ca212a501e#p30575
Well, I would disagree with the performance part of that :)
After a bit of a debug, the gpu usage is from 20-25% gpu per active channel, which is maybe to be expected but seems a bit high when there is no compositing involved (on a GTX1070).
The crash was an easily solvable null exception. Ill raise a pr in a minute for that.
The non-linear cpu increase is a more interesting one. I had a look at the diag window (didnt manage to get a screenshot of it) and the frame-time climbed for each producer climbed each time I added another, so it quickly crossed threshold for being realtime.
If we fall behind far enough, the decklink will send us a null frame (but with audio) and that is the cause of the crash.
So the problem here is the graph taking too long and blocking the decklink callback. A thread could help here. I didnt get to a proper conclusion on the cpu jump between input 3&4, but I suspect it will be resolved by making the decklink producer use a thread.
@ronag Well, I would disagree with the performance part of that :)
I'm about changing the texts now. But it sounded really neat doh, but you're totally right! I apologies for the confusion. 馃槃
I've had more of a look into this and have found the same cpu impact playing 1080i50 video files too.
It turns out my assumption about av_buffersink_get_frame_flags was wrong and it is what is running the graph and so does block. That cpu time is essentially being spent deinterlacing the input video so I guess is expected performance loss, but it affecting video playback makes it much more likely to impact a wider set of users.
I havent narrowed down the cause of the cpu usage jump between 3 and 4 inputs, as it doesnt happen on my machine but I know the cpu in question is 4core+ht so I suspect there just aren't enough cores for it to run all the deinterlacing at once
For reference with 4 decklink inputs running 1080i50:
deinterlacing disabled: 8% cpu
deinterlacing enabled: 50% cpu
Playing 4 1080i50 mjpeg videos (from blackmagic media express) uses 60% cpu
I guess the best fix would be to port bwdif to opengl or alternatively somehow allow interlaced frames through the progressive pipeline. The second options would need a some kind of frame_index passed to frame_producer::receive so that the producer can sync in the interlaced frames on even indices so they come out properly from the mixer (it's hacky).
Porting bwdif to OpenGL should actually be easier than it sounds. I did it from yadif a few years ago without too much hassle. I just can't find the code anymore. Though you might also need a more powerful GPU. But that's probably easier than CPU.
You could also try and run bwdif in non temporal mode which should make it a bit faster.
Disable graph command codes when DIAG command is not in use
Easy way : send the graph existing codes as std::function to a helper graph function which will either flush or execute depending on std::atomic<bool> graph_disable flag.
Ideal But Hard Way: send the graph codes as async packaged tasks. Could be overwhelming due to the rapid task creation.
General suggestion: quite surprise to find casparcg code is not using somethread pool kind. Most multi-threading codes including the executor in the server code use std::threadwhich is way slower than a normal function launch.
Can we expect a PR from you @seccpur ?
@seccpur
While the graph code is using cpu even when not being displayed, I'm not sure if it can be improved much.
If not being rendered, all the graph type does is store track the latest values in a map, and doing that via std::async could well introduce more overhead. As most of these values are time measurements or buffer size checks, the computation of those can't be moved to another thread either.
A lot of the executor code likely dates back to 2.0, and so was written before std::async existed so some of it could definitely consider being changed. However, not everything wants to be run on a thread pool. It is useful to have threads for different parts, as it is more predictable. And some of them are also acting as locks. For example, the executor in decklink_consumer_proxy acts as a lock for the decklink_consumer it wraps. If this was done with std::async and a lock, then we risk having multiple threads waiting on the lock.
std::async always spawns a new thread
@Julusian
Thanks for the detailed reply. I agree with std::async part, however by-passing graph codes ,whenever DIAG command is not used, will improve performance . Is it feasible to incorporate the first proposal? IMHO, existing code could be used in conjunction with a conditional statement or #ifdef or even lambda to disable selectively.
Most helpful comment
After a bit of a debug, the gpu usage is from 20-25% gpu per active channel, which is maybe to be expected but seems a bit high when there is no compositing involved (on a GTX1070).
The crash was an easily solvable null exception. Ill raise a pr in a minute for that.
The non-linear cpu increase is a more interesting one. I had a look at the diag window (didnt manage to get a screenshot of it) and the frame-time climbed for each producer climbed each time I added another, so it quickly crossed threshold for being realtime.
If we fall behind far enough, the decklink will send us a null frame (but with audio) and that is the cause of the crash.
So the problem here is the graph taking too long and blocking the decklink callback. A thread could help here. I didnt get to a proper conclusion on the cpu jump between input 3&4, but I suspect it will be resolved by making the decklink producer use a thread.