Server: 2.2.0 beta 1 performance

Created on 11 Mar 2018 · 27Comments · Source: CasparCG/server

Expected Behaviour

Better performance in Decklink producer playback.
Performance that at least matches 2.0.7 and 2.1.0 beta 2.
Near-linear growth of CPU, GPU and RAM consumption as load increases with new sources and layers.

Current Behaviour

2-4x less performance in Decklink producer playback compared to 2.0.7 and 2.1.0 beta 2.
Exponential growth in resource consumption.
Unexpected crashes at some strange threshold when CPU/GPU maxes out.

Steps to Reproduce

4 channels 1080i5000 with no consumers
8 Decklink inputs available
Fire one command at the time, measure CPU and RAM (Windows Task Manager) and GPU load (GPU-z):

play 1-0 decklink device 1 format 1080i5000
play 2-0 decklink device 2 format 1080i5000
play 3-0 decklink device 3 format 1080i5000
play 4-0 decklink device 4 format 1080i5000
play 1-1 decklink device 5 format 1080i5000
play 2-1 decklink device 6 format 1080i5000
play 3-1 decklink device 7 format 1080i5000
play 4-1 decklink device 8 format 1080i5000

Environment

CasparCG Server version: 2.2.0 beta 1 vs 2.1.0 beta 2 vs 2.0.7 stable
Operating system: Windows 7

Attachments

2.2.0 beta 1

Idle at 0% cpu and 0% gpu
1 input 7% cpu and 34% gpu 3,25gb ram
2 inputs 17% cpu and 36% gpu 3,53gb ram
3 inputs 33% cpu and 56% gpu 3,72gb ram
4 inputs 81% cpu and 93% gpu 4,06gb ram. Crashes after a few seconds.

2.1.0 beta 2

Idles at 3% cpu and 0% gpu 3,60gb ram

1 input 3% cpu and 30% gpu 3,72gb ram
2 inputs 4% cpu and 16% gpu 3,84gb ram
3 inputs 5% cpu and 24% gpu 3,98gb ram
4 inputs 5% cpu and 32% gpu 4,15gb ram
5 inputs 6% cpu and 34% gpu 4,42gb ram
6 inputs 7% cpu and 35% gpu 4,70gb ram
7 inputs 9% cpu and 40% gpu 4,98gb ram
8 inputs 9% cpu and 40% gpu 5,24gb ram

typenhancement

Source

jesperstarkar

Most helpful comment

After a bit of a debug, the gpu usage is from 20-25% gpu per active channel, which is maybe to be expected but seems a bit high when there is no compositing involved (on a GTX1070).
The crash was an easily solvable null exception. Ill raise a pr in a minute for that.
The non-linear cpu increase is a more interesting one. I had a look at the diag window (didnt manage to get a screenshot of it) and the frame-time climbed for each producer climbed each time I added another, so it quickly crossed threshold for being realtime.
If we fall behind far enough, the decklink will send us a null frame (but with audio) and that is the cause of the crash.
So the problem here is the graph taking too long and blocking the decklink callback. A thread could help here. I didnt get to a proper conclusion on the cpu jump between input 3&4, but I suspect it will be resolved by making the decklink producer use a thread.

Julusian on 11 Mar 2018

🎉3

All 27 comments

2.2 runs in progressive so it basically has 2x GPU usage as well as extra cpu overhead for deinterlacing. This is expected.

ronag on 11 Mar 2018

Fixing this will require re-introducing native interlaced support. Which I don't see happening.

ronag on 11 Mar 2018

How much memory does that machine have? It seems to crash due to out of memory. The memory usage is a bit interesting. Do you have a log + DIAG screenshot around the time it crashes?

ronag on 11 Mar 2018

16gb. It crashes with this:

[2018-03-11 18:14:01.209] [fatal]   #######################
[2018-03-11 18:14:01.209] [fatal]    UNHANDLED EXCEPTION: 
[2018-03-11 18:14:01.209] [fatal]   Adress:000007FEFD80A06D
[2018-03-11 18:14:01.209] [fatal]   Code:3221225477
[2018-03-11 18:14:01.209] [fatal]   Flag:0
[2018-03-11 18:14:01.209] [fatal]   Info:000000005020C470
[2018-03-11 18:14:01.209] [fatal]   Continuing execution. 
[2018-03-11 18:14:01.209] [fatal]   #######################

jesperstarkar on 11 Mar 2018

Does that happen when a specific number of inputs or memory usage is reached?

ronag on 11 Mar 2018

Yes. Look at the jump between 3rd and 4th input added. It's massive. Then crashes. Reproducible.

jesperstarkar on 11 Mar 2018

Are you getting drop-frame warnings in DIAG before it happens?

ronag on 11 Mar 2018

I have run it through the vs cpu profiler and with 4 1080i50 inputs, it is spending 50% of the cpu-time in decklink_producer.VideoInputFrameArrived

30% in calling av_buffersink_get_frame_flags
10% in calling av_buffersrc_write_frame
12% in ffmpeg::make_frame

It doesnt quite max out my cpu with that, and hits 1.6gb of ram in use. One channel with decklink consumer.

Julusian on 11 Mar 2018

This is probably because the decklink sdk is calling the VideoInputFrameArrived faster than we are able to process the frames and we start buffering up input frames. Should probably move the graph to a separate thread.

ronag on 11 Mar 2018

@ronag I ran without DIAG since it adds 20-25% cpu usage from idle, and I wanted to hold that out of the measurements.

jesperstarkar on 11 Mar 2018

Yeah it looks to me like we are pulling frames from the graph immediately, and doing a while(true) with no sleeps until a frame is ready. I am assuming av_buffersink_get_frame_flags doesnt block there.
A seperate thread sounds like a good idea there

Julusian on 11 Mar 2018

There isn't much to do here... we could possible avoid the extra overhead in make_frame... and move the processing to a separate thread (to avoid the crash) but that won't actually help with proper playback.

You need a beefier machine or run progressive :).

Maybe we could make the deinterlace filter configurable to allow a faster version.

ronag on 11 Mar 2018

A seperate thread sounds like a good idea there

Yea, it would be appropriate.

ronag on 11 Mar 2018

Ah, we should probably use AVBufferRef for the frames sent to av_buffersrc_write_frame to avoid the extra memcpy.

The av_buffersrc_write_frame and make_frame overhead should be possible to remove.

ronag on 11 Mar 2018

Performance have not been a focus for 2.2

ronag on 11 Mar 2018

The primary goal for this version has been code cleanup, improved performance and improved stability/reliability.

http://casparcg.com/forum/viewtopic.php?f=13&p=30575&sid=a20a315112b8756c9d58d3ca212a501e#p30575

Julusian on 11 Mar 2018

Well, I would disagree with the performance part of that :)

ronag on 11 Mar 2018

Julusian on 11 Mar 2018

🎉3

@ronag Well, I would disagree with the performance part of that :)

I'm about changing the texts now. But it sounded really neat doh, but you're totally right! I apologies for the confusion. 😄

dotarmin on 12 Mar 2018

😄2

I've had more of a look into this and have found the same cpu impact playing 1080i50 video files too.
It turns out my assumption about av_buffersink_get_frame_flags was wrong and it is what is running the graph and so does block. That cpu time is essentially being spent deinterlacing the input video so I guess is expected performance loss, but it affecting video playback makes it much more likely to impact a wider set of users.

I havent narrowed down the cause of the cpu usage jump between 3 and 4 inputs, as it doesnt happen on my machine but I know the cpu in question is 4core+ht so I suspect there just aren't enough cores for it to run all the deinterlacing at once

For reference with 4 decklink inputs running 1080i50:
deinterlacing disabled: 8% cpu
deinterlacing enabled: 50% cpu

Playing 4 1080i50 mjpeg videos (from blackmagic media express) uses 60% cpu

Julusian on 13 Mar 2018

I guess the best fix would be to port bwdif to opengl or alternatively somehow allow interlaced frames through the progressive pipeline. The second options would need a some kind of frame_index passed to frame_producer::receive so that the producer can sync in the interlaced frames on even indices so they come out properly from the mixer (it's hacky).

Porting bwdif to OpenGL should actually be easier than it sounds. I did it from yadif a few years ago without too much hassle. I just can't find the code anymore. Though you might also need a more powerful GPU. But that's probably easier than CPU.

ronag on 13 Mar 2018

You could also try and run bwdif in non temporal mode which should make it a bit faster.

ronag on 13 Mar 2018

Disable graph command codes when DIAG command is not in use

graph related codes are verbose and eating CPU cycles and it is req only during analysis.

Easy way : send the graph existing codes as std::function to a helper graph function which will either flush or execute depending on std::atomic<bool> graph_disable flag.

Ideal But Hard Way: send the graph codes as async packaged tasks. Could be overwhelming due to the rapid task creation.

General suggestion: quite surprise to find casparcg code is not using somethread pool kind. Most multi-threading codes including the executor in the server code use std::threadwhich is way slower than a normal function launch.

seccpur on 18 Apr 2018

❤1 👎1 👍1

Can we expect a PR from you @seccpur ?

jesperstarkar on 18 Apr 2018

@seccpur
While the graph code is using cpu even when not being displayed, I'm not sure if it can be improved much.
If not being rendered, all the graph type does is store track the latest values in a map, and doing that via std::async could well introduce more overhead. As most of these values are time measurements or buffer size checks, the computation of those can't be moved to another thread either.

A lot of the executor code likely dates back to 2.0, and so was written before std::async existed so some of it could definitely consider being changed. However, not everything wants to be run on a thread pool. It is useful to have threads for different parts, as it is more predictable. And some of them are also acting as locks. For example, the executor in decklink_consumer_proxy acts as a lock for the decklink_consumer it wraps. If this was done with std::async and a lock, then we risk having multiple threads waiting on the lock.

Julusian on 18 Apr 2018

👍2

std::async always spawns a new thread

ronag on 18 Apr 2018

@Julusian
Thanks for the detailed reply. I agree with std::async part, however by-passing graph codes ,whenever DIAG command is not used, will improve performance . Is it feasible to incorporate the first proposal? IMHO, existing code could be used in conjunction with a conditional statement or #ifdef or even lambda to disable selectively.

seccpur on 19 Apr 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Asynchronous AMCP

jesperstarkar · 35Comments

Intel HD graphics rendering bug (2.1.0 & 2.2.0)

jesperstarkar · 53Comments

Diag Window Kills Performance

ronag · 29Comments

Audio Glitching Server 2.2/2.3 - late frames route

grahamspr86 · 26Comments

Suggestion - an alternative to OSC...?

Bernie333 · 47Comments