Server: Channel output freezes (including html overlays etc.)

Created on 19 Jan 2017 · 17Comments · Source: CasparCG/server

We are experiencing complete freezing of channel output when playing a list of clips. It doesn't happen often; Once every few days on a 24/7 playlist. We haven't succeeded in manually reproducing the problem. The complete composited output of one channel freezes (ie. html overlays etc. are included).

Details:

It's always on the last frame of a clip.
All clips that freeze have been played out without issue many times before.
If we play the same clip manually in a seperate channel everything also goes fine. So the clip is probably not corrupt.
Overlays like html templates also freeze.
Other channels stay functional.
Commands (like PLAY, CLEAR, INFO) issued from a telnet client connected after the freezing do nothing (no response at all, empty commands still give a 400 ERROR). Only on the frozen channel though. Other channels keep working fine.
Strangely however, incomming commands from a connected node.js client (connected before the freezing) are executed seemingly fine by casparcg according to the logs. If you would look at only the standard logs you would think CasparCG was still playing out fine on the frozen channel.

Only when we turned on trace level in the logs we found some pointers that something changes when the channel freezes. After the clip which freezes the clips that follow in the logs no longer show ffmpeg_input[...] Received EOF and Shutting down ffmpeg_input messages
Perhaps relevant: during normal playout (before the freeze) clips sometimes show multiple 'Received EOF' messages. Is this correct behavior?

The issue occurs on two different systems:

Ubuntu 16.04.1 LTS, CasparCG 2.1.0.294 a8775b6 Beta 1, DeckLink Quad 2 with driver 10.8.1, NVIDIA Quadro m2000
Ubuntu 16.04.1 LTS, CasparCG 2.1.0.294 a8775b6 Beta 1, DeckLink Duo 2 with driver 10.8.4, NVIDIA Quadro m2000

Does this type of error ring a bell with someone?

typbug

Source

saltomodules

Most helpful comment

I can confirm that my reproduction script, which had a 100% failure rate, is now unable to crash CasparCG. I'm really happy you looked into it found the problem. We're running CasparCG in a 24/7 environment and this race condition wasn't a hypothetical one.

@HellGore Please send me ([email protected]) your contact info / address so we can send some love from Amsterdam ;-)

toontoet on 23 Feb 2017

❤1 🎉1

All 17 comments

It may be a problem with the Decklink driver. I had the same problems on a Windows 7 machine with Decklink 4K Extreme.
Please try it with Decklink driver 10.5.4. This version stopped the freezes on my machine.

premultiply on 19 Jan 2017

Please try it with Decklink driver 10.5.4. This version stopped the freezes on my machine.

Thanks for the suggestion. Sadly though, the Quad 2 is not suppported before version 10.6.1.

saltomodules on 20 Jan 2017

I've attached an excerpt from our logs for extra information.
170110 log excerpt frozen channel output.txt

The channel freeze happens (very close to) [2017-01-10 16:05:13.418]. This is the last time a clip shows a Shutting down ffmpeg_input message. The channel freezes on the last frame of this clip (584feea88483016388cf31bb_broadcast.mp4)
Note that the clip that follows directly after in the logs (57f776992b6095b7cce79c86_broadcast.mp4) is the last clip to have a Received EOF message at [2017-01-10 16:05:13.427]. Also note that this clip and all that follow only play out according to the logs. The real channel output remains frozen.
The Dynamic exception type: boost::bad_rational error at [2017-01-10 16:06:51.001] is almost certainly issue #536 . Since this starts happening continuously after the freeze it's likely that the clip that's follows on the frozen clip is stuck in LOADBG mode

saltomodules on 20 Jan 2017

Yesterday we experienced a freeze again. We''ve got two channels running 1 video layer and 5 CG HTML templates each. When @saltomodules filed the report we were running one channel with and another without CG layers.
This time both channels froze at the same time.

When we ran some diagnostics like INFO QUEUES INFO THREADS etc we got proper results. When we did a INFO SERVER we didn't get any response anymore. Not from the INFO SERVER nor from any other command we issued. It seemed to lock the system up.

Because of the other channel freezing now aswel we figured it might have to do something with the HTML rendering.

When we killed the zygote thread - containing a bunch of renderer threads. Commands started to give responses again. This didn't result in a functioning system however cause our command queue was full and 500's were returned.

screen shot 2017-01-27 at 10 24 27

While the channels were frozen the theadlist also showed really a lot of ffmpeg threads. It looked as if the threads were started, but hang and never got closed. Probably cause the channel wasn't playing anything and our controller software kept issuing PLAY commands.

threads.txt

We also forced some memory-leaks in the HTML layers to see if that could cause similar behaviour. This resulted only in the specific layer to stop, but no fatal crashes of the channel.

Any suggestions on how to debug this any further?

toontoet on 27 Jan 2017

@toontoet Do you have something like executor[html_producer] Overflow. Blocking caller. in the log?

HellGore on 27 Jan 2017

@HellGore nope. Only regular [debug] Shutting down html_producer. But those are expected after finishing a PLAY [html] ... command right?

toontoet on 27 Jan 2017

Past weekend we were able to narrow it down a bit more. We are now able to reproduce the freezing and it seems it has nothing to do with the html_producers.

To reproduce we wrote a small script to fire a bunch of LOADBG's and PLAY's. This simulates the script we were running during the random freezes. To speed the process up a bit we decreased the times between the LOADBG's and the PLAY's. Doing so it tries to load and play about 5 clips in 3 seconds. Clips are of different resolution. No html layers were used.

crash.txt

Within a minute or so one or two channels freeze completely. Giving it exactly the same behaviour as before: no response on any command for the crashed channel (like INFO 1, PLAY 1-1 ..., etc)
Also the number ffmpeg_input threads start to increase.

One might argue that the behaviour of the script is not common, but due to network latency or other circumstances a salvo of commands might occur.

The output looks ok. Skipping fast trough the clips. But after a few cycles. We get the freeze.

When we were able to reproduce a freeze we did some experimentation.
We tried different builds, but no difference there.

But when we startend experimenting with different consumers it became clear that it only happened when we used the DECKLINK consumers in our channel config. If we removed the decklinks it kept running ok for a long time. It we added a bunch of stream producers, to increase the server load, it kept running. No freezes.
But if we included the decklinks, it froze again.

We also tried different settings for and with no effect.

We ran the test on the machine having the DeckLink Duo 2 with driver 10.8.4 on Ubuntu 16.04.

Maybe this information can help finding out by what the crash is caused.

toontoet on 30 Jan 2017

Here are 2 log files captured during the tests:

Running 25 cycles of commands WITH decklink, causing a crash on channel 1.
caspar_2017-01-30_crashed.txt

Running 25 cycles of commands WITHOUT decklink, running fine:
caspar_2017-01-30_nocrash.txt

toontoet on 30 Jan 2017

We just reproduced the crash also on the machine with the Decklink QUAD 2 card. Same result: A running server, no errors and no resonse on channel related commands like INFO 1 of INFO SERVER

toontoet on 3 Feb 2017

I just compiled a debug version on our server. I runs ok, and I can still reproduce the crash / freeze. I don't have a real clue where to look next in finding the cause of the freezing.
Anybody here able to help out?

toontoet on 6 Feb 2017

Well I'm going to look at it and hopefully find the problem.

HellGore on 6 Feb 2017

❤1

Any update on this? I'm willing to pay a small reward for this to get solved.

toontoet on 15 Feb 2017

I am really confident that I have fixed this now in the latest build. Linux is built, Windows version building now.

HellGore on 23 Feb 2017

🎉2

Woohooo!!! Can't wait to do some testing on it. Will let you know the results!! :-) 👍

toontoet on 23 Feb 2017

I was just really happy that it was something I had done wrong and not blackmagic. It is so much easier when it is in our codebase.