We are experiencing complete freezing of channel output when playing a list of clips. It doesn't happen often; Once every few days on a 24/7 playlist. We haven't succeeded in manually reproducing the problem. The complete composited output of one channel freezes (ie. html overlays etc. are included).
Details:
The issue occurs on two different systems:
Does this type of error ring a bell with someone?
It may be a problem with the Decklink driver. I had the same problems on a Windows 7 machine with Decklink 4K Extreme.
Please try it with Decklink driver 10.5.4. This version stopped the freezes on my machine.
Please try it with Decklink driver 10.5.4. This version stopped the freezes on my machine.
Thanks for the suggestion. Sadly though, the Quad 2 is not suppported before version 10.6.1.
I've attached an excerpt from our logs for extra information.
170110 log excerpt frozen channel output.txt
The channel freeze happens (very close to) [2017-01-10 16:05:13.418]. This is the last time a clip shows a Shutting down ffmpeg_input message. The channel freezes on the last frame of this clip (584feea88483016388cf31bb_broadcast.mp4)
Note that the clip that follows directly after in the logs (57f776992b6095b7cce79c86_broadcast.mp4) is the last clip to have a Received EOF message at [2017-01-10 16:05:13.427]. Also note that this clip and all that follow only play out according to the logs. The real channel output remains frozen.
The Dynamic exception type: boost::bad_rational error at [2017-01-10 16:06:51.001] is almost certainly issue #536 . Since this starts happening continuously after the freeze it's likely that the clip that's follows on the frozen clip is stuck in LOADBG mode
Yesterday we experienced a freeze again. We''ve got two channels running 1 video layer and 5 CG HTML templates each. When @saltomodules filed the report we were running one channel with and another without CG layers.
This time both channels froze at the same time.
When we ran some diagnostics like INFO QUEUES INFO THREADS etc we got proper results. When we did a INFO SERVER we didn't get any response anymore. Not from the INFO SERVER nor from any other command we issued. It seemed to lock the system up.
Because of the other channel freezing now aswel we figured it might have to do something with the HTML rendering.
When we killed the zygote thread - containing a bunch of renderer threads. Commands started to give responses again. This didn't result in a functioning system however cause our command queue was full and 500's were returned.

While the channels were frozen the theadlist also showed really a lot of ffmpeg threads. It looked as if the threads were started, but hang and never got closed. Probably cause the channel wasn't playing anything and our controller software kept issuing PLAY commands.
We also forced some memory-leaks in the HTML layers to see if that could cause similar behaviour. This resulted only in the specific layer to stop, but no fatal crashes of the channel.
Any suggestions on how to debug this any further?
@toontoet Do you have something like executor[html_producer] Overflow. Blocking caller. in the log?
@HellGore nope. Only regular [debug] Shutting down html_producer. But those are expected after finishing a PLAY [html] ... command right?
Past weekend we were able to narrow it down a bit more. We are now able to reproduce the freezing and it seems it has nothing to do with the html_producers.
To reproduce we wrote a small script to fire a bunch of LOADBG's and PLAY's. This simulates the script we were running during the random freezes. To speed the process up a bit we decreased the times between the LOADBG's and the PLAY's. Doing so it tries to load and play about 5 clips in 3 seconds. Clips are of different resolution. No html layers were used.
Within a minute or so one or two channels freeze completely. Giving it exactly the same behaviour as before: no response on any command for the crashed channel (like INFO 1, PLAY 1-1 ..., etc)
Also the number ffmpeg_input threads start to increase.
One might argue that the behaviour of the script is not common, but due to network latency or other circumstances a salvo of commands might occur.
The output looks ok. Skipping fast trough the clips. But after a few cycles. We get the freeze.
When we were able to reproduce a freeze we did some experimentation.
We tried different builds, but no difference there.
But when we startend experimenting with different consumers it became clear that it only happened when we used the DECKLINK consumers in our channel config. If we removed the decklinks it kept running ok for a long time. It we added a bunch of stream producers, to increase the server load, it kept running. No freezes.
But if we included the decklinks, it froze again.
We also tried different settings for
We ran the test on the machine having the DeckLink Duo 2 with driver 10.8.4 on Ubuntu 16.04.
Maybe this information can help finding out by what the crash is caused.
Here are 2 log files captured during the tests:
Running 25 cycles of commands WITH decklink, causing a crash on channel 1.
caspar_2017-01-30_crashed.txt
Running 25 cycles of commands WITHOUT decklink, running fine:
caspar_2017-01-30_nocrash.txt
We just reproduced the crash also on the machine with the Decklink QUAD 2 card. Same result: A running server, no errors and no resonse on channel related commands like INFO 1 of INFO SERVER
I just compiled a debug version on our server. I runs ok, and I can still reproduce the crash / freeze. I don't have a real clue where to look next in finding the cause of the freezing.
Anybody here able to help out?
Well I'm going to look at it and hopefully find the problem.
Any update on this? I'm willing to pay a small reward for this to get solved.
I am really confident that I have fixed this now in the latest build. Linux is built, Windows version building now.
Woohooo!!! Can't wait to do some testing on it. Will let you know the results!! :-) 馃憤
I was just really happy that it was something I had done wrong and not blackmagic. It is so much easier when it is in our codebase.
I can confirm that my reproduction script, which had a 100% failure rate, is now unable to crash CasparCG. I'm really happy you looked into it found the problem. We're running CasparCG in a 24/7 environment and this race condition wasn't a hypothetical one.
@HellGore Please send me ([email protected]) your contact info / address so we can send some love from Amsterdam ;-)
Please try with 2.2 and reopen of still an issue.
Most helpful comment
I can confirm that my reproduction script, which had a 100% failure rate, is now unable to crash CasparCG. I'm really happy you looked into it found the problem. We're running CasparCG in a 24/7 environment and this race condition wasn't a hypothetical one.
@HellGore Please send me ([email protected]) your contact info / address so we can send some love from Amsterdam ;-)