Server: INFO command can crash CasparCG 2.2

Created on 19 Aug 2018  路  5Comments  路  Source: CasparCG/server

Expected Behaviour

INFO command has no effect on stability whatsoever

Current Behaviour

Our 2.2 test setup was humming along nicely for multiple days until a single INFO 1 command took down the entire server:

[2018-08-17 14:22:50.296] [info]    async_event_server[:5250] Accepted connection from 127.0.0.1 (5 connections).
[2018-08-17 14:22:50.299] [info]    Received message from 127.0.0.1: INFO 1\r\n
[2018-08-17 14:22:50.300] [debug]   Executing command: INFO
Aborted (core dumped)

Steps to Reproduce

To see if we could reproduce we created a test script that sends fifty INFO commands per second. In some variations of the command the script toppled the server indeed within half a minute or so:

  • INFO with small return message. Doesn't crash. Test ended after 40 minutes:
[2018-08-18 17:40:27.911] [info]    async_event_server[:5250] Accepted connection from 127.0.0.1 (5 connections).
[2018-08-18 17:40:27.915] [info]    Received message from 127.0.0.1: INFO\r\n
[2018-08-18 17:40:27.915] [debug]   Executing command: INFO
[2018-08-18 17:40:27.915] [debug]   Executed command (0s): INFO
[2018-08-18 17:40:27.915] [info]    Sent message to 127.0.0.1:200 INFO OK\r\n1 1080i5000 PLAYING\r\n2 PAL PLAYING\r\n3 PAL PLAYING\r\n4 PAL PLAYING\r\n\r\n
[...]
[2018-08-18 18:21:37.502] [info]    Received message from 127.0.0.1: INFO\r\n
[2018-08-18 18:21:37.503] [debug]   Executing command: INFO
[2018-08-18 18:21:37.503] [debug]   Executed command (0s): INFO
[2018-08-18 18:21:37.503] [info]    Sent message to 127.0.0.1:200 INFO OK\r\n1 1080i5000 PLAYING\r\n2 PAL PLAYING\r\n3 PAL PLAYING\r\n4 PAL PLAYING\r\n\r\n
[2018-08-18 18:21:37.523] [info]    Received message from 127.0.0.1: INFO\r\n
[2018-08-18 18:21:37.523] [debug]   Executing command: INFO
[2018-08-18 18:21:37.524] [debug]   Executed command (0.001s): INFO
  • INFO 1 with large xml return message. Crashes in thirteen seconds:
[2018-08-18 17:21:06.762] [info]    async_event_server[:5250] Accepted connection from 127.0.0.1 (5 connections).
[2018-08-18 17:21:06.766] [info]    Received message from 127.0.0.1: INFO 1\r\n
[2018-08-18 17:21:06.766] [debug]   Executing command: INFO
[2018-08-18 17:21:06.766] [debug]   Executed command (0s): INFO
[2018-08-18 17:21:06.767] [info]    Sent more than 512 bytes to 127.0.0.1
[...]
[2018-08-18 17:21:19.533] [info]    Received message from 127.0.0.1: INFO 1\r\n
[2018-08-18 17:21:19.533] [debug]   Executing command: INFO
[2018-08-18 17:21:19.534] [debug]   Executed command (0.001s): INFO
[2018-08-18 17:21:19.534] [info]    Sent more than 512 bytes to 127.0.0.1
[2018-08-18 17:21:19.555] [info]    Received message from 127.0.0.1: INFO 1\r\n
[2018-08-18 17:21:19.556] [debug]   Executing command: INFO
Aborted (core dumped)
  • INFO 1-1 with large xml return message. Crashes in fourty seconds:
[2018-08-18 18:27:30.123] [info]    async_event_server[:5250] Accepted connection from 127.0.0.1 (5 connections).
[2018-08-18 18:27:30.126] [info]    Received message from 127.0.0.1: INFO 1-1\r\n
[2018-08-18 18:27:30.126] [debug]   Executing command: INFO
[2018-08-18 18:27:30.127] [debug]   Executed command (0.001s): INFO
[2018-08-18 18:27:30.127] [info]    Sent more than 512 bytes to 127.0.0.1
[...]
[2018-08-18 18:28:13.128] [info]    Received message from 127.0.0.1: INFO 1-1\r\n
[2018-08-18 18:28:13.129] [debug]   Executing command: INFO
[2018-08-18 18:28:13.129] [debug]   Executed command (0s): INFO
[2018-08-18 18:28:13.129] [info]    Sent more than 512 bytes to 127.0.0.1
[2018-08-18 18:28:13.150] [info]    Received message from 127.0.0.1: INFO 1-1\r\n
[2018-08-18 18:28:13.151] [debug]   Executing coAborted (core dumped)
  • INFO 1-1 on empty channel with small xml return message. Doesn't crash. Test ended after 45 minutes:
[2018-08-18 18:47:47.553] [info]    async_event_server[:5250] Accepted connection from 127.0.0.1 (1 connections).
[2018-08-18 18:47:47.556] [info]    Received message from 127.0.0.1: INFO 1-1\r\n
[2018-08-18 18:47:47.556] [debug]   Executing command: INFO
[2018-08-18 18:47:47.556] [debug]   Executed command (0s): INFO
[2018-08-18 18:47:47.557] [info]    Sent message to 127.0.0.1:201 INFO OK\r\n<?xml version="1.0" encoding="utf-8"?>\n<channel/>\n\r\n
[...]
[2018-08-18 19:33:35.370] [info]    Received message from 127.0.0.1: INFO 1-1\r\n
[2018-08-18 19:33:35.371] [debug]   Executing command: INFO
[2018-08-18 19:33:35.371] [debug]   Executed command (0s): INFO
[2018-08-18 19:33:35.371] [info]    Sent message to 127.0.0.1:201 INFO OK\r\n<?xml version="1.0" encoding="utf-8"?>\n<channel/>\n\r\n
[2018-08-18 19:33:35.379] [info]    async_event_server[:5250] Client 127.0.0.1 disconnected (3 connections).
  • Perhaps of use, on build 2.2.0 d3063ce33 Beta 4 these variations of the crash message showed up on the console:
[2018-08-19 18:21:20.608] [info]    Received message from 127.0.0.1: INFO 1\r\n
[2018-08-19 18:21:20.608] [debug]   Executing command: INFO
[2018-08-19 18:21:20.608] [debug]   Executed command (0s): INFO
casparcg: tpp.c:84: __pthread_tpp_change_priority: Assertion `new_prio == -1 || (new_prio >= fifo_min_prio && new_prio <= fifo_max_prio)' failed.
[2018-08-19 18:21:20.608] [info]Aborted (core dumped)


[2018-08-19 18:24:23.409] [info]    Received message from 127.0.0.1: INFO 1-1\r\n
[2018-08-19 18:24:23.410] [debug]   Executing command: INFO
Segmentation fault (core dumped)

From these tests it seems the issue is somehow related to (the creation of) the large xml response. A threading issue when querying the resources needed for the xml response perhaps?

Environment

CasparCG Server version:
2.2.0 062c4e09a Beta 1  and 2.2.0 d3063ce33 Beta 4
video-mode 1080i5000    

Operating system:
Ubuntu 18.04

Graphics driver:
Nvidia 390.42  (for GTX1050)

Decklink drivers:
Blackmagic Design BlackmagicIO driver 10.10a3>
typbug

Most helpful comment

Hopefully this is fixed in latest master.

I was very easily able to get it crashing within 10s by running

LOAD 1-10 amb
LOAD 1-11 amb
LOAD 1-12 amb
LOAD 1-13 amb
LOAD 1-14 amb

And the following script

const net = require('net')

var client = new net.Socket()
client.connect(5250, '127.0.0.1')

client.on('data', (d) => {
    client.write('info 1\r\n')
})

client.write('info 1\r\n')

It looks like the probelm was concurrent access of the state object, so hopefully this is a sufficient fix to stop it from happening now.

I left this running for 10 mins with the same steps and it didn't crash.

All 5 comments

Experiencing same problem

A useful debugging technique for segfaults is to run the program with valgrind, and see if valgrind can provide more information about why the code aborts.

Hopefully this is fixed in latest master.

I was very easily able to get it crashing within 10s by running

LOAD 1-10 amb
LOAD 1-11 amb
LOAD 1-12 amb
LOAD 1-13 amb
LOAD 1-14 amb

And the following script

const net = require('net')

var client = new net.Socket()
client.connect(5250, '127.0.0.1')

client.on('data', (d) => {
    client.write('info 1\r\n')
})

client.write('info 1\r\n')

It looks like the probelm was concurrent access of the state object, so hopefully this is a sufficient fix to stop it from happening now.

I left this running for 10 mins with the same steps and it didn't crash.

Ran an INFO 1 hammer test for one hour and no crash. So fix definitely seems to be working here. Great!

Issue https://github.com/CasparCG/server/issues/1128 with the segfaults after a LOADBG looks somewhat similar in behavior so we reran reproduction tests for that too.

Those tests sadly still result in a segfault. Might very well be a concurrency issue of some other nature though.

A way to discover concurrency issues is static code analysis. A fairly
well working system for such static code analysis is Coverity, and while
a commercial product, they run a program where free software projects
can get the source scanned without having to pay any money. Perhaps
this is an idea for CasparCG, to avoid problems like these?

See I use it with a few of the projects I am involved it.

I also notice a list of free software alternatives listed on
I do not know how good they are with C++ and this kind of bugs.

--
Happy hacking
Petter Reinholdtsen

Was this page helpful?
0 / 5 - 0 ratings

Related issues

grahamspr86 picture grahamspr86  路  26Comments

jesperstarkar picture jesperstarkar  路  53Comments

jesperstarkar picture jesperstarkar  路  27Comments

ronag picture ronag  路  29Comments

jesperstarkar picture jesperstarkar  路  51Comments