MAVLink over USB fails when the flight controller is on battery power if there's no receiver running on the PC side during the first 1-2 seconds after the connection.
To Reproduce
~~1. Make sure no MAVLink receivers (QGroundControl, MAVROS, etc) are running on your PC
Alternatively,
Expected behavior
MAVLink connection should be established (QGroundControl should display current FCU status, MAVROS should report heartbeats, mavlink_shell should display nsh > prompt)
Actual behavior
The underlying serial device (/dev/ttyACM0 on Linux) appears in the system, but MAVLink connection is not established (QGroundControl suggests rebooting the vehicle, mavlink_shell doesn't go anywhere past Connecting to MAVLINK... message)
I've noticed this behavior on a Pixracer.
Additional context
This issue was present in v1.8.2, but was harder to notice. In v1.8.2, the "timeout" started after I've connected the USB cable to the PC, not straight after FCU boot. Apparently this was enough to let our companion computer finish booting and start MAVROS. This was a bug in Ubuntu 16.04 and is not related to this issue
mavlink process doesn't seem to report any problems in the debug console - in fact, mavlink status shows an instance running over /dev/ttyACM0 on the PX4 side. USB status is shown correctly in the system_power message (running listener system_power shows a message with usb_connected: 1 and usb_valid: 1 when the cable is plugged and usb_connected: 0 and usb_valid: 0 when it isn't).
This is a showstopper for us, since we're using USB for our companion computer connection.
I just tried to reproduce this using a Pixracer and Pixhawk 4 and Fedora 30 and I don't see the issue. I can wait 10 seconds and more, then start QGC and still connect fine. Also, I'm doing this sort of connecting many times every week since months while developing on 1.9 and I never saw it as a regression.
Does this happen on every host computer or just the companion computer that you're using? And have you ruled out a hardware issue by trying with another Pixracer?
@julianoes, thank you for your reply! I've done some more testing, and it appears there are at least two issues there, one of which is actually with my OS. I've tried plugging the flight controller over USB on Ubuntu 18.04 and could not reproduce the issue (my initial testing was done on Ubuntu 16.04). I'll edit the issue to reflect that.
The issue with external power still stands, though. I've tried to use another Pixracer and an older Pixhawk 1, and they all wouldn't connect over USB shortly after booting. The issue is reproducible on Ubuntu 18.04 and Raspbian.
Oh, I missed the part about the external power source! Let me try again.
Ok, I can reproduce this with a Pixhawk 4 on external power.
@davids5 is this a known issue?
I've tried bisecting the bug, and it seems that the bug manifests itself in https://github.com/PX4/Firmware/commit/e12acc4b282d26ae1b8c1e1cb4220b53ef54b097 (previous revision, https://github.com/PX4/Firmware/commit/8b71bb45621d682834807086b82ac65cce179d41, is fine). Seems like the only difference is starting mavlink on USB later than on serial ports.
I've tried moving the commands that start mavlink over USB to /etc/init.d/rcS, and it seems like the USB instance should be started after all other instances: the bug went away if I ran mavlink over USB after /etc/init.d/rc.board_extras. Not sure whether that breaks UART instances, though.
@davids5 please see above, that's very interesting.
@julianoes, I can reproduce this issue on v1.9.1, but I cannot reproduce it in current master. Is this what you observe as well?
I have not tried master.
@dagar , @julianoes , the commit moving mavlink touORB subscription fixes the issue, so this issue does not exist in current PX4:master.
I will work on finding a fix that can work for 1.9.1 and update you as I make progress.
My current understanding is that the Mavlink class isn't handling the error case of not succeeding in creating a mavlink stream and increasing the _instance_count value. If this issue happens on the last stream to be started it doesn't matter, but if the failure isn't the last stream to be started it fails silently and doesn't set the _mavlink_channel.
We might want to add logic to check the count value here and do something about it when it fails::
if (ic == Mavlink::instance_count() ||
count == limit) {
Hm, so you're saying it's the ordering. And where would it set the _mavlink_channel? I think we should fix this locking mechanism properly, it seems very ugly to me.
And for a 1.9 backport fix, I wonder if we can just revert https://github.com/PX4/Firmware/commit/e12acc4b282d26ae1b8c1e1cb4220b53ef54b097?
@julianoes @mcsauder Do we believe this is not OS related, or do you need me to look into it?
@julianoes , the easier backport is to move the rc.mavlink call down below the rc.board_extras call so that the USB mavlink stream is the last stream to be started. I can submit that PR and create an issue to flag the mavlink problem.
@davids5 , I don't think it is an issue with the OS.
This is what the mavlink stream should look like when the ttyACM0 stream is started last and functions:
instance #2:
mavlink chan: #2
type: USB CDC
flow control: OFF
rates:
tx: 0.000 kB/s
txerr: 0.556 kB/s
tx rate mult: 0.050
tx rate max: 800000 B/s
rx: 0.000 kB/s
FTP enabled: YES, TX enabled: YES
mode: Config
MAVLink version: 1
transport protocol: serial (/dev/ttyACM0 @2000000)
And this is what it looks like if it is not started last and the USB port is plugged in after the startup script has completed:
instance #2:
mavlink chan: #0
type: USB CDC
flow control: OFF
rates:
tx: 0.000 kB/s
txerr: 0.000 kB/s
tx rate mult: 1.000
tx rate max: 800000 B/s
rx: 0.000 kB/s
FTP enabled: YES, TX enabled: YES
mode: Config
MAVLink version: 1
transport protocol: serial (/dev/ttyACM0 @2000000)
Notice that the channel number is wrong corresponding to an incorrect _instance_id.
The proper fix is with the Mavlink Class. Give me just a bit of time to see if I can remedy the issue there before I submit a change to the startup script order.
If the USB device is unplugged, Mavlink::open_uart_port() gets stuck here in an infinite while loop trying to open it until something gets plugged in.
@julianoes and @dagar , because the USB mavlink stream is a config stream, and because the infinite loop was designed this way intentionally to never give up, is the correct solution really just to make sure this config mavlink stream always gets called last in the rcS script?
An interesting note is that at present and for the solution I inquired about above, for every autopilot running without a USB cable attached to ttyACM0, every 100ms we are trying to open the port for the entire time the autopilot is powered on and running.
@mcsauder
We could use
int errcode = errno;
/* ENOTCONN means that the USB device is not yet connected */
usleep(errcode == ENOTCONN ? 1000 000 : 100000 );
For disconnect we can use POLLHUP
This issue is reproducible on fmu-v5 and v1.9.2. To reproduce, flash v1.9.2 on a system with power supplied other than over USB. Boot with USB cable connected and with QGC running, QGC will successfully communicate. Next, unplug the USB cable without removing external power, the system will reboot, then plug in the USB cable. QGC will no longer be able to communicate with the system.
@sfalexrog, PR #12428 corrects this issue in all of the tests I have conducted. If you'd like to test this patch applied to v1.9.2, you can checkout, build, and flash my branch here.
Closing via #12428 . If the problem persists for any reason feel free to re-open this issue!