Px4-autopilot: Navigator running erratic or very slow

Created on 20 Nov 2019 · 23Comments · Source: PX4/PX4-Autopilot

Describe the bug
We had a flight where the vehicle_status.nav_state was set to NAVIGATION_STATE_AUTO_RTL for about 0.6 seconds and no message or triplet from navigator was published, as if it was not running.

In the logfile below there are times where RTL should have been engaged by navigator because the nav state was set:

279 s / 7.02 --> No reaction from navigator
280.5 s --> RTL actions were taken again
284.5 s --> RTL actions were taken again

To Reproduce
Not sure.

Expected behavior
RTL should have been initiated.

Log Files and Screenshots
https://logs.px4.io/plot_app?log=7c4e2956-6ab5-409f-a2ac-0b908652a1de

rtl times

Additional context
Navigator is polling on local_position and there are regular updates of this topic.

We need to investigate the effects of accessing the SD card in some of the navigator modes when not inactive. E.g. Mission and Rtl modes check mission items respective rally points even if not active! Also, we check for geofences set in every iteration.

FYI @MaEtUgR

bug

Source

julianoes

👀1

Most helpful comment

The safe points are tiny, let's just keep a full copy of them in RTL.

This is currently not even about safe points. This is about geofence points.

It's both. For RTL we can store all the safe points in a small array. In geofence (IIRC) the polygon is already stored, but dataman is still checked constantly for an update. This is the part we get rid of with something similar to ORB_ID(mission) (https://github.com/PX4/Firmware/blob/master/msg/mission.msg). The mission message could turn into a more general dataman message that includes info about all the dataman entries, or we could add separate messages for each type (safe points, fence, waypoints).

I think we need to implement some caching at some level but also think about moving the read calls to the background and access them async.

Dataman is already like a poor man's async. Caching the first entry (where safe points and geofence store the number of items) might be a quick way to eliminate a lot of the disk reads (similar to the orb idea).

dagar on 21 Nov 2019

👍2

All 23 comments

I just tried in SITL and did some rough measurements about what navigator is doing just sitting in manual:

no waypoints and no rally points set: ~ 22 dm_read per second.
no waypoints and 4 rally points set: ~ 94 dm_read per second.
10 waypoints and 4 rally points set: ~ 94 dm_read per second, 174 dm_read right after upload.
10 waypoints and 4 rally points, and 4 geofence points: ~ 130 dm_read per second.

FYI @sfuhrer. We'll have to look into this.

julianoes on 20 Nov 2019

RTL - constantly re-reading the safepoints when inactive

https://github.com/PX4/Firmware/blob/4ce03dfc1e77425f4ccab6c67f3d357e29127e18/src/modules/navigator/rtl.cpp#L61

One potential quick fix would be to add something to throttle RTL::find_closest_landing_point().

https://github.com/PX4/Firmware/blob/master/src/modules/navigator/rtl.cpp#L66

dagar on 20 Nov 2019

@julianoes yes we should look into this. Let's though first get in https://github.com/PX4/Firmware/pull/13320 . It's mainly about enabling the rally points also in other than RTL_TYPE = 3 mode. Note that the flight with which we had the issue was actually already with this PR.

sfuhrer on 20 Nov 2019

👍1

@dagar guess we should change that RTL::find_closest_landing_point() is only executed when the rally points are updated, right? Could you point me to the easiest way to do that?

sfuhrer on 20 Nov 2019

@dagar guess we should change that RTL::find_closest_landing_point() is only executed when the rally points are updated, right? Could you point me to the easiest way to do that?

We don't really have a mechanism for that, but something similar to mission (https://github.com/PX4/Firmware/blob/master/msg/mission.msg) would probably make sense.

In this case it's a bit different though because it's rechecking based on current position.

dagar on 20 Nov 2019

not sure if this info helps: we also have large delays with a merge that is about 3/4 years behind current master. We have logs where there is more than a second delay between vehicle_status and triplet message.

Stifael on 21 Nov 2019

👀1 👍1

we also have large delays with a merge that is about 3/4 years behind current master.

@dagar this hints that it has nothing to do with the recent addition of rally points or even your bigger architectural changes recently. It sounds more like something off with how/when navigator runs.

julianoes on 21 Nov 2019

@Stifael when this happened, did you have a geofence set, and or a mission?

julianoes on 21 Nov 2019

The people in kunshan reproduced the error several times by:

load mission
Manual takeoff
Switch into mission

Not sure if geofence was active. Let me check

Stifael on 21 Nov 2019

Not sure if geofence was active. Let me check

Geofence was set to default, which is Warning.

Stifael on 21 Nov 2019

@Stifael right but were there geofence items set? This would cause SD card access.

julianoes on 21 Nov 2019

This is in HITL:

navigator: 134324 events, 231147587us elapsed, 1720.82us avg, min 14us max 1401415us 26976.525us rms

I read this as 1.4 seconds worst case.

julianoes on 21 Nov 2019

@Stifael right but were there geofence items set? This would cause SD card access.

How do I see that from the log?

Stifael on 21 Nov 2019

How do I see that from the log?

I don't think you do. But it doesn't matter. I can reproduce it without a geofence set. According to my profiling using perf_counters it seems like the dm_read() call is what takes so much time.

julianoes on 21 Nov 2019

@dagar @bkueng @davids5 I narrowed the call talking in the order of 1s to:

https://github.com/PX4/Firmware/blob/1371887578af6af9fbd3775104c505ce1fc69e92/src/modules/dataman/dataman.cpp#L412-L413

more specifically the actual read() call:
https://github.com/PX4/Firmware/blob/1371887578af6af9fbd3775104c505ce1fc69e92/src/modules/dataman/dataman.cpp#L639

I can reproduce it in HITL on a Pixracer flying a mission. It probably happens without flying a mission, the main point is that position is initialized and navigator runs.

Edit: I could not reproduce read times above about 5ms when logger is not running!

julianoes on 21 Nov 2019

The safe points are tiny, let's just keep a full copy of them in RTL.

Then let's add another orb message with full dataman metadata so we can limit updates.

After that I'd still like to see what the remaining dataman access pattern is like for typical missions. Some simple caching and prefetching in dataman might be interesting.

dagar on 21 Nov 2019

@bkueng just mentioned that this could be happening with cheaper SD cards, and indeed I can reproduce it with a cheap SanDisk EDGE 8GB class 4 (which presumably comes with Pixhawk 4) but I don't see reads above 11ms with a more expensive SanDisk Extreme PRO U3 card.

julianoes on 21 Nov 2019

The safe points are tiny, let's just keep a full copy of them in RTL.

This is currently not even about safe points. This is about geofence points.

I think we need to implement some caching at some level but also think about moving the read calls to the background and access them async.

julianoes on 21 Nov 2019

The safe points are tiny, let's just keep a full copy of them in RTL.

This is currently not even about safe points. This is about geofence points.

I think we need to implement some caching at some level but also think about moving the read calls to the background and access them async.

dagar on 21 Nov 2019

👍2

TODO: review dataman locking. We don't seem to be using it consistently.

dagar on 21 Nov 2019

This issue has been automatically marked as stale because it has not had recent activity. Thank you for your contributions.

stale[bot] on 20 Feb 2020

Regarding geofence. We performed some tests with 60-point poly fence. Read+calculations takes from 30ms to 200ms on different UAVs. Looks like it depends on SD card condition. But 200ms is pretty high value.

@dagar Do you have any plans to rework this logic?