Describe the bug
We had a flight where the vehicle_status.nav_state was set to NAVIGATION_STATE_AUTO_RTL for about 0.6 seconds and no message or triplet from navigator was published, as if it was not running.
In the logfile below there are times where RTL should have been engaged by navigator because the nav state was set:
To Reproduce
Not sure.
Expected behavior
RTL should have been initiated.
Log Files and Screenshots
https://logs.px4.io/plot_app?log=7c4e2956-6ab5-409f-a2ac-0b908652a1de

Additional context
Navigator is polling on local_position and there are regular updates of this topic.
We need to investigate the effects of accessing the SD card in some of the navigator modes when not inactive. E.g. Mission and Rtl modes check mission items respective rally points even if not active! Also, we check for geofences set in every iteration.
FYI @MaEtUgR
I just tried in SITL and did some rough measurements about what navigator is doing just sitting in manual:
dm_read per second.dm_read per second.dm_read per second, 174 dm_read right after upload.dm_read per second.FYI @sfuhrer. We'll have to look into this.
One potential quick fix would be to add something to throttle RTL::find_closest_landing_point().
https://github.com/PX4/Firmware/blob/master/src/modules/navigator/rtl.cpp#L66
@julianoes yes we should look into this. Let's though first get in https://github.com/PX4/Firmware/pull/13320 . It's mainly about enabling the rally points also in other than RTL_TYPE = 3 mode. Note that the flight with which we had the issue was actually already with this PR.
@dagar guess we should change that RTL::find_closest_landing_point() is only executed when the rally points are updated, right? Could you point me to the easiest way to do that?
@dagar guess we should change that RTL::find_closest_landing_point() is only executed when the rally points are updated, right? Could you point me to the easiest way to do that?
We don't really have a mechanism for that, but something similar to mission (https://github.com/PX4/Firmware/blob/master/msg/mission.msg) would probably make sense.
In this case it's a bit different though because it's rechecking based on current position.
not sure if this info helps: we also have large delays with a merge that is about 3/4 years behind current master. We have logs where there is more than a second delay between vehicle_status and triplet message.
we also have large delays with a merge that is about 3/4 years behind current master.
@dagar this hints that it has nothing to do with the recent addition of rally points or even your bigger architectural changes recently. It sounds more like something off with how/when navigator runs.
@Stifael when this happened, did you have a geofence set, and or a mission?
The people in kunshan reproduced the error several times by:
Not sure if geofence was active. Let me check
Not sure if geofence was active. Let me check
Geofence was set to default, which is Warning.
@Stifael right but were there geofence items set? This would cause SD card access.
This is in HITL:
navigator: 134324 events, 231147587us elapsed, 1720.82us avg, min 14us max 1401415us 26976.525us rms
I read this as 1.4 seconds worst case.
@Stifael right but were there geofence items set? This would cause SD card access.
How do I see that from the log?
How do I see that from the log?
I don't think you do. But it doesn't matter. I can reproduce it without a geofence set. According to my profiling using perf_counters it seems like the dm_read() call is what takes so much time.
@dagar @bkueng @davids5 I narrowed the call talking in the order of 1s to:
more specifically the actual read() call:
https://github.com/PX4/Firmware/blob/1371887578af6af9fbd3775104c505ce1fc69e92/src/modules/dataman/dataman.cpp#L639
I can reproduce it in HITL on a Pixracer flying a mission. It probably happens without flying a mission, the main point is that position is initialized and navigator runs.
Edit: I could not reproduce read times above about 5ms when logger is not running!
The safe points are tiny, let's just keep a full copy of them in RTL.
Then let's add another orb message with full dataman metadata so we can limit updates.
After that I'd still like to see what the remaining dataman access pattern is like for typical missions. Some simple caching and prefetching in dataman might be interesting.
@bkueng just mentioned that this could be happening with cheaper SD cards, and indeed I can reproduce it with a cheap SanDisk EDGE 8GB class 4 (which presumably comes with Pixhawk 4) but I don't see reads above 11ms with a more expensive SanDisk Extreme PRO U3 card.
The safe points are tiny, let's just keep a full copy of them in RTL.
This is currently not even about safe points. This is about geofence points.
I think we need to implement some caching at some level but also think about moving the read calls to the background and access them async.
The safe points are tiny, let's just keep a full copy of them in RTL.
This is currently not even about safe points. This is about geofence points.
It's both. For RTL we can store all the safe points in a small array. In geofence (IIRC) the polygon is already stored, but dataman is still checked constantly for an update. This is the part we get rid of with something similar to ORB_ID(mission) (https://github.com/PX4/Firmware/blob/master/msg/mission.msg). The mission message could turn into a more general dataman message that includes info about all the dataman entries, or we could add separate messages for each type (safe points, fence, waypoints).
I think we need to implement some caching at some level but also think about moving the read calls to the background and access them async.
Dataman is already like a poor man's async. Caching the first entry (where safe points and geofence store the number of items) might be a quick way to eliminate a lot of the disk reads (similar to the orb idea).
TODO: review dataman locking. We don't seem to be using it consistently.
This issue has been automatically marked as stale because it has not had recent activity. Thank you for your contributions.
Regarding geofence. We performed some tests with 60-point poly fence. Read+calculations takes from 30ms to 200ms on different UAVs. Looks like it depends on SD card condition. But 200ms is pretty high value.
@dagar Do you have any plans to rework this logic?
I have plans but no time. @dagar might know better.
Most helpful comment
It's both. For RTL we can store all the safe points in a small array. In geofence (IIRC) the polygon is already stored, but dataman is still checked constantly for an update. This is the part we get rid of with something similar to ORB_ID(mission) (https://github.com/PX4/Firmware/blob/master/msg/mission.msg). The mission message could turn into a more general dataman message that includes info about all the dataman entries, or we could add separate messages for each type (safe points, fence, waypoints).
Dataman is already like a poor man's async. Caching the first entry (where safe points and geofence store the number of items) might be a quick way to eliminate a lot of the disk reads (similar to the orb idea).