Describe the bug
Inconsistent and unexpected simulation performance issues when running SITL jMAVsim on Windows that end up in messages like this and very small to huge visible twitches in the simulation result. It can even lead to loss of global position and a crash of the simulated vehicle.
ERROR [sensors] Accel #0 fail: TIMEOUT!
ERROR [sensors] Sensor Accel #0 failed. Reconfiguring sensor priorities.
WARN [sensors] Remaining sensors after failover event 0: Accel #0 priority: 1
WARN [sensors] Remaining sensors after failover event 0: Accel #1 priority: 1
ERROR [sensors] Gyro #0 fail: TIMEOUT!
ERROR [sensors] Sensor Gyro #0 failed. Reconfiguring sensor priorities.
WARN [sensors] Remaining sensors after failover event 0: Gyro #0 priority: 1
WARN [ekf2] accel id changed, resetting IMU bias
WARN [ekf2] accel id changed, resetting IMU bias
ERROR [sensors] Accel #0 fail: TIMEOUT!
ERROR [sensors] Sensor Accel #0 failed. Reconfiguring sensor priorities.
WARN [sensors] Remaining sensors after failover event 0: Accel #0 priority: 1
WARN [sensors] Remaining sensors after failover event 0: Accel #1 priority: 1
ERROR [sensors] Gyro #0 fail: TIMEOUT!
ERROR [sensors] Sensor Gyro #0 failed. Reconfiguring sensor priorities.
WARN [sensors] Remaining sensors after failover event 0: Gyro #0 priority: 1
WARN [ekf2] accel id changed, resetting IMU bias
To Reproduce
Steps to reproduce the behavior:
make posix jmavsim on latest masterExpected behavior
Simulation should work just fine on a modern computer that isn't overloaded. It also worked fine before, I'm not sure but I think on 1.8 I didn't have any obvious problem not even on a slower laptop.
Log Files and Screenshots
I can provide a simulation log if that helps.
Add screenshots to help explain your problem.
Additional context
If I look closely at the description of this issue it looks very similar: https://github.com/PX4/Firmware/issues/10098
Machine is fast enough and not overloaded and it doesn't always happen but regularly/consistently.
I'd suspect threading priority problems with Windows if none of this happens on any other platform but there might be a general problem that just shows off on Windows. While I ported SITL to Cygwin I had to adjust some priority numbers. https://github.com/PX4/Firmware/pull/8407/files#diff-b3ac3b41912e12a40f44757b561a0e4dR216 could be related.
Another general question is why should the simulation not work fine just slower if the physics simulation cannot run in real time. I think this was discussed before.
What do you see in task manager while this is happening? CPU, memory, and network utilization for both PX4 and jmavsim.
What do you see in uorb top?
@julianoes Will your SITL changes likely help with this issue?
@julianoes Will your SITL changes likely help with this issue?
Not in principle but I'm now debugging issues like this hoping to find out why uorb drops messages.
@dagar Thanks for the hints. CPU usage in task manager looks fine with PX4 taking 0-0.2% and jmavsim 2-5%. Here's an uorb top output snapshot:
update: 1s, num topics: 83
TOPIC NAME INST #SUB #MSG #LOST #QSIZE
actuator_controls_0 0 7 171 287 1
battery_status 0 6 79 206 1
ekf2_innovations 0 1 114 0 1
ekf2_timestamps 0 1 169 0 1
ekf_gps_drift 0 1 9 0 1
estimator_status 0 4 114 392 1
geofence_result 0 1 4 0 1
rate_ctrl_status 0 1 171 0 1
sensor_accel 0 2 174 133 1
sensor_accel 1 1 250 86 1
sensor_baro 0 1 99 16 1
sensor_bias 0 4 114 221 1
sensor_combined 0 4 169 236 1
sensor_gyro 0 3 174 136 1
sensor_mag 0 2 99 71 1
sensor_preflight 0 1 169 0 1
vehicle_air_data 0 5 83 91 1
vehicle_attitude 0 9 169 334 1
vehicle_attitude_groundtruth 0 1 140 0 1
vehicle_global_position 0 5 114 271 1
vehicle_global_position_groundtruth 0 1 140 0 1
vehicle_gps_position 0 5 9 8 1
vehicle_local_position 0 8 114 321 1
vehicle_local_position_groundtruth 0 1 140 0 1
vehicle_magnetometer 0 4 83 95 1
vehicle_rates_setpoint 0 4 152 0 1
Is there something striking?
Update: I talked to @Stifael who uses linux jmavsim and there seems to be a problem in simulation in general, it probably just hides better on linux. But it can always happen with a certain CPU load and timing jitter that your sensors are producing errors and especially that you loose global position and do not regain it until you restart simulation...
I have seen that too. Let's hope that it will be fixed with #10648.
I think I have the same issue.
Environment:
MacOS 10.14.1
Macbook Pro, i7
Branch:
tag/v1.8.2
Command:
make posix jmavsim
Output:
ERROR [sensors] Accel #0 fail: TIMEOUT!
ERROR [sensors] Sensor Accel #0 failed. Reconfiguring sensor priorities.
WARN [sensors] Remaining sensors after failover event 0: Accel #0 priority: 1
WARN [sensors] Remaining sensors after failover event 0: Accel #1 priority: 1
ERROR [sensors] Gyro #0 fail: TIMEOUT!
ERROR [sensors] Sensor Gyro #0 failed. Reconfiguring sensor priorities.
WARN [sensors] Remaining sensors after failover event 0: Gyro #0 priority: 1
Performance:
JarSrcLoader takes about 40% of a core
px4 takes another 20%
Behavior:
Vehicle tends to crash into the ground randomly, although I'm not sure this is related as I am not familiar with the sim and I may be sending bad commands.
edit: the issue is gone in the same environment with latest master.
Just as an update. Since the lockstep was introduced the result of this problem is the whole simulation stops. At least you can assume now that before it stops nothing unexpected like a simulated drone crash happens. I'm not sure why it should stop but a first guess would be that a packet and hence the "lockstep token" gets lost e.g. in the simulator and there's a deadlock.
Fun :sob:, so when does it stop? Straight away or after a while of flying? And does it always happen consistently?
It always happens but not consistently after the same amount of time. I just compared to linux, there I can easily let jmavsim fly for an hour without problem but on windows it happens quite often that the simulation completely locks before I did my tests which usually take about a minute or two. It also happens when I don't fly at all: I just start sitl jmavsim and do something else when I come back I see the symptom "global position loss" all the time.
In my tests after #11177 this issue is gone. Thanks to @bkueng!
For more details read: https://github.com/PX4/Firmware/pull/11177#issuecomment-453973931