Inaccurate timeouts since d1ae0d570ceac567849881dd75639c57e93de05a was merged. Affects to Wi-SUN frequency hopping timer and potentially to other Nanostack timers too.
Measured with oscilloscope: Configured periodic timeout of 255 milliseconds is initially working properly but after couple of minutes, it start growing and error could be eventially several hundreds of milliseconds. For example, 30 minutes after test start the periodic timeout of 255ms was actually jumping somewhere between 400 and 500 milliseconds.
Seems that at least K64F, K66F and Disco_F769NI doesn't work properly.
For some reason, this issue doesn't affect to Nucleo_F429ZI.
GCC for Arm (gcc-arm-none-eabi-9-2019-q4-major)
d1ae0d570ceac567849881dd75639c57e93de05a
mbed-cli 1.2.2
Build simple Wi-SUN network of Border router (nanostack-border-router) and Router (mbed-os-example-mesh-minimal). mbed-mesh-test-application could also be used to configure device as BR or Router
mbed-mesh-test-application commands for Border router:
ifconfig --extension Wi-Sun
ifconfig mesh0 --mode brouter
ifup
mbed-mesh-test-application commands for Router:
ifconfig --extension Wi-Sun
ifup
Wait until Router joins the network. This issue should prevent it from joining.
PR that includes the sha referenced above: https://github.com/ARMmbed/mbed-os/pull/12425 (Chrono update)
For some reason, this issue doesn't affect to Nucleo_F429ZI.
A question was if this is the only target that works and its tickless - how the others differ ? Is this related to deep sleep locking or not?
cc @kjbracey-arm @ARMmbed/mbed-os-core
Nucleo 429ZI is the only non-tickless target in the list, from my reading of targets.json.
To double-check what you've previously told me - the measurement is coming purely from GPIO instrumentation of the point of call to Timeout::attach_us, and the routine called by Timeout::attach_us, yes?
Which would mean no possible factors from RTOS/event queue timing calculations or scheduling, so it has to be from the usticker-based Timeout calculation, or general system IRQ load/locking or wakeup problems?
Yes, GPIO is toggled in callback from FHSS timer driver: https://github.com/ARMmbed/mbed-os/blob/master/features/nanostack/nanostack-hal-mbed-cmsis-rtos/arm_hal_fhss_timer.cpp#L99
Every callback immediately starts new timeout. No events used there.
I did look at the FHSS while doing the Chrono work - PR #12903 adapts it to use new APIs, including a Timeout::remaining_time() call I added specifically for it to remove the need for a separate Timer and local start_time and stop_time variables.
Because you start a Timer and leave it always running, that effectively locks deep sleep forever, or should. The new version does that explicitly. I'd like you to double-check the current version by putting in the explicit lock call there as well as the Timer start, to make absolutely sure.
I'm struggling to think of a mechanism that can make you a massive 100ms late aside from deep sleep wakeup problems. That's not interrupt timescale, that's very bad event loop timescale.
Ultimately, to avoid any sort of drift problems accumulating, this stuff could be using absolute time. All the framework is now in place - it could be platform_fhss_timer_start_absolute(abs_time) -> Timeout::attach_absolute.
You've said MBED_TICKLESS off makes K64F pass.
Another thing to try is passing an empty function to Kernel::attach_idle_hook. That will make it do nothing when idle rather than trying to enter full sleep.
If that works, then try passing a function that just does __WFE() to that - the lightest possible sleep.
Both of those make the tickless-built system semi-tickless. Retains the infrastructure for tickless, but keeps the ticker always running, rather than suspending the OS. Will narrow it down.
I afraid that tickles may affect drift to timers which is critical for fhss. Why tickles mode is default mode?
It's default because it saves power. At a latency cost. That latency cost should largely be dispelled by the fact you've done a Timer->start() or manual deep sleep lock in init though. That should stop deep sleep ever being entered at runtime, which does in turn make tickless somewhat pointless. Your systems would probably be better built with it off, if you will always have FHSS active. But if you ever stopped FHSS network operations, it would be different.
edit: going to take that back - there's still a benefit to tickless, even with deep sleep disabled. It stops you waking up from your shallow sleep every millisecond.
But there's clearly a massive regression here that needs to be investigated - those platforms have been tickless for a couple of years. Their performance shouldn't have gotten worse. And this isn't just "getting worse", it's going drastically wrong.
Software timer drift would not be an issue if absolute time was used - there's always a continuous monotonic timebase that can be used to trigger stuff on any strict schedule. But I guess it would still need adjustment for long-term hardware crystal drift.
@kjbracey-arm Actually K66F, but yes, turning off MBED_TICKLESS made it work.
Next test was to call "sleep_manager_lock_deep_sleep" in FHSS timer driver init after timer->start() but it didn't help.
Thanks - keep giving me info. Remaining requests are the idle hook suggestions, and a further Git bisection.
Still thinking through it, but not got anything yet.
Ok, I tested idle hook with empty function and with __WFE() call. Both seem to fix the timer issue.
Thank you for raising this detailed GitHub issue. I am now notifying our internal issue triagers.
Internal Jira reference: https://jira.arm.com/browse/MBOTRIAGE-2656
Hi
From my side, I noticed that tests-mbed_drivers-lp_timeout becomes failed with test case
'Timing drift (attach)' since #12425 merge
(several targets)
@jeromecoutant I'll browse nightly tests now to check (how come we havent seen it in PRs testing, will check)
how come we havent seen it in PRs testing
Maybe you have SKIP_TIME_DRIFT_TESTS macro in CI....
I dont see them in nightly results. @jeromecoutant can you add details how to reproduce? It would be great to have easy to reproduce test case.
I see, that could be it but would expect these time drift to run at least once in a while :/
@kjbracey-arm can you reproduce TESTS/mbed_drivers/lp_timeout/main.cpp test locally ?
Only have a K64F here - I'll look. @jeromecoutant - do you have a list of platforms you have/haven't seen fails on? Any patterns, eg TICKLESS?
@jeromecoutant - do you have a list of platforms you have/haven't seen fails on? Any patterns, eg TICKLESS?
Easy! All platforms, all tool chains
You mean all ST platforms, I assume? Not tested any others?
(If it turns out CI has not been testing timing stuff, I'm going to be a bit grumpy)
Most helpful comment
(If it turns out CI has not been testing timing stuff, I'm going to be a bit grumpy)