https://github.com/ARMmbed/mbed-os/pull/10818 Highlighted a failure of the wait_ns test when it is compiled with ARMC6 and run on NRF52_DK.
On my setup, a Mac, the test doesn't fail out of the box; it fails when compiled with the following flags that replicates the one used within the CI:
-DMBED_HEAP_STATS_ENABLED=1 -DMBED_STACK_STATS_ENABLED=1 -DMBED_TRAP_ERRORS_ENABLED=1 -DMBED_CPU_STATS_ENABLED -DMBED_THREAD_STATS_ENABLED -DMBED_SYS_STATS_ENABLED -DMBED_ALL_STATS_ENABLED
On @LDong-Arm setup (a linux installation); it fails out of the box without adding extra flags to mbed test -n mbed-os-tests-mbed_platform-wait_ns . On both setup ARM compiler 6 version 11 has been used. (Note: version 12 fail as well).
When the duration tested is changed - I've tried from 20ms to 5s - the drift remain constant: 1.64% of the duration expected. The test fails with the same ratio whether lp ticker or us ticker is used to track the time.
The test results seems also highly dependent on compilation options and what is put in the binary:
-O0 or -O2 is used. It fails only with -Os. // replace
Case cases[] = {
#if DEVICE_LPTICKER
Case("Test: wait_ns - compare with lp_timer 1s", test_wait_ns_time_measurement<1000, LowPowerTimer>),
#endif
Case("Test: wait_ns - compare with us_timer 1s", test_wait_ns_time_measurement<1000, Timer>)
};
// with
Case cases[] = {
#if DEVICE_LPTICKER
Case("Test: wait_ns - compare with lp_timer 1s", test_wait_ns_time_measurement<1000, LowPowerTimer>),
#endif
Case("Test: wait_ns - compare with us_timer 1s", test_wait_ns_time_measurement<1000, Timer>),
#if DEVICE_LPTICKER
Case("Test: wait_ns - compare with lp_timer 1s", test_wait_ns_time_measurement<1000, LowPowerTimer>),
#endif
Case("Test: wait_ns - compare with us_timer 1s", test_wait_ns_time_measurement<1000, Timer>)
};
wait_ns makes the test pass.void delay_loop(uint32_t count) {
while (--count) {
__NOP();
__NOP();
}
}
[ ] Question
[ ] Enhancement
[x] Bug
@kjbracey-arm Do you have any idea of what may be going on here ?
Internal Jira reference: https://jira.arm.com/browse/MBOCUSTRIA-1293
Fascinating. If you can reproduce this in a build you can debug, or pull apart with fromelf, can you see what exactly is being generated in wait_ns and its call to delay_loop_code?
I note that delay_loop has been cast to void(*)(), not void(*)(uint32_t), which is fishy, but should be legal as C99 and C11.
Try changing that? If it was being compiled as C++, I can imagine the compiler deciding it had no parameters and didn't need to calculate+pass count.
I also know I've been meaning to change the | 1 to + 1 in the delay_loop definition, as that can be done at link-time via relocation. The | 1 ends up happening at run-time. Try that, although don't know how it helps.
I am compiling with ARMC 6.11 on Windows machine (not sure why it would matter, but there seems to be a difference between Mac OS and Linux mentioned above). For me the failure doesn't occur in the develop build profile, even after I add the compilation flags mentioned in the description. However, the test did fail for me after I added -O1 in the, otherwise unmodified, debug.json:
mbed test --compile -t ARM -m NRF52_DK -n tests-mbed_platform-wait_ns --profile debug -v
The failure mode is the following:
[1561121102.29][CONN][INF] found KV pair in stream: {{__testcase_start;Test: wait_ns - compare with lp_timer 1s}}, queued...
mbedgt: :80::FAIL: Expected 1.200000 Was 1.656494
[1561121103.98][CONN][RXD] :80::FAIL: Expected 1.200000 Was 1.656494
When I go back to -Os the test is passing again.
The exact number I get varies by +/- 0.01
There is some issue, causing the arm compiler to miss source code changes from time to time, but in general, @kjbracey-arm 's proposal to replace the "|" with "+" in the delay_loop code indeed fixes the problem for me.
With "|" the test is failing.
With "+" the test is passing.
If I switch between the two hence and forth I get repetable results.
I did not come to the exact explanation, so it is hard for me to judge if it's a bug in the compiler (the other ones seem to handle it just right) or some undefined corner case, that should better be avoided...
@michalpasztamobica Thanks for the input; I will try accessing the delay code definition with + 1 on Mac and Linux on Monday.
Could be a compiler bug. | 1 is a very odd thing to do to a function pointer, so could tickle a particular optimisation failure that nobody else has spotted.
Still want to see the assembly output here - how on earth does that 1.65 factor get reliably produced?
Unfortunately, I continue to see the same error with the + 1; I also tried with a cast to void(*)(uint32_t) and the issue persist.
Here's some assembly:
delay_loop_code:
0000b8d8: subs r0, r0, #1
0000b8da: nop
0000b8dc: nop
0000b8de: bcs.n 0xb8d8 <delay_loop_code>
0000b8e0: bx lr
0000b8e2: movs r0, r0
wait_ns:
0000ab16: movw r1, #564 ; 0x234
0000ab1a: movw r2, #56963 ; 0xde83
0000ab1e: movt r1, #8192 ; 0x2000
0000ab22: movt r2, #17179 ; 0x431b
0000ab26: ldr r1, [r1, #0]
0000ab28: umull r1, r2, r1, r2
0000ab2c: lsrs r1, r2, #18
0000ab2e: muls r0, r1
0000ab30: movw r1, #5977 ; 0x1759
0000ab34: movt r1, #53687 ; 0xd1b7
0000ab38: umull r0, r1, r0, r1
0000ab3c: lsrs r0, r1, #12
0000ab3e: movw r1, #47320 ; 0xb8d8
0000ab42: movt r1, #0
0000ab46: orr.w r1, r1, #1
0000ab4a: bx r1
delay_loop_code:
0000d5a0: subs r0, r0, #1
0000d5a2: nop
0000d5a4: nop
0000d5a6: bcs.n 0xd5a0 <delay_loop_code>
0000d5a8: bx lr
wait_ns:
00003860: ldr r3, [pc, #24] ; (0x387c <wait_ns+28>)
00003862: ldr r2, [pc, #28] ; (0x3880 <wait_ns+32>)
00003864: ldr r3, [r3, #0]
00003866: udiv r3, r3, r2
0000386a: muls r0, r3
0000386c: movw r3, #5000 ; 0x1388
00003870: udiv r0, r0, r3
00003874: ldr r3, [pc, #12] ; (0x3884 <wait_ns+36>)
00003876: orr.w r3, r3, #1
0000387a: bx r3
// note : constants start here but are displayed as assembly
0000387c: lsls r4, r0, #16
0000387e: movs r0, #0
00003880: negs r0, r0
00003882: movs r7, r1
00003884: bpl.n 0x37c8 <sleep_manager_sleep_auto+80> // note: address of delay_loop_code
00003886: movs r0, r0
ARMC6 seems correct to me. I think my thinking may have got slightly off-course originally due to the original report, which said "1.65%", which had me thinking it was miles out, hence compiler bug.
But if the factor is actually 1.65, then it could just be another alignment/zero-wait-state issue.
Maybe the loop is taking 8 cycles sometimes instead of 5, but depending on what - alignment? Can we see a comparison of working and failing builds from @michalpasztamobica ?
Putting the thing in the data array was an attempt to force alignment to reduce variability, but some platforms still may be iffy. More alignment? #10683 gives you the option to give up on running from flash, and run from RAM instead like #10741, if your RAM is faster, but that has a higher start-up overhead to turn off RAM execution protection :/
Your intuition was good; I've just tried the test with the instruction cache enabled and it passes.
Yep, this test has shaken out a couple of platforms with their core instruction or flash controller caches disabled...
The assembly listings highlight why we shouldn't be doing the | 1. The compiler+linker are arranging to find the address of delay_loop_code between them, with the compiler ensuring it's something that the linker can relocate. But then the compiler has to run-time do the ORR #1 to the result of the relocation.
Whereas +1 is a simple offset - the compiler+linker know how to make an address constant of delay_loop_code + 1 without any extra code having to be generated.
(Mind you, if this was ARM, we could use ORR PC, R1, #1, but for some unaccountable reason they didn't think that was an important instruction, so it's not permitted in Thumb...)
The assembly listings also highlight the cost of SystemCoreClock being a variable not a constant - would save a lot of start-up calculation time if it was known up-front.
Guess it's not too bad in those cases, with the UMULL and/or UDIV, but ARMv6-M or ARMv8-M baseline are going to be much worse, needing a library division.
Could benefit from a similar compile-time optimisation to #10609
Also affected rc4 as expected, what is the resolution for this issue? Is it to use + 1 in https://github.com/ARMmbed/mbed-os/blob/9974899a6b1947f0ad4c729a3608b2d172d6eb28/platform/mbed_wait_api_no_rtos.c#L113 ?
Pr #10609 will go to 5.14, but this affects the 5.13.x as well.
Sounds like the resolution is to turn the instruction cache on. @pan-?
What remains unknown is why the behaviour is unpredictable between builds with it off - something alignment-related maybe. The | 1 effect was probably just another random alignment shift like changing build options.
I can confirm that enabling nrf52 cache helps.
For both ARM and IAR compiler for debug/develop profile test passed regardless of delay_loop thumb indication |+.
NRF_NVMC->ICACHECNF=0x01; need to be add in system_nrf52.c SystemInit()
And the other effect observed was that the slowdown with cache off occurs when the delay loop code is aligned at 16n+8. It's fast if aligned at 16n+0.
Maybe the loop should be 16-byte aligned, to avoid similar glitches with other platforms.
Maybe the loop should be 16-byte aligned, to avoid similar glitches with other platforms.
Worth trying!
MBED_ALIGN(16)
static const uint16_t delay_loop_code[] also causes test to pass even cache is off.
So i will create PR with:
MBED_ALIGN(16) for delay_loop_code
NRF_NVMC->ICACHECNF=0x01; system_nrf52.c SystemInit()
Is it ok?
Both those, yes, but they feel like separate PRs to me. Alignment is a minor patch, effectively a bug fix, and the other one is a target update that will have much wider performance effects than just this wait_ns. Submit them separately for separate review/discussion/application.
Ok so two separate PR's created:
cache https://github.com/ARMmbed/mbed-os/pull/10933
alignment https://github.com/ARMmbed/mbed-os/pull/10932
Closing as the fixes were provided. Please reopen if that is still an issue.