Mbed-os: TEST: Mbed_platform_wait_ns fail on NRF52* when compiled with ARMC6

Created on 17 Jun 2019 · 20Comments · Source: ARMmbed/mbed-os

Description

https://github.com/ARMmbed/mbed-os/pull/10818 Highlighted a failure of the wait_ns test when it is compiled with ARMC6 and run on NRF52_DK.

On my setup, a Mac, the test doesn't fail out of the box; it fails when compiled with the following flags that replicates the one used within the CI:

-DMBED_HEAP_STATS_ENABLED=1 -DMBED_STACK_STATS_ENABLED=1 -DMBED_TRAP_ERRORS_ENABLED=1 -DMBED_CPU_STATS_ENABLED -DMBED_THREAD_STATS_ENABLED -DMBED_SYS_STATS_ENABLED -DMBED_ALL_STATS_ENABLED

On @LDong-Arm setup (a linux installation); it fails out of the box without adding extra flags to mbed test -n mbed-os-tests-mbed_platform-wait_ns . On both setup ARM compiler 6 version 11 has been used. (Note: version 12 fail as well).

When the duration tested is changed - I've tried from 20ms to 5s - the drift remain constant: 1.64% of the duration expected. The test fails with the same ratio whether lp ticker or us ticker is used to track the time.

The test results seems also highly dependent on compilation options and what is put in the binary:

The test always succeeds when -O0 or -O2 is used. It fails only with -Os.
Adding extra cases makes the test pass

// replace 

Case cases[] = {
#if DEVICE_LPTICKER
    Case("Test: wait_ns - compare with lp_timer 1s", test_wait_ns_time_measurement<1000, LowPowerTimer>),
#endif
    Case("Test: wait_ns - compare with us_timer 1s", test_wait_ns_time_measurement<1000, Timer>)
};

// with

Case cases[] = {
#if DEVICE_LPTICKER
    Case("Test: wait_ns - compare with lp_timer 1s", test_wait_ns_time_measurement<1000, LowPowerTimer>),
#endif
    Case("Test: wait_ns - compare with us_timer 1s", test_wait_ns_time_measurement<1000, Timer>),
#if DEVICE_LPTICKER
    Case("Test: wait_ns - compare with lp_timer 1s", test_wait_ns_time_measurement<1000, LowPowerTimer>),
#endif
    Case("Test: wait_ns - compare with us_timer 1s", test_wait_ns_time_measurement<1000, Timer>)
};

Using a plain function instead of the assembly array used by wait_ns makes the test pass.

void delay_loop(uint32_t count) {
  while (--count) {
    __NOP();
    __NOP();
  }
}

Issue request type

[ ] Question
[ ] Enhancement
[x] Bug

@kjbracey-arm Do you have any idea of what may be going on here ?

CLOSED mirrored bug

Source

pan-

All 20 comments

Internal Jira reference: https://jira.arm.com/browse/MBOCUSTRIA-1293

ciarmcom on 17 Jun 2019

Fascinating. If you can reproduce this in a build you can debug, or pull apart with fromelf, can you see what exactly is being generated in wait_ns and its call to delay_loop_code?

I note that delay_loop has been cast to void(*)(), not void(*)(uint32_t), which is fishy, but should be legal as C99 and C11.

Try changing that? If it was being compiled as C++, I can imagine the compiler deciding it had no parameters and didn't need to calculate+pass count.

I also know I've been meaning to change the | 1 to + 1 in the delay_loop definition, as that can be done at link-time via relocation. The | 1 ends up happening at run-time. Try that, although don't know how it helps.

kjbracey-arm on 18 Jun 2019

I am compiling with ARMC 6.11 on Windows machine (not sure why it would matter, but there seems to be a difference between Mac OS and Linux mentioned above). For me the failure doesn't occur in the develop build profile, even after I add the compilation flags mentioned in the description. However, the test did fail for me after I added -O1 in the, otherwise unmodified, debug.json:

 mbed test --compile -t ARM -m NRF52_DK -n tests-mbed_platform-wait_ns --profile debug -v

The failure mode is the following:

[1561121102.29][CONN][INF] found KV pair in stream: {{__testcase_start;Test: wait_ns - compare with lp_timer 1s}}, queued...
mbedgt: :80::FAIL: Expected 1.200000 Was 1.656494
[1561121103.98][CONN][RXD] :80::FAIL: Expected 1.200000 Was 1.656494

When I go back to -Os the test is passing again.
The exact number I get varies by +/- 0.01

There is some issue, causing the arm compiler to miss source code changes from time to time, but in general, @kjbracey-arm 's proposal to replace the "|" with "+" in the delay_loop code indeed fixes the problem for me.
With "|" the test is failing.
With "+" the test is passing.
If I switch between the two hence and forth I get repetable results.

I did not come to the exact explanation, so it is hard for me to judge if it's a bug in the compiler (the other ones seem to handle it just right) or some undefined corner case, that should better be avoided...

michalpasztamobica on 21 Jun 2019

@michalpasztamobica Thanks for the input; I will try accessing the delay code definition with + 1 on Mac and Linux on Monday.

pan- on 21 Jun 2019

Could be a compiler bug. | 1 is a very odd thing to do to a function pointer, so could tickle a particular optimisation failure that nobody else has spotted.

Still want to see the assembly output here - how on earth does that 1.65 factor get reliably produced?

kjbracey-arm on 24 Jun 2019

Unfortunately, I continue to see the same error with the + 1; I also tried with a cast to void(*)(uint32_t) and the issue persist.

Here's some assembly:

ARMC6

          delay_loop_code:
0000b8d8:   subs    r0, r0, #1
0000b8da:   nop     
0000b8dc:   nop     
0000b8de:   bcs.n   0xb8d8 <delay_loop_code>
0000b8e0:   bx      lr
0000b8e2:   movs    r0, r0

          wait_ns:
0000ab16:   movw    r1, #564        ; 0x234
0000ab1a:   movw    r2, #56963      ; 0xde83
0000ab1e:   movt    r1, #8192       ; 0x2000
0000ab22:   movt    r2, #17179      ; 0x431b
0000ab26:   ldr     r1, [r1, #0]
0000ab28:   umull   r1, r2, r1, r2
0000ab2c:   lsrs    r1, r2, #18
0000ab2e:   muls    r0, r1
0000ab30:   movw    r1, #5977       ; 0x1759
0000ab34:   movt    r1, #53687      ; 0xd1b7
0000ab38:   umull   r0, r1, r0, r1
0000ab3c:   lsrs    r0, r1, #12
0000ab3e:   movw    r1, #47320      ; 0xb8d8
0000ab42:   movt    r1, #0
0000ab46:   orr.w   r1, r1, #1
0000ab4a:   bx      r1

          delay_loop_code:
0000d5a0:   subs    r0, r0, #1
0000d5a2:   nop     
0000d5a4:   nop     
0000d5a6:   bcs.n   0xd5a0 <delay_loop_code>
0000d5a8:   bx      lr

          wait_ns:
00003860:   ldr     r3, [pc, #24]   ; (0x387c <wait_ns+28>)
00003862:   ldr     r2, [pc, #28]   ; (0x3880 <wait_ns+32>)
00003864:   ldr     r3, [r3, #0]
00003866:   udiv    r3, r3, r2
0000386a:   muls    r0, r3
0000386c:   movw    r3, #5000       ; 0x1388
00003870:   udiv    r0, r0, r3
00003874:   ldr     r3, [pc, #12]   ; (0x3884 <wait_ns+36>)
00003876:   orr.w   r3, r3, #1
0000387a:   bx      r3

// note : constants start here but are displayed as assembly
0000387c:   lsls    r4, r0, #16
0000387e:   movs    r0, #0
00003880:   negs    r0, r0
00003882:   movs    r7, r1
00003884:   bpl.n   0x37c8 <sleep_manager_sleep_auto+80> // note: address of delay_loop_code
00003886:   movs    r0, r0

pan- on 24 Jun 2019

ARMC6 seems correct to me. I think my thinking may have got slightly off-course originally due to the original report, which said "1.65%", which had me thinking it was miles out, hence compiler bug.

But if the factor is actually 1.65, then it could just be another alignment/zero-wait-state issue.

Maybe the loop is taking 8 cycles sometimes instead of 5, but depending on what - alignment? Can we see a comparison of working and failing builds from @michalpasztamobica ?

Putting the thing in the data array was an attempt to force alignment to reduce variability, but some platforms still may be iffy. More alignment? #10683 gives you the option to give up on running from flash, and run from RAM instead like #10741, if your RAM is faster, but that has a higher start-up overhead to turn off RAM execution protection :/

kjbracey-arm on 24 Jun 2019

Your intuition was good; I've just tried the test with the instruction cache enabled and it passes.

pan- on 24 Jun 2019

Yep, this test has shaken out a couple of platforms with their core instruction or flash controller caches disabled...

kjbracey-arm on 24 Jun 2019

The assembly listings highlight why we shouldn't be doing the | 1. The compiler+linker are arranging to find the address of delay_loop_code between them, with the compiler ensuring it's something that the linker can relocate. But then the compiler has to run-time do the ORR #1 to the result of the relocation.

Whereas +1 is a simple offset - the compiler+linker know how to make an address constant of delay_loop_code + 1 without any extra code having to be generated.

(Mind you, if this was ARM, we could use ORR PC, R1, #1, but for some unaccountable reason they didn't think that was an important instruction, so it's not permitted in Thumb...)

kjbracey-arm on 24 Jun 2019

🚀1

The assembly listings also highlight the cost of SystemCoreClock being a variable not a constant - would save a lot of start-up calculation time if it was known up-front.

Guess it's not too bad in those cases, with the UMULL and/or UDIV, but ARMv6-M or ARMv8-M baseline are going to be much worse, needing a library division.

Could benefit from a similar compile-time optimisation to #10609

kjbracey-arm on 25 Jun 2019

Also affected rc4 as expected, what is the resolution for this issue? Is it to use + 1 in https://github.com/ARMmbed/mbed-os/blob/9974899a6b1947f0ad4c729a3608b2d172d6eb28/platform/mbed_wait_api_no_rtos.c#L113 ?

Pr #10609 will go to 5.14, but this affects the 5.13.x as well.

0xc0170 on 25 Jun 2019

Sounds like the resolution is to turn the instruction cache on. @pan-?

What remains unknown is why the behaviour is unpredictable between builds with it off - something alignment-related maybe. The | 1 effect was probably just another random alignment shift like changing build options.

kjbracey-arm on 25 Jun 2019

I can confirm that enabling nrf52 cache helps.

For both ARM and IAR compiler for debug/develop profile test passed regardless of delay_loop thumb indication |+.

NRF_NVMC->ICACHECNF=0x01; need to be add in system_nrf52.c SystemInit()

tymoteuszblochmobica on 28 Jun 2019

And the other effect observed was that the slowdown with cache off occurs when the delay loop code is aligned at 16n+8. It's fast if aligned at 16n+0.

Maybe the loop should be 16-byte aligned, to avoid similar glitches with other platforms.

kjbracey-arm on 28 Jun 2019

Maybe the loop should be 16-byte aligned, to avoid similar glitches with other platforms.

Worth trying!

0xc0170 on 28 Jun 2019

MBED_ALIGN(16)
static const uint16_t delay_loop_code[] also causes test to pass even cache is off.
So i will create PR with:
MBED_ALIGN(16) for delay_loop_code
NRF_NVMC->ICACHECNF=0x01; system_nrf52.c SystemInit()
Is it ok?

tymoteuszblochmobica on 28 Jun 2019

Both those, yes, but they feel like separate PRs to me. Alignment is a minor patch, effectively a bug fix, and the other one is a target update that will have much wider performance effects than just this wait_ns. Submit them separately for separate review/discussion/application.

kjbracey-arm on 1 Jul 2019

Ok so two separate PR's created:

cache https://github.com/ARMmbed/mbed-os/pull/10933
alignment https://github.com/ARMmbed/mbed-os/pull/10932

tymoteuszblochmobica on 1 Jul 2019

🚀1

Closing as the fixes were provided. Please reopen if that is still an issue.