Mbed-os: NUCLEO_L073RZ: wake up from deep sleep test fails when linking with microlib

Created on 16 Dec 2019 · 33Comments · Source: ARMmbed/mbed-os

Description of defect

Greentea SysTimer wake up from deep sleep test fails on NUCLEO_L073RZ when linking with microlib. See more details in https://github.com/ARMmbed/mbed-os/pull/12014#issuecomment-562565672

Target(s) affected by this defect ?

NUCLEO_L073RZ

Toolchain(s) (name and version) displaying this defect ?

ARM toolchain with C microlib.

What version of Mbed-os are you using (tag or sha) ?

What version(s) of tools are you using. List all that apply (E.g. mbed-cli)

How is this defect reproduced ?

To reproduce, you need https://github.com/ARMmbed/mbed-os/pull/12068 in order to build greentea tests with microlib.
Create a test.json in TESTS\configs with the following:

{
  "target_overrides": {
    "*": {  
      "target.default_lib": "small"
    }
  }
}

and run the test:

mbed test -m NUCLEO_L073RZ -t ARM -n *systimer --app-config TESTS/configs/test.json

CLOSED st mirrored bug

Source

evedon

All 33 comments

Internal Jira reference: https://jira.arm.com/browse/MBOTRIAGE-2476

ciarmcom on 16 Dec 2019

@ARMmbed/team-st-mcd can you please have a look to this?

MarceloSalazar on 17 Dec 2019

I cherry-picked #12608 and tried to reproduce, but I only get a build failure :

$ mbed test -m NUCLEO_L073RZ -t ARM -n *systimer --app-config TESTS/configs/test.json -v
Failed to build library

there must be something missing from #12608 or another dependency ...

LMESTM on 18 Dec 2019

I probably missed led you in my earlier comment:
Add

{
  "target_overrides": {
    "*": {  
      "target.default_lib": "small"
    }
  }
}

to https://github.com/ARMmbed/mbed-os/blob/master/TESTS/configs/baremetal.json

evedon on 18 Dec 2019

@evedon I still see the same.

I'm on mbed master (commit b83d41eef5e0fa0b4ae46aa7a9a223badd9a2c07)
I cherry picked#12068
I modified the baremetal.json with "target.default_lib": "small"

then when I launch the test, I just get the below error message

$ mbed test -m NUCLEO_L073RZ -t ARM -n *systimer --app-config TESTS/configs/baremetal.json -v
Failed to build library
mbedgt: greentea test automation tool ver. 1.7.3
mbedgt: test specification file '.\BUILD\tests\NUCLEO_L073RZ\ARM\test_spec.json' (specified with --test-spec option)
mbedgt: using '.\BUILD\tests\NUCLEO_L073RZ\ARM\test_spec.json' from current directory!
("TestSpec::load('.\\BUILD\\tests\\NUCLEO_L073RZ\\ARM\\test_spec.json')", "[Errno 2] No such file or directory: '.\\\\BUILD\\\\tests\\\\NUCLEO_L073RZ\\\\ARM\\\\test_spec.json'")
mbedgt: detecting connected mbed-enabled devices...
mbedgt: detected 1 device
        | platform_name | platform_name_unique | serial_port | mount_point | target_id                |
        |---------------|----------------------|-------------|-------------|--------------------------|
        | NUCLEO_L073RZ | NUCLEO_L073RZ[0]     | COM66       | D:          | 07600221357F00130A3B964E |
mbedgt: shuffle seed: 0.7357669713
mbedgt: no platform/target matching tests were found!
mbedgt: no matching platforms were found!
mbedgt: completed in 0.13 sec
[mbed] Working path "C:\Data\Workspace\mbed" (program)
[mbed] Exec "c:\python27\python.exe -u C:\Data\Workspace\mbed\tools\test.py -D MBED_TEST_MODE -t ARM -m NUCLEO_L073RZ --source . --build .\BUILD\tests\NUCLEO_L073RZ\ARM --test-spec .\BUILD\tests\NUCLEO_L073RZ\ARM\test_spec.json --build-data .\BUILD\tests\NUCLEO_L073RZ\ARM\build_data.json -n *systimer -v --app-config TESTS/configs/baremetal.json --greentea" in "C:\Data\Workspace\mbed"

The test is working ok as long as I do not add the "target.default_lib": "small" config

Is there some extra change needed to the NUCLEO_L073RZ target ?

LMESTM on 19 Dec 2019

Ok I finally found out, targets.json also needs to be updated for NUCLEO_L073RZ with

        "supported_c_libs": {
                "arm": ["std"],
                "gcc_arm": ["std", "small"],
                "iar": ["std"]
        }

Why not applying the above for all the targets by default ?
I agree it may not have been tested for all yet, but what would prevent that a target to support one of the small configurations ?

LMESTM on 19 Dec 2019

@evedon
Actually compilation seems ok now, but the link fails with error below:

        [DEBUG] Errors: Error: L6218E: Undefined symbol __sync_fetch_and_add_4 (referred from C:\Keil_v5\ARM\ARMCLANG\bin\..\lib\libcxx\libcppabi_p.l(cxa_handlers.cpp.o)).
        [DEBUG] Errors: Finished: 0 information, 3 warning and 1 error messages.

LMESTM on 19 Dec 2019

Which version of ARMC6 are you running?
We build successfully with ARM Compiler 6.11 and 6.13

evedon on 19 Dec 2019

$ /c/Keil_v5/ARM/ARMCLANG/bin/armclang.exe --version
Product: MDK Professional 5.26
Component: ARM Compiler 6.10.1
Tool: armclang [5d143200]

LMESTM on 19 Dec 2019

I upgraded.

$ /c/Keil_v5/ARM/ARMCLANG/bin/armclang.exe --version
Product: MDK-ARM Lite 5.29
Component: ARM Compiler 6.13.1
Tool: armclang [5d895d00]

EDIT:
seems to be okay now - looks like the license manager took some time to make some check ...

LMESTM on 19 Dec 2019

@evedon
So I have reproduced the issue and I confirm that the deep sleep exit takes more time with microlib, but I don't know why ...

Below are a few measurements of the most time consuming part of the deep sleep exit (https://github.com/ARMmbed/mbed-os/blob/master/targets/TARGET_STM/sleep.c#L261)

ForceOscOutofDeepSleep();  // microlib 1ms VS. stnadard arm lib 0,640ms
ForcePeriphOutofDeepSleep(); // microlib 1,39ms VS. standard arm lib 0,7ms
SetSysClock(); // microlib 2,28ms VS. standard arm lib 1,21ms

When executed in microlib configuration the same code seems to be about 2X slower.
And all this code is executer under critical section ..

Is there a specific compiler optimization that is turned OFF with microlib ?
@kjbracey-arm any other idea ?

LMESTM on 19 Dec 2019

from http://www.keil.com/arm/microlib.asp

_As MicroLib has been optimized to minimize code size, some functions will execute more slowly than the standard C library routines available in the ARM compilation tools._

then there is a link to get more details but the link is broken ... Could someone from arm help out find the differences to try and get clues about the major differences ?

LMESTM on 19 Dec 2019

One more test result - by just adding __MICROLIB compilation flag and --library_type=microlib linker option to the release profile, I can reproduce the issue, i.e. the execution time gets longer.

Tested with below release.json modifications

diff --git a/tools/profiles/release.json b/tools/profiles/release.json
index 2aab24f6c8..384bbb7f31 100644
--- a/tools/profiles/release.json
+++ b/tools/profiles/release.json
@@ -20,12 +20,12 @@
                    "-Wno-reserved-user-defined-literal", "-Wno-deprecated-register",
                    "-DMULADDC_CANNOT_USE_R7", "-fdata-sections",
                    "-fno-exceptions", "-MMD", "-D_LIBCPP_EXTERN_TEMPLATE(...)=",
-                   "-fshort-enums", "-fshort-wchar", "-DNDEBUG"],
+                   "-fshort-enums", "-fshort-wchar", "-DNDEBUG", "-D__MICROLIB"],
         "asm": [],
         "c": ["-D__ASSERT_MSG", "-std=gnu11"],
         "cxx": ["-fno-rtti", "-std=gnu++14"],
         "ld": ["--show_full_path", "--legacyalign", "--any_contingency",
-               "--keep=os_cb_sections"]
+               "--keep=os_cb_sections", "--library_type=microlib"]
     },
     "ARM": {
         "common": ["-c", "--gnu", "-Ospace", "--split_sections",

with command line

 mbed test -m NUCLEO_L073RZ -t ARMC6 -n *systimer  --profile=tools/profiles/release.json -v

LMESTM on 20 Dec 2019

👍1

Some further investigations are showing that the line below (/ 1000) is a big part of the problem.
https://github.com/ARMmbed/mbed-os/blob/master/targets/TARGET_STM/hal_tick_overrides.c#L61

It takes more than 300µs to be executed with microlib and is called several times ... this has a high impact on the system

LMESTM on 6 Jan 2020

@evedon @kjbracey-arm @MarceloSalazar this seems to be rather a limitation of the microlib which introduce a relatively huge overhead vs. the other toolchains. At first glance, there seem to be quite many occurences of / 1000 (or / 100 or /10) usage in mbed-os.

So my current conclusion is that this is a limitation (bug?) of microlib, not a bug of device:st ...

Any feedbacks / comments are welcome

FYI @jeromecoutant @ABOSTM

LMESTM on 6 Jan 2020

I would agree that this seems like a limitation of microlib.

For the resolution of the original issue, I suggest that we increase DEEP_SLEEP_DELAY_DELTA_US for microlib.

evedon on 6 Jan 2020

@evedon

I would agree that this seems like a limitation of

could you re-assign to the proper team for further analysis and feedback ?

For the resolution of the original issue, I suggest that we increase DEEP_SLEEP_DELAY_DELTA_US for microlib.

I 'm not in favor of such workaround. HAL_GetTick() function which include "/ 1000" operation is widely used in the ST drviers layers so that the target performance will be highly impacted and degraded by this limitation. Knowing this, I'm not in favor of deploying microlib support, otherwise we will most probably face other side effects. Question is also whether this limitation is only valid for cortex-M0 devices or any other ones as well ...

LMESTM on 7 Jan 2020

M3 or M4 devices wouldn't need to use the library to do / 1000 - the compiler should use a UMULL instruction to do a multiplication-by-reciprocal (something like * ((2**32) / 1000) then >> 32). Same applies for any 32-bit division by constant. (I know ARM and GNU compilers do this, and I'd hope IAR does too).

So this particular problem is only going to hit devices like M0,M0+ and M23.

IMO, 300µs for a simple integer division is too high for stability. That loses you a few serial characters at 115200bps if anyone does division in an interrupt handler. If you were to enable microlib on such a device, it would have to be with some very big provisos about it potentially breaking things.

For comparison, how long does that division take with the normal library on the same device?

kjbracey-arm on 7 Jan 2020

300µs is not the absolute measurement, it is the overhead when moving to microlib.

On NUCLEO_L476RG, including GPIO toggling for instrumentation, I am observing 3µs max (and down to 200ns) for the / 1000 operation ... and I confirm that there is no visible overhead when moving to microlib.

LMESTM on 7 Jan 2020

@LMESTM As it is, your suggestion is the target NUCLEO_L073RZ is not be suitable for supporting small default lib ?

0xc0170 on 7 Jan 2020

If the performance characteristics are as described, I'd be somewhat wary of enabling it at all on anything M0-like, unless we can redirect division to a faster implementation. We do perform divisions in middleware code in interrupt handlers - it's not just the STM HAL.

I'm going to check what microlib is doing - we might want to forcibly intercept its division and redirect to a "standard" division routine. (32/32->32 division should take maximum about 130 cycles, from references I've seen, so ~4µs at 32MHz)

kjbracey-arm on 7 Jan 2020

👍1

If the performance characteristics are as described, I'd be somewhat wary of enabling it at all on anything M0-like, unless we can redirect division to a faster implementation. We do perform divisions in middleware code in interrupt handlers - it's not just the STM HAL.

Agree.

I'm going to check what microlib is doing - we might want to forcibly intercept its division and redirect to a "standard" division routine. (32/32->32 division should take maximum about 130 cycles, from references I've seen, so ~4µs at 32MHz)

Sounds like the best way forward.

LMESTM on 7 Jan 2020

Actually, it seems like the standard M0 division is likely a fair bit slower than the 130 cycles I said above - that number would have been for the traditional ARM, which can munch 1 bit in 3 cycles, giving you a base worst case of 96 cycles, plus extra start-up+loop overhead

RSBS    ip, a1, a2, LSR #2
SUBCS   a2, a2, a1, LSL #2
ADC     a3, a3, a3

The poor M0+ has to take 4-6 cycles per bit, so worst case 192 cycles, plus overhead

LSRS    a4,a1,#2
CMP     a4,a2
BCC     bit1       ; if taken, 3 cycles (M0) or 2 cycles (M0+)
LSLS    a4,a2,#2
SUBS    a1,a1,a4
ADCS    a3,a3,a3

kjbracey-arm on 7 Jan 2020

Looking at what I'm getting from microlib (ARM Compiler 6.13), its division doesn't seem that bad. It is just a long division loop, possibly written in C rather than assembler, with no unrolling. 12-16 cycles per bit.

__aeabi_uidivmod
    0x0800031a:    b530        0.      PUSH     {r4,r5,lr}
    0x0800031c:    460b        .F      MOV      r3,r1
    0x0800031e:    4601        .F      MOV      r1,r0
    0x08000320:    2000        .       MOVS     r0,#0
    0x08000322:    2220         "      MOVS     r2,#0x20
    0x08000324:    2401        .$      MOVS     r4,#1
    0x08000326:    e009        ..      B        0x800033c ; __aeabi_uidiv + 34
    0x08000328:    460d        .F      MOV      r5,r1
    0x0800032a:    40d5        .@      LSRS     r5,r5,r2
    0x0800032c:    429d        .B      CMP      r5,r3
    0x0800032e:    d305        ..      BCC      0x800033c ; __aeabi_uidiv + 34
    0x08000330:    461d        .F      MOV      r5,r3
    0x08000332:    4095        .@      LSLS     r5,r5,r2
    0x08000334:    1b49        I.      SUBS     r1,r1,r5
    0x08000336:    4625        %F      MOV      r5,r4
    0x08000338:    4095        .@      LSLS     r5,r5,r2
    0x0800033a:    1940        @.      ADDS     r0,r0,r5
    0x0800033c:    4615        .F      MOV      r5,r2
    0x0800033e:    1e52        R.      SUBS     r2,r2,#1
    0x08000340:    2d00        .-      CMP      r5,#0
    0x08000342:    dcf1        ..      BGT      0x8000328 ; __aeabi_uidiv + 14
    0x08000344:    bd30        0.      POP      {r4,r5,pc}

My calculations say it should come out at 400-520 cycles per division, depending on value. There's no specific short-circuit - it always loops 32 times.

So that should still be only ~16µs at 32MHz. How fast is this device?

kjbracey-arm on 7 Jan 2020

The device is 32MHz but runs at lower speed (~2MHz or so) during deep sleep exit. This is indeed probably why this takes so long.

At full speed I'm seeing it rather takes around 13µs - good to get back to theory sometimes :-) !

LMESTM on 7 Jan 2020

Aha. Okay, doing the division while at slow speed makes sense. So microlib's division is 2-and-a-bit times slower, and it matters because you're running 16 times slower at that instant, and the division(s) dominate.

Okay, given that info, I now think:

microlib's division is slow, but not slow enough to be necessarily problematic for IRQs at "normal" CPU speed (16MHz+).
General performance with microlib on this device should be okay - this problem is quite specific to the deep sleep wake.
We could still consider substituting in the standard library's divide routines, but I'm not sure it's worth the effort overall.

Substituting the microlib divide routines would presumably get wake performance back to where it was, but is there anything else that can be done in the HAL to reduce the total number of divisions on wake, or at least do more of them after putting the CPU back up to normal speed?

Could register settings be cached from before sleep, rather than recalculated, for example?

Or could the CPU be switched to a fixed faster clock temporarily, before going for the full restore to desired runtime state?

General HAL improvements would benefit users of both microlib and any other library.

kjbracey-arm on 7 Jan 2020

So microlib's division is 2-and-a-bit times slower

Actually here are my measurements. Same division by 1000 line running at same speed is measured using GPIOs (so w/ GPIO handling overhead):

during deep sleep exit (~2MHz tbc), takes 346µs with µARM vs. 46µs with ARM
at full speed (32MHz), takes 13µs with µARM vs. 2µs with ARM

General HAL improvements would benefit users of both microlib and any other library.

Yes, but clock trees can be MCU specific and also depend on boards (whether there is a XTAL or not, etc ..). Doing per-MCU optimization is always feasible but this not something we can afford - so I'd like to stick to a generic implementation for all STM32s. I may consider optimizing this specific function with a locally defined shift-based function to replace division ... do you think that would be acceptable ?

EDIT: actually the clock out of deep sleep seems to be in the range of 1MHz rather than 2MHz and I am checking if I could find a higher speed than that available on applicable STM32 families

LMESTM on 7 Jan 2020

I've pushed #12202 for increasing the clock speed out of deep sleep - that will be helpful for any toolchain, thanks for this @kjbracey-arm :-)

LMESTM on 7 Jan 2020

Actually here are my measurements.

Ah, right. My "two-and-a-bit" was comparing the worst-case for each, because I was thinking of interrupt latency. The standard routine is actually a lot faster in its best case (when the divide result is small - you basically pay per bit of the answer).

I may consider optimizing this specific function with a locally defined shift-based function to replace division ... do you think that would be acceptable ?

It may well be worthwhile doing that for the speed-criticial bits here. (Historically the ARM compiler+library used to have a dedicated shift-based divrem10 routine to cover the frequent use of that in prints.)

Whether you want to be shift or multiply based will depend on how your M0 core was configured - if you have the fast 1-cycle multiplier, you're probably best off just using it.

uint32_t div1000(uint32_t x)
{
    uint64_t m = (uint64_t) x * 274877907; // multiply by (1/1000) * (2**38)
    return (uint32_t) (m >> 38);
}

Although, having tried that, I find that actually you're somewhat scuppered by the facts that

the EABI no longer bothers having optimised 32*32->64 multiply routines - it goes to a full 64*64->64 routine
the microlib 64*64->64 multiply routine is actually twice the size and many times slower than the standard one(!) (Can see that here: http://www.purposeful.co.uk/arm_rtabi/ARMv6M__aeabi_lmul.html)

Looking at it, I get the impression that all the microlib runtime routines were written in C from scratch, and not actually compared to see if they were smaller than the standard assembler versions(!)

kjbracey-arm on 8 Jan 2020

🚀1

Turns out that the div1000 does not seem to improve performances :-(

if using 3 times div10 instead, it seems to show improvements, but only with uARM. I'm not really willing to introduce toolchain specific code ...

static signed Div10(signed n)
{
    n = (n >> 1) + (n >> 2);
    n += n < 0 ? 9 : 2;
    n = n + (n >> 4);
    n = n + (n >> 8);
    n = n + (n >> 16);
    n = n >> 3;
    return n;
}

note: with #12202 all tests are now PASSED.
Shall we track microlib / division performance issue in a separate ticket if needed ?

LMESTM on 8 Jan 2020

Thanks a lot for numbers! Really useful.

Yeah, we can carry on microlib stuff on a new ticket. That 64x64->64 multiply is a real disaster, even slower than the division. It's actually even worse for the M3 - the microlib code is 6 times bigger and 50 times slower than the standard library.

But fortunately the compiler seems to inline rather than call the library for M3.

kjbracey-arm on 8 Jan 2020

@kjbracey-arm I submitted a new Issue #12221
This one can be closed when #12202 will be merged

LMESTM on 8 Jan 2020

🚀1

Thank you @LMESTM for fixing this issue.

evedon on 10 Jan 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings