Greentea SysTimer wake up from deep sleep test fails on NUCLEO_L073RZ when linking with microlib. See more details in https://github.com/ARMmbed/mbed-os/pull/12014#issuecomment-562565672
NUCLEO_L073RZ
ARM toolchain with C microlib.
To reproduce, you need https://github.com/ARMmbed/mbed-os/pull/12068 in order to build greentea tests with microlib.
Create a test.json in TESTS\configs with the following:
{
"target_overrides": {
"*": {
"target.default_lib": "small"
}
}
}
and run the test:
mbed test -m NUCLEO_L073RZ -t ARM -n *systimer --app-config TESTS/configs/test.json
Internal Jira reference: https://jira.arm.com/browse/MBOTRIAGE-2476
@ARMmbed/team-st-mcd can you please have a look to this?
I cherry-picked #12608 and tried to reproduce, but I only get a build failure :
$ mbed test -m NUCLEO_L073RZ -t ARM -n *systimer --app-config TESTS/configs/test.json -v
Failed to build library
there must be something missing from #12608 or another dependency ...
I probably missed led you in my earlier comment:
Add
{
"target_overrides": {
"*": {
"target.default_lib": "small"
}
}
}
to https://github.com/ARMmbed/mbed-os/blob/master/TESTS/configs/baremetal.json
@evedon I still see the same.
then when I launch the test, I just get the below error message
$ mbed test -m NUCLEO_L073RZ -t ARM -n *systimer --app-config TESTS/configs/baremetal.json -v
Failed to build library
mbedgt: greentea test automation tool ver. 1.7.3
mbedgt: test specification file '.\BUILD\tests\NUCLEO_L073RZ\ARM\test_spec.json' (specified with --test-spec option)
mbedgt: using '.\BUILD\tests\NUCLEO_L073RZ\ARM\test_spec.json' from current directory!
("TestSpec::load('.\\BUILD\\tests\\NUCLEO_L073RZ\\ARM\\test_spec.json')", "[Errno 2] No such file or directory: '.\\\\BUILD\\\\tests\\\\NUCLEO_L073RZ\\\\ARM\\\\test_spec.json'")
mbedgt: detecting connected mbed-enabled devices...
mbedgt: detected 1 device
| platform_name | platform_name_unique | serial_port | mount_point | target_id |
|---------------|----------------------|-------------|-------------|--------------------------|
| NUCLEO_L073RZ | NUCLEO_L073RZ[0] | COM66 | D: | 07600221357F00130A3B964E |
mbedgt: shuffle seed: 0.7357669713
mbedgt: no platform/target matching tests were found!
mbedgt: no matching platforms were found!
mbedgt: completed in 0.13 sec
[mbed] Working path "C:\Data\Workspace\mbed" (program)
[mbed] Exec "c:\python27\python.exe -u C:\Data\Workspace\mbed\tools\test.py -D MBED_TEST_MODE -t ARM -m NUCLEO_L073RZ --source . --build .\BUILD\tests\NUCLEO_L073RZ\ARM --test-spec .\BUILD\tests\NUCLEO_L073RZ\ARM\test_spec.json --build-data .\BUILD\tests\NUCLEO_L073RZ\ARM\build_data.json -n *systimer -v --app-config TESTS/configs/baremetal.json --greentea" in "C:\Data\Workspace\mbed"
The test is working ok as long as I do not add the "target.default_lib": "small" config
Is there some extra change needed to the NUCLEO_L073RZ target ?
Ok I finally found out, targets.json also needs to be updated for NUCLEO_L073RZ with
"supported_c_libs": {
"arm": ["std"],
"gcc_arm": ["std", "small"],
"iar": ["std"]
}
Why not applying the above for all the targets by default ?
I agree it may not have been tested for all yet, but what would prevent that a target to support one of the small configurations ?
@evedon
Actually compilation seems ok now, but the link fails with error below:
[DEBUG] Errors: Error: L6218E: Undefined symbol __sync_fetch_and_add_4 (referred from C:\Keil_v5\ARM\ARMCLANG\bin\..\lib\libcxx\libcppabi_p.l(cxa_handlers.cpp.o)).
[DEBUG] Errors: Finished: 0 information, 3 warning and 1 error messages.
Which version of ARMC6 are you running?
We build successfully with ARM Compiler 6.11 and 6.13
$ /c/Keil_v5/ARM/ARMCLANG/bin/armclang.exe --version
Product: MDK Professional 5.26
Component: ARM Compiler 6.10.1
Tool: armclang [5d143200]
I upgraded.
$ /c/Keil_v5/ARM/ARMCLANG/bin/armclang.exe --version
Product: MDK-ARM Lite 5.29
Component: ARM Compiler 6.13.1
Tool: armclang [5d895d00]
EDIT:
seems to be okay now - looks like the license manager took some time to make some check ...
@evedon
So I have reproduced the issue and I confirm that the deep sleep exit takes more time with microlib, but I don't know why ...
Below are a few measurements of the most time consuming part of the deep sleep exit (https://github.com/ARMmbed/mbed-os/blob/master/targets/TARGET_STM/sleep.c#L261)
ForceOscOutofDeepSleep(); // microlib 1ms VS. stnadard arm lib 0,640ms
ForcePeriphOutofDeepSleep(); // microlib 1,39ms VS. standard arm lib 0,7ms
SetSysClock(); // microlib 2,28ms VS. standard arm lib 1,21ms
When executed in microlib configuration the same code seems to be about 2X slower.
And all this code is executer under critical section ..
Is there a specific compiler optimization that is turned OFF with microlib ?
@kjbracey-arm any other idea ?
from http://www.keil.com/arm/microlib.asp
_As MicroLib has been optimized to minimize code size, some functions will execute more slowly than the standard C library routines available in the ARM compilation tools._
then there is a link to get more details but the link is broken ... Could someone from arm help out find the differences to try and get clues about the major differences ?
One more test result - by just adding __MICROLIB compilation flag and --library_type=microlib linker option to the release profile, I can reproduce the issue, i.e. the execution time gets longer.
Tested with below release.json modifications
diff --git a/tools/profiles/release.json b/tools/profiles/release.json
index 2aab24f6c8..384bbb7f31 100644
--- a/tools/profiles/release.json
+++ b/tools/profiles/release.json
@@ -20,12 +20,12 @@
"-Wno-reserved-user-defined-literal", "-Wno-deprecated-register",
"-DMULADDC_CANNOT_USE_R7", "-fdata-sections",
"-fno-exceptions", "-MMD", "-D_LIBCPP_EXTERN_TEMPLATE(...)=",
- "-fshort-enums", "-fshort-wchar", "-DNDEBUG"],
+ "-fshort-enums", "-fshort-wchar", "-DNDEBUG", "-D__MICROLIB"],
"asm": [],
"c": ["-D__ASSERT_MSG", "-std=gnu11"],
"cxx": ["-fno-rtti", "-std=gnu++14"],
"ld": ["--show_full_path", "--legacyalign", "--any_contingency",
- "--keep=os_cb_sections"]
+ "--keep=os_cb_sections", "--library_type=microlib"]
},
"ARM": {
"common": ["-c", "--gnu", "-Ospace", "--split_sections",
with command line
mbed test -m NUCLEO_L073RZ -t ARMC6 -n *systimer --profile=tools/profiles/release.json -v
Some further investigations are showing that the line below (/ 1000) is a big part of the problem.
https://github.com/ARMmbed/mbed-os/blob/master/targets/TARGET_STM/hal_tick_overrides.c#L61
It takes more than 300µs to be executed with microlib and is called several times ... this has a high impact on the system
@evedon @kjbracey-arm @MarceloSalazar this seems to be rather a limitation of the microlib which introduce a relatively huge overhead vs. the other toolchains. At first glance, there seem to be quite many occurences of / 1000 (or / 100 or /10) usage in mbed-os.
So my current conclusion is that this is a limitation (bug?) of microlib, not a bug of device:st ...
Any feedbacks / comments are welcome
FYI @jeromecoutant @ABOSTM
I would agree that this seems like a limitation of microlib.
For the resolution of the original issue, I suggest that we increase DEEP_SLEEP_DELAY_DELTA_US for microlib.
@evedon
I would agree that this seems like a limitation of
could you re-assign to the proper team for further analysis and feedback ?
For the resolution of the original issue, I suggest that we increase DEEP_SLEEP_DELAY_DELTA_US for microlib.
I 'm not in favor of such workaround. HAL_GetTick() function which include "/ 1000" operation is widely used in the ST drviers layers so that the target performance will be highly impacted and degraded by this limitation. Knowing this, I'm not in favor of deploying microlib support, otherwise we will most probably face other side effects. Question is also whether this limitation is only valid for cortex-M0 devices or any other ones as well ...
M3 or M4 devices wouldn't need to use the library to do / 1000 - the compiler should use a UMULL instruction to do a multiplication-by-reciprocal (something like * ((2**32) / 1000) then >> 32). Same applies for any 32-bit division by constant. (I know ARM and GNU compilers do this, and I'd hope IAR does too).
So this particular problem is only going to hit devices like M0,M0+ and M23.
IMO, 300µs for a simple integer division is too high for stability. That loses you a few serial characters at 115200bps if anyone does division in an interrupt handler. If you were to enable microlib on such a device, it would have to be with some very big provisos about it potentially breaking things.
For comparison, how long does that division take with the normal library on the same device?
300µs is not the absolute measurement, it is the overhead when moving to microlib.
On NUCLEO_L476RG, including GPIO toggling for instrumentation, I am observing 3µs max (and down to 200ns) for the / 1000 operation ... and I confirm that there is no visible overhead when moving to microlib.
@LMESTM As it is, your suggestion is the target NUCLEO_L073RZ is not be suitable for supporting small default lib ?
If the performance characteristics are as described, I'd be somewhat wary of enabling it at all on anything M0-like, unless we can redirect division to a faster implementation. We do perform divisions in middleware code in interrupt handlers - it's not just the STM HAL.
I'm going to check what microlib is doing - we might want to forcibly intercept its division and redirect to a "standard" division routine. (32/32->32 division should take maximum about 130 cycles, from references I've seen, so ~4µs at 32MHz)
If the performance characteristics are as described, I'd be somewhat wary of enabling it at all on anything M0-like, unless we can redirect division to a faster implementation. We do perform divisions in middleware code in interrupt handlers - it's not just the STM HAL.
Agree.
I'm going to check what microlib is doing - we might want to forcibly intercept its division and redirect to a "standard" division routine. (32/32->32 division should take maximum about 130 cycles, from references I've seen, so ~4µs at 32MHz)
Sounds like the best way forward.
Actually, it seems like the standard M0 division is likely a fair bit slower than the 130 cycles I said above - that number would have been for the traditional ARM, which can munch 1 bit in 3 cycles, giving you a base worst case of 96 cycles, plus extra start-up+loop overhead
RSBS ip, a1, a2, LSR #2
SUBCS a2, a2, a1, LSL #2
ADC a3, a3, a3
The poor M0+ has to take 4-6 cycles per bit, so worst case 192 cycles, plus overhead
LSRS a4,a1,#2
CMP a4,a2
BCC bit1 ; if taken, 3 cycles (M0) or 2 cycles (M0+)
LSLS a4,a2,#2
SUBS a1,a1,a4
ADCS a3,a3,a3
Looking at what I'm getting from microlib (ARM Compiler 6.13), its division doesn't seem that bad. It is just a long division loop, possibly written in C rather than assembler, with no unrolling. 12-16 cycles per bit.
__aeabi_uidivmod
0x0800031a: b530 0. PUSH {r4,r5,lr}
0x0800031c: 460b .F MOV r3,r1
0x0800031e: 4601 .F MOV r1,r0
0x08000320: 2000 . MOVS r0,#0
0x08000322: 2220 " MOVS r2,#0x20
0x08000324: 2401 .$ MOVS r4,#1
0x08000326: e009 .. B 0x800033c ; __aeabi_uidiv + 34
0x08000328: 460d .F MOV r5,r1
0x0800032a: 40d5 .@ LSRS r5,r5,r2
0x0800032c: 429d .B CMP r5,r3
0x0800032e: d305 .. BCC 0x800033c ; __aeabi_uidiv + 34
0x08000330: 461d .F MOV r5,r3
0x08000332: 4095 .@ LSLS r5,r5,r2
0x08000334: 1b49 I. SUBS r1,r1,r5
0x08000336: 4625 %F MOV r5,r4
0x08000338: 4095 .@ LSLS r5,r5,r2
0x0800033a: 1940 @. ADDS r0,r0,r5
0x0800033c: 4615 .F MOV r5,r2
0x0800033e: 1e52 R. SUBS r2,r2,#1
0x08000340: 2d00 .- CMP r5,#0
0x08000342: dcf1 .. BGT 0x8000328 ; __aeabi_uidiv + 14
0x08000344: bd30 0. POP {r4,r5,pc}
My calculations say it should come out at 400-520 cycles per division, depending on value. There's no specific short-circuit - it always loops 32 times.
So that should still be only ~16µs at 32MHz. How fast is this device?
The device is 32MHz but runs at lower speed (~2MHz or so) during deep sleep exit. This is indeed probably why this takes so long.
At full speed I'm seeing it rather takes around 13µs - good to get back to theory sometimes :-) !
Aha. Okay, doing the division while at slow speed makes sense. So microlib's division is 2-and-a-bit times slower, and it matters because you're running 16 times slower at that instant, and the division(s) dominate.
Okay, given that info, I now think:
Substituting the microlib divide routines would presumably get wake performance back to where it was, but is there anything else that can be done in the HAL to reduce the total number of divisions on wake, or at least do more of them after putting the CPU back up to normal speed?
Could register settings be cached from before sleep, rather than recalculated, for example?
Or could the CPU be switched to a fixed faster clock temporarily, before going for the full restore to desired runtime state?
General HAL improvements would benefit users of both microlib and any other library.
So microlib's division is 2-and-a-bit times slower
Actually here are my measurements. Same division by 1000 line running at same speed is measured using GPIOs (so w/ GPIO handling overhead):
General HAL improvements would benefit users of both microlib and any other library.
Yes, but clock trees can be MCU specific and also depend on boards (whether there is a XTAL or not, etc ..). Doing per-MCU optimization is always feasible but this not something we can afford - so I'd like to stick to a generic implementation for all STM32s. I may consider optimizing this specific function with a locally defined shift-based function to replace division ... do you think that would be acceptable ?
EDIT: actually the clock out of deep sleep seems to be in the range of 1MHz rather than 2MHz and I am checking if I could find a higher speed than that available on applicable STM32 families
I've pushed #12202 for increasing the clock speed out of deep sleep - that will be helpful for any toolchain, thanks for this @kjbracey-arm :-)
Actually here are my measurements.
Ah, right. My "two-and-a-bit" was comparing the worst-case for each, because I was thinking of interrupt latency. The standard routine is actually a lot faster in its best case (when the divide result is small - you basically pay per bit of the answer).
I may consider optimizing this specific function with a locally defined shift-based function to replace division ... do you think that would be acceptable ?
It may well be worthwhile doing that for the speed-criticial bits here. (Historically the ARM compiler+library used to have a dedicated shift-based divrem10 routine to cover the frequent use of that in prints.)
Whether you want to be shift or multiply based will depend on how your M0 core was configured - if you have the fast 1-cycle multiplier, you're probably best off just using it.
uint32_t div1000(uint32_t x)
{
uint64_t m = (uint64_t) x * 274877907; // multiply by (1/1000) * (2**38)
return (uint32_t) (m >> 38);
}
Although, having tried that, I find that actually you're somewhat scuppered by the facts that
Looking at it, I get the impression that all the microlib runtime routines were written in C from scratch, and not actually compared to see if they were smaller than the standard assembler versions(!)
Turns out that the div1000 does not seem to improve performances :-(
Board | Toolchain | Division | Operation time (incl. IOs toggling) | Frequency during wake-up
-- | -- | -- | -- | --
L476RG | uARM | Native ( / ) | 440ns | 48MHz
L476RG | uARM | Div1000 | 480ns | 48MHz
L476RG | ARMC6 | Native ( / ) | 440ns | 48MHz
L476RG | ARMC6 | Div1000 | 480ns | 48MHz
L073RZ | uARM | Native ( / ) | 86µs | 4MHz
L073RZ | uARM | Div1000 | 226µs | 4MHz
L073RZ | uARM | Div10 (3 times) | 18µs | 4MHz
L073RZ | ARMC6 | Native ( / ) | 11,6µs | 4MHz
L073RZ | ARMC6 | Div1000 | 13µs | 4MHz
L073RZ | ARMC6 | Div10 (3 times) | 18µs | 4MHz
if using 3 times div10 instead, it seems to show improvements, but only with uARM. I'm not really willing to introduce toolchain specific code ...
static signed Div10(signed n)
{
n = (n >> 1) + (n >> 2);
n += n < 0 ? 9 : 2;
n = n + (n >> 4);
n = n + (n >> 8);
n = n + (n >> 16);
n = n >> 3;
return n;
}
note: with #12202 all tests are now PASSED.
Shall we track microlib / division performance issue in a separate ticket if needed ?
Thanks a lot for numbers! Really useful.
Yeah, we can carry on microlib stuff on a new ticket. That 64x64->64 multiply is a real disaster, even slower than the division. It's actually even worse for the M3 - the microlib code is 6 times bigger and 50 times slower than the standard library.
But fortunately the compiler seems to inline rather than call the library for M3.
@kjbracey-arm I submitted a new Issue #12221
This one can be closed when #12202 will be merged
Thank you @LMESTM for fixing this issue.