Mbed-os: Nucleo F429ZI hardfaults with debug profile, possible due flash cache corruption

Created on 21 Jan 2020 · 32Comments · Source: ARMmbed/mbed-os

Description of defect

Priority: Blocker

Compiling device management client example for Nucleo F429ZI with debug profile and GCC_ARM compiler will reliably result in crash.

++ MbedOS Fault Handler ++

FaultType: HardFault

Context:
R0   : 00000000
R1   : 00000001
R2   : E000ED00
R3   : 08084F5F
R4   : 00000000
R5   : 00000000
R6   : 00000000
R7   : 00000000
R8   : 00000000
R9   : 00000000
R10  : 00000000
R11  : 00000000
R12  : 00000000
SP   : 20012F68
LR   : 0808D2DF
PC   : 0808D2E0
xPSR : 61000000
PSP  : 20012F48
MSP  : 2002FFC0
CPUID: 410FC241
HFSR : 40000000
MMFSR: 00000000
BFSR : 00000000
UFSR : 00000008
DFSR : 00000000
AFSR : 00000000
Mode : Thread
Priv : Privileged
Stack: PSP

-- MbedOS Fault Handler --



++ MbedOS Error Info ++
Error Status: 0x80FF013D Code: 317 Module: 255
Error Message: Fault exception
Location: 0x808D2E0
Error Value: 0x200001B0
Current Thread: rtx_idle Id: 0x200129C8 Entry: 0x8085E7D StackSize: 0x280 StackMem: 0x20012D10 SP: 0x20012F68
For more info, visit: https://mbed.com/s/error?error=0x80FF013D&tgt=NUCLEO_F429ZI
-- MbedOS Error Info --

Two workarounds allow the target to bootup correctly:

Disabling sleep in hal_code:

diff --git a/targets/TARGET_STM/TARGET_STM32F4/device/stm32f4xx_hal_pwr.c b/targets/TARGET_STM/TARGET_STM32F4/device/stm32f4xx_hal_pwr.c
index dffb78e25c..e699fd9fff 100644
--- a/targets/TARGET_STM/TARGET_STM32F4/device/stm32f4xx_hal_pwr.c
+++ b/targets/TARGET_STM/TARGET_STM32F4/device/stm32f4xx_hal_pwr.c
@@ -391,7 +391,7 @@ void HAL_PWR_EnterSLEEPMode(uint32_t Regulator, uint8_t SLEEPEntry)
   if(SLEEPEntry == PWR_SLEEPENTRY_WFI)
   {   
     /* Request Wait For Interrupt */
-    __WFI();
+    __NOP();
   }
   else
   {

or changing the flash cache at application start in beginning of main.

FLASH->ACR = 0x405;
FLASH->ACR = 0xC05;

FLASH->ACR = 0x405;
FLASH->ACR = 0x705;

Target(s) affected by this defect ?

Nucleo F429ZI

tested with both LWIP and WISUN configurations.

Toolchain(s) (name and version) displaying this defect ?

gcc-arm-none-eabi-9-2019-q4-major

This _does not_ reproduce with ARMC6 compiler.

What version of Mbed-os are you using (tag or sha) ?

Mbed OS 5.15.0

What version(s) of tools are you using. List all that apply (E.g. mbed-cli)

Mbed CLI 1.10.2

How is this defect reproduced ?

mbed import https://github.com/armmbed/mbed-cloud-client-example (4.2.1 version with Mbed OS 5.15.0).
mbed compile -m NUCLEO_F429ZI --profile debug

Crashes immediately on application start.

Also originally verified with internal test-tool which failed the same way.

CLOSED st mirrored bug

Source

teetak01

Most helpful comment

Thanks for the info @se7ensong, with further investigation we believe we found the cause:
it doesn't relate to flash but reset is a trigger.

pyOCD or other debug tool when they perform a reset to the target it didn't clear DBGMCU_CR register. sometime if the flashed image is built with debug profile, it deliberately set DBGMCU_CR register to 0x7.

When this register is set, especially the DBG_SLEEP bit, and in the target sleep mode it will crash at WFI instruction. For details please see STM32F4 errata chapter 2.1.3 - Debugging Sleep/Stop mode with WFE/WFI entry

I tried our image with manually clear the DBGMCU_CR, the crash is gone,

Also base on the errata, there are two other conditions to met to see the crash:

The number of wait state configured on Flash interface is higher than 0
And Linker place WFE or WFI instructions on 4-bytes aligned addresses
(0x080xx_xxx4)
maybe these adding some of the randomnesses when we see the crash happens

BTW, we tried the workaround of adding NOPx3 after WFI, that solution seems works for us. If you can confirm whether that is working for you or not, that would be great.

jamesbeyond on 26 Mar 2020

👍5

All 32 comments

@evedon @bulislaw can we get someone to help investigate this? Teemu has said this is a blocker for client team...

adbridge on 21 Jan 2020

I've been working with Teemu - I think we need @ARMmbed/team-st-mcd input.

Symptoms are consistent with the flash cache being corrupted during sleep. When it occurs and doesn't occur isn't clear, but we're currently thinking certain alignments are probably a factor.

In 2 failing cases we've failed with PC = 0x......80 and 0x......E0 - same 32n+0 alignment - where the code looks like

xxxxxx7A   BL    HAL_PWR_EnterSLEEPMode
xxxxxx7E   B     backwards branch                  <-- apparently didn't take this branch
xxxxxx80   DCD   not_an_instruction              <-- crashes trying to execute this

HAL_PWR_EnterSLEEPMode finishes by executing

WFI
BX     LR

I would assume the BX LR is in the CPU's pipeline, so isn't fetched from the I-bus after wake-up, but the code at 7E/DE after returning would be the first thing fetched.

kjbracey-arm on 21 Jan 2020

@kjbracey-arm So it seems playing around with the FLASH->ACR values help to workaround the issue. Also alternatively adding enough NOP() before the __WFI() seems to help also, so this can be alignment issue as you suggested.

Crashing: 80568ea: bf30 wfi (default application)
Working: 8056902: bf30 wfi (disabling flash cache)
Working: 8056902: bf30 wfi (disable and enable)
Working: 8056900: bf30 wfi (adding 11 __NOP() before __WFI() ).

teetak01 on 21 Jan 2020

Internal Jira reference: https://jira.arm.com/browse/MBOTRIAGE-2515

ciarmcom on 21 Jan 2020

playing around with the FLASH->ACR values help to workaround the issue

Do you have evidence of that? Can you make it work at the ea alignment with a different ACR value? Those results just show it always working at 00 or 02.

And it seems the address of the BL may not be relevant, only the WFI?

kjbracey-arm on 22 Jan 2020

@ARMmbed/team-st-mcd can you please review and comment on this issue asap?
Have you seen this in other cases or know where the problem might be coming from?

The current workaround is to disable sleep as commented by @teetak01 .

MarceloSalazar on 23 Jan 2020

Maybe issue could be raised in https://github.com/STMicroelectronics/STM32CubeF4

jeromecoutant on 29 Jan 2020

I come across this thread when I facing a similar issue. I am using Mbed OS 5.15.0 with IAR EWB 8.42.1. It doesn't always crash, but when it does, it cannot be recovered easily. Previously, I just erase the whole flash. Now, @teetak01 's method works great for me.

A typical error message is:
01-29 13:26:14 UART-RX DEBUG logStr=FaultType: HardFault
01-29 13:26:14 UART-RX DEBUG logStr=
01-29 13:26:14 UART-RX DEBUG logStr=Context:
01-29 13:26:14 UART-RX DEBUG logStr=R0 : E000ED10
01-29 13:26:14 UART-RX DEBUG logStr=R1 : 00000000
01-29 13:26:14 UART-RX DEBUG logStr=R2 : 00000000
01-29 13:26:14 UART-RX DEBUG logStr=R3 : 00000001
01-29 13:26:14 UART-RX DEBUG logStr=R4 : 200004F0
01-29 13:26:14 UART-RX DEBUG logStr=R5 : 00000000
01-29 13:26:14 UART-RX DEBUG logStr=R6 : 00000000
01-29 13:26:14 UART-RX DEBUG logStr=R7 : 00000000
01-29 13:26:14 UART-RX DEBUG logStr=R8 : 00000000
01-29 13:26:14 UART-RX DEBUG logStr=R9 : 00000000
01-29 13:26:14 UART-RX DEBUG logStr=R10 : 00000000
01-29 13:26:14 UART-RX DEBUG logStr=R11 : 00000000
01-29 13:26:14 UART-RX DEBUG logStr=R12 : 00000000
01-29 13:26:14 UART-RX DEBUG logStr=SP : 2000CD64
01-29 13:26:14 UART-RX DEBUG logStr=LR : 0811ACEF
01-29 13:26:14 UART-RX DEBUG logStr=PC : 0811ACF6
01-29 13:26:14 UART-RX DEBUG logStr=xPSR : 61000200
01-29 13:26:14 UART-RX DEBUG logStr=PSP : 2000CD40
01-29 13:26:14 UART-RX DEBUG logStr=MSP : 20011050
01-29 13:26:14 UART-RX DEBUG logStr=CPUID: 410FC241
01-29 13:26:14 UART-RX DEBUG logStr=HFSR : 40000000
01-29 13:26:14 UART-RX DEBUG logStr=MMFSR: 00000000
01-29 13:26:14 UART-RX DEBUG logStr=BFSR : 00000000
01-29 13:26:14 UART-RX DEBUG logStr=UFSR : 00000001
01-29 13:26:14 UART-RX DEBUG logStr=DFSR : 0000000B
01-29 13:26:14 UART-RX DEBUG logStr=AFSR : 00000000
01-29 13:26:14 UART-RX DEBUG logStr=Mode : Thread
01-29 13:26:14 UART-RX DEBUG logStr=Priv : Privileged
01-29 13:26:14 UART-RX DEBUG logStr=Stack: PSP
01-29 13:26:14 UART-RX DEBUG logStr=
01-29 13:26:14 UART-RX DEBUG logStr=-- MbedOS Fault Handler --
01-29 13:26:14 UART-RX DEBUG logStr=
01-29 13:26:14 UART-RX DEBUG logStr=
01-29 13:26:14 UART-RX DEBUG logStr=
01-29 13:26:14 UART-RX DEBUG logStr=++ MbedOS Error Info ++
01-29 13:26:24 UART-RX DEBUG logStr=Error Status: 0x80FF013D Code: 317 Module: 255
01-29 13:26:24 UART-RX DEBUG logStr=Error Message: Fault exception
01-29 13:26:24 UART-RX DEBUG logStr=Location: 0x811ACF6
01-29 13:26:24 UART-RX DEBUG logStr=Error Value: 0x2000F73C
01-29 13:26:24 UART-RX DEBUG logStr=Current Thread: rtx_idle Id: 0x2000FB1C Entry: 0x80FB945 StackSize: 0x280 StackMem: 0x2000CAF8 SP: 0x2000CD64
01-29 13:26:24 UART-RX DEBUG logStr=For more info, visit: https://mbed.com/s/error?error=0x80FF013D&tgt=UBLOX_EVK_ODIN_W2
01-29 13:26:24 UART-RX DEBUG logStr=-- MbedOS Error Info --

se7ensong on 30 Jan 2020

Maybe issue could be raised in https://github.com/STMicroelectronics/STM32CubeF4

@jeromecoutant Was this created there? Just checking if this can be fixed in near future.

0xc0170 on 24 Feb 2020

@se7ensong Thanks for the report, would you be able to share how to reproduce the issue (your app might do different things but the same result as this issue. It would be good to have the steps to reproduce it) ? Do you have a code snippet that would allow us to reproduce locally ?

0xc0170 on 24 Feb 2020

@0xc0170 , I will try to get a minimal example for you as I cannot use the exact code I am currently working on. FYI, I am running this on ODIN-EVK-UBLOX-W2 and will try 5.15.0 instead of 5.15.1.

se7ensong on 24 Feb 2020

👍1

@ARMmbed/team-st-mcd has there been any update? Was this shared with the Cube4 team ?

0xc0170 on 26 Feb 2020

How is this defect reproduced ?

mbed import https://github.com/armmbed/mbed-cloud-client-example (4.2.1 version with Mbed OS 5.15.0).
mbed compile -m NUCLEO_F429ZI --profile debug

Crashes immediately on application start.

I am sorry but I couldn't reproduce the crash...

console.log

mbed import https://github.com/armmbed/mbed-cloud-client-example -v cd mbed-cloud-client-example <update mbed_cloud_dev_credentials.c> mbed compile -t GCC_ARM -m NUCLEO_F429ZI -v --profile debug -f

jeromecoutant on 3 Mar 2020

Hi @jeromecoutant,
Thanks for helping the investigation, we are currently looking at this internally. here are our findings so far:

We can reproduce a crashing consistently.
The crash is using debug profile images. and crashing immediately at booting up stage.
Reset button not able to recover the target, has to use a full power cycle to recover
This crashing seems related to pyocd, and how the tool flashes the image. no clear evidence it is an issue for STM targets yet.

But based on teams observations, there seems to be another crash, which using develop profile build, and the crash happens randomly after targets booted up.
it is not conclusive whether these 2 crashings are the same cause. We haven't able to reproduce the 2nd type crash reliably

jamesbeyond on 4 Mar 2020

Hi,
I don't have too many details to disclose, but we are experiencing a very similar issue on our STML4 target. A few observations:

We have seen various behavior when waking from sleep depending on optimization. It seems to work with -O2, but with -Os it just fails to wake from sleep after being programmed with openocd. With -lto it crashes trying to wake from sleep.
With -Os a power cycle will make it work as expected.
Reset button is not able to recover the target.
Replacing __WFI() with __NOP() as a workaround works.

chopbo on 4 Mar 2020

👍1

Please raise an issue in pyOCD

jeromecoutant on 5 Mar 2020

Any updates on this issue?

LarsTimm on 17 Mar 2020

@se7ensong and @chopbo did you use pyocd when flashing?

TuomoHautamaki on 20 Mar 2020

@TuomoHautamaki , no I use IAR. I still see the issues sometimes, but not 100% reproduce yet.

se7ensong on 20 Mar 2020

Just answering on @chopbo behalf, as we work on the same project. We don't use pyOCD but flash via openOCD.

LarsTimm on 20 Mar 2020

👍2

@jamesbeyond did we do any progress since your last comment?

bulislaw on 20 Mar 2020

Thanks everyone for the reports, useful to have multiple records - to see the scope of this (multiple toolchain/debug tools and targets).

Please raise an issue in pyOCD

openOCD and IAR also have this so would mean they all 3 share the same bug or rather this is in this codebase.

0xc0170 on 23 Mar 2020

I am now trying to run my application on UBLOX_EVK_ODIN_W2 without the above workarounds. The error is not 100% reproducible, but cannot be reset once occurred.

The latest finding is that I can get it working (without reprogramming etc.) by the following steps:

Unplug the dev board from the laptop. USB power is used.
Press and hold the reset button and reconnect the dev board to the laptop.
Only release the button when the dev board is successfully populated and you can visit the drive from the file browser
Press the reset button again to restart the program

So far, these steps work for me 100%.

se7ensong on 24 Mar 2020

👍1

Thanks for the info @se7ensong, with further investigation we believe we found the cause:
it doesn't relate to flash but reset is a trigger.

I tried our image with manually clear the DBGMCU_CR, the crash is gone,

Also base on the errata, there are two other conditions to met to see the crash:

The number of wait state configured on Flash interface is higher than 0
And Linker place WFE or WFI instructions on 4-bytes aligned addresses
(0x080xx_xxx4)
maybe these adding some of the randomnesses when we see the crash happens

BTW, we tried the workaround of adding NOPx3 after WFI, that solution seems works for us. If you can confirm whether that is working for you or not, that would be great.

jamesbeyond on 26 Mar 2020

👍5

Thank you @jamesbeyond ! I have now modified HAL_PWR_EnterSLEEPMode in the "stm32f4xx_hal_pwr.c". I will let you know if the bug happens anymore.

se7ensong on 26 Mar 2020

Awesome work @jamesbeyond.
I have just tried to clear the DBGMCU_CR register as the first thing in our main (On a STM32L4 board), and now our application works with -Os and -flto.
I will try with the three NOP also.

LarsTimm on 26 Mar 2020

👍1

Three times NOP in HAL_PWR_EnterSLEEPMode also works here.

LarsTimm on 26 Mar 2020

🎉1

Thank you @jamesbeyond ! I have now modified HAL_PWR_EnterSLEEPMode in the "stm32f4xx_hal_pwr.c". I will let you know if the bug happens anymore.

So far, during my developing process, it hasn't happened once yet. Thank you so much for the fix!

se7ensong on 26 Mar 2020

ST_INTERNAL_REF 83447

jeromecoutant on 27 Mar 2020

The errata also specifies "if the application software disables the Prefetch queue". Have we done that? (Are the conditions all supposed to be "and"?)

kjbracey-arm on 30 Mar 2020

Fix is now on master, I'll close this as resolved

0xc0170 on 30 Mar 2020

PR that fixes this; https://github.com/ARMmbed/mbed-os/pull/12717 (for tracing purposes), should make Mbed OS 5.15.2.