Zephyr: eth: stm32h747i_disco: sem timeout and hang on debug build

Created on 10 Nov 2020 · 14Comments · Source: zephyrproject-rtos/zephyr

Describe the bug
When building the dumb http server with CONFIG_DEBUG enabled, for the stm32h747i_disco_m7 board, we very fast get a semaphore timeout on waiting for transmission complete callback from ST HAL layer. We can see

[00:00:02.719,000] <err> eth_stm32_hal: HAL_ETH_TransmitIT tx_int_sem take timeout
[00:00:02.719,000] <err> eth_stm32_hal: eth packet timeout
98 fa 9b 39 67 d7 02 80  e1 67 51 79 08 00 45 00 |...9g... .gQy..E.
00 28 00 00 00 00 40 06  64 a8 a9 fe 61 15 a9 fe |.(....@. d...a...
61 16 1f 90 ed 02 04 81  fc b9 99 ef e4 5e 50 10 |a....... .....^P.
05 00 08 90 00 00                                |......

Often it hangs and doesn't recover. Build without CONFIG_DEBUG and it works flawless. Increasing semaphore timeout time doesn't do any difference.
What have you tried to diagnose or workaround this issue?
With instruction cache disabled, it works flawlessly with CONFIG_DEBUG enabled. I managed to track it down to
modules/hal/stm32/stm32cube/stm32h7xx/drivers/src/stm32h7xx_hal_eth.c: 2979. If we enable instruction cache before this line, timeout. If we enable it after, it works. If we insert a data barrier after, it works:

/* Mark it as LAST descriptor */
SET_BIT(dmatxdesc->DESC3, ETH_DMATXNDESCRF_LD);
__DSB();

I guess this is a fix, but really not that nice to be messing around in ST's HAL, and I'm also wondering if this fix actually fixes a problem we are causing in the driver. We might be doing something wrong in our stm32h7 driver? Comparing with ST samples for STM32H743 it looks correct. I didn't find anybody else having this issue with ST's HAL on stm32h747i MCU.
I've seen some people from ST contributing here, maybe somebody can take a look at this?
Messing with buffer alignments doesn't have any effect either; I tried alignment 256 bytes, with confirmation it was aligned.
Please note that this issue was seen when making this driver as well (https://github.com/zephyrproject-rtos/zephyr/pull/27188#discussion_r475380472)

To Reproduce
Steps to reproduce the behavior:

add to /boards/arm/stm32h747i_disco/stm32h747i_disco.dtsi

&mac {
    status="okay";
    pinctrl-0 = <&eth_tx_en_pg11 &eth_txd1_pg12
                 &eth_txd0_pg13 &eth_mdc_pc1
                 &eth_mdio_pa2 &eth_ref_clk_pa1
                 &eth_crs_dv_pa7 &eth_rxd0_pc4
                 &eth_rxd1_pc5>;
};

add to samples/net/sockets/dumb_http_server/prj.conf

CONFIG_NET_L2_ETHERNET=y
CONFIG_ETH_STM32_HAL=y
CONFIG_DEBUG=y

west build -b stm32h747i_disco_m7 zephyr/samples/net/sockets/dumb_http_server/
west flash
Open COM port and do repeated http requests to devkit over ethernet, for example ab -n 100 -c 1 http://192.0.2.1:8080/
The error will be printed in the terminal instantly, and it will hang subsequent transmissions

Expected behavior
Ethernet tx complete semaphore should not timeout. Temporary fix proves this is possible.

Impact
Showstopper if it hangs, which it appears to do. We realize it only happens when building for debug, but it's not sustainable to not be able to debug properly. Something's wrong

Logs and console output
Please note I am using different IP's than example.

*** Booting Zephyr OS build zephyr-v2.4.0-1314-gc3ac3027a17a  ***
[00:00:00.006,000] <inf> net_config: Initializing network
[00:00:00.006,000] <inf> net_config: Waiting interface 1 (0x24004ae4) to be up...
Single-threaded dumb HTTP server waits for a connection on port 8080...
[00:00:02.000,000] <inf> net_config: Interface 1 (0x24004ae4) coming up
[00:00:02.000,000] <inf> net_config: IPv4 address: 169.254.97.21
[00:00:02.991,000] <err> eth_stm32_hal: HAL_ETH_TransmitIT tx_int_sem take timeout
[00:00:02.991,000] <err> eth_stm32_hal: eth packet timeout
98 fa 9b 39 67 d7 02 80  e1 97 e2 22 08 06 00 01 |...9g... ..."....
08 00 06 04 00 02 02 80  e1 97 e2 22 a9 fe 61 15 |........ ..."..a.
98 fa 9b 39 67 d7 a9 fe  61 16                   |...9g... a.      
Connection #0 from 169.254.97.22
Connection from 169.254.97.22 closed

Environment (please complete the following information):

OS: Ubuntu 20.04
Toolchain Zephyr SDK
Commit SHA or Version used: c3ac3027a17a89fd30863db818ab320e5bf8b14e (v2.4.99)

Additional context
Ethernet cable connected straight to computer

Drivers Ethernet bug STM32 low

Source

emillindq

Most helpful comment

@emillindq Have you already created PR in stm32_hal repo? I can confirm, that __DSB() approach (no matter if in HAL or driver without CONFIG_DEBUG) increases stability with my board and samples. I have observed it with my latest CivetWeb update (currently as PR).

https://github.com/STMicroelectronics/STM32CubeH7/issues/95
I just did!

emillindq on 23 Nov 2020

👍2

All 14 comments

Hey @emillindq,

Thank you for your great job and finding this bug and another big thanks for figuring out the solution. I can see same behavior on my nucleo_h745zi_q board while compiling it for M7 core.

In my Embedded Development experience I had already same situation with stm32h7 SoC series: It was just saving a byte of data into some on start non erasable registers, which were defined by the system design. The problem was, that writing a byte does not triggers exact writing to corresponding registers, until __DSB() were executed. Digging deeper into it, showed, that CPU writes this memory, only as it accumulates 4 bytes (e.g. 32 bit - bus width of M7 core). Thus __DSB() helped a lot at that time.

I can only approve your solution. Moreover, it appears with CONFIG_DEBUG=y, cause some number of commands may be not optimized away (see this).

It is also known, that ST's dirvers can behave buggy under special circumstances (@erwango :eyes: ). But ST seems to do a good job fixing it continuously.

I would not recomend adding __DSB() to stm32h7xx_hal_eth.c: 2979, since it may impact other HAL driver users. I've also tested adding __DSB() to eth_stm32_hal.c: 234 right after hal_ret = HAL_ETH_Transmit_IT(heth, &tx_config); and it also solved the problem. From my side it would be preferable solution. I would be happy if you @emillindq can create a PR with this solution, since it is mostly your merit, figuring out this bug.

Edit: Also putting SCB_InvalidateICache() instead of __DSB() worked for me. I would prefer this solution even more.

P.S.: @emillindq If you like you can read on example of D-Cache, why such things are happening.

Nukersson on 10 Nov 2020

Thank you for the encouragement :smiley:

Edit: Also putting SCB_InvalidateICache() instead of __DSB() worked for me. I would prefer this solution even more.

At first sight it seems like SCB_InvalidateICache() works but I am able to get it to timeout, albeit a lot harder. It appears to only happen just after reset; on average 1/5 times. I am running ab test while I'm resetting and it happens just after network comes up. Increasing timeout to 100ms didn't help. It's not like that using __DSB() in the HAL. Trying with __DSB() after HAL_ETH_Transmit_IT makes it hang completely. The same is true for __ISB().

I ran four tests (hint: case 3 & 4 interesting)
timeout default (20ms), test duration 5min, reset every 7s, running ab test continuously making GET requests**

__DSB() in HAL.
Result: About 700 requests made between reset, no hangs, no sem timeouts, 30 000 connections in total
SCB_InvalidateICache() in eth_stm32_hal.c after HAL_ETH_Transmit_IT().
Result: About 700 requests between reset, 45 resets and ten sem timeouts, no hangs, about 30 000 connections in total. The timeout exclusively happens just after boot.
__DSB() before and after HAL_ETH_Transmit_IT().
Result: Same result as 1.
__ISB() before and __DSB() after HAL_ETH_Transmit_IT().
Result: Same result as 1.

With nothing after HAL_ETH_Transmit_IT(), always failure in all cases.

So now we have a solution (3 or 4) that doesn't involve messing around in the HAL; however it just feels a bit ghetto. I don't know why it suddenly works with a sync barrier before and after function call. Is this good enough?

emillindq on 11 Nov 2020

👍1

@emillindq Good idea with tests. @erwango Would we be able integrate this tests into Zephyrs CI?

Even if solutions 3 and 4 are not optimal, they are better than changing the HAL layer itself. For the first the solution will be good enough.

Of course you can write your own functions instead ones of ST' HAL (see it here). But, again you will probably find other bugs and the solutions 3 and 4 are good enough.
In the end wins the driver implementation, which is most intensively tested and works flawlessly.

P.S.: It would be also interesting to declare HAL_ETH_Transmit_IT() section as threads critical region. Can't find currently Zephyrs critical section declaration from my smartphone.

Nukersson on 11 Nov 2020

I ran the ab loop with test setup 3 for a longer time and it went on for almost 1 million connections without sem timeout or hang. I will make a PR with this change so we have something to look at.

emillindq on 11 Nov 2020

👍2

To answer your last thought, I tried to wrap HAL_ETH_Transmit_IT() around both k_sched_(un)lock() and __dis(en)able_irq(), still with timeout and hang.

emillindq on 11 Nov 2020

🚀1 👍1

Thank you @emillindq and @Nukersson for figuring all of this out !
I don't have much experience with cache on H7 series, but it looks like it might be an issue in the future if we have to sync CACHE data each time we are dealing with RAM descriptors for peripherals and not registers.

Seems like you found out that if the cache data is not in sync with RAM when calling DMA (which is accessed using registers this time, so no caching is done right ?), this can lead to mismatch in configuration data and in the end, Ethernet IP is not set to fire the TX complete interrupt.

I never thought that this would be a real problem until now so thanks for the discovery 👍

Also great thanks for the fix ! Will try it out as soon as possible.

lochej on 11 Nov 2020

👍2

As mentioned in the PR, still issues with CONFIG_NO_OPTIMIZATIONS

emillindq on 12 Nov 2020

@emillindq, does the initial fix in ST HAL that you mentioned earlier actually fixes the issue in all cases ?

If there's an actual bug in the HAL driver, it is not excluded to fix it there directly.
You can report the issue there: https://github.com/STMicroelectronics/STM32CubeH7 (you can also directly propose a PR to fix it). The only issue with putting fixes in Zephyr replication of HAL is that they need to be maintained, which becomes a heavy process as far as the list grows.
If the issue is acknowledged in the official repo, we can put a temporary fix in the Zephyr HAL replication waiting the official delivery.

erwango on 12 Nov 2020

👍1

@emillindq, does the initial fix in ST HAL that you mentioned earlier actually fixes the issue in all cases ?

If there's an actual bug in the HAL driver, it is not excluded to fix it there directly.
You can report the issue there: https://github.com/STMicroelectronics/STM32CubeH7 (you can also directly propose a PR to fix it). The only issue with putting fixes in Zephyr replication of HAL is that they need to be maintained, which becomes a heavy process as far as the list grows.
If the issue is acknowledged in the official repo, we can put a temporary fix in the Zephyr HAL replication waiting the official delivery.

@erwango yes adding __DSB() in the HAL fixes it in all cases. Let's see what happens with the PR here and I'll create an issue/PR in official ST repo if required.

emillindq on 13 Nov 2020

@erwango yes adding __DSB() in the HAL fixes it in all cases.

Ok, thanks for confirming.

Let's see what happens with the PR here and I'll create an issue/PR in official ST repo if required.

Well, before merging the PR here, we need to be clear on the strategy vs HAL. So please address the HAL official issue/PR first.

erwango on 13 Nov 2020

@emillindq Have you already created PR in stm32_hal repo? I can confirm, that __DSB() approach (no matter if in HAL or driver without CONFIG_DEBUG) increases stability with my board and samples. I have observed it with my latest CivetWeb update (currently as PR).

Nukersson on 22 Nov 2020

@emillindq Have you already created PR in stm32_hal repo? I can confirm, that __DSB() approach (no matter if in HAL or driver without CONFIG_DEBUG) increases stability with my board and samples. I have observed it with my latest CivetWeb update (currently as PR).

https://github.com/STMicroelectronics/STM32CubeH7/issues/95
I just did!

emillindq on 23 Nov 2020

👍2

Hi!

I also ran into this problem and I noticed that the DMA descriptors and RX buffers in eth_stm32_hal.c are placed in the default system RAM. Wouldn't it be more robust in general to configure the memory as described here, instead of introducing additional memory barriers?

reloZid on 27 Nov 2020

Hi!

I also ran into this problem and I noticed that the DMA descriptors and RX buffers in eth_stm32_hal.c are placed in the default system RAM. Wouldn't it be more robust in general to configure the memory as described here, instead of introducing additional memory barriers?

Absolutely agree with you. Thank you for this notice. If you try to run nucleo_h743zi or nucleo_h745zi_q boards with sram1 istead of sram0 you'll see more responsive Ethernet. Current problem is not in h7 series, but in Zephyr itself, which does not allow (cry if it is not!) putting multiple none-consequtive SRAM regions (sram0, sram1, etc.) in one build together. I've already asked it here: @erwango @lochej @gmarull - please take a look.

@reloZid Does your described solution helped you?

@emillindq Could you change sram0 in zephyr/dts/arm/st/h7/stm32h747Xi_m7.dtsi to