I'm developing a HomeKit IP camera based on various ESP32-CAM modules (in this particular case I am testing on ESP-EYE). Part of the functionality includes streaming live H.264 encoded video with RTP protocol. The protocol is based on UDP. At the time of streaming there is also at least one TCP connection open (for HomeKit management).
The problem I'm having is: when stream starts it obviously sends UDP packets (<1400 bytes) at some rate. Sometimes a after sending several dozens packets sendto() call fails with errno 12 (Out of memory) and after that the whole system becomes unresponsive over network and device needs to be rebooted. What's odd is that sometimes it works and I am able to start and stop streaming sessions multiple times.
It feels it's not only about UDP, it's a combination of factors (UDP + TCP + high load on CORE1 maybe?). Though sometimes it works just fine and it seems that it is somehow determined on first stream: if first stream works, subsequent streams will work as well. I tested it multiple times and when it does not hang on first stream, it works every stream. If it fails on first stream though, you can not continue testing as it becomes unresponsive over network.
Network calls do not fail with an error. Module continue to be accessible via network.
Network API call (sendto()) fails with errno 12 (Out of memory) and module stops responding via network.
There are no crashes per se, it's just returns error on API call and at the same time stops any (visible) network activity.
build folder (note this may contain all the code details and symbols of your project.)@maximkulkin Can you give me the wrong log?
Sure. I'm attaching full log of my device, but most of it is probably not interesting to you. The culprit is these two messages after which in a couple of seconds it hangs completely:
W (32998) wifi: alloc eb len=24 type=3 fail, heap:3004276
W (32998) wifi: m f null
Hi @maximkulkin ,
This looks very similar to #3592. The root cause is that Wi-Fi can't use all of the internal memory correctly, and a fix for this is being merged soon (the issue thread has some examples of fixes you can make in IDF, in the meantime). There is a secondary issue that this error message reports internal+external free heap, but only internal memory is full.
I am also seeing similar errors on ESP-WROVER + esp-idf v4.0 which I understand has this commit merged 02f6bc5 (amongst other fixes).
E (842801507) wifi: esf_buf: t=2 l=76 max:32, alloc:32 no eb, TXQ_BLOCK=0
W (842801507) wifi: alloc eb len=76 type=2 fail, heap:4235316
W (842801507) wifi: m f probe req l=0
...
I (842802602) DIAG: >>> Heap internal: 68536, spiram: 4175880 <<<
...
I've ruled out memory leaks on my end, as far as I am aware. There is ample internal and spiram heap available.
Can you clarify if there is a workaround for this and when a fix is expected please?
Hi @lerebel103 please help to provide following info:
menuconfig => Component Config => ESP32 Specific => SPI RAM Config => Try to allocate memories of WiFi and LWIP in SPIRAM firstly..... enabled?sure,
Here is the information you requested:
#ESP-IDF
$ git status
HEAD detached at v4.0
nothing to commit, working tree clean
$ git show-ref -s v4.0
f9cb434eeb90be4f34893a278d0a70a8b21e4ca2
a4009277a7b33d2af1cc721302cf0dd8a0b23091
Hi @lerebel103 ,
The buffer for probe req is allocated from internal memory, whether it can be allocated successful depends on two factors:
Could you tell me in what scenario this issue happen?
Thanks @liuzfesp,
This is very informative. Right, as you say, it sounds as though the 32 allocated buffers aren't recycled if the WiFi hardware can't send these. If this is the case I presume CONFIG_WIFI_LWIP_ALLOCATION_FROM_SPIRAM_FIRST=y would probably not be the ultimate fix. What do you think?
I can provide some background on this specific app. It actually spends most of its time light sleeping( esp_pm_config_esp32_t.light_sleep_enable = true and esp_wifi_set_ps(WIFI_PS_MAX_MODEM)) and maintains a single TCP connection in this state. This seems to work pretty well almost indefinitely. The device occasionally wakes up (light_sleep_enable = false and WIFI_PS_NONE) and sends/receives a few payloads of around 4kb or so tops, most are probably under 1Kb. Other than TCP keep-alive, I don't believe much data, if any, is Rx/Tx whilst the device is in light sleep.
Can you think of a scenario where Light Sleeping and/or Radio power states could be a source of error? I haven't been able to trap this error as it happens for the very first time to see if it is during a light sleep or active phase.
So if the 32 buffers are getting exhausted, even with low throughput, what would be the best way of recovering from this situation in your opinion?
HI @lerebel103,
Yes, you are right, configuring CONFIG_WIFI_LWIP_ALLOCATION_FROM_SPIRAM_FIRST=y doesn't fix your issue.
The issue has no relationship with TCP data, because TCP data is allocated from different WiFi Buffer (the WiFi Buffer for WiFi Data). The error log E (842801507) wifi: esf_buf: t=2 l=76 max:32, alloc:32 no eb, TXQ_BLOCK=0 indicates that the WiFi buffer type t=2, which is only for WiFi Management Data, is out of memory. Generally speaking, only the WiFi Driver sends WiFi Management Data except that the application calls esp_wifi_80211_tx().
According to the log W (842801507) wifi: m f probe req l=0, when the problem happens, the ESP32 Station is sending a probe request frame, it only happens when the ESP32 Station is scanning or when ESP32 Station is disconnected because beacon timeout. So I suspect that issue may happen in following scenario:
As for the opinion to solve this issue, I think we should find the root cause first. Could you help to provide the whole log for me to do further analysis?
Thanks @liuzfesp,
Sure, happy to help and thanks for the info on management buffers, that's great.
It may be that W (842801507) wifi: m f probe req l=0 seen in the log snippet is coincidental rather than causal, at that point network had dropped off for some time and the device is trying to reconnect at regular intervals on SYSTEM_EVENT_STA_DISCONNECTED with esp_wifi_connect().
I can setup a continuous test to this effect. Can you let me know what log levels and from which components you need these and anything else in particular in terms of event logging and I'll work something out. I have seen this error come up anywhere from a few hours of uptime to weeks.
As for the opinion to solve this issue, I think we should find the root cause first. Could you help to provide the whole log for me to do further analysis?
Sure, I was asking in your opinion, what is an immediate workaround we can implement. I am not sure there is a way to detect that all the management buffers are exhausted, so have elected to simply issue esp_restart() if WiFi fails to reconnect after a period of time - which is not ideal but better than having devices drop off the network forever.
Thanks.
HI @lerebel103 ,
I suspect it's caused by a WiFi internal bug which is fixed by below commit:
d2af36c9fb5bfa26dc8eb9b2ae59a4bdf3c9b205
Please update your IDF to include above commit and do a test.
As for the workaround, the esp_restart() or esp_wifi_deinit()/esp_wifi_stop() maybe the candidates, but they are not good.
Ok, great, have just pulled the tip of release/v4.0 as hash 718af6ed0773df8ee372fd147871b21f6ca860e7 and will see what happens.
@maximkulkin @lerebel103 Thanks for reporting and sharing the updates. Would you please help share if any more updates about this issue? Thanks.
Hi @Alvin1Zhang,
No problem, I have had good results and not seen this issue since updating esp-idf to the above mentioned hash, or forward of it.
Hope this helps.
Will.
@lerebel103 Thanks for sharing the updates, feel free to reopen if the issue still happens. Thanks.
Most helpful comment
Hi @maximkulkin ,
This looks very similar to #3592. The root cause is that Wi-Fi can't use all of the internal memory correctly, and a fix for this is being merged soon (the issue thread has some examples of fixes you can make in IDF, in the meantime). There is a secondary issue that this error message reports internal+external free heap, but only internal memory is full.