Esp-idf: [BUG] send() sometimes gets stuck ? (IDFGH-461)

Created on 19 Nov 2018  路  21Comments  路  Source: espressif/esp-idf

As i have tested tcp client example i ran it for a couple of hours until send() got stuck and didn't return any value

I'm calling it this way : send(sock, data_to_send, strlen(data_to_send), 0);

This bug is repreducable

All 21 comments

Hi @Lir10 , could you help to provide more information about this issue, such as:

  • IDF commit ID
  • Log
  • How long does TCP client get stuck? Does it recoverable?

Hi @liuzfesp ,

* IDF commit ID

fb7ba1b

* Log

I can only see that the task is stuck - nothing special in the log
Also, if i'm adding prints after send i can't see them since it got stuck on send() function

* How long does TCP client get stuck?  Does it recoverable?

if i don't reboot the ESP it stays like that indefinitely

Hi @Lir10 I don't think the task is stuck on send(), look at the code in tcp_client_task()::

     while (1) {
        int err = send(sock, payload, strlen(payload), 0);
        if (err < 0) {
            ESP_LOGE(TAG, "Error occured during sending: errno %d", errno);
            break;
        }
       int len = recv(sock, rx_buffer, sizeof(rx_buffer) - 1, 0);
        // Error occured during receiving
        if (len < 0) {
            ESP_LOGE(TAG, "recv failed: errno %d", errno);
            break;
        }
        // Data received
        else {
            rx_buffer[len] = 0; // Null-terminate whatever we received and treat like a string
            ESP_LOGI(TAG, "Received %d bytes from %s:", len, addr_str);
            ESP_LOGI(TAG, "%s", rx_buffer);
        }
        vTaskDelay(2000 / portTICK_PERIOD_MS);
    }

The task may be blocked by recv() just because the timeout is not configured for it and there is nothing to receive. If so, it's expected behavior.

Hi @liuzfesp

The task looks stuck on send() - I have added prints before and after send().

I can see the print before send()but not after - i'm sure it's stuck.

Also, i'm sending and receiving without any problem - it works like that for hours until it gets stuck

Hi @Lir10

I have tested with the commit fb7ba1b for hours, but I did not reproduce your problem.
I use the tcp_client example, sending and receving data with the tcp server, it works well all the time.
Do you still have this problem?

Hi @zhangyanjiaoesp

Thanks for checking it out
Currently I can still see the problem

As a workaround i'm rebooting the esp if the task gets stuck - it's working fine for hours until it happens

@Lir10
We will debug the problem as soon as possible

@Lir10
Test tcp_client example, hang up for 2 days, no recurrence problem,do you still have the problem

Hello, I am encountering a similar issue.

I am running a WROVER device - ESP32D0WDQ5 (revision 1), ESP-IDF official version 3.3.1.

When sending a large data buffer (> 1MB) from PSRAM every 30 seconds to a server which resides on an UBUNTU machine I get this behavior after a couple of iterations.

I also used the tcp-client example, modified for this scenario.

This is the modified tcp_client_task() function:

#define MAX_SEND_SIZE        CONFIG_TCP_SND_BUF_DEFAULT
#define PSRAM_START_ADDR     (0x3F800000)
#define LARGE_DATA_START     PSRAM_START_ADDR
#define LARGE_DATA_SIZE      (1295701)

static int send_large_data(int sock, uint8_t *buffer, size_t size)
{
    size_t remaining_size = size;
    size_t total_sent = 0;
    uint8_t *curr = buffer;

    ESP_LOGW(TAG, "Going to send total of %u bytes", size);
    while (remaining_size > 0) {
        size_t send_size = (remaining_size > MAX_SEND_SIZE) ? MAX_SEND_SIZE : remaining_size;
        int sent = send(sock, curr, send_size, 0);
        if (sent < 0) {
            ESP_LOGE(TAG, "Error occurred during sending: errno %d", errno);
            return -1;
        }
        remaining_size -= sent;
        total_sent += sent;
        ESP_LOGW(TAG, "Sent %u bytes from 0x%08x (%u/%u)", sent, (uint32_t)curr, total_sent, size);
        curr += sent;
    }
    return total_sent;
}

static void tcp_client_task(void *pvParameters)
{
    char rx_buffer[128];
    char addr_str[128];
    int addr_family;
    int ip_protocol;

    while (1) {
#ifdef CONFIG_EXAMPLE_IPV4
        struct sockaddr_in destAddr;
        destAddr.sin_addr.s_addr = inet_addr(HOST_IP_ADDR);
        destAddr.sin_family = AF_INET;
        destAddr.sin_port = htons(PORT);
        addr_family = AF_INET;
        ip_protocol = IPPROTO_IP;
        inet_ntoa_r(destAddr.sin_addr, addr_str, sizeof(addr_str) - 1);
#else // IPV6
        struct sockaddr_in6 destAddr;
        inet6_aton(HOST_IP_ADDR, &destAddr.sin6_addr);
        destAddr.sin6_family = AF_INET6;
        destAddr.sin6_port = htons(PORT);
        addr_family = AF_INET6;
        ip_protocol = IPPROTO_IPV6;
        inet6_ntoa_r(destAddr.sin6_addr, addr_str, sizeof(addr_str) - 1);
#endif

        int sock =  socket(addr_family, SOCK_STREAM, ip_protocol);
        if (sock < 0) {
            ESP_LOGE(TAG, "Unable to create socket: errno %d", errno);
            break;
        }
        ESP_LOGI(TAG, "Socket created");

        int err = connect(sock, (struct sockaddr *)&destAddr, sizeof(destAddr));
        if (err != 0) {
            ESP_LOGE(TAG, "Socket unable to connect: errno %d", errno);
            break;
        }
        ESP_LOGI(TAG, "Successfully connected");

        while (1) {
            int err = send_large_data(sock, (uint8_t *)LARGE_DATA_START, LARGE_DATA_SIZE);
            if (err < 0) {
                ESP_LOGE(TAG, "Error occured during sending: errno %d", errno);
                break;
            }
            ESP_LOGI(TAG, "Going to sleep for 30 seconds");
            vTaskDelay(30000 / portTICK_PERIOD_MS);
        }

        if (sock != -1) {
            ESP_LOGE(TAG, "Shutting down socket and restarting...");
            shutdown(sock, 0);
            close(sock);
        }
    }
    vTaskDelete(NULL);
}

I use the default configuration, with only these changes in sdkconfig.defaults:

CONFIG_WIFI_SSID=<MY_SSID>
CONFIG_WIFI_PASSWORD=<MY_PASSWORD>
CONFIG_EXAMPLE_IPV4_ADDR=<MY_SERVER_IP>

CONFIG_SPIRAM_SUPPORT=y
CONFIG_SPIRAM_BOOT_INIT=y

In Wireshark on the server-side I can see that the device sent an out-of-order packet (packet 15110) which the server responds with a duplicate ack for the needed seq.num (packet 15111), then it is stuck for 20 minutes until the retransmission actually happen (packet 22681), but by this time the operation fails with NO ROUTE TO HOST error (I guess the routing table entry has aged by this time).

stuck_test

Logs from device:

rst:0x1 (POWERON_RESET),boot:0x32 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0018,len:4
load:0x3fff001c,len:6788
load:0x40078000,len:13232
load:0x40080400,len:7172
entry 0x40080778
I (73) boot: Chip Revision: 1
I (73) boot_comm: chip revision: 1, min. bootloader chip revision: 0
I (39) boot: ESP-IDF v3.3.1 2nd stage bootloader
I (39) boot: compile time 18:32:44
I (39) boot: Enabling RNG early entropy source...
I (43) boot: SPI Speed      : 40MHz
I (47) boot: SPI Mode       : DIO
I (51) boot: SPI Flash Size : 2MB
I (55) boot: Partition Table:
I (59) boot: ## Label            Usage          Type ST Offset   Length
I (66) boot:  0 nvs              WiFi data        01 02 00009000 00006000
I (74) boot:  1 phy_init         RF data          01 01 0000f000 00001000
I (81) boot:  2 factory          factory app      00 00 00010000 00100000
I (89) boot: End of partition table
I (93) boot_comm: chip revision: 1, min. application chip revision: 0
I (100) esp_image: segment 0: paddr=0x00010020 vaddr=0x3f400020 size=0x20a14 (133652) map
I (156) esp_image: segment 1: paddr=0x00030a3c vaddr=0x3ffb0000 size=0x030a4 ( 12452) load
I (161) esp_image: segment 2: paddr=0x00033ae8 vaddr=0x40080000 size=0x00400 (  1024) load
0x40080000: _WindowOverflow4 at /home/michaels/esp/esp-idf-v3.3.1/components/freertos/xtensa_vectors.S:1779

I (163) esp_image: segment 3: paddr=0x00033ef0 vaddr=0x40080400 size=0x0c120 ( 49440) load
I (193) esp_image: segment 4: paddr=0x00040018 vaddr=0x400d0018 size=0x70a10 (461328) map
0x400d0018: _stext at ??:?

I (355) esp_image: segment 5: paddr=0x000b0a30 vaddr=0x4008c520 size=0x08750 ( 34640) load
0x4008c520: prvCheckTasksWaitingTermination at /home/michaels/esp/esp-idf-v3.3.1/components/freertos/tasks.c:3507

I (383) boot: Loaded app from partition at offset 0x10000
I (383) boot: Disabling RNG early entropy source...
I (384) psram: This chip is ESP32-D0WD
I (390) spiram: Found 64MBit SPI RAM device
I (393) spiram: SPI RAM mode: flash 40m sram 40m
I (398) spiram: PSRAM initialized, cache is in low/high (2-core) mode.
I (405) cpu_start: Pro cpu up.
I (409) cpu_start: Application information:
I (414) cpu_start: Project name:     esp-idf
I (419) cpu_start: App version:      1
I (423) cpu_start: Compile time:     May  5 2020 18:32:42
I (429) cpu_start: ELF file SHA256:  e81b19025ddbc8a2...
I (435) cpu_start: ESP-IDF:          v3.3.1
I (440) cpu_start: Starting app cpu, entry point is 0x400813d8
0x400813d8: call_start_cpu1 at /home/michaels/esp/esp-idf-v3.3.1/components/esp32/cpu_start.c:269

I (0) cpu_start: App cpu up.
I (1331) spiram: SPI SRAM memory test OK
I (1332) heap_init: Initializing. RAM available for dynamic allocation:
I (1332) heap_init: At 3FFAE6E0 len 00001920 (6 KiB): DRAM
I (1338) heap_init: At 3FFB90B0 len 00026F50 (155 KiB): DRAM
I (1344) heap_init: At 3FFE0440 len 00003AE0 (14 KiB): D/IRAM
I (1351) heap_init: At 3FFE4350 len 0001BCB0 (111 KiB): D/IRAM
I (1357) heap_init: At 40094C70 len 0000B390 (44 KiB): IRAM
I (1364) cpu_start: Pro cpu start user code
I (1368) spiram: Adding pool of 4096K of external SPI memory to heap allocator
I (48) cpu_start: Starting scheduler on PRO CPU.
I (0) cpu_start: Starting scheduler on APP CPU.
I (49) spiram: Reserving pool of 32K of internal memory for DMA/internal allocations
I (109) wifi: wifi driver task: 3ffc8c9c, prio:23, stack:3584, core=0
I (109) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE
I (109) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE
I (149) wifi: wifi firmware version: ac331d7
I (149) wifi: config NVS flash: enabled
I (149) wifi: config nano formating: disabled
I (149) wifi: Init dynamic tx buffer num: 32
I (149) wifi: Init data frame dynamic rx buffer num: 32
I (159) wifi: Init management frame dynamic rx buffer num: 32
I (159) wifi: Init management short buffer num: 32
I (169) wifi: Init static tx buffer num: 16
I (169) wifi: Init static rx buffer size: 1600
I (169) wifi: Init static rx buffer num: 10
I (179) wifi: Init dynamic rx buffer num: 32
I (179) example: Setting WiFi configuration SSID miki_hadar...
I (279) phy: phy_version: 4102, 2fa7a43, Jul 15 2019, 13:06:06, 0, 0
I (279) wifi: mode : sta (30:ae:a4:df:10:fc)
I (289) example: Waiting for AP connection...
I (289) example: SYSTEM_EVENT_STA_START
I (409) wifi: new:<1,0>, old:<1,0>, ap:<255,255>, sta:<1,0>, prof:1
I (1539) wifi: state: init -> auth (b0)
I (1559) wifi: state: auth -> assoc (0)
I (1569) wifi: state: assoc -> run (10)
I (1629) wifi: connected with miki_hadar, aid = 1, channel 1, BW20, bssid = 10:5a:f7:20:1e:84
I (1629) wifi: security type: 3, phy: bgn, rssi: -59
I (1639) wifi: pm start, type: 1

I (1699) wifi: AP's beacon interval = 102400 us, DTIM period = 1
I (2599) event: sta ip: 10.100.102.19, mask: 255.255.255.0, gw: 10.100.102.1
I (2599) example: SYSTEM_EVENT_STA_GOT_IP
I (3099) example: SYSTEM_EVENT_STA_GOT_IP6
I (3099) example: IPv6: FE80::32AE:A4FF:FEDF:10FC
I (3099) example: Connected to AP
I (3099) example: Socket created
I (3449) example: Successfully connected
W (3449) example: Going to send total of 1295701 bytes
W (3449) example: Sent 5744 bytes from 0x3f800000 (5744/1295701)
W (3529) example: Sent 5744 bytes from 0x3f801670 (11488/1295701)
W (3559) example: Sent 5744 bytes from 0x3f802ce0 (17232/1295701)
W (3609) example: Sent 5744 bytes from 0x3f804350 (22976/1295701)
...
...
W (6529) example: Sent 5744 bytes from 0x3f93a200 (1292400/1295701)
W (6539) example: Sent 3301 bytes from 0x3f93b870 (1295701/1295701)
I (6539) example: Going to sleep for 30 seconds
...
...
W (1261749) example: Sent 5744 bytes from 0x3f87ccf0 (516960/1295701)
W (1261749) example: Sent 5744 bytes from 0x3f87e360 (522704/1295701)
W (1261799) example: Sent 5744 bytes from 0x3f87f9d0 (528448/1295701)
W (1261819) example: Sent 5744 bytes from 0x3f881040 (534192/1295701)
W (1262099) example: Sent 5744 bytes from 0x3f8826b0 (539936/1295701)
W (1264899) example: Sent 5744 bytes from 0x3f883d20 (545680/1295701)
W (1295539) example: Sent 5744 bytes from 0x3f885390 (551424/1295701)
E (2446359) example: Error occurred during sending: errno 113
E (2446359) example: Error occured during sending: errno 113
E (2446359) example: Shutting down socket and restarting...

hi @mikisch81
I will use your code to reproduce this problem. Thank you for your feedback

Hi @freakyxue did you manage to reproduce it?

hi @mikisch81
LWIP_IP_FRAG
Is this enabled in menuconfig????

@freakyxue No, it is not set, My sdkconfig.defaults looks like this:

CONFIG_WIFI_SSID=<MY_SSID>
CONFIG_WIFI_PASSWORD=<MY_PASSWORD>
CONFIG_EXAMPLE_IPV4_ADDR=<MY SERVER_IP>

CONFIG_SPIRAM_SUPPORT=y
CONFIG_SPIRAM_BOOT_INIT=y

So the generated sdkconfig does not set LWIP_IP_FRAG.

make menuconfig
Component config-->LWIP->ENABLE fragment outgoing ip packets

Is this what is causing the issue?

yes , Enable this option, test it, thank you

No, it didn't help

hi @mikisch81
I didn't reproduce the problem locally,
From the log view, the error "route problem" is returned, which leads to the sending failure, right?
If so, I think during the transmission process, WiFi may be disconnected.

@freakyxue If you are referring to these lines:

W (1295539) example: Sent 5744 bytes from 0x3f885390 (551424/1295701)
E (2446359) example: Error occurred during sending: errno 113

Then how would you explain the huge time difference between them?

hi @mikisch81
Because the length of each transmission you send is equal to TCP_ Wind, after sending data, you need to wait for TCP_ Ack to release TCP_ Wind, can carry on data transmission

The last log shows that the sending data stops. I think the device end does not receive the ACK sent by the service end, or this loss causes the send_ Wind was not released, resulting in the send function never returning

I recommend that you do not send data each time with a length equal to TCP_ Wind, the best length is equal to TCP_ MSS

Thanks for reporting, feel free to reopen.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

hex007 picture hex007  路  4Comments

LosDeiblos picture LosDeiblos  路  4Comments

howroyd picture howroyd  路  3Comments

beriberikix picture beriberikix  路  3Comments

kylefelipe picture kylefelipe  路  3Comments