Mbed-os: Ticker causes big delays in TCPSocket::recv

Created on 10 Jan 2018  路  15Comments  路  Source: ARMmbed/mbed-os

Description

  • Type: Bug | Question
  • Priority: Major

Bug

Target
NUCLEOF429ZI

Toolchain:
GCC_ARM

mbed-os
5.6.2, 5.7.1

Expected behavior

  • TCPSocket::recv is consistent in time
  • detaching Ticker should return system to correct function

Actual behavior

  • when Ticker is attached with function that blocks for few us (20 us and more every 1 ms) socket is slowed down
  • this happens after few minutes of function (not more than 2 minutes); until this critical time system works as expected
  • unexpected behavior - reading segments of packet by TCPSocket::recv returns NSAPI_ERROR_WOULD_BLOCK (between segments there are 300 ms and more delays - up to seconds)
  • once critical time is reached, detaching, deleting or attaching Ticker with different options (for example blocks for 1 us every 1 s) doesn't restore system to original state (only way is to restart MCU)

Steps to reproduce

  1. run some TCP server on PC (i.e. SocketTest) on port 61112
  2. compile and run code (IP address should be replaced with your PC IP address)
#include "mbed.h"
#include "EthernetInterface.h"

TCPSocket socket;
EthernetInterface eth;

#define WITH_TICKER 1

void tick_func(){
    wait_us(90);
    return;
}



int main()
{
    eth.connect();
#if WITH_TICKER
    Ticker tick;
    tick.attach(&tick_func,0.001);
#endif
    socket.open(&eth);
    socket.connect("192.168.67.38", 61112);
    char sbuffer[] = "hello!\r\n";
    int scount = socket.send(sbuffer, sizeof sbuffer);
    printf("sent %d [%.*s]\r\n", scount, strstr(sbuffer, "\r\n")-sbuffer, sbuffer);

    char rbuffer[2048];
    Timer tim;
    bool first=true;

    while(true){

        int rcount = socket.recv(rbuffer, sizeof rbuffer);
        if(rcount>0){
            if(first){
                tim.reset();
                tim.start();
                first=false;
            }
            if(rcount != 536){
                first=true;
                tim.stop();
            }
#if WITH_TICKER
            if(tim.read_ms() > 300){
                printf("detaching\n");
                tick.detach();
            }
#endif
            printf("recv %d in %d\r\n", rcount,tim.read_ms());
        }else{
            Thread::wait(10);
        }

    }

}
  1. after NUCLEO connects to your TCP server, send some packets (1100 B so there are three segments of data)
  2. after output should be something like this:
sent 9 [hello!]
recv 536 in 0
recv 536 in 1
recv 230 in 2
  1. wait for a while and send the same packet again, now response is following (reading whole packet took more then 300 ms so Ticker was detached):
recv 536 in 0
recv 536 in 9
detaching
recv 230 in 539
  1. send the same packet again to see if detaching helped
  2. you can compile the whole code again with #define WITH_TICKER 0 to see that problem is caused by attached ticker (should not be reproduced)

Note:
Code just simulates real situation (some pin actions in Ticker), I dont want to use any waits in Ticker.


Question

Am I doing something wrong? I can agree that 10% of CPU usage by Ticker function is a little much, but in my opinion deteching Ticker should bring system back to correct state. Next thing I don't understand is that packet read delay starts after 2 mins and not immediately.

closed_in_jira mirrored

Most helpful comment

@pan- I agree that Thread-like solution would be better, but it can cause function calling jitter. I tried this but for me it is workaround, not solution. I use Ticker for display drawing so I can exactly see blinking when I use Thread solution.
@kjbracey-arm Yeah, it seems that some pointers are confused in emac driver or so. It looks like that everything becames delayed by 1 frame so first parts are OK, but last one is delayed and waits for retransmission which makes original frame available. But I have no idea how so debug/solve it.

All 15 comments

@kjbracey-arm @mikaleppanen

I would hazard a guess that this is a problem in the Nucleo network driver triggered by an interrupt latency spike - if it doesn't respond fast enough to some packet received interrupt, the driver state becomes inconsistent, eg buffer index getting out of sync with hardware.

Random large packet delays are generally a symptom of driver and hardware ring buffer pointers getting out of sync. Might I suggest CCing a Nucleo person?

@ARMmbed/team-st-mcd

Hi
Does adding DMB as in #5720 solve anything ?
Regards,

Nice thought, but I doubt it's a related issue. Still, it's worth trying, just in case.

I didn't add the DMB to the other drivers because I didn't believe they'd achieve anything on pre-M7 devices, and was aiming for a quick urgent fix for the M7. But I'd say they should be there in all driver variants, just to make the code robust against future devices.

Hi @jeromecoutant, thanks for suggestion. Unfortunatelly adding those three lines to stm32f4xx_hal_eth.c doesn't seem to fix the issue.
Best regards

@martinnov71 Callbacks registered in ticker instances are executed in handler mode (within an IRQ); depending on IRQ priorities it can prevents other interrupts to be served. Depending on the maximum interrupt latency accepted by peripherals or drivers present in the system, this may be an issue.

I would advise you to use the event queue facility to defer the work made in the ticker handler to thread mode.

@pan- - good advice in general, but in this case he's claiming 90us in the interrupt handler, which is within the 100us area that I've always used as a rule of thumb for "tolerable IRQ latency" on a conventional system. I'd expect the network to cope with that sort of latency, and certainly for it to not cause lasting harm once the handler's gone.

@pan- I agree that Thread-like solution would be better, but it can cause function calling jitter. I tried this but for me it is workaround, not solution. I use Ticker for display drawing so I can exactly see blinking when I use Thread solution.
@kjbracey-arm Yeah, it seems that some pointers are confused in emac driver or so. It looks like that everything becames delayed by 1 frame so first parts are OK, but last one is delayed and waits for retransmission which makes original frame available. But I have no idea how so debug/solve it.

I saw there is branch of mbed-os called _feature-emac_. Is it possible that those changes could prevent getting into inconsistent state?

Ah, we have been investigating various ST driver performance issues - see #6262. And one of the key fixes proposed there there has made it onto the feature-emac branch, which is scheduled for merge into 5.9.

https://github.com/ARMmbed/mbed-os/pull/6425

That's the change to make the STM driver not assume a single packet per interrupt, but process multiple. That driver flaw would be consistent with the problem you're seeing, making the driver vulnerable to interrupt latency.

We don't currently have the equivalent fix scheduled for a patch release - maybe @pauluap would care to submit his? It would be superceded by the feature-emac branch though. Or adapt the commit from feature-emac?

@martinnov71, I've isolated the fixes that kjbracey was talking about on the https://github.com/pauluap/mbed-os/tree/stm32xxx_emac_drain_all_the_drains branch. You can try out the changes there. If they work for you, then I can turn it into a PR.

Note, if you start exploring all the branches that I made on #6262, you may see that the version I provided has an windup issue because the ISR kicks a semaphore with unlimited count. I left that out to concentrate on fixing the problem of packet droppage.

The feature-emac branch has the flag implementation that resolves the windup issue, I'm willing to wait for that. If you want to do it though, my version of the implementation is at https://github.com/pauluap/mbed-os/commit/8ba6d92148e88c56947e41eefdc5a1bcadacf90a

Thank you very much.

I've tested https://github.com/pauluap/mbed-os/tree/stm32xxx_emac_drain_all_the_drains and I can confirm it solves this issue, packets are delivered as expected. I'm glad to hear it should be merged into 5.9.

ARM Internal Ref: MBOTRIAGE-292

Closing this issue since it appears that the issue was resolved.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

neilt6 picture neilt6  路  4Comments

ccchang12 picture ccchang12  路  4Comments

davidantaki picture davidantaki  路  3Comments

sarahmarshy picture sarahmarshy  路  4Comments

1domen1 picture 1domen1  路  3Comments