Target
NUCLEOF429ZI
Toolchain:
GCC_ARM
mbed-os
5.6.2, 5.7.1
Expected behavior
Actual behavior
Steps to reproduce
#include "mbed.h"
#include "EthernetInterface.h"
TCPSocket socket;
EthernetInterface eth;
#define WITH_TICKER 1
void tick_func(){
wait_us(90);
return;
}
int main()
{
eth.connect();
#if WITH_TICKER
Ticker tick;
tick.attach(&tick_func,0.001);
#endif
socket.open(ð);
socket.connect("192.168.67.38", 61112);
char sbuffer[] = "hello!\r\n";
int scount = socket.send(sbuffer, sizeof sbuffer);
printf("sent %d [%.*s]\r\n", scount, strstr(sbuffer, "\r\n")-sbuffer, sbuffer);
char rbuffer[2048];
Timer tim;
bool first=true;
while(true){
int rcount = socket.recv(rbuffer, sizeof rbuffer);
if(rcount>0){
if(first){
tim.reset();
tim.start();
first=false;
}
if(rcount != 536){
first=true;
tim.stop();
}
#if WITH_TICKER
if(tim.read_ms() > 300){
printf("detaching\n");
tick.detach();
}
#endif
printf("recv %d in %d\r\n", rcount,tim.read_ms());
}else{
Thread::wait(10);
}
}
}
sent 9 [hello!]
recv 536 in 0
recv 536 in 1
recv 230 in 2
recv 536 in 0
recv 536 in 9
detaching
recv 230 in 539
#define WITH_TICKER 0 to see that problem is caused by attached ticker (should not be reproduced)Note:
Code just simulates real situation (some pin actions in Ticker), I dont want to use any waits in Ticker.
Am I doing something wrong? I can agree that 10% of CPU usage by Ticker function is a little much, but in my opinion deteching Ticker should bring system back to correct state. Next thing I don't understand is that packet read delay starts after 2 mins and not immediately.
@kjbracey-arm @mikaleppanen
I would hazard a guess that this is a problem in the Nucleo network driver triggered by an interrupt latency spike - if it doesn't respond fast enough to some packet received interrupt, the driver state becomes inconsistent, eg buffer index getting out of sync with hardware.
Random large packet delays are generally a symptom of driver and hardware ring buffer pointers getting out of sync. Might I suggest CCing a Nucleo person?
@ARMmbed/team-st-mcd
Hi
Does adding DMB as in #5720 solve anything ?
Regards,
Nice thought, but I doubt it's a related issue. Still, it's worth trying, just in case.
I didn't add the DMB to the other drivers because I didn't believe they'd achieve anything on pre-M7 devices, and was aiming for a quick urgent fix for the M7. But I'd say they should be there in all driver variants, just to make the code robust against future devices.
Hi @jeromecoutant, thanks for suggestion. Unfortunatelly adding those three lines to stm32f4xx_hal_eth.c doesn't seem to fix the issue.
Best regards
@martinnov71 Callbacks registered in ticker instances are executed in handler mode (within an IRQ); depending on IRQ priorities it can prevents other interrupts to be served. Depending on the maximum interrupt latency accepted by peripherals or drivers present in the system, this may be an issue.
I would advise you to use the event queue facility to defer the work made in the ticker handler to thread mode.
@pan- - good advice in general, but in this case he's claiming 90us in the interrupt handler, which is within the 100us area that I've always used as a rule of thumb for "tolerable IRQ latency" on a conventional system. I'd expect the network to cope with that sort of latency, and certainly for it to not cause lasting harm once the handler's gone.
@pan- I agree that Thread-like solution would be better, but it can cause function calling jitter. I tried this but for me it is workaround, not solution. I use Ticker for display drawing so I can exactly see blinking when I use Thread solution.
@kjbracey-arm Yeah, it seems that some pointers are confused in emac driver or so. It looks like that everything becames delayed by 1 frame so first parts are OK, but last one is delayed and waits for retransmission which makes original frame available. But I have no idea how so debug/solve it.
I saw there is branch of mbed-os called _feature-emac_. Is it possible that those changes could prevent getting into inconsistent state?
Ah, we have been investigating various ST driver performance issues - see #6262. And one of the key fixes proposed there there has made it onto the feature-emac branch, which is scheduled for merge into 5.9.
https://github.com/ARMmbed/mbed-os/pull/6425
That's the change to make the STM driver not assume a single packet per interrupt, but process multiple. That driver flaw would be consistent with the problem you're seeing, making the driver vulnerable to interrupt latency.
We don't currently have the equivalent fix scheduled for a patch release - maybe @pauluap would care to submit his? It would be superceded by the feature-emac branch though. Or adapt the commit from feature-emac?
@martinnov71, I've isolated the fixes that kjbracey was talking about on the https://github.com/pauluap/mbed-os/tree/stm32xxx_emac_drain_all_the_drains branch. You can try out the changes there. If they work for you, then I can turn it into a PR.
Note, if you start exploring all the branches that I made on #6262, you may see that the version I provided has an windup issue because the ISR kicks a semaphore with unlimited count. I left that out to concentrate on fixing the problem of packet droppage.
The feature-emac branch has the flag implementation that resolves the windup issue, I'm willing to wait for that. If you want to do it though, my version of the implementation is at https://github.com/pauluap/mbed-os/commit/8ba6d92148e88c56947e41eefdc5a1bcadacf90a
Thank you very much.
I've tested https://github.com/pauluap/mbed-os/tree/stm32xxx_emac_drain_all_the_drains and I can confirm it solves this issue, packets are delivered as expected. I'm glad to hear it should be merged into 5.9.
ARM Internal Ref: MBOTRIAGE-292
Closing this issue since it appears that the issue was resolved.
Most helpful comment
@pan- I agree that Thread-like solution would be better, but it can cause function calling jitter. I tried this but for me it is workaround, not solution. I use Ticker for display drawing so I can exactly see blinking when I use Thread solution.
@kjbracey-arm Yeah, it seems that some pointers are confused in emac driver or so. It looks like that everything becames delayed by 1 frame so first parts are OK, but last one is delayed and waits for retransmission which makes original frame available. But I have no idea how so debug/solve it.