Mbed-os: ISR Queue overflow when signalling from CAN RX ISR

Created on 23 Apr 2018 · 30Comments · Source: ARMmbed/mbed-os

Description of defect

When signalling from CAN using RX ISR, we are seeing the ISG Queue overflowing.
This looks to be related to ARM-software/CMSIS_5#283.

Type: Bug
Priority: Major

Expected behavior
The program keeps running.
Actual behavior
The program crashes with the message CMSIS-RTOS error: ISR Queue overflow (status: 0x2, task ID: 0x0, object ID: 0x20001014)

Target(s) affected by this defect ?

NUCLEO_F767ZI

Toolchain(s) (name and version) displaying this defect ?

arm-none-eabi-g++ (GNU Tools for ARM Embedded Processors 6-2017-q2-update) 6.3.1 20170620 (release) [ARM/embedded-6-branch revision 249437]

What version of Mbed-os are you using (tag or sha) ?

f9ee4e8

What version(s) of tools are you using. List all that apply (E.g. mbed-cli)

mbed-cli version:
mbed --version
1.2.2

How is this defect reproduced ?

I've connected a Peak PCAN USB through a TJA1051T/3 with proper termination on both ends. I'm periodically sending dummy messages using the PCAN. I've tried frequencies as low as 10hz.

I've looked at the trace and the error happens inside the isrRtxThreadFlagsSet function.

I've sporadically run into this or a very similar issue (the ISR queue overflow) in the past when signaling/releasing a semaphore/inserting into a queue, but was never able to consistently reproduce it before.

#include "mbed.h"

static const uint32_t READ_IRQ =    1 << 0;
CAN bus(PB_5, PB_6, 500000);

rtos::Thread thread;
void read_isr() {
    thread.signal_set(READ_IRQ);
}

void can_main() {
    CANMessage canmsg;
    while(true) {
        osEvent evt = rtos::Thread::signal_wait(READ_IRQ);
        while (bus.read(canmsg)) {
        }
        break;
        thread.signal_clr(READ_IRQ);
    }
}

// main() runs in its own thread in the OS
int main() {
    printf("Starting\r\n\n");
    thread.start(can_main);
    wait_ms(10);
    bus.attach(read_isr, CAN::RxIrq);
    while (true) {
        wait(100);
    }
}

IOTOSM-2718 open for community contribution bug

Source

rutgervandenberg

Most helpful comment

As an update, polling CAN::read has proven to be too slow for me as well. This issue is now blocking. Going to look into writing my own driver for the Canbus or using pre-existing HAL drivers.

Still, would be great if the CAN library in mbed was fixed!

swansk on 6 Aug 2019

👍4

All 30 comments

@SenRamakri @bulislaw Please review

0xc0170 on 24 Apr 2018

I did some more testing and it seems that the ISR is called at 5 us intervals and the handling process never becomes active.

can_isr_timings

#include "mbed.h"

static const uint32_t READ_IRQ =    1 << 0;
CAN bus(PB_5, PB_6, 500000);
DigitalOut isr_on(PA_3, 0);
DigitalOut processing(PC_0, 0);
rtos::Thread thread;
void read_isr() {
    isr_on = 1;
    thread.signal_set(READ_IRQ);
    isr_on = 0;
}

void can_main() {
    CANMessage canmsg;
    while(true) {
        processing = 0;
        osEvent evt = rtos::Thread::signal_wait(READ_IRQ);
        processing = 1;
        while (bus.read(canmsg)) {
        }
        thread.signal_clr(READ_IRQ);
    }
}

// main() runs in its own thread in the OS
int main() {
    printf("Starting\r\n\n");
    thread.start(can_main);
    wait_ms(10);
    bus.attach(read_isr, CAN::RxIrq);
    while (true) {
        wait(100);
    }
}

rutgervandenberg on 24 Apr 2018

@rutgervandenberg - I don't know what your application requirements are, one option is to try increase the depth of ISR queue using OS_ISR_FIFO_QUEUE. But I would suggest you to profile your driver/interrupt behavior to make sure you are not running into spurious interrupts issue here.

SenRamakri on 3 May 2018

ARM Internal Ref: MBOTRIAGE-67

ciarmcom on 1 Jun 2018

The issue was:
CAN.read() uses a mutex, which means it cannot be used in an interrupt, which lead me to believe that it should be used out of interrupt context. The result was that the interrupt was never reset, and therefore kept triggering the interrupt handler, preventing any thread from becoming active.

I 'fixed' the issue by extending the CAN class and disabling the mutex lock, which allows me to use CAN.read() in the interrupt handler, which in turn resets the interrupt. I'm guessing this behaviour is not intended, but I see no other way to use interrupts using (only) the mbed CAN class.

rutgervandenberg on 8 Jun 2018

@rutgervandenberg So as I understand it, using the EventQueue was what caused the interrupt to trigger infinitely? Assuming there isn't a way to prevent this.

I am suffering from the same problem currently, surprising that after a year the CAN class has not been improved specifically with can.read(). Have you found any other workarounds as of yet?

swansk on 24 Jul 2019

@checked01 How badly do you need it improved?

We have not come across many customer projects in our direction that need CAN, but we know our way around Mbed.

40Grit on 25 Jul 2019

Thanks for the quick response @40Grit!

It would be excellent to have the ability to use CAN::read in the CAN interrupt. I've tried working with mbed-events (EventsQueue specifically), executing the CAN::read outside of the ISR attached to the CAN object, but ended up with an ISR queue overflow. This seems to be the same problem @rutgervandenberg had.

I additionally tried the workaround involving disabling the mutex lock in CAN.h under CAN::read, interestingly this did not resolve the mutex lock error for me.

I've settled on using a separate thread to wait for and read incoming CAN messages on a millisecond delay. While I'm glad to have a working solution, it would certainly be nice if we were able to use the interrupt without spinning up a new thread for reading the CAN.

One other quick note, I am developing with mbed on PlatformIO. I was able to use CAN::read in the interrupt no problem when using the mbed library without building the project with the full mbedOS for the RTOS features.

swansk on 25 Jul 2019

@checked01 The reason the interrupt was constantly called is that I never cleared it in the handler. CAN::read resets it, so calling that in the handler solved the issue. I had to disable the lock to do that though.

Unfortunately I have to this day not been able to get CAN fully reliable using the mbed driver on my custom platform.

I considered your solution as well, but 1ms waits will cause me to miss messages, and to my knowledge mbed does not have sub-ms delays that don't busy wait.

rutgervandenberg on 26 Jul 2019

The point I'm trying to imply is. Do either of you have a customer backing your efforts or are being prevented from going to production?

40Grit on 26 Jul 2019

@rutgervandenberg sorry to hear that, the micro that I'm using has a fairly deep FIFO CAN queue built in, and luckily I don't have many messages to read. So here's to hoping that I won't miss any messages!

For some reason I am unable to use your solution, even after disabling the lock and attempting a read in the interrupt I am still receiving a Mutex abuse error over my Serial debug.

@40Grit I am developing a CAN enabled charger for a robotics company, we are currently not blocked by this issue as the threaded CAN::read workaround is functional and should serve our purposes. It would be nice to have CAN::read functional in the interrupt, but is not mission critical at this point.

swansk on 26 Jul 2019

As an update, polling CAN::read has proven to be too slow for me as well. This issue is now blocking. Going to look into writing my own driver for the Canbus or using pre-existing HAL drivers.

Still, would be great if the CAN library in mbed was fixed!

swansk on 6 Aug 2019

👍4

@checked01 ARM uses an internal Jira so a lot of people end up not seeing closed github issue comments.
Re-open this or make a new issue and mention you are working towards production on a product and are blocked.
@linlingao

Alternatively if you need third party assistance Embedded Planet might be able to help.

loverdeg on 6 Aug 2019

This issue was actually closed on Github first by the person who created this issue:
@rutgervandenberg rutgervandenberg closed this on Jun 12, 2018

Corresponding Jira issue was then closed on the same day. I checked in Jira, there's no comments.
If there's still lingering issues, you can reopen it or create a new one.

linlingao on 6 Aug 2019

@linlingao This time I'm not blaming the bot haha.

Just passing @checked01 a tip and copied the first ARM person I knew of that has visibility over closed issues.

loverdeg on 6 Aug 2019

Thanks for the quick response guys, @linlingao would it be possible to re-open this issue? I can create a new one otherwise but there is a lot of good information in this thread, and the issues would be identical.

swansk on 6 Aug 2019

@checked01 Sure, we'll reopen this.

linlingao on 6 Aug 2019

Wanted to make sure, has this been reopened in JIRA as well?

I noticed the issue still has the "closed_in_jira" label.

swansk on 7 Aug 2019

This is blocking us from releasing production code for a client.
I'm just setting an EventFlags bit in the CAN read callback to signal my main thread, but the interrupt keeps firing before my main thread can read the message.

bikeNomad on 26 Nov 2019

@0xc0170 , @kjbracey-arm

40Grit on 27 Nov 2019

Here's my quick hack work-around: in CAN::_irq_handler(), disable the interrupt enable bit for the active interrupt. In CAN::read(), re-enable the RxIrq interrupt if there is a registered handler.
For the other interrupt types, I have to re-enable the interrupts using CAN::attach().

This at least gets my test program going; I will test it more thoroughly tomorrow.

--- a/drivers/source/CAN.cpp
+++ b/../Targets/targets/TARGET_STM/TARGET_STM32F0/TARGET_STM32F072VB/CAN.cpp
@@ -79,6 +79,12 @@ int CAN::read(CANMessage &msg, int handle)
     lock();
     int ret = can_read(&_can, &msg, handle);
     unlock();
+    // re-enable Rx interrupt if there is a handler
+    __disable_irq();
+    if (_irq[(CanIrqType)RxIrq]) {
+      can_irq_set(&_can, (CanIrqType)RxIrq, 1);
+    }
+    __enable_irq();
     return ret;
 }

@@ -152,6 +158,9 @@ void CAN::attach(Callback<void()> func, IrqType type)
 void CAN::_irq_handler(uint32_t id, CanIrqType type)
 {
     CAN *handler = (CAN *)id;
+    // disable the interrupt.
+    // To re-enable interrupts other than the Rx, call CAN::attach again.
+    can_irq_set(&handler->_can, type, 0);
     if (handler->_irq[type]) {
         handler->_irq[type].call();
     }

bikeNomad on 27 Nov 2019

👍1

cc @ARMmbed/mbed-os-core

0xc0170 on 27 Nov 2019

Looking at the CMSIS team answer from 283 issue referenced here: https://github.com/ARM-software/CMSIS_5/issues/283#issuecomment-472316216

ISR queue handling is processed immediately after ISRs and before returning to threads. ISR queue overflow indicates that the system is busy executing ISRs most of the time and has no time to execute threads. Usually it indicates a problem in the application design and it is questionable if it will function as intended.

Short burst of interrupts that put items into the ISR queue can be simply solved by increasing the ISR queue size. But if there are sustained interrupts that trigger postprocessing then the above solutions might not help much.

@bikeNomad let us know how it goes, there are some suggestions above similar to to the comment ^^

0xc0170 on 27 Nov 2019

I still regard this as an RTX bug. RTX could totally avoid this. Discussion here: https://github.com/ARM-software/CMSIS_5/issues/283

Maybe I should put my money where my mouth is and actually make a patch addressing it.

Here's my quick hack work-around

Disabling the IRQ until you need it again is a correct solution, but should be up to the application to do it.

An alternative and roughly equivalent solution is to not signal the flag again if the handler still hasn't dealt with the last flag setting. I would do it that way just to avoid potential HAL glitches with attaching and deattaching the IRQ.

This is bit tricky to get right and I would code it as:

mstd::atomic_flag handler_pending; // flag must be atomic

application_irq_handler()
{
    if (!handler_pending.test_and_set()) {
        event_flags.set(CAN_IRQ);
        // or
       event_queue.call(can_handler);
    }
}

can_handler()
{
    handler_pending.clear(); // must clear flag before attempting read to avoid race
    for (;;) { // must keep reading until we see it's empty (may be triggered once for multiple IRQs)
        if (!can.read(&msg)) {
            break;
        }
        // process message
    }
}

You can find similar logic in InternetSocket and InternetDatagramSocket, with its _pending flag - it uses the older C equivalent of the C++ above.

kjbracey-arm on 27 Nov 2019

👍2 👀1

The above code assumes that there is some sort of queue in the HAL, such that it's acceptable for multiple IRQs to arrive before you get around to a read.

If there was a requirement that you read before the next IRQ to not lose data, you have a worse problem.

First thing to check is the priority of the thread - it should be "high" or "real time". If using a shared event queue, it should be the "high priority" one. If that still isn't sufficient, then you might need to actually read in your interrupt handler.

You can do that - you would just need to deactivate the mutexes. This can be done by making an UnlockedCAN or CriticalCAN class derived from CAN. Override the lock() and unlock() virtual methods with either empty functions or "enter/exit_critical" calls, and then you can use read from the IRQ handler. That override of the lock and unlock is intended use.

kjbracey-arm on 27 Nov 2019

For some reason I am unable to use your solution, even after disabling the lock and attempting a read in the interrupt I am still receiving a Mutex abuse error over my Serial debug.

Serial is not safe to use from IRQ. You can use RawSerial instead. But don't put serial debug in interrupt handlers. It's too slow.

kjbracey-arm on 27 Nov 2019

👍1

An alternative and roughly equivalent solution is to not signal the flag again if the handler still hasn't dealt with the last flag setting. I would do it that way just to avoid potential HAL glitches with attaching and deattaching the IRQ.

This is bit tricky to get right and I would code it as:

Thanks @kjbracey-arm ! I'll give that a try. I'm new to RTX and haven't poked around in its guts yet.

As for having a queue in the HAL, most parts have a FIFO in hardware (the ST parts have two 3-message deep FIFOs but the mbed code only uses one of them).

bikeNomad on 27 Nov 2019

Thank you for raising this detailed GitHub issue. I am now notifying our internal issue triagers.
Internal Jira reference: https://jira.arm.com/browse/IOTOSM-2718

ciarmcom on 9 Nov 2020

This issue is on our backlog, but given our current priorities we can make no
commitment to address this in the near term.

As an open source project, we welcome fixes to Mbed OS, so we have opened this
issue up for community contribution.

If this is an issue you think you could address yourselves, then we'd welcome the
contribution. Please see our guidelines here -
https://os.mbed.com/docs/mbed-os/latest/contributing/workflow.html#contributions .

ciarmcom on 23 Nov 2020

@bikeNomad @kjbracey-arm any progress or conclusions. I'm still facing the problem. @bikeNomad solution seems to work for me.