The drivers in the CAN subsystem suffer from priority inversion. This causes urgent frames to be delayed arbitrarily, and in the case of the FlexCAN drivers, to cause the CPU to spin for an arbitrary time.
I explain CAN priority inversion in a blog post here:
https://kentindell.github.io/2020/06/29/can-priority-inversion/
The solution to this is to rewrite the CAN drivers. Some software may rely on FIFO processing (e.g. for segmented messages) and a separate API for sending frames according to FIFO will probably be necessary. The blog post discusses this in more detail.
@kentindell As you already noticed, some CAN subsystems rely on FIFO processing of the messages. Unfortunately, the CAN controllers can either have priority by CAN-ID OR FIFO. A combination is regrettably not possible.
The flexcan driver does NOT busy-wait for a free slot.
The thread is suspended and waits on the semaphore.
If you need to have a priority based on the ID, you can submit a PR with a Kconfig option to switch to that mode. It would be possible for flexcan and STM32. Relying on a FIFO ordering is not a bad API design but necessary in the case of ISO-TP and 6LoCAN.
Using only a single hardware buffer, like suggested in your blog, is a bad idea. It is almost impossible to get a frame ready during the interframe space (3 tq). Hence the frame is too late, and another node can send a frame (maybe lower priority) before we get our frame ready. I agree that this delay is bounded (1 frame), but it also comes with a high cost on bandwidth.
I’m not suggesting a combination: the hardware must be put into ID priority mode to avoid priority inversion.
For segmented messaging, like diagnostics, FIFO is necessary but only with respect to the frames containing segments. This is not a mixed strategy: it operates at a higher level. In my blog post I explain that frame is taken from a software FIFO (specific to the segmented message and not shared with other frames) and put into the driver’s (priority) queue, when that frame has been sent, the next one is pulled from the FIFO and put in the queue, and so on.
My reference to a single buffer is for the MCP2515 only: there is almost nothing else that can be done. And yes, I am aware of the problem of releasing access to the CAN engine for longer than the IFS: I am the inventor of the three transmit buffer design (first applied to MSCAN on the 68HC08) designed to avoid this race condition. The MCP2515 simply can’t be used properly for real-time CAN, and the best that can be done is to use a single buffer slot.
@kentindell Thank you for your report on this.
I am the original author of the FlexCAN shim driver in Zephyr. The original driver used the CAN-ID for priority, but this was changed at a later point to match the behaviour of other CAN drivers in Zephyr. See https://github.com/zephyrproject-rtos/zephyr/commit/ec0e19920657e0b074d424410381d12b788494c9 for further details.
I agree that this needs to be addressed in Zephyr.
With hindsight it probably would've been best to change the other drivers :)
I can see how it gets to this point: there is a need for FIFO queueing for frames with the same CAN ID, and in nearly all cases the hardware doesn't do that (choosing a frame arbitrarily). So this looks like a FIFO problem, and so then the drivers are hacked to produce FIFO ordering, and then we get priority inversion. But what's really needed is two things:
I've written drivers for the controller on the SAMC21 and also the ST bxCAN that do this. The easiest approach is if the priority queue drivers in the transmitted event handler know that the sent frame came from a FIFO queue (e.g. if the frame data structures include a FIFO tag) and as an action they fetch the frame at the head of the FIFO and queue that in the priority queue.
There is no other way around this. FIFO transmission is required for segmented messaging and priority queues are required to avoid priority inversion.
I do recognize that when a FIFO frame is transmitted then the CAN bus is released while the next FIFO frame is fetched and one lower priority frame can get in each time, so the maximum throughput of CAN frames transmitted back-to-back is curtailed. That's an issue for sure, but much, much less of a problem than priority inversion, which is devastating to a critical real-time system.
@kentindell thanks for pointing this out!
I need to think about a realization. Maybe we can somehow put the next frame to the mailbox when a frame is just about to send. For now we are somehow save in our subsystems, because we don‘t rely on priorities. Maybe we should add a warning to the documentation until we find a solution. I think we can discuss the possibilities in this issue.
I think you might be able to do something clever on the FlexCAN: do an experiment and find out how frames with the same ID are chosen. The documentation says that it does a sweep down the mailboxes looking for the highest priority, and assuming it sweeps in the same direction (very likely) then it will probably pick frames of the same ID from ether lower or higher slot numbers (depending on sweep direction).
What you can do then is put _two_ frames with the same ID from the software FIFO into FlexCAN mailboxes, arranging to put them in slot order _with respect to each other_ (kind of like the drivers do now). Then when there is a notification of one of the FIFO frames being sent, the other FIFO frame will still be in the controller and be entered into arbitration while the driver is working. The driver then has time to take the next frame from the software FIFO and find a free mailbox that will be behind the current FIFO frame. The bus arbitration will not allow a lower priority frame on the bus to get in between in this case and the FIFO frames can go out back-to-back if no other higher priority traffic comes along. Sort of like double buffering the FIFO frames. But it's also free of priority inversion because higher priority frames will still jump ahead of these FIFO frames.
If there are multiple segmented messages needing FIFO ordering with respect to each other then each FIFO requires two slots in the FlexCAN. But maybe that's not a problem: probably only a one or two FIFO queues would be needed in most applications.
@kentindell As I understand, a software buffer suffers from the same problem as the hardware buffers do. If a frame with lower ID must be sent and the HW and SW buffer is full, we cannot do anything, because we can't just remove other frames without putting them in the queue again. This is an issue do to the async behavior of the CAN API. For synchronous calls, we can just keep a copy of the frame in the stack and put it to the queue later on. For async send, we don't have a stack, unfortunately.
A solution could be to use a buffer and context provided by the user, for async transfers. The async context holds the callback + callback arg, the semaphore, a list pointer, and a pointer to N frames, where the N frames could be used for sequential transfers.
The context can form a list and can be moved forward/backward according to their ID. The driver needs to be able to remove frames from its mailbox to avoid priority inversion. The synchronous calls can hold that context in the stack. The async calls need to make sure that the memory stays valid until the callback is called. The drawback is that when we let the user provide a buffer and he puts the buffer on the stack, the whole list could be damaged.
What do you guys think about it @henrikbrixandersen @karstenkoenig @nixward ?
The software queues would have the same problem as the hardware queues if they aren’t big enough, that’s true. My MicroPython bxCAN drivers can be compiled to support 32 or 64 frames, but it’s not easy to have an arbitrary number.
For serialising a segmented message, you could have an API where the whole message is stored in RAM (allocated by the application but where ‘ownership’ of the RAM is handed to the CAN driver) and the CAN driver assemble a new frame in the driver’s priority queue on a callback when the previous frame was sent. Obviously there are memory management issues (bad idea to have that message stored on the stack!).
I think this scheme is workable. It’s important to be careful about CPU time in the TX callback - it will need to be kept short, particularly for the bxCAN driver which will also have the job of filling a hardware mailbox from the priority queue (there is probably an optimisation when the new frame is also the one that should go into the mailbox - can combine the two operations into one).
I think this scheme is workable. It’s important to be careful about CPU time in the TX callback - it will need to be kept short, particularly for the bxCAN driver which will also have the job of filling a hardware mailbox from the priority queue (there is probably an optimisation when the new frame is also the one that should go into the mailbox - can combine the two operations into one).
The TX callback (ISR context) will only move the head of the sorted queue to the HW mailbox. This operation is quite short and O(1). The sorting (Insertion sort, which is O(n)) will happen in the context of the sending thread.