Linux: SPI transfers sometimes corrupt last received byte

Created on 18 Sep 2017  路  25Comments  路  Source: raspberrypi/linux

I use several Raspberry Pis to communicate with other devices via SPI. There, I observed a strange data corruption that I was able to reproduce with a Raspberry Pi 3 B alone, i.e. with SPI in loopback mode (MOSI connected to MISO). I did several experiments to narrow down the problem:

  1. The problem happens only for data that is received, not for data that is sent.
  2. The data on the line looks fine, my logic analyzer shows the correct data.
  3. It is always the last byte of the transfer that is affected. It seems that this byte is not received at all but instead taken from the previous transfer.
  4. Only certain transfer sizes seem to be affected (at 5 MHz). I tested 1 to 150 byte transfers and saw the problem only for 56 to 72 byte transfers and for 111 to 136 byte transfers. Notice how these are roughly the same ranges, just twice the size. For all other sizes 500k transfers of random data did not show the problem.
  5. Only certain clock frequencies seem to be affected (for 64 byte transfers). I tested 1 to 62.5 MHz (in steps of 500 kHz; with 63 MHz and beyond no reliable communication seems possible) and saw the problem only for 4 to 9.5 MHz.
  6. Raspberry Pi 2 B is affected as well.
  7. (Almost) current mainline kernel (4.13-rc2) is affected, as well as some older version (4.11-rc4).

For my experiments on the Raspberry Pi 3 (with kernel 4.13-rc2) I used the following Python script:

import spidev

spi = spidev.SpiDev()
spi.open(32766, 1)

spi.max_speed_hz = 5000000
size = 64

for iteration in range(500000):
    data_send = size * [iteration % 256]
    data_recv = spi.xfer(data_send)
    if data_send != data_recv:
        print(iteration)
        print(data_send)
        print(data_recv)
        break

When running this script, you should see that as soon as it detects a mismatch, the wrong byte belongs to the previous transfer (for example, when it transferred 64 times the value 111, the last byte received will still have the value 110 from the previous transfer).

And for reference, this is the device tree overlay used to activate spidev:

/dts-v1/;
/plugin/;

/ {
    fragment@0 {
        target-path = "/soc/gpio@7e200000";
        __overlay__ {
            spi0_pins: spi0_pins {
                brcm,pins = <9 10 11>;
                brcm,function = <4>;
            };

            spi0_cs_pins: spi0_cs_pins {
                brcm,pins = <8 7>;
                brcm,function = <1>;
            };
        };
    };

    fragment@1 {
        target-path = "/soc/spi@7e204000";
        __overlay__ {
            #address-cells = <1>;
            #size-cells = <0>;
            pinctrl-names = "default";
            pinctrl-0 = <&spi0_pins &spi0_cs_pins>;
            status = "okay";

                spidev1: spi@1{
                compatible = "spidev";
                reg = <1>;
                spi-max-frequency = <5000000>;
            };
        };
    };
};
Waiting for external input Waiting for internal comment

All 25 comments

Interesting. Can you clarify a few points?:

  1. Is it the case that the current 4.9 kernels don't exhibit this corruption?
  2. Does your test need any external wiring other than a patch cable between GPIOs 9 and 10?
  3. Why do you use an overlay to enable SPI when "dtparam=spi=on" ought to do the same thing?
  1. Is it the case that the current 4.9 kernels don't exhibit this corruption?

I did not test that initially, but I have just now verified that a 4.9.50 kernel (built from 12bafdab8214b23223603b75845c5c88f7444b6f, non-mainline) on a Raspberry Pi 2 B is also affected (for 5 MHz, 64 bytes).

  1. Does your test need any external wiring other than a patch cable between GPIOs 9 and 10?

No, connecting BCM 9 and 10 (pins 19 and 21 on the header) is sufficient.

  1. Why do you use an overlay to enable SPI when "dtparam=spi=on" ought to do the same thing?

Sorry, forgot to mention that. The Raspberry Pi 3 is running an AArch64 mainline kernel and the dtparam does not seem to work there. It works fine for the Raspberry Pi 2 I used in the test above.

Thanks for the clarifications. This is a very niche issue so is going to be a low priority, but the easier it is to reproduce the more likely it is to get looked at. How long does it typically take to see a failure?

How long does it typically take to see a failure?

It depends. The average for all my tests is probably several ten thousand iterations with the script above.

Unfortunately, the real-world application with which I first hit the problem, seems to be more likely to see it. I use those Raspberry Pis to test kernel drivers for TPMs (tpm_tis_spi in this case). For those devices, the protocol specifies a maximum transfer size of 64 bytes, which lies perfectly in the affected range. There I see the problem with only a few hundred high-level commands (that might be split up into ~10 SPI transfers each).

Is there anything else I could test? Is there an easy way to get the Linux kernel out of the way and run code directly on the board, so that we can distinguish between a kernel bug and a hardware bug? There was an attempt to port MicroPython to a (bare metal) Raspberry Pi, but I am not sure whether this is usable yet.

My gut feel is that this is a hardware problem, but in which block? All data is transferred by DMA, and the transfer completion interrupt should only occur once the requested number of bytes has been transferred, so an apparent short transfer with old data left behind can only be explained by DMA miscounting or by a cache problem. If only the last byte is lost and the buffers are at least word aligned, a cache maintenance problem seems unlikely.

I have seen the error many times now, and it was always the last byte and only the last byte that was wrong.

When you are talking about DMA you mean the hardware peripheral, not the kernel driver, right? The kernel driver uses DMA only for transfers of 96 bytes and more.

So if this is indeed a hardware bug, the question is: can we fix it in software?

This reminds me of a separate problem we had with the Raspberry Pi's SPI master: Sometimes it seemed to release the CS line too early, while the last bit was still in transmission. We worked around that in the TPM driver (5cc0101d1f88500f8901d01b035af743215d4c3a), but never investigated the problem further. Should I create another issue for that?

Assuming this smells like a clock-crossing bug (which seems likely), can we force synchronous clocks to the SPI peripheral in the same manner as we do for the eMMC host? It may be sufficient to just run the SPI clock from the same PLL source as core clock.
Edit: the divisor is internal to the SPI peripheral and runs from the core clock, so it's synchronous.

One thought: with smaller data sizes the CPU is reading the FIFO rather than DMA. Is the CPU running in polled mode or irq-driven mode when reading the last byte? What happens if a (small) delay is inserted in the reading loop in bcm2835_rd_fifo(), say 1uS or so?

What happens if a (small) delay is inserted in the reading loop in bcm2835_rd_fifo(), say 1uS or so?

That seems to fix the issue, at least for my test of 64-byte transfers at 5 MHz. It also makes the whole test slower by about 25% (the 500k iterations of the script above need roughly 2 minutes on my Raspberry Pi 2, with the fix it's 2.5 minutes). 128-byte transfers are also fixed with this change, so there seems to be something else that prevents the driver from using DMA, even for larger transfers.

This is the patch I used:

diff --git a/drivers/spi/spi-bcm2835.c b/drivers/spi/spi-bcm2835.c
index f35cc10..73aa9c1 100644
--- a/drivers/spi/spi-bcm2835.c
+++ b/drivers/spi/spi-bcm2835.c
@@ -107,6 +107,7 @@ static inline void bcm2835_rd_fifo(struct bcm2835_spi *bs)

    while ((bs->rx_len) &&
           (bcm2835_rd(bs, BCM2835_SPI_CS) & BCM2835_SPI_CS_RXD)) {
+       udelay(1);
        byte = bcm2835_rd(bs, BCM2835_SPI_FIFO);
        if (bs->rx_buf)
            *bs->rx_buf++ = byte;

If adding a small delay fixes the issue in the definite polling case, then it's probably a case of FIFO pointers not getting updated before FIFO flags getting updated.

It may be sufficient to just do a dummy read (or several) of another register inside the BCM2835 SPI register block in place of the udelay() call in order to force a few cycles of idle time. Maybe the CDIV register?

@pelwell could there be a missing memory barrier in driver?
e.g. like https://github.com/raspberrypi/linux/commit/b84ce141b9306937f96aaec682e93b0d2058dce9

https://github.com/raspberrypi/linux/blob/rpi-4.9.y/drivers/spi/spi-bcm2835.c#L94 - bcm2835_rd|wr use readl() and writel() respectively - so there should be barriers in the right places.

readl and writel on ARM use rmb and wmb respectively. Theoretically rmb should only provided synchronisation with respect to reads, and wmb to writes, but AFAICT on ARM they are stronger than that and the SPI driver should be safe in that regard.

I'm not convinced the rx_len and rx_buf handling is safe. I can't see a failure mechanism yet, but you could add a "wmb();" after *bs->rx_buf++ = byte;:

        byte = bcm2835_rd(bs, BCM2835_SPI_FIFO);
        if (bs->rx_buf) {
            *bs->rx_buf++ = byte;
            wmb();
        }
        bs->rx_len--;

I did some more tests:

  • ndelay(10) does not seem to be really different from udelay(1), i.e. it also fixes the problem, but the test takes roughly the same time.
  • Adding the udelay(1) call only for the last byte (prefix call with if (bs->rxlen == 1)), does not fix the problem.
  • Reading the same register twice in the condition of the while loop does not fix the problem.
  • The wmb() call does not fix the problem.

And just to be sure, does sprinkling a few mb()s do anything?

        byte = bcm2835_rd(bs, BCM2835_SPI_FIFO);
        mb();
        if (bs->rx_buf) {
            *bs->rx_buf++ = byte;
            mb();
        }
        bs->rx_len--;
        mb();

@pelwell: No, that does not help.

Thanks for trying.

This appears to have stalled, but clearly is an issue. Anyone have any further ideas/comments/fixes?

I tried everything that has been suggested so far, but nothing seems to help. I'm open to try more ideas, if there are any. Did anyone try to reproduce the issue?

For now I'm running the SPI clock outside of the affected range, though that is not a nice solution.

@pelwell Anything changed on SPI since the last post on here?

No, the driver hasn't really changed in the last two years.

Two users have, independently, discovered and fixed the driver bug that is almost certainly responsible for this data loss. @niklasekstrom submitted a Pull Request with a fix, but instead we are waiting for Lukas Wunner's set of improvements to be accepted upstream.

@pelwell Was that fix ever upstreamed?

They're in 5.0 and later. It isn't a clean back-port, but could be done.

@pelwell I guess depends on when we will be moving to 5.x

Was this page helpful?
0 / 5 - 0 ratings