Circuitpython: Teensy 4.0 / i.MXRT1062 displayio via SPI very slow

Created on 22 Jun 2020  Â·  14Comments  Â·  Source: adafruit/circuitpython

Trying out the Adafruit 2.0" 320x240 ST7789 based display I get a incredibly slow refresh rate considering the speed of this MCU. A full refresh takes about 2 seconds.
The display works over SPI and I tried setting higher SPI clock rates but that does not seem to have any effect.
The same code works about 20x as fast on a Feather M4. I haven't done any measurements but just from looking at it I get an instant screen update on the M4 while the Teensy needs about 1-2 seconds to draw.
You can try the example provided with the displays guide: https://github.com/adafruit/Adafruit_CircuitPython_ST7789/blob/master/examples/st7789_320x240_simpletest.py

My guess is that SPI clock defaults to a very low clock rate and setting it via spi.configure() doesn't have any effect but I have yet to verify this.

mimxrt10xx

All 14 comments

So I hooked up my scope and I get 17.5MHz no matter what I feed into spi.configure()

This is the modified example with setting SPI clock.

import board
import terminalio
import displayio
from adafruit_display_text import label
from adafruit_st7789 import ST7789

# Release any resources currently in use for the displays
displayio.release_displays()

spi = board.SPI()
tft_cs = board.D5
tft_dc = board.D6

while not spi.try_lock():
    spi.configure(baudrate=32000000)
    pass
spi.unlock()

display_bus = displayio.FourWire(
    spi, command=tft_dc, chip_select=tft_cs, reset=board.D9
)

display = ST7789(display_bus, width=320, height=240, rotation=90)

# Make the display context
splash = displayio.Group(max_size=10)
display.show(splash)

color_bitmap = displayio.Bitmap(320, 240, 1)
color_palette = displayio.Palette(1)
color_palette[0] = 0x00FF00  # Bright Green

bg_sprite = displayio.TileGrid(color_bitmap, pixel_shader=color_palette, x=0, y=0)
splash.append(bg_sprite)

# Draw a smaller inner rectangle
inner_bitmap = displayio.Bitmap(280, 200, 1)
inner_palette = displayio.Palette(1)
inner_palette[0] = 0xAA0088  # Purple
inner_sprite = displayio.TileGrid(inner_bitmap, pixel_shader=inner_palette, x=20, y=20)
splash.append(inner_sprite)

# Draw a label
text_group = displayio.Group(max_size=10, scale=3, x=57, y=120)
text = "Hello World!"
text_area = label.Label(terminalio.FONT, text=text, color=0xFFFF00)
text_group.append(text_area)  # Subgroup for text scaling
splash.append(text_group)

while True:
    pass

Interestingly if I put a print() in the while not spi.try_lock(): loop I get nothing printed. The display still draws something to the screen though.
Not really sure how that can work but maybe the root cause for not being able to set the clock rate.

The try_lock grabs the lock immediately and skips the loop body. The configured rate of the SPI is ignored by the display code as well. Instead, give the baudrate to FourWire so it knows and can call configure every time it transmits to the display.

I also tried this but unfortunately makes no difference as well.

Hrm. Could be that the SPI setup code can't do anything faster.

I have an application where the Teensy 4.0 running CircuitPython should be ideal.

After reading some comments (issue #3062) about how slow the Teensy 4.0 is updating a SPI driven display, I used a Teensy 4.0 in a Teensy-to-Feather Adaptor to compare performance of the M4E and Teensy 4.0.

Both T4.0 and M4E are running CircuitPython 5.3.1 and bootloader 3.10.0.

I reproduced the complaint about the T4.0 taking about four times as long to update a (SPI) TFT Display.

The Teensy is running an NXP iMX RT1062 Cortex M7 which is allegedly a 600 MHz processor.

The M4E is a Microchip/Atmel ATSAMD51 Cortex M4 at 120 MHz.

A simple pin toggling loop exhibits the behavior.
The M4E can toggle a single pin at 72.2 kHz
The T4.0 can only toggle a single pin at 24.6 kHz.

import board
import digitalio

pin = digitalio.DigitalInOut(board.D10)  # J3-P7 is D10 on M4E and T4.0 Adapter
pin.direction = digitalio.Direction.OUTPUT

while True:
    pin.value = True
    pin.value = False

Now I examine the SPI performance on a scope, sending an array of ten bytes in a single chip select event.

I request a 10 MHz SPI clock to keep things at about the same clock speed.
Result:
M4E: get 12.0 MHz
T4.0: get 11.6 MHz
So close enough, and no where near the explanation for what is going on. This tells me that once you get the data byte into the SPI hardware peripheral, they clock out about the same rate.

Results:

Frame Transmission rate (frame of ten 8-bit SPI transmissions inside of one CS cycle, so CS pre and post time is included.
M4E: sends at frame rate of 22.9 kHz ( ~44 us per SPI frame of ten bytes )
T4.0 sends at frame rate of 6.8 kHz ( ~147 us per SPI frame of ten bytes )

CS management: (No lock acquisition or release included in the loop.)
M4E: Lowers CS 16 us before first SPI data starts. Raises CS 8 us after last SPI data complete.
T4.0 Lowers CS 40 us before first SPI data starts. Raises CS 22 us after last SPI data complete.

Spacing between inner single byte transmission events:
M4E: SPI single byte transfer events are spaced 550 ns apart
T4.0 SPI single byte transfer events are spaced 7200 ns (7.2 us) apart

Time to send 10 bytes (Start of first byte to end of last byte, CS pre and post
time not included):
M4E: 11.25 uS
T4.0 72 uS

So, I can conclude that the CP performance of a T4.0, running CP 5.3.1 is about
four times slower, not four times faster than a M4E.

The iMX RT-1062 is a wicked complicated beast. The only way to get the 600 MHz performance is to execute the code from tightly coupled RAM. If you use their “XIP“ (Execute in Place) option, executing directly in external (132 MHz ?) QSPI Flash, your execution speed could drop down to 60 Mhz or so.

I have scope pictures if you want them, but this is easy to reproduce.

import busio
import board
import digitalio

cs = digitalio.DigitalInOut(board.D10)     # J3-P7 is D10 on M4E and T4.0 Adapter
cs.direction = digitalio.Direction.OUTPUT
cs.value = True


spi = busio.SPI(board.SCK, board.MOSI, board.MISO)

while not spi.try_lock():
    pass
spi.configure(baudrate=10000000)    # Request 10 MHz
spi.unlock()

out_array = bytearray([0x00, 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77, 0x88, 0x99])

while not spi.try_lock():
    pass

while True:
    cs.value = False
    spi.write(out_array)
    cs.value = True

@phrogger the comment has been edited to remove words like "disappointing" - as it isnt constructive or helpful

@phrogger Very interesting findings! The port is still in an early stage and development focus of the core team has currently shifted in favour of the ESP32-S2. I would also be interested in further work on this port and you seem to know a fair bit about the i.MX RT. I think everyone would welcome more helping hands on the port if you would like to help out! I wanted to dive deeper into it but got distracted by other things unfortunately and not knowing all that much about this MCU to begin with.

@PTS93 I am currently doing a C-Language project on iMX RT1052 bare metal, no RTOS, and still learning how to manage the beast. At least in my case, the entire compiled program and data will fit in the tightly coupled RAM without paging, so I expect to get the full performance of the processor.

Paging can slow things down, and the speed of the memory/bus where the code is resident can change the overall performance.

There is also an option called XIP (execute in place) where instead of paging the code into RAM and executing from RAM, you actually execute directly from permanent storage memory. The docs warn that if you do this for the on-board FLASH on the iMX-RT1062, that it will slow down the effective execution speed to about 130 MHz.

My thought is that if you do it out of QSPI, it will take you two clock cycles (at 132 MHz?) to retrieve a single byte from the QSPI chip, which sounds really ugly.

It appears that the Teensy 4.0 CP 5.3.1 implementation is running at an effective clock speed down around 40 or 30 MHz, if you use the 120 MHz M4E as a benchmark, at least for simple loops and managing the SPI hardware peripheral.

I think the XIP option is ON by default in the MCUXpresso IDE, so beware.

What is the tool chain used for building CP on the iMX-RT1062?
Will the source just drop into the MCUXpresso IDE and build/run?
Or is it a big project to even get started?

--- Graham

Yea afaik there were some plans for CPY's architecture on how to deal with the segregation of memory. @tannewt or @arturo182 might have more to say about this.

If you want to evaluate further it is probably also a good idea to work with version 6 beta of CPY or a nightly release, it wont change very much for the i.MX RT but it is closer to the main development branch.

Compiling CPY is very straight forward and not hard at all to get started on. For the i.MX RT port the NXP SDK is pulled in via a git submodule and you don't really need anything other than a couple standard dependency like GCC for ARM.
If you are on Windows a very quick and easy way is to use Windows Subsystem for Linux:
https://learn.adafruit.com/building-circuitpython/windows-subsystem-for-linux
You can also work natively with something like MinGW but a bit more fiddely. WSL is setup with a few clicks.

If you are on Linux just follow this section of the guide:
https://learn.adafruit.com/building-circuitpython/linux

After setting that up you can use any IDE you want and compile CPY for your board via make as described here: https://learn.adafruit.com/building-circuitpython/build-circuitpython

If you want to start contributing its probably good to read the design guide as it explains the whole folder structure and how CPY is setup: https://circuitpython.readthedocs.io/en/latest/docs/design_guide.html
Also the contributing guidelines: https://github.com/adafruit/circuitpython/blob/main/CONTRIBUTING.md

T4.0 SPI single byte transfer events are spaced 7200 ns (7.2 us) apart

This looks like a red flag to me. The SAMD51 is fast with SPI because it uses DMA. If the iMX isn't using DMA then this could be the culprit.

My intent was to setup the iMX so that the core CircuitPython VM code live in ITCM, the stack in DTCM and the caches were left for parts of CircuitPython used off and on. There is definitely tuning to do.

As, I wrote that, I realized displayio buffers live on the stack which may prevent DMA for working. The example uses a CircuitPython bytearray though that should live in ORAM (iirc).

I believe the memory bus connected to ORAM is 1/4 the core frequency which places in similar territory to the SAMD51's memory speed. CPU speed isn't the only thing that matters. :-)

Everything you say makes sense, but since every thing measured on the iMX RT is slower than the M4E, there is still some fundamental memory management thing to "tune". I am still learning the vocabulary that NXP uses to talk about their memories and management, much less understanding how to do it, yet.

In case you haven't seen it yet, this is a nice app note by NXP specifically focusing on their "FlexRAM" architecture and its limitations: https://www.nxp.com/doc/AN12077

There is also AN12437, "i.MX RT Series Performance Optimization", which assumes that you have already read AN12077.
https://www.nxp.com/docs/en/application-note/AN12437.pdf

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dhalbert picture dhalbert  Â·  122Comments

deshipu picture deshipu  Â·  44Comments

robertgallup picture robertgallup  Â·  42Comments

Vjmorrison picture Vjmorrison  Â·  28Comments

No1089 picture No1089  Â·  31Comments