Circuitpython: CircuitPython 4.x Performance issue with the pygamer vs. MicroPython

Created on 9 Jun 2019  路  14Comments  路  Source: adafruit/circuitpython

Hi,

I did a simple loop test in python to compare with my STM32F4 based MicroPython board and the results are quite unexpected. CircuitPython 4.x is 16x slower than MicroPython for an equivalent CPU frequency.

Details available here:
https://forums.adafruit.com/viewtopic.php?f=60&t=152372&p=753830#p753830

Test code:
````python
def test():
s = time.monotonic
count = 0
r = range(0, 10000000)

startTime = s()
for count in r:
    pass
endTime = s()

duration = endTime - startTime

return duration

````

Results:
````

  • 18 766ms on Meowbit/MP at 56 MHz (using pyb.millis)
  • 12 506ms on my Meowbit/MP at 84 MHz (using pyb.millis)
  • 138 390ms on my Pygamer/CP at 120 MHz (using time.monotonic)
  • 86 000ms on my Pygamer/CP at 200 MHz (using time.monotonic)
    ````

C module reference:
````C

define NOP asm volatile(" nop \n\t")

uint64_t start = ticks_ms;  
uint32_t count = 0;  
uint64_t end = 0; 

for(count=0; count<10000000; count++) { NOP; }

end = ticks_ms;  

uint32_t delta = end - start;
printf("Delta: %lu\n", delta);

````

Results:
````

  • 715ms at 56MHz on Meowbit/MP
  • 477ms at 84MHz on Meowbit/MP
  • 334ms at 120MHz on pygamer/CP
  • 200ms at 200MHz on pygamer/CP
    ````

Real issue or I did something wrong ?

enhancement

All 14 comments

This doesn't surprise me because performance hasn't been our focus. I'll probably take a glance for simple improvements in the next couple weeks but we'd totally take PRs that improve things.

I did some research on this. I looked at the stm32 port settings in MicroPython. In our ports, we use MICROPY_VM_HOOK_LOOP and MICROPY_VM_HOOK_RETURN to call run_background_tasks() very frequently. This happens after every jump opcode and every return. We do this to service USB and other background tasks that need to use polling instead of interrupts. The stm32 port does not do this.

If do not define the HOOK macros to call run_background_tasks() , your timing test and a similar one done with explicit increment and test code run run 3-4 times faster.

I could only test this in CircuitPython 3.1.2. It works at least for a short while without calling background tasks frequently, but eventually the REPL connection goes away. CircuitPython 4.0.1 doesn't work at all if I eliminate the background task call.

That accounts for a large fraction of your performance difference, but not all, and further research would uncover more.

Thanks Dan! Interesting findings! "This happens after every jump opcode and every return" looks pretty scary :)

so everything happens here, ans especially usb_background(), right?

````C

define MICROPY_VM_HOOK_LOOP run_background_tasks();

define MICROPY_VM_HOOK_RETURN run_background_tasks();

void run_background_tasks(void) {
// Don't call ourselves recursively.
if (running_background_tasks) {
return;
}
assert_heap_ok();
running_background_tasks = true;

#if (defined(SAMD21) && defined(PIN_PA02)) || defined(SAMD51)
audio_dma_background();
#endif
#if CIRCUITPY_DISPLAYIO
displayio_refresh_displays();
#endif

#if CIRCUITPY_NETWORK
network_module_background();
#endif
filesystem_background();
usb_background();
running_background_tasks = false;
assert_heap_ok();

last_finished_tick = ticks_ms;

}
````

And yep MICROPY_VM_HOOK_* are called in a lot of places...

Further testing: I have turned on MICROPY_OPT_COMPUTED_GOTO which is a performance enhancement in the VM interpreter (adds about 1700 bytes ti firmware). This further speeds up the basic timing loop by a factor of 5. Together these seem to account for pretty much all the speed differences.

The basic test I am using (to avoid any issues about how range() is implemented:

import time
def test():
    i = 100000
    startTime = time.monotonic()
    while i > 0:
        i -= 1
    endTime = time.monotonic()
    return endTime - startTime

print(test())

Interesting ! I'll try tonight on my pygamer to enable MICROPY_OPT_COMPUTED_GOTO and disable CIRCUITPY_DISPLAYIO. I'll post my results here.

@dhalbert, with the last CircuitPython source code, when I add #define MICROPY_OPT_COMPUTED_GOTO (1) I got this error at compilation time:

````Makefile
In file included from ../../py/vm.c:129:0:
../../py/vmentrytable.h:27:5: error: "__clang__" is not defined, evaluates to 0 [-Werror=undef]
#if __clang__
^~~
../../py/vmentrytable.h:116:5: error: "__clang__" is not defined, evaluates to 0 [-Werror=undef]
#if __clang__
^
~~
cc1: all warnings being treated as errors

make: * [../../py/mkrules.mk:82: build-pygamer/genhdr/qstr.i.last] Error 123
````

Anything I should do ?

Change both #if __clang__ to #ifdef __clang__. The original is a bit sloppy and we turn on warnings on that.

Thanks!

So I ran again the same test with MICROPY_OPT_COMPUTED_GOTO enabled and CIRCUITPY_DISPLAYIO disabled:

Results: 35.3ms

So a huge improvement but still some room for optimization to reach the ~10ms on an equivalent 120MHz STM32F4 board with MicroPython.

Good job @dhalbert !

Thanks! The next step is more work, which is to figure out how to call the background tasks only as often as necessary. That will get the next 2-3x speedup, I think.

We really appreciate your taking the time to do this comparison and point out the big disparity.

Just tried at 200Mhz: 21.2 ms.

Ah ah, no issue :) It forces me to go into the details to get a better understanding of CircuitPython, I like that. But a beer in the Walking City downtown would be appreciated too :)

Closing in favor of #2142.

trying with
Adafruit CircuitPython 5.0.0-beta.0-109-g15886b150 on 2019-12-03; Adafruit Feather M4 Express with samd51j19

got 18.765 seconds @ 120 MHz, so a 7.4x improvement over original 138.39 :)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bbtinkerer picture bbtinkerer  路  26Comments

DWiskow picture DWiskow  路  29Comments

kattni picture kattni  路  49Comments

tannewt picture tannewt  路  41Comments

PaulStoffregen picture PaulStoffregen  路  29Comments