Esp-idf: A simple loop is slower than esp8266

Created on 9 Nov 2016  路  4Comments  路  Source: espressif/esp-idf

long t;int i;int j; for(i=0;i<10000;i++) for(j=0;j<1000;j++) t=t+j;

Most helpful comment

Wow, that's a very interesting observation!

GCC 4.8.5 for ESP8266 and GCC 4.8.5 for the ESP32 both produce exactly identical instruction sequences for this loop.
The difference is that on the ESP8266, inner loop takes 5 clock cycles on average to execute, while
on the ESP32 it takes 6 clock cycles on average.
So I would say, this is not an issue with ESP-IDF, because code generation is the final part which ESP-IDF can affect, and resulting code is the same as on the 8266.

One might be curious about two things:
1) why does the same instruction sequence take once cycle longer to execute on the ESP32 than on the ESP8266?
2) why does the compiler not optimize the loop away?

For fun and learning, I suggest figuring out the answers yourself (it won't take much time...). Then you can check your answers in the code fold below.


Answers

  1. The cores inside ESP32 have a 7-stage pipeline, while ESP8266 has a 5-stage pipeline. On average, if there are no pipeline stalls, Xtensa architecture executes one instruction in one cycle. Branch requires 2(3) extra instructions with a 5-stage pipeline, and 3(4) instructions with a 7-stage pipeline. Numbers in brackets are for the case when branch target is unaligned, but gcc does a nice job aligning them.
    Inner loop requires three instructions:
begin:
  $a12 += $a3 // add.n    $a12, $a12, $a3
  $a3 += 0x1 // addi.n   $a3, $a3, 0x1
  if ($a3 != $a2) goto begin // bne      $a3, $a2, begin

So with the 5-stage pipeline, it totals to 5 instructions per iteration, and with 7-stage pipeline it totals to 6 instructions per iteration. So branches are less efficient on the ESP32, but due to longer pipeline it can get less stalls during longer sequences of instructions, especially when more registers are used and when memory reads are performed.

  1. Both the ESP8266 and ESP32 SDKs use GCC 4.8.5 right now. GCC 4.8.5 can optimize loops _like_ this into a constant. The problem with this loop is that it invokes undefined behavior. long can only hold values up to (2^31 - 1), so after ~4k iterations of the inner loop, t will overflow. This wouldn't have been a problem if t was an unsigned long, but overflow of a signed integer is undefined in C. If you compile this code with a toolchain based on GCC 5.2.0 (and silence/ignore the warnings about UB), the whole function would execute in just 11 cycles (instead of ~60 million). The result, however, would not be guaranteed to be correct. In this specific case, compiler will produce the same result as the ESP32 chip, but in some other case the story may be different. See this Stackoverflow question for an interesting case.

All 4 comments

Wow, that's a very interesting observation!

GCC 4.8.5 for ESP8266 and GCC 4.8.5 for the ESP32 both produce exactly identical instruction sequences for this loop.
The difference is that on the ESP8266, inner loop takes 5 clock cycles on average to execute, while
on the ESP32 it takes 6 clock cycles on average.
So I would say, this is not an issue with ESP-IDF, because code generation is the final part which ESP-IDF can affect, and resulting code is the same as on the 8266.

One might be curious about two things:
1) why does the same instruction sequence take once cycle longer to execute on the ESP32 than on the ESP8266?
2) why does the compiler not optimize the loop away?

For fun and learning, I suggest figuring out the answers yourself (it won't take much time...). Then you can check your answers in the code fold below.


Answers

  1. The cores inside ESP32 have a 7-stage pipeline, while ESP8266 has a 5-stage pipeline. On average, if there are no pipeline stalls, Xtensa architecture executes one instruction in one cycle. Branch requires 2(3) extra instructions with a 5-stage pipeline, and 3(4) instructions with a 7-stage pipeline. Numbers in brackets are for the case when branch target is unaligned, but gcc does a nice job aligning them.
    Inner loop requires three instructions:
begin:
  $a12 += $a3 // add.n    $a12, $a12, $a3
  $a3 += 0x1 // addi.n   $a3, $a3, 0x1
  if ($a3 != $a2) goto begin // bne      $a3, $a2, begin

So with the 5-stage pipeline, it totals to 5 instructions per iteration, and with 7-stage pipeline it totals to 6 instructions per iteration. So branches are less efficient on the ESP32, but due to longer pipeline it can get less stalls during longer sequences of instructions, especially when more registers are used and when memory reads are performed.

  1. Both the ESP8266 and ESP32 SDKs use GCC 4.8.5 right now. GCC 4.8.5 can optimize loops _like_ this into a constant. The problem with this loop is that it invokes undefined behavior. long can only hold values up to (2^31 - 1), so after ~4k iterations of the inner loop, t will overflow. This wouldn't have been a problem if t was an unsigned long, but overflow of a signed integer is undefined in C. If you compile this code with a toolchain based on GCC 5.2.0 (and silence/ignore the warnings about UB), the whole function would execute in just 11 cycles (instead of ~60 million). The result, however, would not be guaranteed to be correct. In this specific case, compiler will produce the same result as the ESP32 chip, but in some other case the story may be different. See this Stackoverflow question for an interesting case.

long long t=0;int i;int j;
for(i=0;i<10000;i++)
for(j=0;j<100;j++)
{t=t+j+1;
t=t+j+2;
t=t+j+3;
t=t+j+4;
t=t+j+5;
t=t+j+6;
t=t+j+7;
t=t+j+8;
t=t+j+9;
t=t+j+10;
t=t+j+11;
t=t+j+12;
t=t+j+13;
t=t+j+14;
t=t+j+15;
t=t+j+16;
t=t+j+17;
t=t+j+18;
t=t+j+19;
t=t+j+20;
t=t+j+21;
t=t+j+22;
t=t+j+23;
t=t+j+24;
t=t+j+25;
t=t+j+26;
t=t+j+27;
t=t+j+28;
t=t+j+29;
t=t+j+30;
t=t+j+31;
t=t+j+32;
t=t+j+33;
t=t+j+34;
t=t+j+35;
t=t+j+36;
t=t+j+37;
t=t+j+38;
t=t+j+39;
t=t+j+40;
}
I used long long,speed is 3:2(esp32:esp8266),just like 240MHZ:160MHZ,but ESP32 have a 7-stage pipeline, ESP8266 has a 5-stage pipeline,Why ESP32 don't be a litter faster

Your inner loop doesn't cause pipeline stalls, except for the branch instruction, so both ESP32 and ESP8266 execute 1 instruction per cycle.

This subject is very interesting could you create a wiki page about that ?
Adding other measurement ...

Was this page helpful?
0 / 5 - 0 ratings