long t;int i;int j;
for(i=0;i<10000;i++)
for(j=0;j<1000;j++)
t=t+j;
Wow, that's a very interesting observation!
GCC 4.8.5 for ESP8266 and GCC 4.8.5 for the ESP32 both produce exactly identical instruction sequences for this loop.
The difference is that on the ESP8266, inner loop takes 5 clock cycles on average to execute, while
on the ESP32 it takes 6 clock cycles on average.
So I would say, this is not an issue with ESP-IDF, because code generation is the final part which ESP-IDF can affect, and resulting code is the same as on the 8266.
One might be curious about two things:
1) why does the same instruction sequence take once cycle longer to execute on the ESP32 than on the ESP8266?
2) why does the compiler not optimize the loop away?
For fun and learning, I suggest figuring out the answers yourself (it won't take much time...). Then you can check your answers in the code fold below.
Answers
begin:
$a12 += $a3 // add.n $a12, $a12, $a3
$a3 += 0x1 // addi.n $a3, $a3, 0x1
if ($a3 != $a2) goto begin // bne $a3, $a2, begin
So with the 5-stage pipeline, it totals to 5 instructions per iteration, and with 7-stage pipeline it totals to 6 instructions per iteration. So branches are less efficient on the ESP32, but due to longer pipeline it can get less stalls during longer sequences of instructions, especially when more registers are used and when memory reads are performed.
long can only hold values up to (2^31 - 1), so after ~4k iterations of the inner loop, t will overflow. This wouldn't have been a problem if t was an unsigned long, but overflow of a signed integer is undefined in C. If you compile this code with a toolchain based on GCC 5.2.0 (and silence/ignore the warnings about UB), the whole function would execute in just 11 cycles (instead of ~60 million). The result, however, would not be guaranteed to be correct. In this specific case, compiler will produce the same result as the ESP32 chip, but in some other case the story may be different. See this Stackoverflow question for an interesting case.long long t=0;int i;int j;
for(i=0;i<10000;i++)
for(j=0;j<100;j++)
{t=t+j+1;
t=t+j+2;
t=t+j+3;
t=t+j+4;
t=t+j+5;
t=t+j+6;
t=t+j+7;
t=t+j+8;
t=t+j+9;
t=t+j+10;
t=t+j+11;
t=t+j+12;
t=t+j+13;
t=t+j+14;
t=t+j+15;
t=t+j+16;
t=t+j+17;
t=t+j+18;
t=t+j+19;
t=t+j+20;
t=t+j+21;
t=t+j+22;
t=t+j+23;
t=t+j+24;
t=t+j+25;
t=t+j+26;
t=t+j+27;
t=t+j+28;
t=t+j+29;
t=t+j+30;
t=t+j+31;
t=t+j+32;
t=t+j+33;
t=t+j+34;
t=t+j+35;
t=t+j+36;
t=t+j+37;
t=t+j+38;
t=t+j+39;
t=t+j+40;
}
I used long long,speed is 3:2(esp32:esp8266),just like 240MHZ:160MHZ,but ESP32 have a 7-stage pipeline, ESP8266 has a 5-stage pipeline,Why ESP32 don't be a litter faster
Your inner loop doesn't cause pipeline stalls, except for the branch instruction, so both ESP32 and ESP8266 execute 1 instruction per cycle.
This subject is very interesting could you create a wiki page about that ?
Adding other measurement ...
Most helpful comment
Wow, that's a very interesting observation!
GCC 4.8.5 for ESP8266 and GCC 4.8.5 for the ESP32 both produce exactly identical instruction sequences for this loop.
The difference is that on the ESP8266, inner loop takes 5 clock cycles on average to execute, while
on the ESP32 it takes 6 clock cycles on average.
So I would say, this is not an issue with ESP-IDF, because code generation is the final part which ESP-IDF can affect, and resulting code is the same as on the 8266.
One might be curious about two things:
1) why does the same instruction sequence take once cycle longer to execute on the ESP32 than on the ESP8266?
2) why does the compiler not optimize the loop away?
For fun and learning, I suggest figuring out the answers yourself (it won't take much time...). Then you can check your answers in the code fold below.
Answers
Inner loop requires three instructions:
So with the 5-stage pipeline, it totals to 5 instructions per iteration, and with 7-stage pipeline it totals to 6 instructions per iteration. So branches are less efficient on the ESP32, but due to longer pipeline it can get less stalls during longer sequences of instructions, especially when more registers are used and when memory reads are performed.
longcan only hold values up to (2^31 - 1), so after ~4k iterations of the inner loop,twill overflow. This wouldn't have been a problem iftwas anunsigned long, but overflow of a signed integer is undefined in C. If you compile this code with a toolchain based on GCC 5.2.0 (and silence/ignore the warnings about UB), the whole function would execute in just 11 cycles (instead of ~60 million). The result, however, would not be guaranteed to be correct. In this specific case, compiler will produce the same result as the ESP32 chip, but in some other case the story may be different. See this Stackoverflow question for an interesting case.