Creating a new thread for this so I don't clutter the other sharing threads
Current performance figures:

Slider 1: Loop time excluding raster and TFT draw time (4ms)
Slider 2: TFT draw time (248ms)
Slider 3: Raster time (187ms)
The main loop without rasterizing and drawing to the screen is running at ~240Hz which is crazy fast for such a small device (160MHz ESP32), no need for any more performance gains here.
The TFT library is pushed right to its limit, only way to get it any faster will be to crank up the SPI speed or somehow work out how to only draw to the parts of the screen that have actually changed between refreshes (this would be handy for other devices with less RAM).
The rasterizer is still running a little slow, but as mentioned before it can be optimised with a heap of special cases, some of which have been tested but are currently disabled for debugging purposes.
Test if 2 triangles are actually a square
Test if triangle is a flat colour with basically no UV map, if it is don't bother interpolating values
Test if triangle is actually a line/single pixel
Faster lerp factor calculation for grid aligned triangles/rectangles
Remove rounding
Special cases for alpha blending
Less specific to the rasterizer but not rendering the window or background makes sense on a platform like this and should also give a performance boost
I'm going to continue to try and get as much performance as I can out of the software rasterizer for use both on Arduino and for general PC use, perhaps it will be a good base for an sr example or the regression testing system mentioned in an earlier thread.
If anyone has any working examples for optimisations that would certainly help speed me up
Code: https://github.com/LAK132/ImDuino
Only necessary modification to the base ImGui library was the removal of #include <memory.h> from stb_truetype.h
draw (subsection of) font bitmap directly when drawing characters.
Treat the texture as bitmap - if there's a white pixel, draw a pixel to your framebuffer, else do nothing. No need for alpha blending there.
Yes it is also worth noting the atlas texture can be output as Alpha8, so 1 byte per pixel without color information.
Interesting you should mention that, it is currently using Alpha8 but I had to immediately MemFree the pixel buffer it returns as it is way too big to fit in the ESP32s RAM, luckily I managed to store it as a constant at the top of the ino so it is read from flash rather than RAM.
Might be able to get it into RAM if it was returned as a more space efficient 2D array rather than one large 1D array
If you're low on ram, convert it to 1bit per pixel format, that should reduce it to 4kB at the cost of a few bitops more per texture access.
Reading flash is probably still faster than doing bitops, but I'd need to actually test that to be sure
Yes in the case of the default font, even 1bpp would work.
By the way I just added two flags to ImFontAtlas which are helpful in that sort of situation:
enum ImFontAtlasFlags_
{
ImFontAtlasFlags_NoPowerOfTwoHeight = 1 << 0, // Don't round the height to next power of two
ImFontAtlasFlags_NoMouseCursors = 1 << 1 // Don't build software mouse cursors into the atlas
};
// Use with
io.Fonts->Flags |= ImFontAtlasFlags_NoPowerOfTwoHeight | ImFontAtlasFlags_NoMouseCursors;`
ImFontAtlasFlags_NoPowerOfTwoHeight is probably usable with most backends, not sure how it may impact performances on modern GPU.?
With default flags: 32768 bytes

Without mouse cursors, without rounding height to next power-of-two: 13824 bytes

There's also a ProggyTiny font in misc/fonts you may use for that sorts of screen.
This thread is beautiful, I have some ESP32 here in the office getting dust ;)
ImFontAtlasFlags_NoPowerOfTwoHeight is probably usable with most backends, not sure how it may impact performances on modern GPU.?
On one small texture, they won't even notice :) IIRC I used single npot screen-sized texture per frame on Radeon 9600 some 10 years ago (for a video player) and generally there was tiny perf difference (if any at all, and it was certainly faster than stretching the image to power of two dimensions before sending it to gpu).
First lot of optimizations more than halved the raster time (180ms -> 80ms). Roughly 11FPS excluding screen updates

I also added support for 8bit, 16bit, 24bit and 32bit textures. Might be able to speed the raster time up further if you only use one type, but potentially at the cost of space (which the ESP32 doesn't have much of)
It's a little curious how you are using SliderFloat to display times, instead of, say ImGui::Text("%f ms", time);
Which optimizations of the ones above have you applied?
Just removed rounding and added special cases for alpha blending (return if 0, don't blend if 255). Currently working on adding more
Alright, I think this is about as good as I'm gonna get it
That needs to be at minimum 10 times faster to be usable, let's make it happen :)
You still have WindowRounding and borders visible in the video. The rounding will cause your window background to use large thin triangles instead of one rectangle. You'll probably double your speed for that given code just by disabling WindowRounding. Have you got anti-aliasing enabled? Between rounding and borders with AA just cost you double the amount of vertices in that shot.
I'm not sure I understand why you have those 8/16/24/32 paths, especially for textures as you know your texture is 1bpp or 8bpp?
You detect rectangle by comparing vertex contents whereas you could compare indices.
The triangle rasterization could be done much faster, maybe look up at state of art triangle rasterization.
Not sure why you go and do all those extraction of colors when it's not necessary for case where we don't blend?
And you can switch to ProggyTiny (10 px) instead of ProggyClean (13 px) for that sorts of screen.
I think I'm going to run a little bounty challenge for that tonight! It would be useful to have a good specialized software rasterizer available for imgui. Someone specialized in that sort of things (not me) could probably get us 100 times faster. I guess using much floating points on ESP32 isn't exactly desirable?
EDIT Also added a link to my comment in the gallery thread: https://github.com/ocornut/imgui/issues/1269#issuecomment-364374265 for people stumbling here.
Alright, that points me in the right direction for more optimisations at least. Currently the font atlas is 8bit, the screen is 16bit and ImGui seems to work in 24/32, and I didn't see any performance impact by having them all supported by texture_t. I also found that checking for the cases where it doesn't blend was actually slower than just blending. Might have something to do with the compilers optimisations?
At pointed out by Per on twitter (I dumbly had overlooked the actual numbers) the raster cost is only a fifth of the cost, so while ultimately we can drive that down, it should probably be tacked along with the final blitting which is currently the slowest part.
Where is the drawBitmap() function you are calling in UpdateScreen?
https://github.com/LAK132/ImDuino/blob/master/ImDuino.ino#L30
If I look here there's no copy of drawBitmap() that matches your exact prototype
https://github.com/Nkawu/TFT_22_ILI9225/blob/master/src/TFT_22_ILI9225.cpp
The good news is that this TFT_22_ILI9225 code seemingly has immense of room for optimization.
Check the PRs on that repo, my version is several times faster (4s vs 250ms)
Thanks!
I'll post it here https://github.com/Nkawu/TFT_22_ILI9225/pull/23
I asked on twitter for people to try to help solving it (with a bounty) .
OK so you'll already done a good job optimizing that part from the original version, that leaves us with less obvious perspectives.
i had built a softrender for imgui too, but have no idea whether it will be faster on ESP32 (also, not interested in bounty, attribution is more than enough) - https://github.com/AlgoTradingHub/imgui_rt
@wizzard0 I'll see if I can get it running on my ESP tomorrow afternoon
Here is an idea, instead optimizing rasterizer, ImGui should support terminal rendering (prototype is here: https://github.com/jonvaldes/tear_imgui, video https://www.youtube.com/watch?v=OEGb4HrMkDo). This way you don't have to optimize generalized polygon rendering, rather, you focus only on terminal text rendering.
Haven't started optimising yet, but I did add a screen clip. Worst case it's 4 if/pixel slower, best case it doesn't draw to the screen at all. Current test case is 2x faster:

This version requires the testing branch of my fork of the TFT library https://github.com/LAK132/TFT_22_ILI9225/tree/testing
None of the triangle render functions are working yet, but the new rectangle functions seems to be a heap faster (down to 2~3ms)

EDIT: Current version no longer crashes on renderTri but it still isn't drawing correctly. Raster time is now at 13ms, a little over 10x faster than the first version

The rewrite has been successful (as far as I can tell), it's well over 10x faster with WindowRounding disabled
With WindowRounding:

Without WindowRounding:

I made a software rasterizer for Dear ImGui which is NOT made for Arduino (it relies heavilty on floating point math), but it could maybe be a useful reference: https://github.com/emilk/imgui_software_renderer/blob/master/src/imgui_sw.cpp
I'm close to breaking that 10x faster threshold with some more modifications to the TFT library

My version now looks like this
void TFT_22_ILI9225::_spiWrite16(uint16_t s)
{
#ifdef HSPI_WRITE16
if(_clk < 0){
HSPI_WRITE16(s);
return;
}
#endif
_spiWrite((uint8_t)(s >> 8));
_spiWrite((uint8_t)s);
}
void TFT_22_ILI9225::drawBitmap(uint16_t x1, uint16_t y1,
const uint16_t* bitmap, int16_t w, int16_t h) {
_setWindow(x1, y1, x1+w-1, y1+h-1,L2R_TopDown);
startWrite();
SPI_DC_HIGH();
SPI_CS_LOW();
#ifdef HSPI_WRITE_PIXELS
if (_clk < 0) {
HSPI_WRITE_PIXELS(bitmap, w * h * sizeof(uint16_t));
} else
#endif
for (uint16_t i = 0; i < h * w; ++i) {
_spiWrite16(bitmap[i]);
}
SPI_CS_HIGH();
endWrite();
}
This is with the hardware SPI clocked at 20MHz. The ESP32 can handle 40MHz (and even 80MHz iirc), but the cables I'm using aren't good enough for that kind of speed.
And there we have it, software rasteriser running the test code in under 10ms!
Rasteriser is roughly 20x faster than in the original post, full loop roughly 10x faster!

I have also moved some stuff around, softraster is now in the misc folder and there is an example impl for it:
https://github.com/LAK132/ImDuino/blob/master/ImDuino.ino
https://github.com/LAK132/ImDuino/blob/master/misc/softraster/softraster.h
https://github.com/LAK132/ImDuino/blob/master/examples/imgui_impl_softraster.h
Great work! I鈥檓 excited to use this in future project.
It looks like there's still a few more kinks to work out, mainly texture mapping and alpha blending, but performance looks rock solid even on PC!

Nice! Would you mind update the wiki (root page and/or back-end page) with any useful applicable link? Thank you!
Most helpful comment
I made a software rasterizer for Dear ImGui which is NOT made for Arduino (it relies heavilty on floating point math), but it could maybe be a useful reference: https://github.com/emilk/imgui_software_renderer/blob/master/src/imgui_sw.cpp