Emscripten: LTO in wasm backend not that effective

Created on 28 Feb 2020  路  15Comments  路  Source: emscripten-core/emscripten

E.g. on Box2D it increases code size by 2.5%, and users report similar things on the mailing list - results with fastcomp used to be better. (tested with -flto at compile and -flto --llvm-lto 1 at link)

Perhaps LLVM LTO has changed a lot between the LLVM versions, and no longer shrinks code size as much? Or perhaps we are not doing something right? cc @sbc100

Most helpful comment

Comparing to fastcomp, it's true LTO sometimes hurts a little in upstream, but that happens in fastcomp too - maybe only slightly less often.

But upstream either with or without LTO is smaller than fastcomp on all benchmarks (and usually faster too). LTO on fastcomp helps it get closer to upstream, but not match it, on several tests.

The large lua code size change happens in fastcomp too, as does the fannkuch speed regression - so nothing really stands out as a regression in upstream's LTO. Everything looks good.

Overall, LTO is sometimes a noticeable improvement, but usually upstream without LTO is about as good. The one big risk with LTO is larger size due to inlining, but compiling with -Os avoids that.

All 15 comments

Which box2d exactly are you looking at ? I'd like to reproduce your results and take a look.

Just the Box2D in the test suite, tests/box2d. I tinkered a little with the Makefile,

diff --git a/tests/box2d/Makefile b/tests/box2d/Makefile
index ee1f4b5e2..12afde209 100644
--- a/tests/box2d/Makefile
+++ b/tests/box2d/Makefile
@@ -51,11 +51,14 @@ OBJECTS = \
 all: box2d.a

 %.o: %.cpp
-       $(CXX) $(CFLAGS) -I. $< -o $@ -O2 -c -fno-exceptions -fno-rtti
+       $(CXX) $(CFLAGS) -I. $< -o $@ -O2 -c -fno-exceptions -fno-rtti -flto

 box2d.a: $(OBJECTS)
        $(AR) rvs $@ $(OBJECTS)

+bench: box2d.a
+       $(CXX) $(CFLAGS) Benchmark.cpp box2d.a -I. -o $@ -Os -fno-exceptions -fno-rtti -o bench-lto.js -flto --llvm-lto 1
+
 clean:
-       rm box2d.a
+       rm box2d.a $(OBJECTS)

I can reproduce those results yes. It does seems strange. Presumably LTO can sometimes increase size because of inlining?

Yeah, I believe on fastcomp we had the option to disable inlining (if you linked with -s INLINING_LIMIT=1 then it passed some flag to LLVM for that), so that might be part of the difference.

However, if the source files were compiled with -Os then it should be conservative about such size increases, I would think? (It doesn't change the results on box2d for me.) Perhaps something changed there.

We can run the entire benchmark suite to get more data. Overall I am worried because I've not seen LTO be a win since we switched to the wasm backend, and when users reported similar things on the mailing list, that suggests maybe it's not just bad luck that I've had...

As a random test I compiled BananaBread, and it's 8% larger with LTO.

From my (somewhat sloppy) tests with my emulators (https://floooh.github.io/tiny8bit/):

  • WASM size increase about 10%
  • no "noticeable" perf incrase

(just "eyeballing" the time-spent-in-emulator in the top-right corner: https://floooh.github.io/tiny8bit/kc85-ui.html?type=kc85_4, on Chrome because Chrome doesn't round to milliseconds - but still that's not exactly "proper benchmarking" of course).

Anyway:

I would expect that size increases because LTO has more inlining opportunities (OTH it might also do better dead-code-elimination). The 10% size increase would be alright IMHO if there would be a performance benefit countering the increased size.

In the past (at least with asm.js) I saw a performance benefit for inlined code versus function call overhead, and that's where LTO came in handy.

My guess is that this clear performance benefit for inlining vs not inlining doesn't exist anymore.

Maybe current WASM engines have less function call overhead, or (more likely I guess) they are doing such optimizations (e.g. inlining) on the fly while compiling the WASM?

PS: --llvm-lto 1 vs 3 didn't make much of a difference on the emulator code base.

Oh, about performance, one big change compared to the asm.js days is that after we added wasm support we added the binaryen optimizer, which can inline on the wasm directly (so even if LLVM LTO is not run). So I would expect LLVM LTO's speed benefit to be somewhat lower because we get some of that benefit anyhow (in both fastcomp and upstream, and for a few years now). However, LLVM LTO should be doing more than inlining, so even the performance side of things is still puzzling to me.

(edit: To my knowledge no wasm VM inlines, which is why we added it to binaryen.)

so even the performance side of things is still puzzling to me

I think we need to look at more different code bases there. The "hot-path" in my emulator is all quite simple "bit-twiddling" code with little optimization opportunities (except in one giant switch-case statement in a single function in the CPU emulator, and this should be optimized the same with or without LTO).

Inlining on the other hand has some benefit in that code base though, because an emulated system calls out into various per-chip tick-functions, theoretically the entire tick-function for a complete emulated system could be merged into a single big function, and removing the function call overhead might help there.

Binaryen doing the ininling explains to me why I'm not seeing any benefits anymore from LTO, because apart from inlining, there isn't much what optimizer passes can do (e.g. it's the opposite of a complex C++ code base where the optimizer needs to work hard to resolve all the "zero cost abstractions").

PS: another aspect which reduces LTO effectiness on my emulator code base is that I shuffled around the code structure a bit. Large parts of the emulators are now implemented in STB-style header-only libs, and the header implementations for most of the emulator are now included in a single source file. That way I'm getting LTO-like effects (e.g. "cross-function optimizations") even without enabling LTO, because the compiler sees a handful of big compilation units instead of many small ones.

...taking all that into account it's rather surprising though that I'm seeing that 10% size increase when enabling LTO ;)

Thought I'd share some metrics of my codebase, sorry if it's not very helpful.

Taisei Project (coroutines branch)

-Os -flto --llvm-lto 1: 3321659 bytes (3.2 mb)
-Os: 3481082 bytes (3.4 mb)
-O3 -flto --llvm-lto 1: 4322883 bytes (4.2 mb)
-O3: 3756842 bytes (3.6 mb)

All dependencies have been built with the same flags (cglm, freetype, libpng, libwebp, libogg, libopus, libopusfile, sdl2, sdl2_mixer, zlib).

So LTO actually still helps a little bit with -Os in my case. I can't really speak about performance, though.

I am currently doing a benchmark of a huge C++ code base on which I am working on (project based on OpenCascade geometric kernel). I'll provide numbers (and hopefully publish the project once ready), but just to give you an idea of the potential issue so far:

O3-lto: size increase is almost +30%, but perf is more or less the same (maybe 1/2% better)
Oz-lto: size increase is 5%, but perf is more or less the same (maybe 1/2% better)

I guess it means LTO is inlining stuff quite a lot in O3 (which is not surprising giving the amount of object files in the project).... but in the end, this have no effect on the performances. Most probably because the optimized paths are only "cold" ones. "Hot paths" are already well optimizable by regular compiler in the code (math-intensive or tightly coupled complex code already in single files)

It could be probably interesting to have selective LTO (whitelist/blacklist), so it prevents to LTO unneeded stuff. But maybe such selective LTO is too tricky to implement :)

Here are the numbers, on a sample program that loads a ~6MB STEP file (CAD/BREP geometry) and tessellate it using fixed parameters:

Os

  • Wasm Size: 6508 KB
  • Time: 8682ms (NB: the lower, the better)

Os+LTO

  • Wasm Size: 7434 KB (+14% compared to Os)
  • Time: 9542ms (+10% compared to Os)

O3

  • Wasm Size: 14978 KB
  • Time: 6069ms

O3+LTO

  • Wasm Size: 18427 KB (+23% compared to O3)
  • Time: 10188ms (+68% compared to O3)

Compilation have been done with Emscripten 1.39.7 without particular additional flags (except -fno-exceptions). No Asyncify too.

As you can see, LTO really gives worse results both in terms of size AND in terms of performance. While the size increase is OK (+14% size for Os and +23% size for O3), the performance drop is very surprising: +14% time with Os and a massive +70% time with O3.

Such a big performance drop on the O3+LTO case is a bit suspicious to me (I need to re-check that I did not make a mistake on the compilation flags). Or it is just mean LTO is definitively NOT good for this particular program/benchmark: mostly parsing, data structure traversal (the BREP), analytic maths (tessellation of restricted NURBS and other kind of surfaces), not real-time, and probably already well-optimized code.

Hopefully I'll have time to release that stuff for people to investigate on this, but that's not ready yet :(

All I can say is that the difference used to be smaller in earlier Emscripten versions (fastcomp). Unfortunately, I can no more check this easily now as my program would have to be adapted again for earlier versions of Emscripten (API changes and stuff).

Note also the comparison between Os and O3:

O3 is 2.3x bigger than Os in terms of Wasm size, but only +45% faster in terms of performance.

Difficult to arbitrate what is really better: 6,5MB of Wasm binary, or a 15MB one with +45% performance increase. 15MB, that's huuuuge to start a Web page :)

Very interesting, thanks @gabrielcuvillier ! We should really examine that slowdown when you can provide builds (preferably with --profiling), looks quite bad.

Any chance you can build the project natively as well, also with and without LTO, and compare to that?

I did a bunch of tests on the emscripten benchmark suite. Overall, most benchmarks do not benefit or lose from LTO, and often gain a little size. For example zlib is 1% smaller, lzma 4% larger.

But there are several interesting things:

  • skinning is 25% faster. It is also 1% smaller. (That 1% improvement goes away if we don't run the binaryen optimizer, fwiw, but it's pretty small anyhow). A similar case is matrix_multiply which is 20% faster.
  • fasta_float is 10% faster and 18% smaller.
  • lua is 47% larger (!). It's also 3% faster, which is nice but doubtful it's worth it given the size...
  • fannkuch is 70% SLOWER. That's very weird, but I double-checked.

The last two extreme cases, lua and fannkuch, both only happen with -O3 - when building with -Os the bad effects (and the 3% speedup in lua) go away. So the obvious answer of inlining seems very likely here, but why it hurts fannkuch is less obvious - perhaps it makes a function big enough to hit a bad case in a wasm VM, I'll investigate that more.

Just to add another example, when I was compiling with fastcomp lto reduces the size by ~30% (~4 -> ~3 MB), and I didn't notice any performance change (I didn't actually profile this).

However my general understanding of lto is that is often increases code size significantly, I believe due to inlining. So I don't really think the point of lto is size, but speed.

Comparing to fastcomp, it's true LTO sometimes hurts a little in upstream, but that happens in fastcomp too - maybe only slightly less often.

But upstream either with or without LTO is smaller than fastcomp on all benchmarks (and usually faster too). LTO on fastcomp helps it get closer to upstream, but not match it, on several tests.

The large lua code size change happens in fastcomp too, as does the fannkuch speed regression - so nothing really stands out as a regression in upstream's LTO. Everything looks good.

Overall, LTO is sometimes a noticeable improvement, but usually upstream without LTO is about as good. The one big risk with LTO is larger size due to inlining, but compiling with -Os avoids that.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jcfr picture jcfr  路  4Comments

rpellerin picture rpellerin  路  3Comments

juj picture juj  路  3Comments

answer1103 picture answer1103  路  4Comments

yahsaves picture yahsaves  路  4Comments