Go: wasm: improve Efficiency on Init and Constants

Created on 26 Jul 2018 · 11Comments · Source: golang/go

The WASM code emitted for unicode.init is ridiculously large and inefficient for what it does (22446 instructions as of this writing). This is undoubtedly due to unicode/tables. There are a ton of blocks, a big table switch, and a bunch of get global + const + store + set global. This should really be fixed IMO.

Ideally, this would leverage the data section. While I have not looked at the ssa specifically, I understand that the WASM compiler is likely just handling it generically as would be expected. There are a few options here:

Change the WASM code generator to, when it recognizes a large variable assignment of many literals to use a local first to build the entire assignment, and then put the local in memory or in the global. EDIT: After some thought, this is not how WASM works, but surely can do all the const+store in mem and local ptr val or even better a stack val to keep re-setting instead of a bunch of set global, get global.
Have maketables build a version specific to the wasm arch (and make the other !wasm) that builds those range tables out of a large init-local embedded byte array.
Change how unicode/tables.go works for all archs to use a more packed format at compile time that doesn't require so much struct/slice creation and have an as-fast-as-today runtime lookup (e.g. inlined R16() might return pseudo-pointer to Range16 arr, and then can be indexed and return Lo on request). I think this could even be an opt-in build tag and could result in faster startup time (not that Go's startup time is slow or anything) Breaks backwards compat guarantees, forgot all of these vars are exposed

Not sure the best solution here, but any improvement would be welcome to WASM startup time and WASM binary size. Selfishly, I tried to compile unicode.init to JVM in my non-web WASM impl and exceeded the JVM method size limit.

Arch-Wasm FrozenDueToAge NeedsInvestigation

Source

cretz

👍1

Most helpful comment

Ok, analysis done...

Expand to see some detailed notes

For the following Go code:

package main

import (
    "fmt"
)

func main() {
    fmt.Println("Hello, World!")
}

Here are some quick notes in bulleted form (note, a lot of them are irrelevant or you surely already know them, but interesting maybe to an outsider):

This compiles to a ~2300KB WASM file (a simple println w/ no fmt import is a ~1200KB WASM file)
There is currently no way to compile and/or export a function like a library, only main
There are two exports, the run func and the mem instance.
There are exactly 1600 functions
The largest function in terms of insn size is unicode.init at 22446 insns, the next largest is fmt.__pp_.printValue at 10864, and the next largest is runtime.gentraceback and it goes down from there
Of those insns in unicode.init, get_global is the most frequently called at 3375 times, followed by i32.const at 2471, set_global at 2024, i64.store at 2020, and so on.
The func has 32 locals it uses, the first half are all i64s, the second are all f64s
There are 1791 ends, 896 for blocks, 894 for ifs, and 1 for loop. So suspension points and WASMs design aren't even the biggest burden.
The start of the func is the loop, followed by all the blocks, followed by a br_table of 1345 cases (and 1 default case)
I have a new algo for func splitting that finds the section of code that, when broken out to a new func and consts made as params, will save the most insns. It found that the best set was a set of 9 insns that are repeated 675 times in the function: get_global(idx=2), i32.const, i32.sub, set_global(idx=2), get_global(idx=2), i64.const, i64.store, i32.const, set_global(idx=2) (moved to this separate common function and called w/ the consts as params would save 3366 insns). Of course, this only made it clear to me which part was being repeated and we saved even more because some consts don't change in this pattern at runtime (see suggestion below)
I also dumped the ssa form of unicode's init area by setting env vars GOSSAFUNC to init, GOOS to js, and GOARCH to wasm and then running go build ./ in src/unicode.
After last opt pass, it was ~13100 ssa insns for tables.go. As expected, there are plenty of coroutine markers
There are several "get R0" and "set R0" which are presumably getting and setting a register which happens to be at wasm global index 2

After a lot of trying different things, I have only one concrete suggestion right now:

Make a helper function (i.e. what the JVM might call a "synthetic" function) for the call-prep insns. Specifically, these 9 insns can be extracted to a helper function taking a single i64 param, change the 6th from a const to a get local of 0 for the param, and it is invoked via 2 insns (push the const on the stack, do a call) saving 7 insns each use. In the hello world WASM, this pattern was invoked 12176 times. Performing this replacement/extraction using some of my own tools to a function called runtime.$prepCall reduced file size from 2419913 bytes to 2261282 bytes, so saved 158631 bytes or > 6%. It can be argued that inlining these 9 insns has performance improvement over function invocation, but I don't believe the costs outweigh the benefits (WASM implementation dependent of course...my implementation uses a VM that benefits greatly from a shared "hot" function to JIT).

Expand to see some other possibilities I considered

There are of course other possibilities that I could not easily validate their benefits:

Write a de-dupe algorithm. Though in practice I have found this doesn't help too much.
Not sure about current code details, but maybe have different params for when Go should inline on WASM.
Do not combine init calls into one init. This doesn't really help much except for single-function size, overall size is still the same (may help a bit with source maps guaranteeing no funcs span multiple files, but I doubt it)
Create an SSA optimization pass that determines slice+struct literal creation from only constant expressions all the way down. Then generate the struct/slice memory interpretation from that at compile time and embed in the data section. Ideally use a fixed pointer offset in the data section for the var but if this cannot be done, even looping over that fixed place in mem to a more dynamic one could have benefits. Bulk mem operations are coming to WASM one day that may help with this.
For WASM arch only, reduce the suspension point count. Maybe don't do it in init for non-branching sections of code or something, not sure the best limit.
For inits, maybe determine starting fixed offset in mem/heap so it doesn't constantly have to be calculated at runtime. I suspect this would only have value if it was specifically for packages like unicode. But knowing at compile time the mem location could help.
Beg the Go authors to break backwards compat and remove all the exposed vars in the unicode package :-) Or write a very WASM-specific part of it, for instance a script that runs unicode.init at some kind of codegen time, grab the block of mem that it affects, puts that in a data section and change the vars to start at those pointers.

I did not do much analysis beyond the unicode init, so it's quite possible there are lots of savings elsewhere in more complex code.

cretz on 30 Jul 2018

👍4

All 11 comments

/cc @neelance

agnivade on 29 Jul 2018

There are a ton of blocks, a big table switch, and a bunch of get global + const + store + set global.

What you see is a workaround for https://github.com/WebAssembly/design/issues/796#issuecomment-401310366. If you want to see this improved, please consider reaching out to the WebAssembly team to get them to take this issue seriously.

neelance on 29 Jul 2018

If you mean just the part I said about blocks, then I think it's not a good workaround for inits. Maybe the ssa compiler doesn't give you a choice, but the init function should delegate to functions representing other inits. But that doesn't address size much.

Regardless, my concerns about the tens of thousands of instructions should not be dismissed with a link to wasm's block design. That Go choose to eagerly instantiate dozens of publicly visible yet possibly unused structs for string usage is annoying, and other runtimes do it to though they often wait for explicit charset use first. But aside from that, surely the algorithm could be smarter to use, say, the stack instead of updating then setting, then re-getting the same global.

Surely, if possible, the best approach to building structs from all constant expressions would be to use the data section and load the fixed struct memory representation into memory instead. Especially on slices of these. Though I know recognizing cases where this optimization may apply is extra work, using the data sections and doing memory init with fewer instructions would be ideal.

There has to be some way to reduce the startup function instruction count. Even if you were right and it's wasm's fault for their block design and they had arbitrary jumps, you'd just trade block+end with a label and still have a large jump table. And then you'd still have the tons of other instructions completely unrelated to your link. I understand that you had to use jump tables to have suspension points for your resumable coroutines, but that's orthogonal to what I'm talking about with the other costs of memory and package initialization.

cretz on 29 Jul 2018

Sure, there are probably quite a few optimizations specific to WebAssembly that we should add, but I think they should not require very deep changes. This is because the reasoning of the comment I linked above also applies to other issues: The SSA that the Go compiler generates works fine for all other architectures that Go targets. If WebAssembly requires that the Go compiler now has to do something entirely different, then this is primarily a limitation of WebAssembly, not of the Go compiler.

I think it would be most helpful if you could make some very concrete suggestions, e.g. pick some portion of the generated wasm code and show how we could simplify it. Then we can talk about the feasibility of such simplification.

neelance on 30 Jul 2018

Ok, I will do more in-depth analysis of the ssa insns, wasm insns, and make specific concrete suggestions. I will update when I have more.

cretz on 30 Jul 2018

Ok, analysis done...

Expand to see some detailed notes

For the following Go code:

package main

import (
    "fmt"
)

func main() {
    fmt.Println("Hello, World!")
}

Here are some quick notes in bulleted form (note, a lot of them are irrelevant or you surely already know them, but interesting maybe to an outsider):

This compiles to a ~2300KB WASM file (a simple println w/ no fmt import is a ~1200KB WASM file)
There is currently no way to compile and/or export a function like a library, only main
There are two exports, the run func and the mem instance.
There are exactly 1600 functions
The largest function in terms of insn size is unicode.init at 22446 insns, the next largest is fmt.__pp_.printValue at 10864, and the next largest is runtime.gentraceback and it goes down from there
Of those insns in unicode.init, get_global is the most frequently called at 3375 times, followed by i32.const at 2471, set_global at 2024, i64.store at 2020, and so on.
The func has 32 locals it uses, the first half are all i64s, the second are all f64s
There are 1791 ends, 896 for blocks, 894 for ifs, and 1 for loop. So suspension points and WASMs design aren't even the biggest burden.
The start of the func is the loop, followed by all the blocks, followed by a br_table of 1345 cases (and 1 default case)
I have a new algo for func splitting that finds the section of code that, when broken out to a new func and consts made as params, will save the most insns. It found that the best set was a set of 9 insns that are repeated 675 times in the function: get_global(idx=2), i32.const, i32.sub, set_global(idx=2), get_global(idx=2), i64.const, i64.store, i32.const, set_global(idx=2) (moved to this separate common function and called w/ the consts as params would save 3366 insns). Of course, this only made it clear to me which part was being repeated and we saved even more because some consts don't change in this pattern at runtime (see suggestion below)
I also dumped the ssa form of unicode's init area by setting env vars GOSSAFUNC to init, GOOS to js, and GOARCH to wasm and then running go build ./ in src/unicode.
After last opt pass, it was ~13100 ssa insns for tables.go. As expected, there are plenty of coroutine markers
There are several "get R0" and "set R0" which are presumably getting and setting a register which happens to be at wasm global index 2

After a lot of trying different things, I have only one concrete suggestion right now:

Expand to see some other possibilities I considered

There are of course other possibilities that I could not easily validate their benefits:

Write a de-dupe algorithm. Though in practice I have found this doesn't help too much.
Not sure about current code details, but maybe have different params for when Go should inline on WASM.
Do not combine init calls into one init. This doesn't really help much except for single-function size, overall size is still the same (may help a bit with source maps guaranteeing no funcs span multiple files, but I doubt it)
Create an SSA optimization pass that determines slice+struct literal creation from only constant expressions all the way down. Then generate the struct/slice memory interpretation from that at compile time and embed in the data section. Ideally use a fixed pointer offset in the data section for the var but if this cannot be done, even looping over that fixed place in mem to a more dynamic one could have benefits. Bulk mem operations are coming to WASM one day that may help with this.
For WASM arch only, reduce the suspension point count. Maybe don't do it in init for non-branching sections of code or something, not sure the best limit.
For inits, maybe determine starting fixed offset in mem/heap so it doesn't constantly have to be calculated at runtime. I suspect this would only have value if it was specifically for packages like unicode. But knowing at compile time the mem location could help.
Beg the Go authors to break backwards compat and remove all the exposed vars in the unicode package :-) Or write a very WASM-specific part of it, for instance a script that runs unicode.init at some kind of codegen time, grab the block of mem that it affects, puts that in a data section and change the vars to start at those pointers.

I did not do much analysis beyond the unicode init, so it's quite possible there are lots of savings elsewhere in more complex code.

cretz on 30 Jul 2018

👍4

Thanks for your investigation.

reduced file size from 2419913 bytes to 2261282 bytes, so saved 158631 bytes

Could you please also compare the sizes of the gzipped versions of those two? And please also compare the startup time on the latest Firefox Developer Edition, since this is currently the best WebAssembly runtime and thus a good benchmark.

neelance on 30 Jul 2018

I was going to use it for the non-web use. I wasn't really wanting to do the before steps of investigation, or these next steps, or the ones following. I'll understand if y'all close the issue as not enough evidence of improvement. I'm thinking the 6% saved probably won't show much improvement in download size or startup time for web browsers. I actually think my one concrete suggestion is not enough and the shear size of the payload has many angles to tackle from. But at this point, I accept that this issue doesn't have enough real improvement or suggestions.

cretz on 30 Jul 2018

It is good that you are thinking about improvements. But WebAssembly and its runtimes are still in an early stage. For example Firefox Developer Edition is able to load a large WebAssembly binary much more quickly than current Chrome. This is evidence that there are techniques that WebAssembly runtimes can and should apply. Any optimizations of the WebAssembly binary itself should be benchmarked against a sufficiently mature WebAssembly runtime. Otherwise we may optimize for the wrong things.

This being said, the goto/jmp limitation is really the severest right now. It is also a blocker for some other optimizations that I have in mind. I would really appreciate if we could somehow get that ball rolling. I'm not yet sure how...

neelance on 30 Jul 2018

👍2

@ianlancetaylor I think we can close this.

neelance on 18 Oct 2018

I toyed around w/ some ideas at https://github.com/cretz/go-wasm-bake. Still nothing concrete to suggest to the compiler.

cretz on 22 Oct 2018

❤1

Was this page helpful?

0 / 5 - 0 ratings