The WASM code emitted for unicode.init is ridiculously large and inefficient for what it does (22446 instructions as of this writing). This is undoubtedly due to unicode/tables. There are a ton of blocks, a big table switch, and a bunch of get global + const + store + set global. This should really be fixed IMO.
Ideally, this would leverage the data section. While I have not looked at the ssa specifically, I understand that the WASM compiler is likely just handling it generically as would be expected. There are a few options here:
Not sure the best solution here, but any improvement would be welcome to WASM startup time and WASM binary size. Selfishly, I tried to compile unicode.init to JVM in my non-web WASM impl and exceeded the JVM method size limit.
/cc @neelance
There are a ton of blocks, a big table switch, and a bunch of get global + const + store + set global.
What you see is a workaround for https://github.com/WebAssembly/design/issues/796#issuecomment-401310366. If you want to see this improved, please consider reaching out to the WebAssembly team to get them to take this issue seriously.
If you mean just the part I said about blocks, then I think it's not a good workaround for inits. Maybe the ssa compiler doesn't give you a choice, but the init function should delegate to functions representing other inits. But that doesn't address size much.
Regardless, my concerns about the tens of thousands of instructions should not be dismissed with a link to wasm's block design. That Go choose to eagerly instantiate dozens of publicly visible yet possibly unused structs for string usage is annoying, and other runtimes do it to though they often wait for explicit charset use first. But aside from that, surely the algorithm could be smarter to use, say, the stack instead of updating then setting, then re-getting the same global.
Surely, if possible, the best approach to building structs from all constant expressions would be to use the data section and load the fixed struct memory representation into memory instead. Especially on slices of these. Though I know recognizing cases where this optimization may apply is extra work, using the data sections and doing memory init with fewer instructions would be ideal.
There has to be some way to reduce the startup function instruction count. Even if you were right and it's wasm's fault for their block design and they had arbitrary jumps, you'd just trade block+end with a label and still have a large jump table. And then you'd still have the tons of other instructions completely unrelated to your link. I understand that you had to use jump tables to have suspension points for your resumable coroutines, but that's orthogonal to what I'm talking about with the other costs of memory and package initialization.
Sure, there are probably quite a few optimizations specific to WebAssembly that we should add, but I think they should not require very deep changes. This is because the reasoning of the comment I linked above also applies to other issues: The SSA that the Go compiler generates works fine for all other architectures that Go targets. If WebAssembly requires that the Go compiler now has to do something entirely different, then this is primarily a limitation of WebAssembly, not of the Go compiler.
I think it would be most helpful if you could make some very concrete suggestions, e.g. pick some portion of the generated wasm code and show how we could simplify it. Then we can talk about the feasibility of such simplification.
Ok, I will do more in-depth analysis of the ssa insns, wasm insns, and make specific concrete suggestions. I will update when I have more.
Ok, analysis done...
For the following Go code: Here are some quick notes in bulleted form (note, a lot of them are irrelevant or you surely already know them, but interesting maybe to an outsider):Expand to see some detailed notes
package main
import (
"fmt"
)
func main() {
fmt.Println("Hello, World!")
}
println w/ no fmt import is a ~1200KB WASM file)unicode.init at 22446 insns, the next largest is fmt.__pp_.printValue at 10864, and the next largest is runtime.gentraceback and it goes down from thereunicode.init, get_global is the most frequently called at 3375 times, followed by i32.const at 2471, set_global at 2024, i64.store at 2020, and so on.i64s, the second are all f64sends, 896 for blocks, 894 for ifs, and 1 for loop. So suspension points and WASMs design aren't even the biggest burden.br_table of 1345 cases (and 1 default case)get_global(idx=2), i32.const, i32.sub, set_global(idx=2), get_global(idx=2), i64.const, i64.store, i32.const, set_global(idx=2) (moved to this separate common function and called w/ the consts as params would save 3366 insns). Of course, this only made it clear to me which part was being repeated and we saved even more because some consts don't change in this pattern at runtime (see suggestion below)GOSSAFUNC to init, GOOS to js, and GOARCH to wasm and then running go build ./ in src/unicode.tables.go. As expected, there are plenty of coroutine markers
After a lot of trying different things, I have only one concrete suggestion right now:
Make a helper function (i.e. what the JVM might call a "synthetic" function) for the call-prep insns. Specifically, these 9 insns can be extracted to a helper function taking a single i64 param, change the 6th from a const to a get local of 0 for the param, and it is invoked via 2 insns (push the const on the stack, do a call) saving 7 insns each use. In the hello world WASM, this pattern was invoked 12176 times. Performing this replacement/extraction using some of my own tools to a function called runtime.$prepCall reduced file size from 2419913 bytes to 2261282 bytes, so saved 158631 bytes or > 6%. It can be argued that inlining these 9 insns has performance improvement over function invocation, but I don't believe the costs outweigh the benefits (WASM implementation dependent of course...my implementation uses a VM that benefits greatly from a shared "hot" function to JIT).
There are of course other possibilities that I could not easily validate their benefits: I did not do much analysis beyond the unicode init, so it's quite possible there are lots of savings elsewhere in more complex code.Expand to see some other possibilities I considered
unicode.init at some kind of codegen time, grab the block of mem that it affects, puts that in a data section and change the vars to start at those pointers.
Thanks for your investigation.
reduced file size from 2419913 bytes to 2261282 bytes, so saved 158631 bytes
Could you please also compare the sizes of the gzipped versions of those two? And please also compare the startup time on the latest Firefox Developer Edition, since this is currently the best WebAssembly runtime and thus a good benchmark.
I was going to use it for the non-web use. I wasn't really wanting to do the before steps of investigation, or these next steps, or the ones following. I'll understand if y'all close the issue as not enough evidence of improvement. I'm thinking the 6% saved probably won't show much improvement in download size or startup time for web browsers. I actually think my one concrete suggestion is not enough and the shear size of the payload has many angles to tackle from. But at this point, I accept that this issue doesn't have enough real improvement or suggestions.
It is good that you are thinking about improvements. But WebAssembly and its runtimes are still in an early stage. For example Firefox Developer Edition is able to load a large WebAssembly binary much more quickly than current Chrome. This is evidence that there are techniques that WebAssembly runtimes can and should apply. Any optimizations of the WebAssembly binary itself should be benchmarked against a sufficiently mature WebAssembly runtime. Otherwise we may optimize for the wrong things.
This being said, the goto/jmp limitation is really the severest right now. It is also a blocker for some other optimizations that I have in mind. I would really appreciate if we could somehow get that ball rolling. I'm not yet sure how...
@ianlancetaylor I think we can close this.
I toyed around w/ some ideas at https://github.com/cretz/go-wasm-bake. Still nothing concrete to suggest to the compiler.
Most helpful comment
Ok, analysis done...
Expand to see some detailed notes
For the following Go code:
Here are some quick notes in bulleted form (note, a lot of them are irrelevant or you surely already know them, but interesting maybe to an outsider):
printlnw/ nofmtimport is a ~1200KB WASM file)unicode.initat 22446 insns, the next largest isfmt.__pp_.printValueat 10864, and the next largest isruntime.gentracebackand it goes down from thereunicode.init,get_globalis the most frequently called at 3375 times, followed byi32.constat 2471,set_globalat 2024,i64.storeat 2020, and so on.i64s, the second are allf64sends, 896 forblocks, 894 forifs, and 1 forloop. So suspension points and WASMs design aren't even the biggest burden.br_tableof 1345 cases (and 1 default case)get_global(idx=2),i32.const,i32.sub,set_global(idx=2),get_global(idx=2),i64.const,i64.store,i32.const,set_global(idx=2)(moved to this separate common function and called w/ the consts as params would save 3366 insns). Of course, this only made it clear to me which part was being repeated and we saved even more because some consts don't change in this pattern at runtime (see suggestion below)GOSSAFUNCtoinit,GOOStojs, andGOARCHtowasmand then runninggo build ./insrc/unicode.tables.go. As expected, there are plenty of coroutine markersAfter a lot of trying different things, I have only one concrete suggestion right now:
Make a helper function (i.e. what the JVM might call a "synthetic" function) for the call-prep insns. Specifically, these 9 insns can be extracted to a helper function taking a single
i64param, change the 6th from a const to a get local of 0 for the param, and it is invoked via 2 insns (push the const on the stack, do a call) saving 7 insns each use. In the hello world WASM, this pattern was invoked 12176 times. Performing this replacement/extraction using some of my own tools to a function calledruntime.$prepCallreduced file size from 2419913 bytes to 2261282 bytes, so saved 158631 bytes or > 6%. It can be argued that inlining these 9 insns has performance improvement over function invocation, but I don't believe the costs outweigh the benefits (WASM implementation dependent of course...my implementation uses a VM that benefits greatly from a shared "hot" function to JIT).Expand to see some other possibilities I considered
There are of course other possibilities that I could not easily validate their benefits:
unicode.initat some kind of codegen time, grab the block of mem that it affects, puts that in a data section and change the vars to start at those pointers.I did not do much analysis beyond the unicode init, so it's quite possible there are lots of savings elsewhere in more complex code.