I was porting some frontend Go code to be compiled to WebAssembly instead of GopherJS, and noticed the performance was noticeably reduced. The Go code in question makes a lot of DOM manipulation calls and queries, so I decided to benchmark the performance of making calls from WebAssembly to the JavaScript APIs via syscall/js.
I found it's approximately 10x slower than native JavaScript.
Results of running a benchmark in Chrome 75.0.3770.80 on macOS 10.14.5:
131.212518 ms/op - WebAssembly via syscall/js
61.850000 ms/op - GopherJS via syscall/js
12.040000 ms/op - GopherJS via github.com/gopherjs/gopherjs/js
11.320000 ms/op - native JavaScript
Here's the benchmark code I used, written to be self-contained:
Source Code
package main
import (
"fmt"
"runtime"
"syscall/js"
"testing"
"time"
"honnef.co/go/js/dom/v2"
)
var document = dom.GetWindow().Document().(dom.HTMLDocument)
func main() {
loaded := make(chan struct{})
switch readyState := document.ReadyState(); readyState {
case "loading":
document.AddEventListener("DOMContentLoaded", false, func(dom.Event) { close(loaded) })
case "interactive", "complete":
close(loaded)
default:
panic(fmt.Errorf("internal error: unexpected document.ReadyState value: %v", readyState))
}
<-loaded
for i := 0; i < 10000; i++ {
div := document.CreateElement("div")
div.SetInnerHTML(fmt.Sprintf("foo <strong>bar</strong> baz %d", i))
document.Body().AppendChild(div)
}
time.Sleep(time.Second)
runBench(BenchmarkGoSyscallJS, WasmOrGJS+" via syscall/js")
if runtime.GOARCH == "js" { // GopherJS-only benchmark.
runBench(BenchmarkGoGopherJS, "GopherJS via github.com/gopherjs/gopherjs/js")
}
runBench(BenchmarkNativeJavaScript, "native JavaScript")
document.Body().Style().SetProperty("background-color", "lightgreen", "")
}
func runBench(f func(*testing.B), desc string) {
r := testing.Benchmark(f)
msPerOp := float64(r.T) * 1e-6 / float64(r.N)
fmt.Printf("%f ms/op - %s\n", msPerOp, desc)
}
func BenchmarkGoSyscallJS(b *testing.B) {
var total float64
for i := 0; i < b.N; i++ {
total = 0
divs := js.Global().Get("document").Call("getElementsByTagName", "div")
for j := 0; j < divs.Length(); j++ {
total += divs.Index(j).Call("getBoundingClientRect").Get("top").Float()
}
}
_ = total
}
func BenchmarkNativeJavaScript(b *testing.B) {
js.Global().Set("NativeJavaScript", js.Global().Call("eval", nativeJavaScript))
b.ResetTimer()
js.Global().Get("NativeJavaScript").Invoke(b.N)
}
const nativeJavaScript = `(function(N) {
var i, j, total;
for (i = 0; i < N; i++) {
total = 0;
var divs = document.getElementsByTagName("div");
for (j = 0; j < divs.length; j++) {
total += divs[j].getBoundingClientRect().top;
}
}
var _ = total;
})`
// +build wasm
package main
import "testing"
const WasmOrGJS = "WebAssembly"
func BenchmarkGoGopherJS(b *testing.B) {}
// +build !wasm
package main
import (
"testing"
"github.com/gopherjs/gopherjs/js"
)
const WasmOrGJS = "GopherJS"
func BenchmarkGoGopherJS(b *testing.B) {
var total float64
for i := 0; i < b.N; i++ {
total = 0
divs := js.Global.Get("document").Call("getElementsByTagName", "div")
for j := 0; j < divs.Length(); j++ {
total += divs.Index(j).Call("getBoundingClientRect").Get("top").Float()
}
}
_ = total
}
I know syscall/js is documented as "Its current scope is only to allow tests to run, but not yet to provide a comprehensive API for users", but I wanted to open this issue to discuss the future. Performance is important for Go applications that need to make a lot of calls into the JavaScript world.
What is the current state of syscall/js performance, and are there known opportunities to improve it?
/cc @neelance @cherrymui @hajimehoshi
It would be also good to benchmark with Firefox and see the results.
IIUC, you are just benchmarking DOM manipulation. And since DOM manipulation anyways happens outside wasm, it is just about the price of context jump from wasm land to browser land and back. In that case, I wonder if it is even within the control of syscall/js and not the underlying wasm engine.
Would be also good to benchmark equivalent code using Rust and C and compare the benchmarks. I think that may be a better apples-apples comparison just to compare syscall/js performance with other languages.
As @agnivade said, probably worth trying Firefox. V8 is known to have some performance problems with the Wasm code generated by the Go compiler.
It would be also good to benchmark with Firefox and see the results.
Agreed. I'll do this later and share results.
IIUC, you are just benchmarking DOM manipulation. And since DOM manipulation anyways happens outside wasm, it is just about the price of context jump from wasm land to browser land and back. In that case, I wonder if it is even within the control of syscall/js and not the underlying wasm engine.
Yes. When I said syscall/js, I meant the entire performance cost of jumping from Wasm to the browser APIs and back. It's what the user sees when they use the API to interact with the JavaScript world.
Would be also good to benchmark equivalent code using Rust and C and compare the benchmarks. I think that may be a better apples-apples comparison just to compare syscall/js performance with other languages.
Agreed, that would be good and more representative of the actual WebAssembly <-> JS call overhead. Doing that would give us more information. I won't have a chance to do this, but if someone else can, it'd be helpful.
Perhaps it's not worth doing anything substantial here before something like WASI is standardized. @neelance even did a WIP implementation at https://github.com/golang/go/issues/31105.
I've tried the benchmark again with recent development versions of 3 browsers:
Chrome Canary
Version 77.0.3824.0 (Official Build) canary (64-bit)
114.154496 ms/op - WebAssembly via syscall/js
63.350000 ms/op - GopherJS via syscall/js
11.740000 ms/op - GopherJS via github.com/gopherjs/gopherjs/js
11.360000 ms/op - native JavaScript
Firefox Nightly
69.0a1 (2019-06-13) (64-bit)
94.150003 ms/op - WebAssembly via syscall/js
85.300000 ms/op - GopherJS via syscall/js
7.695000 ms/op - GopherJS via github.com/gopherjs/gopherjs/js
7.405000 ms/op - native JavaScript
Safari Technology Preview
Release 85 (Safari 13.0, WebKit 14608.1.28.1)
57.249996 ms/op - WebAssembly via syscall/js
42.866666 ms/op - GopherJS via syscall/js
5.536666 ms/op - GopherJS via github.com/gopherjs/gopherjs/js
5.073333 ms/op - native JavaScript
The results are pretty consistent across the 3 browsers in that doing lots of DOM queries via WebAssembly was about 10x slower than with pure JavaScript.
Could you share the code to take the benchmark to output the values [s/op]?
Thanks for the tests @dmitshur. I would have thought that after https://hacks.mozilla.org/2018/10/calls-between-javascript-and-webassembly-are-finally-fast-%F0%9F%8E%89/, the DOM access overhead would have reduced in Firefox. And interesting that Safari is much faster for DOM access than Firefox.
The tests with Rust/C should give us a better idea on what exactly can be improved from Go side. If anybody can post results for that, that'll be great.
@hajimehoshi Sure. I've updated the source code in the original post.
Change https://golang.org/cl/183457 mentions this issue: runtime,syscall/js: reuse wasm memory DataView
@martisch suggested that I add a "real-world" example that demonstrates the performance hit of webassembly compared to running natively. A good example is the "gophers" demo from Gio (gioui.org). With modules enabled and using Go 1.13 (tip), you can build and run the demo with two commands:
$ export GO111MODULE=on
$ go run gioui.org/cmd/gio -target js gioui.org/apps/gophers -stats # for building gophers
$ go run github.com/shurcooL/goexec 'http.ListenAndServe(":8080", http.FileServer(http.Dir("gophers")))' # for serving gophers on localhost:8080
Then, open a browser and open http://localhost:8080. The target frame time is ~16.7ms (60 Hz), but on my macbook pro it almost never hit the target.
Running the example natively,
$ go run gioui.org/apps/gophers -stats
it easily hits the 60 Hz target.
In both Chrome and Firefox the builtin profiler is a great way to see what takes up the time. I've attached a screenshot of a single frame from Chrome's "Performance" tab. The frame time is 24ms.

Unfortunately, the function names are all mangled ("wasm-function[ (Fixed by not passing -w -s to ldflags).
CC @cherrymui who recently optimized wasm.
Thanks @eliasnaur. I have sent CL 183457 which should alleviate the DOM overhead to some extent. Would you be able to try that and check if it helps at all ? Just a note that the CL only optimizes DOM overhead, so if your app is heavy on computations in the wasm land itself, it might not help very much.
Regarding profiles, yes the Chrome profiler is a great tool. The wasm-function[] is indeed a bother (see bug I filed with the Chrome team). Until then, may I take this chance to suggest you to use wasmbrowsertest ? It was mentioned in @johanbrandhorst's Gophercon talk. Using this you can natively take cpu profiles for wasm just as you would do for amd64. It automagically converts wasm-function to their appropriate names and you can directly analyze the profiles using go tool pprof :slightly_smiling_face:. That should give you better insight into your app regarding what's going on and you can see if there is a possibility to optimize hot functions.
Thanks @agnivade, wasmbrowsertest is definitely useful for running benchmark and standalone tests on wasm. However, the full drawing and rendering to a window doesn't lend itself to that model yet.
Fortunately, I figured out how to bring back function names: the gioui.org/cmd/gio command passed -ldflags=-w -s which as a side effect stripped the function names from browser debuggers. I've removed the flags which didn't save much space anyway.
Finally I updated my comment to add the -stats flag that enable profiling without Ctrl-P.
Re: function names, it is a cold cache phenomenon as far as I understood. For the first time, it comes up as wasm-function, and then on all consecutive reloads, the names show up. Although, it is hard to reproduce. See the bug I filed.
Anyways, I see some syscall/js.ValueCall in the profile. So my CL shouldTM be able to help. Feel free to give it a try whenever you have a chance.
I tested with your CL 183457 which seems to help: the frame times are lower and more consistent. This is an example for a 17ms frame (the above profile had frame times above 20ms):

However, the CPU usage still seems too high. According to the profile, almost 10ms of CPU time is spent building the vector shape for the frame timer in the top right corner. The text layout code is definitely CPU heavy and unoptimized, but 10ms seems excessive.
To verify the profile, I cut out the rendering of the statistics label and redid the profile:

Firefox also misses the frame target:

It looks like CPU heavy code is faster in Firefox, whereas DOM calls are slower. Perhaps DOM calls are only slower because Firefox' WebGL implementation is slower.
In summary, it looks like the demo is CPU bound, leading to the claim that Go generates inefficient webassembly.
I'll work on preparing a benchmark that can run in wasmbrowsertest and that skips all rendering/DOM calls.
Great stuff ! I think we are getting somewhere. Yes, the wasm code generation can use some love. I have a couple of CLs which apply some rewrite optimizations which were there in amd64 but absent in wasm, which should go in when the tree opens.
But it would be great if you can prepare a standalone benchmark. That would allow us to compare the generated code with amd64 and see if there are some obvious places for improvement.
I split the UI update from its rendering and added a benchmark. To see the difference, I ran:
$ go test -bench . -count 8 -cpu 1 gioui.org/apps/gophers > native.bench
$ GOOS=js GOARCH=wasm go test -exec ~/go/bin/wasmbrowsertest -bench . -count 8 gioui.org/apps/gophers > wasm.bench
$ benchstat native.bench wasm.bench
name old time/op new time/op delta
UI 14.9碌s 卤 1% 216.5碌s 卤22% +1354.01% (p=0.000 n=7+8)
So more than 10 times slower on wasm compared to native code, at least on my 2014 MBP.
I investigated the profiles and started looking at the GOSSAFUNC output of some hot functions. The amd64 code showed lots of (MUL/DIV)SS. However, the wasm code showed something interesting, there were lots of F32DemoteF64 and F64PromoteF32 in the generated code. For example:
v395 00419 (14) F32Load "".ctrl1+32(SP)
v395 00420 (14) F64PromoteF32
v395 00421 (14) F64Sub
v395 00422 (14) F64Const $(0.5)
And in fact, several times, code like this was generated -
v403 00474 (213) I32WrapI64
v403 00475 (213) F32Load ""..autotmp_318-64(SP)
v403 00476 (213) F64PromoteF32
v403 00477 (213) F32DemoteF64
v403 00478 (213) F32Store $0
This means all 32 bit FP values are being promoted to 64bit, then worked on, and then again demoted to 32 bit before being written back to memory.
A quick look into WasmOps.go revealed that 32 bit FP instructions were missing. And then I understood why. It is because all the FP registers (F0-F15) are treated as 64 bit registers.
Now here is where my speculation begins. Since Go SSA works with only registers, these virtual registers were created to work with SSA. But in the generated code, all references to registers are rewritten to local.(get|set|tee). So theoretically it should be possible to construct another set of 32bit registers and add 32 bit FP instructions which just deal with them, and avoid this 32-64 jump.
@neelance / @cherrymui - Is this analysis correct ? If so, how would you recommend to extend the F0-F15 register set to include 32 bit registers too. I have a local CL where I have already added the 32 bit instructions. Now I just need to fix these local.(get|set|tee) to work with 32 bit values.
The F64PromoteF32+F32DemoteF64 combination should only happen if rounding to 32 bits is actually necessary. In many cases the Go spec allows to use 64 bit precision for float32 values.
Yes, it is possible to add registers for 32 bit floats, but I'm not sure how much this would affect performance, because I guess that CPUs are not faster on 32 bit floats than on 64 bit floats (might be wrong).
The F64PromoteF32+F32DemoteF64 combination should only happen if rounding to 32 bits is actually necessary.
I think you are referring to this
case ssa.OpWasmLoweredRound32F:
getValue64(s, v.Args[0])
s.Prog(wasm.AF32DemoteF64)
s.Prog(wasm.AF64PromoteF32)
I actually found another code path in case ssa.OpWasmF32Store where getValue64 actually generates a F64PromoteF32 and then because of if v.Op == ssa.OpWasmF32Store {, another AF32DemoteF64 gets added. I did not look much deeper into it though.
Yes, it is possible to add registers for 32 bit floats, but I'm not sure how much this would affect performance,
Sure, if there is no perf boost, then there is no use. But I would like to try and check the benchmarks. What is the right way to add 32 bit registers ? Just add F16-F32 ? Or is there another way ?
Is the F64PromoteF32 the one emitted by case ssa.OpLoadReg: of ssaGenValueOnStack? If yes, then this is indeed something we could optimize.
What is the right way to add 32 bit registers ? Just add F16-F32 ? Or is there another way ?
This is not easy to describe in a few words...
@agnivade does your CL contain fixes to 32-bit integral instructions? it also amazed me to see such 32-64-32 int/fp convertions.
I have not sent any CL yet. And no, I have not looked into 32bit integral instructions.
Most helpful comment
It would be also good to benchmark with Firefox and see the results.
IIUC, you are just benchmarking DOM manipulation. And since DOM manipulation anyways happens outside wasm, it is just about the price of context jump from wasm land to browser land and back. In that case, I wonder if it is even within the control of syscall/js and not the underlying wasm engine.
Would be also good to benchmark equivalent code using Rust and C and compare the benchmarks. I think that may be a better apples-apples comparison just to compare syscall/js performance with other languages.