Go: syscall/js: performance considerations

Created on 13 Jun 2019 · 22Comments · Source: golang/go

I was porting some frontend Go code to be compiled to WebAssembly instead of GopherJS, and noticed the performance was noticeably reduced. The Go code in question makes a lot of DOM manipulation calls and queries, so I decided to benchmark the performance of making calls from WebAssembly to the JavaScript APIs via syscall/js.

I found it's approximately 10x slower than native JavaScript.

Results of running a benchmark in Chrome 75.0.3770.80 on macOS 10.14.5:

  131.212518 ms/op - WebAssembly via syscall/js
   61.850000 ms/op - GopherJS via syscall/js
   12.040000 ms/op - GopherJS via github.com/gopherjs/gopherjs/js
   11.320000 ms/op - native JavaScript

Here's the benchmark code I used, written to be self-contained:

Source Code

main.go

package main

import (
    "fmt"
    "runtime"
    "syscall/js"
    "testing"
    "time"

    "honnef.co/go/js/dom/v2"
)

var document = dom.GetWindow().Document().(dom.HTMLDocument)

func main() {
    loaded := make(chan struct{})
    switch readyState := document.ReadyState(); readyState {
    case "loading":
        document.AddEventListener("DOMContentLoaded", false, func(dom.Event) { close(loaded) })
    case "interactive", "complete":
        close(loaded)
    default:
        panic(fmt.Errorf("internal error: unexpected document.ReadyState value: %v", readyState))
    }
    <-loaded

    for i := 0; i < 10000; i++ {
        div := document.CreateElement("div")
        div.SetInnerHTML(fmt.Sprintf("foo <strong>bar</strong> baz %d", i))
        document.Body().AppendChild(div)
    }

    time.Sleep(time.Second)

    runBench(BenchmarkGoSyscallJS, WasmOrGJS+" via syscall/js")
    if runtime.GOARCH == "js" { // GopherJS-only benchmark.
        runBench(BenchmarkGoGopherJS, "GopherJS via github.com/gopherjs/gopherjs/js")
    }
    runBench(BenchmarkNativeJavaScript, "native JavaScript")

    document.Body().Style().SetProperty("background-color", "lightgreen", "")
}

func runBench(f func(*testing.B), desc string) {
    r := testing.Benchmark(f)
    msPerOp := float64(r.T) * 1e-6 / float64(r.N)
    fmt.Printf("%f ms/op - %s\n", msPerOp, desc)
}

func BenchmarkGoSyscallJS(b *testing.B) {
    var total float64
    for i := 0; i < b.N; i++ {
        total = 0
        divs := js.Global().Get("document").Call("getElementsByTagName", "div")
        for j := 0; j < divs.Length(); j++ {
            total += divs.Index(j).Call("getBoundingClientRect").Get("top").Float()
        }
    }
    _ = total
}

func BenchmarkNativeJavaScript(b *testing.B) {
    js.Global().Set("NativeJavaScript", js.Global().Call("eval", nativeJavaScript))
    b.ResetTimer()
    js.Global().Get("NativeJavaScript").Invoke(b.N)
}

const nativeJavaScript = `(function(N) {
    var i, j, total;
    for (i = 0; i < N; i++) {
        total = 0;
        var divs = document.getElementsByTagName("div");
        for (j = 0; j < divs.length; j++) {
            total += divs[j].getBoundingClientRect().top;
        }
    }
    var _ = total;
})`

wasm.go

// +build wasm

package main

import "testing"

const WasmOrGJS = "WebAssembly"

func BenchmarkGoGopherJS(b *testing.B) {}

gopherjs.go

// +build !wasm

package main

import (
    "testing"

    "github.com/gopherjs/gopherjs/js"
)

const WasmOrGJS = "GopherJS"

func BenchmarkGoGopherJS(b *testing.B) {
    var total float64
    for i := 0; i < b.N; i++ {
        total = 0
        divs := js.Global.Get("document").Call("getElementsByTagName", "div")
        for j := 0; j < divs.Length(); j++ {
            total += divs.Index(j).Call("getBoundingClientRect").Get("top").Float()
        }
    }
    _ = total
}

I know syscall/js is documented as "Its current scope is only to allow tests to run, but not yet to provide a comprehensive API for users", but I wanted to open this issue to discuss the future. Performance is important for Go applications that need to make a lot of calls into the JavaScript world.

What is the current state of syscall/js performance, and are there known opportunities to improve it?

/cc @neelance @cherrymui @hajimehoshi

Arch-Wasm NeedsInvestigation Performance

Source

dmitshur

👀3

Most helpful comment

It would be also good to benchmark with Firefox and see the results.

IIUC, you are just benchmarking DOM manipulation. And since DOM manipulation anyways happens outside wasm, it is just about the price of context jump from wasm land to browser land and back. In that case, I wonder if it is even within the control of syscall/js and not the underlying wasm engine.

Would be also good to benchmark equivalent code using Rust and C and compare the benchmarks. I think that may be a better apples-apples comparison just to compare syscall/js performance with other languages.

agnivade on 13 Jun 2019

👍4

All 22 comments

It would be also good to benchmark with Firefox and see the results.

agnivade on 13 Jun 2019

👍4

As @agnivade said, probably worth trying Firefox. V8 is known to have some performance problems with the Wasm code generated by the Go compiler.

cherrymui on 13 Jun 2019

👍1

It would be also good to benchmark with Firefox and see the results.

Agreed. I'll do this later and share results.

IIUC, you are just benchmarking DOM manipulation. And since DOM manipulation anyways happens outside wasm, it is just about the price of context jump from wasm land to browser land and back. In that case, I wonder if it is even within the control of syscall/js and not the underlying wasm engine.

Yes. When I said syscall/js, I meant the entire performance cost of jumping from Wasm to the browser APIs and back. It's what the user sees when they use the API to interact with the JavaScript world.

Would be also good to benchmark equivalent code using Rust and C and compare the benchmarks. I think that may be a better apples-apples comparison just to compare syscall/js performance with other languages.

Agreed, that would be good and more representative of the actual WebAssembly <-> JS call overhead. Doing that would give us more information. I won't have a chance to do this, but if someone else can, it'd be helpful.

dmitshur on 13 Jun 2019

Perhaps it's not worth doing anything substantial here before something like WASI is standardized. @neelance even did a WIP implementation at https://github.com/golang/go/issues/31105.

eliasnaur on 13 Jun 2019

I've tried the benchmark again with recent development versions of 3 browsers:

Chrome Canary
Version 77.0.3824.0 (Official Build) canary (64-bit)

    114.154496 ms/op - WebAssembly via syscall/js
     63.350000 ms/op - GopherJS via syscall/js
     11.740000 ms/op - GopherJS via github.com/gopherjs/gopherjs/js
     11.360000 ms/op - native JavaScript

Firefox Nightly
69.0a1 (2019-06-13) (64-bit)

     94.150003 ms/op - WebAssembly via syscall/js
     85.300000 ms/op - GopherJS via syscall/js
      7.695000 ms/op - GopherJS via github.com/gopherjs/gopherjs/js
      7.405000 ms/op - native JavaScript

Safari Technology Preview
Release 85 (Safari 13.0, WebKit 14608.1.28.1)

     57.249996 ms/op - WebAssembly via syscall/js
     42.866666 ms/op - GopherJS via syscall/js
      5.536666 ms/op - GopherJS via github.com/gopherjs/gopherjs/js
      5.073333 ms/op - native JavaScript

The results are pretty consistent across the 3 browsers in that doing lots of DOM queries via WebAssembly was about 10x slower than with pure JavaScript.

dmitshur on 14 Jun 2019

Could you share the code to take the benchmark to output the values [s/op]?

hajimehoshi on 14 Jun 2019

Thanks for the tests @dmitshur. I would have thought that after https://hacks.mozilla.org/2018/10/calls-between-javascript-and-webassembly-are-finally-fast-%F0%9F%8E%89/, the DOM access overhead would have reduced in Firefox. And interesting that Safari is much faster for DOM access than Firefox.

The tests with Rust/C should give us a better idea on what exactly can be improved from Go side. If anybody can post results for that, that'll be great.

agnivade on 14 Jun 2019

👍1

@hajimehoshi Sure. I've updated the source code in the original post.

dmitshur on 14 Jun 2019

👍1

Change https://golang.org/cl/183457 mentions this issue: runtime,syscall/js: reuse wasm memory DataView

gopherbot on 22 Jun 2019

❤1

@martisch suggested that I add a "real-world" example that demonstrates the performance hit of webassembly compared to running natively. A good example is the "gophers" demo from Gio (gioui.org). With modules enabled and using Go 1.13 (tip), you can build and run the demo with two commands:

    $ export GO111MODULE=on
    $ go run gioui.org/cmd/gio -target js gioui.org/apps/gophers -stats # for building gophers
    $ go run github.com/shurcooL/goexec 'http.ListenAndServe(":8080", http.FileServer(http.Dir("gophers")))' # for serving gophers on localhost:8080

Then, open a browser and open http://localhost:8080. The target frame time is ~16.7ms (60 Hz), but on my macbook pro it almost never hit the target.

Running the example natively,

     $ go run gioui.org/apps/gophers -stats

it easily hits the 60 Hz target.

In both Chrome and Firefox the builtin profiler is a great way to see what takes up the time. I've attached a screenshot of a single frame from Chrome's "Performance" tab. The frame time is 24ms.

Screenshot 2019-08-02 at 16 06 29

~~Unfortunately, the function names are all mangled ("wasm-function[]") which makes it much harder to discern which functions take up time.~~ (Fixed by not passing -w -s to ldflags).

CC @cherrymui who recently optimized wasm.

eliasnaur on 1 Aug 2019

Thanks @eliasnaur. I have sent CL 183457 which should alleviate the DOM overhead to some extent. Would you be able to try that and check if it helps at all ? Just a note that the CL only optimizes DOM overhead, so if your app is heavy on computations in the wasm land itself, it might not help very much.

Regarding profiles, yes the Chrome profiler is a great tool. The wasm-function[] is indeed a bother (see bug I filed with the Chrome team). Until then, may I take this chance to suggest you to use wasmbrowsertest ? It was mentioned in @johanbrandhorst's Gophercon talk. Using this you can natively take cpu profiles for wasm just as you would do for amd64. It automagically converts wasm-function to their appropriate names and you can directly analyze the profiles using go tool pprof :slightly_smiling_face:. That should give you better insight into your app regarding what's going on and you can see if there is a possibility to optimize hot functions.

agnivade on 1 Aug 2019

Thanks @agnivade, wasmbrowsertest is definitely useful for running benchmark and standalone tests on wasm. However, the full drawing and rendering to a window doesn't lend itself to that model yet.

Fortunately, I figured out how to bring back function names: the gioui.org/cmd/gio command passed -ldflags=-w -s which as a side effect stripped the function names from browser debuggers. I've removed the flags which didn't save much space anyway.

Finally I updated my comment to add the -stats flag that enable profiling without Ctrl-P.

eliasnaur on 2 Aug 2019

Re: function names, it is a cold cache phenomenon as far as I understood. For the first time, it comes up as wasm-function, and then on all consecutive reloads, the names show up. Although, it is hard to reproduce. See the bug I filed.

Anyways, I see some syscall/js.ValueCall in the profile. So my CL should^TM be able to help. Feel free to give it a try whenever you have a chance.

agnivade on 2 Aug 2019

I tested with your CL 183457 which seems to help: the frame times are lower and more consistent. This is an example for a 17ms frame (the above profile had frame times above 20ms):

Screenshot 2019-08-02 at 16 14 51

However, the CPU usage still seems too high. According to the profile, almost 10ms of CPU time is spent building the vector shape for the frame timer in the top right corner. The text layout code is definitely CPU heavy and unoptimized, but 10ms seems excessive.

To verify the profile, I cut out the rendering of the statistics label and redid the profile:

Screenshot 2019-08-02 at 16 23 47

Firefox also misses the frame target:

Screenshot 2019-08-02 at 16 28 46

It looks like CPU heavy code is faster in Firefox, whereas DOM calls are slower. Perhaps DOM calls are only slower because Firefox' WebGL implementation is slower.

In summary, it looks like the demo is CPU bound, leading to the claim that Go generates inefficient webassembly.

I'll work on preparing a benchmark that can run in wasmbrowsertest and that skips all rendering/DOM calls.

eliasnaur on 2 Aug 2019

Great stuff ! I think we are getting somewhere. Yes, the wasm code generation can use some love. I have a couple of CLs which apply some rewrite optimizations which were there in amd64 but absent in wasm, which should go in when the tree opens.

But it would be great if you can prepare a standalone benchmark. That would allow us to compare the generated code with amd64 and see if there are some obvious places for improvement.

agnivade on 2 Aug 2019

I split the UI update from its rendering and added a benchmark. To see the difference, I ran:

    $ go test -bench . -count 8 -cpu 1 gioui.org/apps/gophers > native.bench
    $ GOOS=js GOARCH=wasm go test -exec ~/go/bin/wasmbrowsertest -bench . -count 8 gioui.org/apps/gophers > wasm.bench
    $ benchstat native.bench wasm.bench
    name  old time/op  new time/op   delta
    UI    14.9µs ± 1%  216.5µs ±22%  +1354.01%  (p=0.000 n=7+8)

So more than 10 times slower on wasm compared to native code, at least on my 2014 MBP.

eliasnaur on 2 Aug 2019

I investigated the profiles and started looking at the GOSSAFUNC output of some hot functions. The amd64 code showed lots of (MUL/DIV)SS. However, the wasm code showed something interesting, there were lots of F32DemoteF64 and F64PromoteF32 in the generated code. For example:

v395 00419 (14) F32Load "".ctrl1+32(SP)
v395 00420 (14) F64PromoteF32
v395 00421 (14) F64Sub
v395 00422 (14) F64Const    $(0.5)

And in fact, several times, code like this was generated -

v403 00474 (213) I32WrapI64
v403 00475 (213) F32Load    ""..autotmp_318-64(SP)
v403 00476 (213) F64PromoteF32
v403 00477 (213) F32DemoteF64
v403 00478 (213) F32Store   $0

This means all 32 bit FP values are being promoted to 64bit, then worked on, and then again demoted to 32 bit before being written back to memory.

A quick look into WasmOps.go revealed that 32 bit FP instructions were missing. And then I understood why. It is because all the FP registers (F0-F15) are treated as 64 bit registers.

Now here is where my speculation begins. Since Go SSA works with only registers, these virtual registers were created to work with SSA. But in the generated code, all references to registers are rewritten to local.(get|set|tee). So theoretically it should be possible to construct another set of 32bit registers and add 32 bit FP instructions which just deal with them, and avoid this 32-64 jump.

@neelance / @cherrymui - Is this analysis correct ? If so, how would you recommend to extend the F0-F15 register set to include 32 bit registers too. I have a local CL where I have already added the 32 bit instructions. Now I just need to fix these local.(get|set|tee) to work with 32 bit values.

agnivade on 17 Aug 2019

❤3

The F64PromoteF32+F32DemoteF64 combination should only happen if rounding to 32 bits is actually necessary. In many cases the Go spec allows to use 64 bit precision for float32 values.

Yes, it is possible to add registers for 32 bit floats, but I'm not sure how much this would affect performance, because I guess that CPUs are not faster on 32 bit floats than on 64 bit floats (might be wrong).

neelance on 17 Aug 2019

The F64PromoteF32+F32DemoteF64 combination should only happen if rounding to 32 bits is actually necessary.

I think you are referring to this

case ssa.OpWasmLoweredRound32F:
        getValue64(s, v.Args[0])
        s.Prog(wasm.AF32DemoteF64)
        s.Prog(wasm.AF64PromoteF32)

I actually found another code path in case ssa.OpWasmF32Store where getValue64 actually generates a F64PromoteF32 and then because of if v.Op == ssa.OpWasmF32Store {, another AF32DemoteF64 gets added. I did not look much deeper into it though.

Yes, it is possible to add registers for 32 bit floats, but I'm not sure how much this would affect performance,

Sure, if there is no perf boost, then there is no use. But I would like to try and check the benchmarks. What is the right way to add 32 bit registers ? Just add F16-F32 ? Or is there another way ?

agnivade on 17 Aug 2019

Is the F64PromoteF32 the one emitted by case ssa.OpLoadReg: of ssaGenValueOnStack? If yes, then this is indeed something we could optimize.