Hi folks,
I was asked by @jfbastien to post here based on a twitter conversation we had: https://twitter.com/jfbastien/status/941170112014327808
AutoCAD web is a flavor of AutoCAD which runs entirely in the browser thanks to WebAssembly. We took our core engine, removed everything we could to get it to the smallest possible size. The resulting wasm file is currently 29.6 MB. It's now in beta if you want to try it out: http://client.autocad360.com/
Our team has been working on this for a while now, and we expect many different users with slow internet connections to use the application. We want to optimize for first-time use as much as possible. We also realize that this represents one of the largest web applications out there.
Ideally, we'd break up the wasm file into smaller chunks, where the first chunks downloaded would only represent the minimum code necessary to display graphics, show the cursor, and let the user zoom and pan around while the commands and other modules are lazy loaded.
We don't currently have a good strategy for defining split points. The desktop variant of the product uses a virtual memory manager (VMM) and profile guided optimization (PGO) to optimize startup time, and our hunch is that this is a good strategy for partitioning our code into "hot" and "cold" chunks. It's not feasible for us to re-design the core engine smaller than it is through manual definition of split points.
We think we can solve this problem by investing in tooling internally such that we extend PGO so that we emit two wasm files, one hot wasm file with stubs which, when called, block until the cold wasm file is downloaded and started. Managing two wasm files in JavaScript feels like a pretty nasty hack since the boundary between the hot and cold wasm files would most likely involve crossing from wasm to js to wasm again.
It would be even better if we could work with the browser vendors and solve the problem by extending the design of WebAssembly to support this use case. I expect lots of larger applications would be able to take advantage of it.
For example, what if we could pipeline a PGO optimized wasm file so that it was downloading, compiling, instantiating, and executing in a streaming process. The process could also raise events when a cold stub was hit allowing us to design an experience around startup.
@szilvaa feel free to chip in if I missed anything.
This is certainly seems like a use case that seems we should have an answer for. I don't have any outstanding starting points ATM for a proposal unfortunately. I do have one comment on your "hot" to "cold" bridge paying the cost of Wasm->JS->Wasm every time, however.
I think if you made the bridge calls into indirect calls from a Table then you would only have to do a Wasm -> JS call once and it could stub its entry in the table with the "cold" Wasm function. At least in JSC, although I expect all engines do this, if we see a Wasm -> Wasm indirect call we will bypass the JS entrypoint code. Thus, only the first "cold" call will pay the Wasm->JS->Wasm boundary cost.
Twitter context, with @TheLarkInn saying that Webpack is looking at doing this 馃榿
Of course other tools doing it would be great.
One sad thing about inserting wasm->js->wasm calls where there used to only be wasm->wasm is that tools now need to re-write all i64 parameters to be i32 pairs.
@camwest you mentioned the streaming process. WebAssembly supports instantiation using a stream (probably your HTTP request). Have you tried this approach?
In our Cheerp C++ compiler we are currently exploring a solution based on tagging functions/classes/namespaces at the C++ level (something like [[cheerp::hot]] or [[cheerp::first]], like we do already to automatically generate JS bridges for DOM interaction) to mark them for inclusion into a loader module. PGO based approach could also be considered.
Agreed that it would be cool if tools could help here.
@camwest Is this 29.6mb before Content-Encoding:gzip? Assuming the usual 3x reduction we see with gzip, is the download time of the ~10mb compressed payload problematic, or is it more the subsequent compilation time you see in release browsers? If it's the latter, that should be improving significantly in the coming months, especially if you use the streaming compilation API so that compilation can overlap download.
If it's the latter, that should be improving significantly in the coming months
Still, the fastest compilation will occur when we don't have to compile anything 馃槃
@lukewagner What we (I work with @camwest) are hoping is that streaming compilation can be further pipelined all the way to the "run" stage so in the end we could start running wasm before it is fully downloaded. We expect this would further improve performance provided that the code in the wasm is arranged such that the first bytes are the first to run (via PGO)
This would mimic what the virtual memory manager does on desktop: when you load an executable the code is simply mapped into the virtual address space (no i/o). Then as the address space is used the VMM receives the page-fault and brings the code in page by page.
@jfbastien Agreed.
@szilvaa I can see why that's attractive, but having an arbitrary synchronous wasm call block on the network seems to risk the app freezing (if the user does something outside the profiled path or if the network is extra slow) and is also at odds with the general non-blocking-io design of the web. Also, the network can be a lot slower than local i/o, so it's not quite so analogous to what native does when launching a local app.
Running WASM code before the file is fully downloaded is not possible with the current design: the download process must check for the presence of a data section (which is after the code section) before the instance can be returned.
If the implementation had an option to disable the data section, then there's potential.
A solution where WASM exports could be directly connected to imports without bridging through JavaScript seems ideal to me. It provides efficient solutions to this problem (IE, on-demand loading of rarely-used code) and enables 64-bit integers (and future types) to be directly transferred.
@RyanLamansky I don't quite understand how the WASM import/export mechanism avoids the problem that @lukewagner mentions above (i.e. arbitrary wasm call may be blocked on network I/O). The import dependency is either resolved at instantiation time in which case it really does not help at all with startup performance. Or it is resolved at runtime in which case the provider of the import may not be present yet so the call must block.
I think the only way to avoid "freezing" the UI thread is to run your WASM on a worker (which is what we do). This suggests that maybe this sort of pipelined code execution should only be available in a worker.
cc @sokra for coverage. This is scenario is something we'll likely discuss for webpack helping solve.
webpack will not help you with a big wasm file.
It supports Code Splitting with import() for wasm. So if you can (manually) split up your WASM into multiple pieces you will be able to load these pieces on demand. It's an async call (Promise), so you probably need to handle with in your wasm (Probably some callback called from JS on Promise completion).
This probably requires you to restructure your native code, at least on these boundaries where you want to load on demand. It's a kind of distributed architecture: Multiple wasm components communicate async over JS.
@szilvaa Yeah, I can imagine a pure toolchain solution almost working in workers; the main limitation is effective lack of new WebAssembly.Module on Chrome which means no synchronous compilation to handle the "call before downloaded+compiled" case.
Regarding "We don't currently have a good strategy for defining split points." in the OP, have you considered a coarse-grained strategy of splitting the app up into an exe with asynchronously-loaded DLLs? IIUC, Emscripten provides support for dynamic linking (where the exe and each dll turn into a .wasm) and, now that we have Table, it should be pretty efficient.
@lukewagner Yes, of course, we have considered this. In fact, the code already has exe/dlls break-up on windows/osx but these boundaries are not on the hot vs. cold code boundary for our current web scenarios.
But let's say we have a hot.wasm and a cold.wasm.
As far as I understand we couldn't use an import Table in hot.wasm because these imports would have to be satisfied at instantiation time (which defeats our purpose here). So we would have to create some sort of custom thunking mechanism that allows the imports to be delay loaded (runtime-loaded). @sokra Is that what webpack has built or building? Can you link me to some more info?
@szilvaa That's only a restriction for load time dynamic linking. Tables are fully mutable at runtime (via set()) and so the toolchain can implement dlopen after instantiation time. I'm not actually up to date on the state of toolchain support here, but I had thought it worked already.
@sokra Is that what webpack has built or building?
Yep, I imagined a JS bridge between two WASM modules with a async API inbetween. The bridge would use import() to on-demand-load the second WASM module on first use.
But @lukewagner's approach where JS only fills imports into a Table sounds also nice. I guess this results in fasterer WASM to WASM calls and you could use a sync interface.
Sync download + instanciation in WebWorkers doesn't look like a nice approach from UX to me. You basically block your complete native part while parts are downloaded.
I guess this results in fasterer WASM to WASM calls and you could use a sync interface.
Yes, it should basically be the same call as a plain pointer-to-function call which is going to be a factor faster than thunking through JS.
Sync download + instanciation in WebWorkers doesn't look like a nice approach from UX to me. You basically block your complete native part while parts are downloaded.
So it sounds like the current impl of dlopen in Emscripten does synchronous wasm compilation for bytes that are supposed to have been preloaded in the filesystem image. So that's not suitable for your use case here. I think instead you would want a function that takes a URL and C callback and then does instantiateStreaming(fetch(url)) under the hood. If that sounds right, I'd suggest filing an Emscripten issue; @kripken said it wouldn't be hard to add if someone wanted it.
Managing two wasm files in JavaScript feels like a pretty nasty hack since the boundary between the hot and cold wasm files would most likely involve crossing from wasm to js to wasm again.
Rather than communicating through JavaScript, a better approach may be to reload the entire application: ultimately, the application is binary data that can be modified with string concatenation. You can download a "hot" wasm file and then a patch: the difference between the "hot" wasm file and the the wasm file for the entire application. It should be possible to save the state from the hot application, and then and restore with the full application.
Or, you could streaming-instantiate the new module in a new WebWorker, and also instantiate the existing module, by passing the WebAssembly.Module as a Transferable, in the new worker.
This way they are in one context and can get optimized calls through a Table. And after you loaded your data into the new worker, just nuke the old one.
@qm3ster , but the application would still be running midway when the data is loaded into the new worker so wouldn't we need to exit the application first?
@awtcode not before the application is finished loading in the new worker.
The old worker could still do things like rendering even as the new worker is loading the data, it should just avoid mutable operations.
Most helpful comment
Yes, it should basically be the same call as a plain pointer-to-function call which is going to be a factor faster than thunking through JS.
So it sounds like the current impl of
dlopenin Emscripten does synchronous wasm compilation for bytes that are supposed to have been preloaded in the filesystem image. So that's not suitable for your use case here. I think instead you would want a function that takes a URL and C callback and then doesinstantiateStreaming(fetch(url))under the hood. If that sounds right, I'd suggest filing an Emscripten issue; @kripken said it wouldn't be hard to add if someone wanted it.