Emscripten: Coroutines are broken with Asyncify on upstream WASM backend

Created on 14 Jul 2019  路  38Comments  路  Source: emscripten-core/emscripten

Test case:

#include <stdio.h>
#include <emscripten.h>

void cofunc(void *data) {
    printf("co: 1\n");
    emscripten_yield();
    printf("co: 2\n");
    emscripten_yield();
    printf("co: 3\n");
}

int main(int argc, char **argv) {
    emscripten_coroutine co = emscripten_coroutine_create(cofunc, NULL, 64 * 1024);
    printf("main: 1\n");
    emscripten_coroutine_next(co);
    printf("main: 2\n");
    emscripten_coroutine_next(co);
    printf("main: 3\n");
    emscripten_coroutine_next(co);
}

Compile and run:

$ emcc -o cotest.js -s ASYNCIFY=1 cotest.c
$ node cotest.js

Fastcomp output (correct):

main: 1
co: 1
main: 2
co: 2
main: 3
co: 3

Upstream wasm output (broken):

main: 1
co: 1
co: 2
co: 3
main: 2
co: 1
co: 2
co: 3
main: 3
co: 1
co: 2
co: 3
help wanted

Most helpful comment

Yes, exactly. The simple thing might be to allocate (using malloc) some C stack space, then when a coroutine is started or resumed to use stackRestore to set the stack pointer to that location. Then when it pauses, to remember that location using stackSave and return to the previous stack location from before it was resumed, etc.

All 38 comments

I'm aware that asyncify is deprecated, but there's nothing else to support coroutines in the upstream wasm backend right now. I'm looking forward to bysyncify support.

It should be pretty simple to implement your own yield by calling Asyncify.handleSleep, right?

Like in the Asyncify docs: https://emscripten.org/docs/porting/asyncify.html#making-async-web-apis-behave-as-if-they-were-synchronous

I have very little idea of how asyncify works, and library_async.js looks arcane to me. The purpose of Asyncify.handleSleep, from that example, seems to be to make asynchronous JS calls block in WASM code. But then it just returns back to the call site. For coroutines you'd need a way to keep multiple contexts/stacks around and switch between them, and I don't know how to implement that right now.

I am also a bit new to Emscripten and WASM, but I will try to do it as an exercise :)

@corwin-of-amber Sounds great! :) Let me know if you have any questions about asyncify or the emscripten runtime that I can help with.

@kripken, are coroutines expected to work with asynify/llvm backend? It looks like we currently don't test them (tests/test_coroutines.cpp is only tested with EMTERPRETITY).

@sbc100 they are unimplemented on the llvm backend, and both fastcomp implementations (empterpreter and old asyncify) are currently broken.

OK, on ToT we currently report: missing function: emscripten_coroutine_create which is a least accurate.

Should we re-title this bug "Coroutines are missing ..."?

I have created this gist with a prototype coroutine implementation based on Asyncify. For simplicity of demonstration, it allows a single main and several coroutines, although with a bit more tweaking I suspect it can be made to allow coroutines which invoke other coroutines as well.

@corwin-of-amber thank you for this prototype, it helped me get started with Asyncify a lot.

I've been trying to implement some more complete support. However, rather than extending the prototype, I went another way about it for now. I've tried to implement a simpler symmetric coroutine interface (or "fibers" as they are sometimes called). This is because I already have a library in my project that implements asymmetrical coroutines on top of various backends with symmetric "fiber-like" semantics, that is known to work. This way the JS part is kept to a bare minimum and most of the coroutine bookkeeping is done on the C/WASM side.

Here is my implementation so far.

The results were initially promising: it seems to pass any contrived test-cases I've thrown at it so far, and I actually haven't been able to isolate a failure yet. However, things still break in very peculiar ways in the real application (which schedules 100s of coroutines with lots of nesting, resumptions across main loop frames, etc.). I suspect #9401 and #9153 may be related in some way, but I don't yet understand what exactly I'm doing enough to identify and fix the issues.

Also, using setTimeout for every context switch absolutely destroys performance. I'm talking about single-digit FPS in a game that can easily pull 1000s on the desktop build. Fortunately, we don't actually need to yield to the browser here, so we could simply use Asyncify.afterUnwind instead... except doing so overflows the JS call stack pretty quickly. I've found a hacky compromise here: use Asyncify.afterUnwind for the first ~1000 or so switches, then do one via setTimeout to reset the call stack, ~1000 more of Asyncify.afterUnwind, etc. Seems to work ok, but there's probably no robust way to estimate the optimal magic number.

@kripken, perhaps you have some more insight on this?

I have actually encountered both #9401 and #9153; the latter is the reason I used setTimeout to avoid the nested calls to handleSleep. The former is confusing but can be explained by the fact that the asynchrony-inducing function is called a second time when rewinding the stack. Therefore the entire body of the function should be inside handleSleep.

This does not explain your intermittent failures with stress payloads. Also it does not explain why afterUnwind exhausts the stack - after all, it happens after unwinding.

Also it does not explain why afterUnwind exhausts the stack - after all, it happens after unwinding.

Unwinding only applies to the wasm part of the stack, I believe. afterUnwind happens somewhere inside the handleSleep call, so when the continuation wants to do another context switch (either to resume or yield from a coroutine), it happens to call handleSleep recursively, and this never unwinds.

One thing I don't see in the prototype code is C stack handling (stackSave, stackRestore) - I think each coroutine needs its own, otherwise if it allocates a little (using C alloca(), or otherwise needing some C stack), then is paused, then we go to another one which frees some C stack, it will corrupt the first. Maybe that explains some of the problems being seen?

Regarding afterUnwind, I think we need to find a way to unwind the stack up to the top, where "runtime" code calls the next thing that needs to execute. That may include the non-coroutine code too, i.e. treat the main execution context as a coroutine as well.

Oh, and nice work btw! :) Exciting to see experimentation and progress here.

Thanks for looking at this! I have been wondering about stackSave/stackRestore, from looking at the WAT it seems that these only manage the stack pointer, not the stack content... so according to what you say, if both coroutines use alloca(), then they might clobber each other's scratch space.

Yes, exactly. The simple thing might be to allocate (using malloc) some C stack space, then when a coroutine is started or resumed to use stackRestore to set the stack pointer to that location. Then when it pauses, to remember that location using stackSave and return to the previous stack location from before it was resumed, etc.

By the way, is it possible to somehow store a reference to the to the jump/wakeUp function inside the coroutine struct on the heap, or would I have to track it on the JS side via e.g. a mapping from coroutine pointers to the corresponding resumption functions?

Thanks for the pointers @kripken, it looks like I actually got it working! Taisei's coroutines branch works beautifully with Emscripten now (except for one mysterious bug that only manifests when optimizations are enabled, but I don't think it is coroutine-related).

I must say I actually prefer this symmetrical "fiber" API for my use-case, so I'll probably keep using it in my project after some clean-up. Given that Emscripten's coroutines API has been unusable for a while now (even in fastcomp), perhaps it's ok to deprecate/remove it in favor of a new symmetrical API? I think it's both easier to implement and to work with. But I'm ok with implementing the existing API too, if you want to keep it around. I could also leave that for @corwin-of-amber if he wants to do that :stuck_out_tongue:

Nice @Akaricchi , glad to hear things work!

By a more symmetrical API, do you mean this?

I think it would be useful to support the existing API for existing users - would be great if you or @corwin-of-amber want to open a PR with that! But also if there is a nicer API for this, I'm open to that too (there likely are not that many existing users of coroutines, so a breaking change for a good reason might be justified).

@kripken

By a more symmetrical API, do you mean this?

No, libkoishi actually provides an asymmetrical public API - which just means that coroutines have this sort of caller-callee relationship with resume/yield semantics built into the API.

With a symmetrical API, resume and yield are not distinct from each other. There is only one context switching operation, which means every "yield" must be explicit about what context/coroutine to return control to. A low-level symmetrical API doesn't have to manage as much state as an asymmetrical one, making the implementation smaller and simpler.

Now, it just so happens that pretty much all of libkoishi backends are implemented on top of symmetrical context switching APIs, such as Boost's fcontext, POSIX ucontext, and win32 fibers. The old emscripten coroutine backend is actually the only exception to this. For this reason, there's a bit of thin glue code shared by all those backends, which implements the public asymmetrical API on top of symmetrical context switching semantics. As you can probably see, this mapping is very straightforward.

With the new Asyncify backend, I leveraged the same glue code, by implementing a very minimal symmetrical context switching API in JS. Note how it doesn't even care about which coroutine/context is the "current" one, and doesn't save any caller information in the control structure. It doesn't need to, because it simply takes the caller/current context as a parameter to the swap function.

I find this approach simpler and more maintainable compared to the older coroutine implementations in emscripten, which conflate the context-switching logic with the higher-level bookkeeping. There may have been a good reason to do it that way (implementation details of old asyncify, perhaps?), but it doesn't seem to apply to the new Asyncify at least.

TL;DR

So basically, my suggestion is to add something like libkoishi's internal JS API to emscripten. Then people can either use it directly as a low-level context switching mechanism and a building block for stuff like resume/yield coroutines, async/await, etc. or use libkoishi as a ready-made wrapper on top of it. We can even try to expose it as a subset of the ucontext API. Some code using ucontext will then probably Just Werk, although sjlj hacks commonly used to speed it up are still off limits.

We can then also implement emscripten's old coroutines API on top of it quite easily, though I'm honestly not sure there's much point in continuing to support it.

+1 to shipping the lower level primitive and implemented the existing API on top of it. Or just dropping the existing API if there are no users. I guess it depends how hard it would be to implement the existing API on top of the primitive one (I'm guess not that hard).

Cool, that definitely handles the stack allocation. Have you found this to be stable w.r.t. your code that has hundreds of coroutines?

I'm not sure I understand what makeSetValueAsm is used for, esp. since f_old is passed as a string? https://github.com/taisei-project/koishi/blob/b16c82fc6e86a1623d8ea553bfb934aee05033dc/src/asyncify/library_fiber.js#L75
I probably need to look at the runtime code for that.

since f_old is passed as a string?

Oh it's a macro thingy... * feeling stupid *

Have you found this to be stable w.r.t. your code that has hundreds of coroutines?

Yeah. I'll be doing more testing later, but so far it seems to work fine. Performance is not ideal, but seems acceptable for now. Even though I really dislike the hack that makes it at all viable.

If you're curious, my use case is in this game, where we're currently refactoring game logic around coroutines in the coroutines branch.

Wow and you made it work in the browser? Pretty amazing!

Also, the game looks super hard 馃槅

Interesting, so basically we can add a new Fiber API as in that link, and implement the existing Coroutine API on top of it? Sounds good!

Just to be a little forward looking, we should make sure that whatever we do here aligns with C++20 cooroutines, so we don't need to invent another primitive when we do to implement those.

From what I understand, C++ coroutines are stackless, thus operate in a fundamentally different way - with different limitations as well, e.g. they can't yield across a function call. LLVM has some intrinsics and code transformations for those, but I have no idea how/if we can leverage them.

It should be possible to fake stackless coroutines on top of fibers (inefficiently), but not the other way around.

In any case, this issue has always been about stackful coroutines (i.e. those that emscripten's API was supposed to provide and that happen to fit my use case). C++ coroutines are a different beast entirely as far as I'm concerned, and it's honestly way over my head right now.

I agree, I think C++20 coroutines are different enough that we'll need to think about support for them separately, when that time comes. (But we may end up using some of the same tools in both, perhaps.)

Oh wow, I didn't realize C++ coroutines were stackless, that makes them a whole lot easier to implement.

From the above doc: "Coroutines are stackless: they suspend execution by returning to the caller and the data that is required to resume execution is stored separately from the stack."

Coincidentally that the implementation level it seems that asynchify is also stackless .. since it doesn't actually switch stacks, it stashes everything it needs to resume in a side-buffer.. which seems to be exactly what they talking about the spec here. The main difference being that asynchify does it all under the hood so you don't realize its stackless.. I guess?

Yeah, Asyncify is stackless from C's point of view, it has its own buffer on the side. I do hope we can implement C++20 coroutines using it, but maybe we'll have native wasm support by then anyhow...

For now, I don't think any of that blocks any of the current stuff here, a PR would be great!

Interesting, so basically we can add a new Fiber API as in that link, and implement the existing Coroutine API on top of it? Sounds good!

Yes, though a couple issues remain:

  • We should really do something about the performance hack, if at all possible.
  • My current implementation has to track the continuation function on the JS side per-fiber. That's because it needs to store the wakeUp callback from handleSleep, which is a closure. Not really a problem in itself, but read on.
  • The coroutines API is missing a free/cleanup function. I consider this to be a design flaw. Coroutines are, apparently, supposed to be automagically cleaned up after they are finished (odd choice for a C API), and it's not accounted for that sometimes you way want to interrupt a coroutine prematurely, or have one that never returns normally. Previously, users have been suggested to simply call free in such cases (#6540). However, this won't fly anymore because of the previous point.
  • I have no idea how any of this currently interacts with threads at all. Generally, you should be able to suspend a fiber on one thread and resume it on another just fine (given proper synchronization). Not relevant to my use-case, but worth considering.

If you think about it, exposing fibers puts the current coroutines API in an odd place.

Since it's so bare-bones, you'll probably never see it used directly. Users will most likely end up writing wrappers to implement things like querying coroutine status (running/suspended/dead), querying the currently running coroutine, cancelling a coroutine, passing values across resume/yields, etc. (something like the libkoishi API basically). At this point they might as well wrap the fibers API directly and skip some unnecessary indirection.

If they are porting code that already uses coroutines (I suspect this is the most likely use case), then fibers are a more natural fit as well, since all the native stackful coroutine implementations are based on very similar context switching primitives.

The only reason to keep it around may be compatibility with existing code. But given that it's been broken for several months without virtually anyone noticing, I honestly doubt there's any code that relies on it.

So I'm leaning towards not implementing it, and officially declaring it deprecated/removed after fibers land. Of course, we can always implement it later if someone wants it for some odd reason. I don't have strong feelings about it though.

SGTM

Coroutines are already missing on the wasm backend, and you seem to be only person asking for them so far. So if nobody else asks for them they will simply disappear when we delete fastcomp.

Sounds good to me too!

What's the performance hack mentioned earlier (with swapCounter == 1000)?

@kripken see the last paragraph in this comment. Basically, setTimeout is way too slow to call on every context switch, but never doing so causes the call stack to build up (and eventually overflow).

I see @Akaricchi , yeah, sounds like you need that hack. I don't see a way around it.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

hcomere picture hcomere  路  3Comments

ShawZG picture ShawZG  路  4Comments

juj picture juj  路  3Comments

nemequ picture nemequ  路  4Comments

nerddan picture nerddan  路  4Comments