Design: Typed continuations to model stacks

Created on 29 Jul 2020  Â·  68Comments  Â·  Source: WebAssembly/design

Motivation

Wasm currently lacks any support for _multiple stacks_ of execution and switching between them.
That prevents it from efficiently representing most _control flow abstractions_ that appear in many modern programming languages, such as

  • lightweight (“green”) threads
  • coroutines
  • generators
  • async/await
  • effect handlers
  • call/cc
  • …

In the spirit of Wasm as a low-level engine, it's highly undesirable to add a multitude of ad-hoc features for each of this mechanisms individually.
Instead, it's preferable to identify a sufficiently general mechanism for managing execution stacks that allows expressing all of them.
More concretely, Wasm needs a way to represent, create, and switch between stacks.

Challenges

  1. In order to be able to express features like green threads and others, switching stacks must be possible at any stack depth, not just at the bottom of the stack (“deep coroutines”).

  2. In order to be able to express features like generators, it must be possible to pass values (back and forth!) when switching to a different stack.

  3. In order to be able to validate the usage of stacks, both stacks and the values being passed when switching must be properly typed.

  4. In order to inform the semantics of the interaction with other control flow, esp exceptions, it is desirable to have a semantic framework that can explain all such constructs.

  5. In order for multiple of the above features to be usable at the same time, they must be expressible in a _composable_ fashion that allows arbitrary nesting.

  6. The mechanism must not require copying or moving stack frames, since some engines could not easily do that due to the heterogeneous nature of their stacks.

  7. It must not be possible to create cyclic references to stacks, since that would require GC.

Proposal

One known way to address the above challenges is to view stacks slightly more abstractly, through the notion of a _delimited continuation_.
Broadly speaking, that represents “the rest of” a computation (a continuation) “up to” a certain point (delimited).
Concretely, that certain point is the bottom of a specific stack, while the stack frames on the stack essentially describe the remaining tasks to be completed.

In particular, we can apply known ways to type such continuations, which is not a problem with an immediately obvious solution otherwise.
One specific way of typing continuations and the values communicated back and forth is by following the approach taken by so-called _effect handlers_, one modern way of representing delimited continuations, which essentially are exception handlers with resumption, where both throw and resume can carry along arguments.

Like with _asymmetric coroutines_, switching between stacks then occurs through a pair of resume and suspend instructions.
The suspend instruction “throws” a value akin to an exception — we call it an _event_ below — which the resume handles.
The event's tag, like a regular exception tag, determines the types of values passed from suspend to resume point, but also those passed back.

Details

The first central addition is a new form of _resumable exception_, a.k.a. event.
Such an event is declared with a respective new form of exception definition
(in the binary format this is an exception definition with a certain flag bit):

(event $evt (param tp*) (result tr*))

Unlike a regular exception, events can have return values.

The other central addition is a new (reference) type of continuations:

(cont $ft)

where $ft is a type index denoting a function type [t1*] -> [t2*].
Intuitively, this describes a suspended stack that is resumable with values of types t1* and will ultimately terminate with values of types t2*.

A new stack is created with the instruction

cont.new : [(ref $ft)] -> [(cont $ft)]

which takes a function reference to run on the new stack.
Execution isn’t started before resuming the continuation.

The instruction

cont.resume (event $evt $handler)* : [t1* (cont $ft)] -> [t2*]

resumes such a continuation, by passing it the expected arguments t1*.
Once the computation terminates, it returns with t2*.
If the computation _suspends_ before termination (see next instruction), with one of the event tags listed, then execution branches to the respective label $handler.
This happens _without unwinding the stack_.
The label receives the event arguments tp* and a continuation of type (cont $ft'), where $ft' is the function type [tr*] -> [t2*] of the _next_ continuation.

Continuations are _single-shot_, that is, they cannot be resumed more than once.
An attempt to resume an already resumed continuation traps.

A computation running on its own stack (i.e., initiated as a continuation), can suspend itself with the following instruction:

cont.suspend $evt : [tp*] -> [tr*]

where $evt is an event of respective type [tp*] -> [tr*].
Essentially, this passes control back to the parent stack.
That is, like throw the exception takes arguments tp*, and transfers control to the innermost cont.resume, or rather, its handler label, respectively.
But unlike regular exceptions, execution can also be resumed, which returns here with values of type tr*.

As described above, the innermost active handler matching the event tag receives the event's tp* and a new continuation.
Resuming this continuation (with cont.resume) will therefore pass back tr* to the originating cont.suspend.

Resumption can nest, i.e., a continuation may itself resume another continuation.
Consequently, an event may generally pass through multiple parent stacks before hitting a matching resume handler.
The same chain of stacks is maintained when resuming.

Note how the exception tag ties together the types of the values passed in both direction between a matching suspend and resume.
Different suspension points within the same computation can use different exception tags and thereby pass different types of values back and forth between parent and child stack.

Continuations are first-class values that are allowed to escape a handler.
That way, a stack can be kept live to be resumed at a later point.
The stack’s lifetime ends when the final resume terminates regularly (or with an exception, see below).
Managing the lifetime of continuations/stacks and evtrefs generally requires reference counting in the VM.
However, it’s impossible for them to form cycles, so general garbage collection is not needed.

Finally,

cont.throw $exn : [te* (cont $ft)] -> [t2*]

aborts a continuation by injecting the (regular) exception $exn with arguments of type te* at the suspension point.
This is a way to explicitly unwind the stack associated with the continuation.
In practice, a compiler should make sure that every continuation is used linearly, i.e., either resumed or aborted with a throw.
However, there is no simple way to enforce this in a type system, so Wasm validation won’t be able to check this constraint.

Example: A simple generator

To see how these mechanisms compose, consider encoding a generator enumerating i64 integers from 0 up until a certain condition is met.
The consumer can signal back this condition by returning a bool (i32) to the generator.
This can be expressed by the following event.

(event $enum-yield (param i64) (result i32)) 

Here is the actual generator:

(func $enum-until
 (param $b i32)
  (local $n i64)

  (local.set $n (i64.const -1))

  (br_if 0 (local.get $b))
  (loop $l

    (local.set $n (i64.add (local.get $n) (i64.const 1)))

    (cont.suspend $enum-yield (local.get $n))

    (br_if $l)

  )

)

Here is one possible consumer, which runs the generator until a given max value is reached:

(func $run-upto (param $max i64)

  (local $n i64)

  (local $cont (cont (param i32)))
  (local.set $cont (cont.new $enum-until))

  (loop $l

    (block $h (result i64 (cont (param i32)))
      (cont.resume (event $enum-yield $h)
        (i64.ge_u (local.get $n) (local.get $max))

        (local.get $cont)
      )
      (return)
    )

    (local.set $cont)
    (local.set $n)
    ;; ...process $n...
    (br $l)
  )
)

Note how the handler $h takes both the i64 argument produced by the generator and the next continuation.

Example: A simple thread scheduler

For a more interesting example, consider a simple scheduler for green threads.
We define two events with which a thread can signal the scheduler to either yield or spawn a new thread:

(event $yield)
(event $spawn (param (ref $proc)))

where

(type $proc (func))

We further assume that there is a global variable encoding a thread queue in form of a list of coninuations:

(global $queue (list-of (cont $proc)) ...)

The exact representation does not matter, so list-of is just pseudo code.
All we need is three obvious operations on this queue, ignoring their details:

(func $enqueue (param (cont $proc)) …)
(func $dequeue (result (cont $proc)) …)
(func $queue_empty (result i32) …)

Given these prerequisites, a simple scheduler loop can be implemented as follows:

(func $scheduler (param $main (ref $proc))
  (cont.new (local.get $main)) (call $enqueue)
  (loop $l
    (if (call $queue_empty) (then (return)))   ;; program is done
    (block $on_yield (result (cont $proc))
      (block $on_spawn (result (ref $proc) (cont $proc))
        (call $dequeue)
        (cont.resume (event $yield $on_yield) (event $spawn $on_spawn))
        (br $l)                                ;; thread terminated
      )
      ;; on $spawn, proc and cont on stack
      (call $enqueue)                          ;; continuation of old thread
      (cont.new) (call $enqueue)               ;; new thread
      (br $l)
    )
    ;; on $yield, cont on stack
    (call $enqueue)
    (br $l)

  )

)

TODO: More examples.

Interaction with regular exceptions

Regular exceptions and events are disjoint.
An exception handler will not catch events and vice versa.

If resuming a continuation throws a regular exception that is uncaught on its stack then it propagates through the active resume, unwinding the stack encapsulated in the continuation and ending its lifetime.

Yielding an event on the other hand does not trigger exception handlers between the suspend and the resume point.
They become active again when the continuation resumes and execution thereby switches back to their stack.

Note that this dichotomy could, in principle, be used to implement exceptions with two-phase unwind: the first phase is represented by an event, whose handler then injects back a proper exception to unwind. However, such a an implementation would be fairly expensive, since every exception handler would have to run on its own stack.
It is still useful as a semantic explanation of what _ought_ to happen, and how two-phase unwind would interact with single-phase exceptions.

Most helpful comment

Ah, thanks for the clarification @sabine! The link @rossberg above should be useful for understanding the evaluation-context approach. @kayceesrk mentions the CEK machine, which is another approach. It makes the stack explicit part of the machine state, much like the heap. So if the stack of the machine state looks like (cons K K'), i.e. a stack K' with a pointer to another stack K on top, then a stack switch changes the stack of the machine state to look like (cons K' K). The stack.switch instruction then does something similar, but further incorporating exceptions to manage untyped communication. Hope that helps.

All 68 comments

Is this identical to what you presented in the past @rossberg ? (If not a summary of the changes might be useful.)

@kripken, yes, up to some renaming and minor tweaks.

Got it, thanks @rossberg .

Is there another proposal on the way on how to support applications of stack inspection, e.g. two-phase exception handling, resumable exceptions, C#/Python-style generators, stack tracing, linear-memory garbage collection, stack duplication?

cont.throw doesn't take an evtref, so it's not possible to respond to an unknown suspension with an exception. Is that intentional?

The new function type $ft' : [tr*] -> [t2*] isn't a type that the module specifies, but one that is synthesized from other types. Doesn't that make for significantly more work at validation-time, as the synthesized type has to be checked for compatibility with a (later) specified function type?

@RossTate Doesn't this support those, if indirectly, since the exceptions are resumable?

@taralx I believe you're right that all of those features can be implemented in terms of this proposal. However, for many of those features, allocating an entire new stack would be overkill. Proposals for stack inspection, two-phase unwinding, etc. are essentially special cases of the feature proposed here that can be implemented more efficiently.

Yes, and on top of that, this proposal requires running code just to find out if an inspection site is relevant to the inspection at hand. So I'm wondering if interaction with (more efficient) stack inspection has been taken into account in this design.

Putting stack inspection aside, the lightweight-threads example seems to be a relatively inefficient implementation of lightweight threads. Ideally, switching between threads just involves a few instructions to switch contexts. But in this proposal, switching threads involves a stack walk to find the relevant br_on_evt, which is not necessarily cheap.

@taralx:

cont.throw doesn't take an evtref, so it's not possible to respond to an unknown suspension with an exception. Is that intentional?

That's a good point. But you cannot actually get an evtref for an unknown suspension either, since each handler is guarded by a list of known events. The idea here was that a handler has no business interacting with an unknown suspension -- that would destroy composition. But maybe there are use cases for a handle-all.

The new function type $ft' : [tr] -> [t2] isn't a type that the module specifies, but one that is synthesized from other types. Doesn't that make for significantly more work at validation-time, as the synthesized type has to be checked for compatibility with a (later) specified function type?

I don't think so, since validation already has to be able to treat multiple definitions of the same function type as compatible. Function types are structural, even in the MVP.

FWIW, there is a similar question regarding func.bind and other such instructions. Currently, it takes a target type as immediate to side-step the issue, but it would actually be enough (and cheaper to check) if it just had an numeric immediate that indicates the number of parameters to bind and synthesise the resulting function type.

So generally speaking, we have to decide whether we want to avoid synthesising, at the price of more type annotations and checking on instructions like these, or if we're fine with synthesising such types (which may require slightly more work in the spec primarily).

@RossTate Doesn't this support those, if indirectly, since the exceptions are resumable?

It does, and more importantly, it does so in a composable fashion -- an important point I forgot to include (added). I.e., you can use multiple of these features together at the same time, even if they are implemented separately.

@tlively is correct however that there is one special case where creating a new stack can be overkill, namely when the continuation does not escape the handler. In practice, though, handling some of these cases without a stack would require the introduction of either fragmented stack frames, display chains, or scope restrictions that essentially make it equivalent to just calling another function. It is worth investigating if adding those mechanisms is a relevant win over implementing them in user space.

@RossTate:

Putting stack inspection aside, the lightweight-threads example seems to be a relatively inefficient implementation of lightweight threads. Ideally, switching between threads just involves a few instructions to switch contexts. But in this proposal, switching threads involves a stack walk to find the relevant br_on_evt, which is not necessarily cheap.

That is not correct. Handlers are associated with resume points, not branches. Consequently, a handler is only present "at the bottom" of each non-suspended stack, and there is no need to walk an actual stack. Moreover, each resume declares a list of tags it handles. So transferring control to a handler can e.g. be implemented by an indirect jump through a (thread-local) side table associating each event tag with its active handler.

It does, and more importantly, it does so in a composable fashion -- an important point I forgot to include (added). I.e., you can use multiple of these features together at the same time, even if they are implemented separately.

Yes, and the same is true for every other suggestion that has been made for directly supporting these features. Only the suggestion of implementing shadow stacks within WebAssembly without direct support is not composable.

however that there is one special case

Let's work through the first such "special case" that was discussed four months ago: linear-memory garbage collection. Runtimes and programs with their own (linear-)memory management generally have some way to inspect the stack to find and later update the GC roots on demand. These programs now want to compile to WebAssembly and want some way to inspect the stack on demand to find and update these roots (albeit in a safer manner per WebAssembly's security requirements).

Your proposal is to allocate a new stack at every point someone would want to perform such inspection. For these applications, nearly every stack frame has such a root and therefore such a need. So your proposal is essentially to have these applications allocate every stack frame on its own separate stack. @aardappel, as someone who has been particularly advocating for this feature, do you think this would serve your needs?

Now the second "special case" that was discussed was stack tracing. All the most-used garbage-collected languages have some means for collecting/examining the stack trace (sometimes even lazily using an interactive stack walk) during execution even in production mode (not just debug mode). Again, nearly every frame would need to provide stack-trace information, so again your proposal would have all these languages allocate every stack frame on its own separate stack.

Moreover, each resume declares a list of tags it handles.

Ah, missed this change. There's a bug though. cont.forward also needs to specify a (handler) label and a list of events.

Oh, and another (unrelated) bug: cont.throw needs to consider that the continuation might catch the thrown exception. So it has an output type, and it needs to specify a (handler) label and a list of events.

Consequently, a handler is only present "at the bottom" of each non-suspended stack, and there is no need to walk an actual stack. Moreover, each resume declares a list of tags it handles. So transferring control to a handler can e.g. be implemented by an indirect jump through a (thread-local) side table associating each event tag with its active handler.

This assumes a particular implementation strategy and infrastructure, essentially baking in decisions that would often be made at the application level (e.g. should this dynamically scoped variable be implemented using a stack walk or using thread-local storage). If we're going to assume there is always the infrastructure for thread-local storage, it might be a good idea to incorporate the feature directly into WebAssembly, since it's useful for a lot more than just continuations. Regardless, this means that in the examples above every call and return (in the surface language) would have to update thread-local storage. In general, every suspend/resume, forward, and throw will have to update thread-local storage, which common implementations of lightweight threads do not need.

When you presented this proposal six months ago, it was really great. You provided a lot of great ideas and garnered a lot of interest for an unconventional yet useful feature. That sparked a lot of productive conversations within the CG. Those conversations explored the design space, identifying weaknesses in the specifics you presented and then searching for alterations and extensions that would address those weaknesses. That is exactly what the effect of a good presentation should be. But this design has barely changed to incorporate those many related conversations within the group over the months since.

This design seems especially well suited to implement first class stack-full asymmetric co-routines, as it maps relative 1:1 on its implementation needs.

One place where it is more general than such a co-routine feature is the use of event tags, which essentially allow one to yield not just from the current co-routine, but from any of the parents, presumably suspending a chain of co-routines in the process. It is not quite clear to me when this is useful, and I could imagine it having an implementation cost. I guess this gives us the flexibility of continuations where each stack frame has its own lifetime.

The machinery also seems quite heavy-weight for typical uses of "generators", which tend to function more like fancy for-loops than anything else. There would have to be some kind of guarantee that it is easy to collapse this functionality (no dynamic allocations, re-use of the same stack etc etc) for this to be feasible. Leaving this up to the cleverness of the engine makes it useless.

That holds even more if this is to be useful for stack inspection like @RossTate mentions. Even if we can specify that in simple cases these instructions translate to something close to no-ops, it seems like it would generate a lot bigger binaries than the simpler stack inspection proposal. To me, the whole point of having something like a stack inspection feature is that it is more efficient than a shadow stack, so if there's even a small chance that it isn't (e.g. requires a dynamic allocation) then it will not be used.

So as it stands, this feature seem limited in its applicability, unless its generality can be subdivided in semantically more restrictive uses with different implementation requirements. That is counter-intuitive maybe: the more general a feature is, the more things it can represent. But potentially the less features it can represent efficiently, which is what matters the most for Wasm, I feel.

Another consideration is how to inspect (efficiency aside) the delimited continuations. For example, a garbage-collecting lightweight-thread manager needs to collect (and later update) the roots across all the lightweight threads. Given a particular cont representing a yielded lightweight thread, it's not clear how the thread manager can reasonably collect the cont's roots using this approach.

I think, for fundamental performance/implementation reasons, we should expect that there are two features we need:

  1. the ability for wasm to cheaply create, suspend and resume a lightweight stack (green thread)
  2. the ability for wasm to iterate over and inspect (and possibly mutate) opt-in locals of frames on the stack without having had to significantly change the representation of the stack

I think this proposal speaks to 1, whereas stack iteration proposals speak to 2. They should obviously be designed to compose, but as far as I can tell from an perf/impl POV, I don't think it's practical to assume that one proposal is going to satisfy all our use cases and achieve all of our goals here. I also think the relative priorities of these features are different: there are several high-priority use cases for 1 whereas at least one wasm engine has specifically voiced an intention to delay 2 until we have wasm GC in place.

Yeah, a common thread across feedback we got for #1340 was to prioritize first-class stacks over stack inspection. My presentation last week was to gauge whether stack inspection should still be a feature in the pipeline, and based on online and offline feedback the answer seems to be likely yes (provided it comes out after GC). This is important to know for designing first-class stacks because, just as stack inspection can be combined with unwinding to support two-phase exception handling, stack inspection can be combined with (restricted) detaching or redirecting to support (one-shot) delimited continuations. So if you want a "unifying view on stack-related mechanisms", you need to take stack inspection into account, which this proposal has not.

@RossTate:

It does, and more importantly, it does so in a composable fashion -- an important point I forgot to include (added). I.e., you can use multiple of these features together at the same time, even if they are implemented separately.

Yes, and the same is true for every other suggestion that has been made for directly supporting these features.

No, it is not. It is one of the main innovation of effect handlers, and the reason why PL folks got excited about them, that they are a compositional mechanism for defining effects, including control effects.

Your proposal is to allocate a new stack at every point someone would want to perform such inspection.

I have not been proposing that. The only thing I have mentioned is that effect handlers _can_ express this form of stack inspection (and they immediately explain how that would interact with other control features). But obviously, they would only suitable as an actual strategy if we were able to reliably optimise the special case I mentioned -- which may or may not be the case in this setup. If not, there needs to be a more explicit alternative that can be.

There's a bug though. cont.forward also needs to specify a (handler) label and a list of events.

Why? I don't think it does.

Oh, and another (unrelated) bug: cont.throw needs to consider that the continuation might catch the thrown exception. So it has an output type

Ah thanks, you're right, I mixed up signatures. Fixed.

and it needs to specify a (handler) label and a list of events.

We could do that. But it's not clear whether it would be all that useful. The idea is that cont.throw is used to "destroy" a stack, when it is not supposed to be completed anymore. At least with the use cases listed, it shouldn't throw back to the handler anymore in that situation. But maybe there are some odd use cases? This is one of the details that needs figuring out.

This assumes a particular implementation strategy and infrastructure, essentially baking in decisions that would often be made at the application level

It doesn't assume, it just points out one possible implementation strategy. It may not even be the best one, as fast implementations have used other strategies.

When you presented this proposal six months ago, it was really great. You provided a lot of great ideas and garnered a lot of interest for an unconventional yet useful feature. That sparked a lot of productive conversations within the CG. Those conversations explored the design space, identifying weaknesses in the specifics you presented and then searching for alterations and extensions that would address those weaknesses. That is exactly what the effect of a good presentation should be. But this design has barely changed to incorporate those many related conversations within the group over the months since.

I honestly don't know what you mean. AFAICT, there hasn't been much discussion of this proposal. There have been various discussions about various ideas, and occasionally I pointed out the connection. Correct me if I'm wrong, but none of these recent ideas has been presented as a substitute for this one, so its utility still stands. Whether it is sufficient to also subsume some of the others (efficiently) is a separate discussion to be had.

So if you want a "unifying view on stack-related mechanisms", you need to take stack inspection into account, which this proposal has not.

Yes, well, the nice thing about this proposal is that it _can_ express it, even if not efficiently, but thereby at least suggesting a canonical semantics for integrating such a feature.

@aardappel:

This design seems especially well suited to implement first class stack-full asymmetric co-routines, as it maps relative 1:1 on its implementation needs.

Ah, it may look like it from first glance, but it's not at all specific to or optimised for co-routines. Its heritage is (algebraic) effect handlers, which are a principled way of expressing and composing almost any sort of state or control effect.

One place where it is more general than such a co-routine feature is the use of event tags, which essentially allow one to yield not just from the current co-routine, but from any of the parents, presumably suspending a chain of co-routines in the process. It is not quite clear to me when this is useful, and I could imagine it having an implementation cost. I guess this gives us the flexibility of continuations where each stack frame has its own lifetime.

Ah, the ability to nest resumption is what makes this mechanism composable. It is what allows each control abstraction to be implemented separately, e.g., you can run an async function in a green thread and still yield to the scheduler without worrying about intermediate handlers. That significantly simplifies implementations, as well as enabling the use of control abstractions in a heterogenous setting with multiple interacting languages.

The machinery also seems quite heavy-weight for typical uses of "generators", which tend to function more like fancy for-loops than anything else. There would have to be some kind of guarantee that it is easy to collapse this functionality (no dynamic allocations, re-use of the same stack etc etc) for this to be feasible. Leaving this up to the cleverness of the engine makes it useless.

It is much less heavyweight than you might think. We have reason to believe that implementing generators this way can be just as or even more efficient on average than the common approach of constructing a state machine, which often involves additional local control flow to reenter the previous state and requires copying live locals to the heap at every yield, or allocating them there in the first place. Plus, it immediately allows deep generators (yielding from a nested call) without any additional overhead.

The design outlined above is similar to that of effect handlers in Multicore OCaml. In practice, the overhead of implementing a range of abstractions from generators to lightweight threads to async/await on top of effect handlers in Multicore OCaml is close to zero.

Another consideration is how to inspect (efficiency aside) the delimited continuations. For example, a garbage-collecting lightweight-thread manager needs to collect (and later update) the roots across all the lightweight threads. Given a particular cont representing a yielded lightweight thread, it's not clear how the thread manager can reasonably collect the cont's roots using this approach.

There is a standard way of implementing this which Multicore OCaml does and so goes Go, GHC Haskell, Fibers in OpenJDK and other languages which use lightweight threads. Essentially, the continuations (blocked & ready to run goroutines in Go, blocked lightweight threads in GHC) become a distinguished object in the GC, and the GC knows how to scan the stack associated with the continuation. The suspended continuations are marked as regular objects would be. For any suspended continuation reachable from the GC roots, the GC scans the stack associated with the continuation in the same way it would scan the current program stack.

@kayceesrk That comment was meant to be in the context of modules that implement their own garbage collector in linear memory, though it's my fault for not making that clear. So my point was that a module-implemented garbage collector would not be able to perform the scans you describe.

It is one of the main innovation of effect handlers, and the reason why PL folks got excited about them, that they are a compositional mechanism for defining effects, including control effects.

Effect handlers are a surface-level language feature. Depending on specifics of the handlers, they can be compiled in a variety of ways. While continuations are always a viable option for any handler, they are generally by far the least efficient option. Even the body of work you are basing this on has significant discussion of analyses and optimizations for identity common special cases and eliminating, but many of those optimized forms are not expressible in this proposal. So what's problematic about this proposal is that it's based on high-level concepts but does not provide the corresponding low-level primitives that those concepts are often implemented with.

Correct me if I'm wrong, but none of these recent ideas has been presented as a substitute for this one, so its utility still stands. Whether it is sufficient to also subsume some of the others (efficiently) is a separate discussion to be had.

WebAssembly/exception-handling#105 was written up to describe these low-level primitives. You were explicitly informed here that it could express algebraic effects, but you never engaged in that conversation. #1340 was a Phase 0 proposal to explore those ideas. But in that conversation concerns were raised about security, prioritization, efficiency, and complexity. The presentation I gave last week was following up on these concerns regarding stack inspection, which we needed to gauge in order to determine how best to address the concerns regarding first-class stacks.

The only thing I have mentioned is that effect handlers can express this form of stack inspection (and they immediately explain how that would interact with other control features). But obviously, they would only suitable as an actual strategy if we were able to reliably optimise the special case I mentioned -- which may or may not be the case in this setup. If not, there needs to be a more explicit alternative that can be.

This more explicit alternative has already been discussed a fair amount. It is stack inspection, e.g. #1340 and #1356. Notice that your cont.suspend instruction has the same type as call_stack in #1356. The difference is that call_stack does not require an answer to perform stack allocation just to find out basic information or make basic updates to the current stack. But if you combine it with stack-allocation and stack-redirection primitives, an answer can choose to use continuations if it needs to in order to support the surface-level feature at hand, as we show in #1360. That is, cont.suspect breaks down into a bunch of smaller primitive operations and is not a primitive operation itself.

So this proposal is two iterations behind others that were heavily inspired by it, but which have since broken its concepts down into smaller parts that can be combined more flexibly, and which have already developed answers to many of the open questions in this proposal.

Will this proposal be compatible with the changes I proposed in WebAssembly/exception-handling#123, namely, removal of exnref, introduction of catch_br, and split of catch and unwind?

@aheejin From what I can tell, it should be compatible with those changes.

soil-initiative/stacks#10 now has two translations of this proposal in terms of #1360, one using stack inspection and one using thread-local storage (assuming such a thing exists in the host and is made accessible to WebAssembly). Each implementation strategy has its strengths and weaknesses, and different effects might be better expressed with one or with the other (the translations compose so that different effects can in fact make different choices even within the same application). One of the problems with the proposal here is that it does not let the application pick its own implementation strategy; instead it bakes in a high-level feature and leaves it to the engine to decide/guess which strategy to use, much like how the JVM bakes in interface-method dispatch.

The green threads scheduler example shown by @rossberg at the 2020-08-04 meeting appears with the addition of exception handling to catch exception from the cont.resume to be sufficient to implement a scheduler for Lumen (Erlang compiler/runtime).

On x86_64 where we have native stack switching we add yield points (here encoded as calls to @__lumen_builtin_yield()) when each Erlang process (green thread) has used up its time slice (encoded as @CURRENT_REDUCTION_COUNT reaching a static limit)

init.erl

-module(init).
-export([start/0]).
-import(erlang, [display/1]).

start() ->
  display(atom).

init.llvm.mlir

module @init {
  llvm.func @"erlang:display/1"(!llvm.i64) -> !llvm.i64
  llvm.func @__lumen_builtin_yield()
  llvm.mlir.global external local_exec @CURRENT_REDUCTION_COUNT() : !llvm.i32
  llvm.func @lumen_eh_personality(...) -> !llvm.i32
  llvm.func @"init:start/0"() -> !llvm.i64 attributes {personality = @lumen_eh_personality} {
    %0 = llvm.mlir.constant(20 : i32) : !llvm.i32
    %1 = llvm.mlir.addressof @CURRENT_REDUCTION_COUNT : !llvm<"i32*">
    %2 = llvm.load %1 : !llvm<"i32*">
    %3 = llvm.icmp "uge" %2, %0 : !llvm.i32
    llvm.cond_br %3, ^bb1, ^bb2
  ^bb1: // pred: ^bb0
    llvm.call @__lumen_builtin_yield() : () -> ()
    llvm.br ^bb2
  ^bb2: // 2 preds: ^bb0, ^bb1
    %4 = llvm.mlir.constant(562949953421378 : i64) : !llvm.i64
    %5 = llvm.mlir.addressof @CURRENT_REDUCTION_COUNT : !llvm<"i32*">
    %6 = llvm.mlir.constant(1 : i32) : !llvm.i32
    %7 = llvm.atomicrmw add %5, %6 monotonic  : !llvm.i32
    %8 = llvm.call @"erlang:display/1"(%4) {tail} : (!llvm.i64) -> !llvm.i64
    llvm.return %8 : !llvm.i64
  }
}

On the BEAM for Erlang, the schedulers can do work stealing. How continuations are stored was hand-waved in the presentation. @rossberg will the continuations be stealable/transferable between Web Workers to allow for a scheduler per Web Worker (and one on the main UI thread) to work steal?

Updated OP to match the version I presented today, removing separate evtref type and br_on_evt instruction. Also removed cont.forward instruction, since it may not be needed and its semantics seems to be confusing.

@kayceesrk Thanks for the talk! It was very clear and well presented.

In your talk, you showed some assembly code on how you implemented algebraic effects in Multicore OCaml. I feel like a goal of a WebAssembly design for stacks should be to enable WebAssembly programs to essentially write those assembly-level operations themselves, provided the functionality can be provided in a way that is similarly composable. Based on your experience with Multicore OCaml, what would be the key primitives for enabling that?

In the previous CG meeting it was suggested that these continuations would be used for effects that would have a handler in roughly every stack frame. The implementation strategy you presented involved a linear walk through effect handlers, which suggests that having frequent effect handlers would affect performance. Have you run experiments in the context of such effects? If not and you would like a concrete example to try, make a pass through the source code so that every function call has a handler indicating what file and line number the function call is on (or was on in the original source code) so that effect handlers can be used to implement stack tracing (turning off any optimizations you have to eliminate continuation allocations). I'd be interested to hear what the relative performance of the original and the modified programs are.

@rossberg Those were much appreciated improvements!

@rossberg will the continuations be stealable/transferable between Web Workers to allow for a scheduler per Web Worker (and one on the main UI thread) to work steal?

I'm guessing that continuations will not be accessible across threads in the first version, regardless of which stack switching proposal we go with.

Yeah, I suspect that we won't even be able to pass simple data references between threads for quite some time to come, let alone functions or stacks. The hurdle isn't so much semantics but implementations: most web engines will have a very tough time retrofitting any of that. Hope we get there eventually, though.

@KronicDeth I see that Lumen uses LLVM. Does it use the upstream LLVM backend for WebAssembly? If so, what kind of support would you expect the backend to provide for stack switching (independent of which proposal we use)? In other words, how would you expect to use the stack switching mechanism from LLVM? How does Lumen's stack switching work today, if at all?

@KronicDeth I see that Lumen uses LLVM. Does it use the upstream LLVM backend for WebAssembly?

We have fork of LLVM and we're keeping up-to-date with master with our changes here.

If so, what kind of support would you expect the backend to provide for stack switching (independent of which proposal we use)? In other words, how would you expect to use the stack switching mechanism from LLVM?

Something that can go in a similar place to where we call the __lumen_builtin_yield would be nicest as it means no divergence between the x86_64 and WASM switching code. @bitwalker do you have any other preferences for the stack switching in WASM in LLVM IR or LLVM MLIR?

How does Lumen's stack switching work today, if at all?

__lumen_builtin_yield calls a bunch of stuff to deal with the scheduler's queues ([1], [2]), but for the actual swapping it's inline assembly. This is all for the x86_64 target.

For the WASM target we don't have the compiler yet, we have the runtime, but the scheduler and switching is still in hand-written Rust.

Thanks for the great pointers, @KronicDeth!

If I'm reading your current implementation of __lumen_builtin_yield correctly, it seems like it works by maintaining a thread-local list of processes (which seem to be a store of the register states, including stack and code pointers), getting the next ready process from that list, and then switching the registers (including stack and code pointers) after some basic bookkeeping. Is that an accurate summary?

If it is, then that is precisely the implementation strategy we provide here for lightweight-thread managers. For example, in contrast to the proposal here, we guarantee that your __lumen_builtin_yield performs in constant time, regardless of how large the old or new stack is or how many effect handlers they have (ignoring the time it takes for the scheduler to determine which is the next ready process).

Edit: Oh actually, there is a slight difference. In our example, we switch to the manager's stack, which then determines the next process to run and switches to that stack. Your implementation finds the next process on the yielding thread's stack, rather than having the middle switch to the manager. Of course, it's easy to modify our example to employ that optimization. On the other hand, the proposal here will always require two stack switches (and a stack walk).

@kayceesrk I added a translation of this design to soil-initiative/stacks#10 using Multicore OCaml's implementation strategy that you implemented today. I'm interested to hear if you think it accurately captures what you described. I have no idea how fluent you are in WebAssembly, especially regarding the relevant young/pseudo-proposals, so I'd be happy to meet with you and walk through the translation with you.

Outside of that, one thing I should note is that your implementation strategy assumes it is always being possible to quickly determine the address of the root of a stack based on the address of its current leaf. This is a perfectly plausible assumption; I'm just making it explicit since it does impose an implementation constraint. Furthermore, this assumption has other useful implications on the design space. In particular, it enables a variant of stack inspection that is particularly useful for your implementation strategy, which is what I use in soil-initiative/stacks#10.

@rossberg great to see this moving. One minor thing: shouldn't it be

It must not be possible to create cyclic references to stacks, since that would require GC.

?

@b-studios, oops, yes, fixed. Thanks!

Thanks, @RossTate.

In the previous CG meeting it was suggested that these continuations would be used for effects that would have a handler in roughly every stack frame.

This would not be sensible. Effect handlers should not be used for stack tracing. Stack inspection and stack switching are orthogonal features. I am failing to understand why one _requires_ the other or why use one to _simulate_ the other. As I had presented in the talk, effect handler implementation does not inspect every stack frame. We directly fetch the parent stack in constant time. Compare this to exceptions. If I were to implement exceptions, unwinding one stack frame at a time to reach the exception handler would lead to a terrible implementation. My honest take on it is that stack inspection has nothing to do with stack switching and should be implemented as independent features.

Realistic examples of handlers that we've seen in practice are never very deep. There is usually a handler at the bottom of the driving the scheduler, and perhaps one or two handlers near the top for supporting generators and other local control inversions. Hence, linearly searching through the stack of handlers is not a problem. Moreover, the refined proposal presented yesterday removes instructions for tag test. Unlike OCaml which allows arbitrary pattern matching on payloads, the tag tests can be implemented directly in the engines. The engines can just maintain a stack of current tags and the labels to branch to to find the matching tag and branch to it. I understand that there is some reservation against advocating an implementation policy. But in our experience, for real-world examples, where the number of handlers is few, linear search will work just as well. In the few years, I've been working on delimited continuations/effect handlers, I am yet to see a _useful_ example where there are many instances of the same handler on the stack.

I feel like a goal of a WebAssembly design for stacks should be to enable WebAssembly programs to essentially write those assembly-level operations themselves, provided the functionality can be provided in a way that is similarly composable. Based on your experience with Multicore OCaml, what would be the key primitives for enabling that?

I should preface my response by saying that I am still developing the taste for the right level of abstraction in WebAssembly. You could possibly get the same dynamic semantics by introducing

  1. stack-local storage (to store the parent pointers, table of events and the corresponding labels to branch to),
  2. the ability to prepend computation to a suspended stack (so that the body of the matching handler can be evaluated on the parent stack, something like MLton.Thread.prepend [1]), and
  3. a mechanism to switch between stacks.

One advantage of doing this allows the handler function where (a) the continuation does not escape and (b) is resumed at the tail position can be evaluated on top of the current stack, which is performing the effect. If all of the events/effects associated with a handler have the properties (a) and (b), then we could avoid allocating a separate stack for running the computation. The benefit of tail resumption might outweigh the difficulty of identifying that (a) and (b) holds + the complexity of implementing this optimization; @daanx may have evidence for the benefits of tail resumption.

That said, I have two reservations against this proposal. Firstly, I don't know whether we can get the same static semantics as effect handlers (which is an interesting question nevertheless). I do care about static semantics of handlers as they are the least surprising among all of the delimited control operators. Given that there is interest in establishing formal foundations of WebAssembly [2], which includes both compilation to and of WebAssembly, having strong static semantics for delimited continuations give you the ability to reason about any number of things you might want to do with delimited continuations. @slindley, @daanx and @matijapretnar know a lot more about this than I do.

Secondly, effect handlers are known to be general enough to describe all the other control operators and useful high-level paradigms (@slindley and @matijapretnar have a paper on this). So what is the benefit of low-level primitives if every useful thing that you want to do can be expressed with effect handlers? Are there useful things that you can do with these low-level primitives that you cannot do with effect handlers. To answer my own question, stack-local storage may be useful to store thread-ids for example.

[1] http://www.mlton.org/MLtonThread
[2] https://www.dagstuhl.de/en/program/calendar/semhp/?semnr=21012

You could possibly get the same dynamic semantics by introducing

  1. stack-local storage (to store the parent pointers, table of events and the corresponding labels to branch to),
  2. the ability to prepend computation to a suspended stack (so that the body of the matching handler can be evaluated on the parent stack, something like MLton.Thread.prepend [1]), and
  3. a mechanism to switch between stacks.

In addition, (1) would depend on the presence of both first-class labels and (2) on some form of closures. Furthermore, it's not clear how to make any of this type-safe.

For similar reasons, Wasm already encapsulates other forms of stack-related mechanisms in higher-than-assembly-level ways (functions, exceptions). It is only natural that a mechanism for stack switching would operate on a comparable level. The proposed mechanism does so with a minimal instruction set and surface semantics.

@kayceesrk Thanks for your thorough post! It has a lot of useful perspective and productive discussion points.

My honest take on it is that stack inspection has nothing to do with stack switching and should be implemented as independent features.

Agreed. However, their designs cannot be completely independent. Stack inspection generally wants to walk through the continuation chain. This raises a number of questions on how the two features interact, on how they might cause for each other, and on how they might facilitate each other. I can digress into various examples, but the high-level concern is that this proposal has not taken this into consideration.

I should preface my response by saying that I am still developing the taste for the right level of abstraction in WebAssembly.

I can totally relate to this, having been there myself (and even still getting this wrong sometimes despite much practice). It is, however, the crux of the issue with this proposal.

I don't think I've seen any concerns raised over the high-level concepts of this proposal. That is, I don't see any objections to supporting algebraic effects or to the composability properties of algebraic effects. The concerns are with the low-level design of the proposal.

Secondly, effect handlers are known to be general enough to describe all the other control operators and useful high-level paradigms (@slindley and @matijapretnar have a paper on this). So what is the benefit of low-level primitives if every useful thing that you want to do can be expressed with effect handlers?

Being able to express high-level features in this proposal does not make this proposal the right fit for each of those features. You can express OCaml in the JVM or .NET, but we both know that it's not a great fit. Their designs combine operations that OCaml would prefer were separated, and they do not provide operations that OCaml would like to have. This proposal prescribes a particular paradigm and thus has many of the same problems.

For example, stack.suspend is two operations combined into one: find the handler and switch the stack. As we've already discussed, stack tracing is expressible in this proposal, but it is not a good fit. However, if the "find the handler" operation were separated out of stack.suspend into its own primitive, i.e. stack inspection, then it can be implemented well. You also mentioned having a scheduler towards the root of the stack. This implementation technique for lightweight threads is known as a trampoline. Its downside is that it involves two stack switches (and a stack walk) to transfer control from one thread to the next. If you separate out the "switch the stack" part of stack.suspend, then a implementation of lightweight threads using a predetermined scheduler can have the yielding thread simply run the scheduler code itself to find the next thread and then switch directly to that thread's stack. This indeed seems to be how Lumen is implemented.

The benefit of tail resumption might outweigh the difficulty of identifying that (a) and (b) holds + the complexity of implementing this optimization; @daanx may have evidence for the benefits of tail resumption.

This presents another example of this proposal being at the wrong level of abstraction for WebAssembly. @daanx has no need to convince me of the benefits of this optimization that applies to many handlers; I've already read the relevant text. But this proposal does not actually support this optimization!

With WebAssembly, the expectation is that the engine does a fairly straightforward compilation of the instructions. It will generally perform some local optimizations, but it is not expected to do any large scale reasoning. In particular, a WebAssembly engine is certainly not going to check if (a) and (b) hold and then perform this optimization. Instead, the expectation is that the application will perform that reasoning and generate the WebAssembly instructions that directly execute that optimized code. The problem is that those optimized instructions are not expressible in this design. So if you or @daanx believe this optimization is useful for your purposes, then you should be made aware that this proposal does not actually suit your own purposes.

Are there useful things that you can do with these low-level primitives that you cannot do with effect handlers.

It is useful to give the application compiling to WebAssembly the freedom to specify its implementation in low-level terms. This enables the application to incorporate application-specific optimizations, and it helps the application (or compiler) be able to reasonably predict trade-offs in implementation strategies. It also alleviates WebAssembly from having to anticipate or add extensions for supporting the various optimizations that applications would be able to perform themselves had they had lower-level control. For example, the design in #1360 has been able to support each of the many implementation strategies that have been discussed for this proposal without change. It is even able to support the tail-resumption optimization discussed above. (It's just a minor variation on the header-based case study in soil-initiative/stacks#10.)

Of course, too much freedom can be problematic. But the design in #1360 is sound and has the compositionality properties of algebraic effects while also being lower level. It has those compositionality properties because its design actually started from this proposal (and then went through many iterations of lowering while preserving these properties).

stack inspection and effect handlers

Agreed. However, their designs cannot be completely independent. Stack inspection generally wants to walk through the continuation chain. This raises a number of questions on how the two features interact, on how they might cause for each other, and on how they might facilitate each other. I can digress into various examples, but the high-level concern is that this proposal has not taken this into consideration.

It is true that there is an interaction that needs to be sorted out, but the interaction itself is quite simple in my mind. Rather than your call stack being a single contiguous stack of frames, with effect handlers, it becomes a stack of stacks. The stack inspection facility can easily be informed of the new stack layout. Here are two pieces of evidence from Multicore OCaml.

For example, stock (upstream) OCaml uses the system stack for OCaml functions and the call stack contains both OCaml frames and C frames. OCaml has a single stack inspection facility which walks each of the OCaml frames to fetch the roots in that frame and skip over a sequence of C frames where roots are not found. In Multicore OCaml, we had to extend this stack inspection facility to handle stacks of stacks introduced due to effect handlers. All we had to do is to extend the stack inspection facility to say "once you reach the end of the stack during stack inspection, fetch the parent stack and continue inspecting the parent stack". This is a modular change and does not affect how the stack inspection on any given frame works.

In fact, we have evidence that you can a generic stack inspection facility such as libunwind can be informed of the new stack layout in Multicore OCaml thanks to the expressibility of DWARF. Our DWARF description for stacks describes how to find the parent stack and continue inspection. Here is an example of gdb using it to get the backtrace of a program with handlers: https://github.com/ocamllabs/ocaml-effects-tutorial#31-examining-effect-handlers-through-gdb.

The experience of rewriting the OCaml's own stack inspection facility to handle stacks of stacks and having described the stack layout in a generic setting like DWARF leads me to believe that any stack inspection proposal can be made aware of effect handlers with minimal effort.

@kayceesrk I added a translation of this design to soil-initiative/stacks#10 using Multicore OCaml's implementation strategy that you implemented today. I'm interested to hear if you think it accurately captures what you described. I have no idea how fluent you are in WebAssembly, especially regarding the relevant young/pseudo-proposals, so I'd be happy to meet with you and walk through the translation with you.

Thanks. This would be helpful for me to understand #1360 in detail. The rest of the week is a little busy for me. I'll go through the design early next week. I am not fluent in WebAssembly, and may need your help in digesting the encoding.

I thought @lukewagner had captured the essence of both #1359 and #1360 well in https://github.com/WebAssembly/design/issues/1359#issuecomment-666583483. If I am reading this right, this proposal keeps stack inspection orthogonal to stack switching, whereas #1360 makes it inter-dependent. Multicore OCaml experience tells us that the interaction between stack inspection and switching can be modularly described without having to significantly change the stack layout. Moreover, the low-level API that I had sketched out here https://github.com/WebAssembly/design/issues/1359#issuecomment-669203944, requires features such as first-class labels and closures, which are currently unavailable in WebAssembly today.

Making the high-priority feature such as stack switching depend on many other features (stack inspection, stack-local state, first-class labels, closures) whose scope is much wider than stack switching seems unwise to me. For example, I could use the same arguments against effect handlers to say that exceptions are too high-level. Rather than have the exceptions as they are today, can't we argue that exceptions should be implemented with other features such as stack inspection (to find where the exception handler is on the stack), stack-local state (to map the stack of exception tags in the dynamic scope to the target labels) and first-class labels (to build the table in stack-local state)?

I am not fluent in WebAssembly, and may need your help in digesting the encoding.

Happy to help. Feel free to shoot me an e-mail to coordinate a time.

It is true that there is an interaction that needs to be sorted out, but the interaction itself is quite simple in my mind. Rather than your call stack being a single contiguous stack of frames, with effect handlers, it becomes a stack of stacks. The stack inspection facility can easily be informed of the new stack layout.

Yes, this is the starting point. But this starting point has yet to consider ways in which the two features can cause problems for each other or can facilitate each other.

As for problems, various applications of stack inspection need to be able to execute application-specific code during the process, and it is important that this code semantically be considered to be executed at the leaf of the stack (chain) even if the stack inspection is currently examining a frame in a stack further up the chain. (This is also important for composability.) So what should happen if this code were to suspend the continuation, and what if the relevant handler is somewhere in the stack chain between its leaf and the current frame being inspected? Detaching the continuation at that point would be unsound. With a high-level design like in this proposal, we have to prescribe a universal policy for what to do in this situation. In a low-level design, the application is given enough control to determine its own policy.

As for facilitating, in your research y'all mention various optimizations showing how to eliminate allocation of continuations for certain common classes of handlers. If you codesign stack inspection and first-class stacks, then you can use stack inspection to find effect handlers. In the optimizable cases, a particular effect handler can be implemented as a simple stack-inspection "answer". In the unoptimizable cases, the stack-inspection "answer" can take care of detaching the (previously allocated/attached) continuation. (And using the eager variant of stack inspection described in soil-initiative/stacks#10, you can avoid the need to walk through the stack frame by frame on engines that have the ability to easily access the header of a stack from any leaf point.)

Making the high-priority feature such as stack switching depend on many other features (stack inspection, stack-local state, first-class labels, closures) whose scope is much wider than stack switching seems unwise to me.

Yes, which was why we had developed a minified version of stack inspection that would be just enough to support its role in first-class stacks while also forwards compatible with a future more advanced development of stack inspection. For context, we were in the middle of presenting this minified version to get feedback from the CG when it was claimed that stack inspection should be supported through a more unified way, and then this proposal was provided. That context might explain the initial reactions to this proposal, and why it was so helpful to focusing the discussion when you, an advocate of the proposal, stated that you did not believe it should be used for such purposes.

  • I'm not sure how cont.throw works. Is it used from within a continuation that should be stopped, such as cont.suspend, or from outside, such as cont.resume? I am assuming the latter, and if that's correct, what is the effect of the instruction? Is it considered as a throw, so unless we want to propagate the exception up to the top frame and crash, should we wrap it with try~catch, like
  cont.throw $some_continuation
catch
  do nothing, or do some wrap up
end

?
A code example using this instruction will be helpful.

  • Now that this proposal doesn't have evtref either, I was wondering if the EH proposal can be more like this in syntax, meaning, we don't use exnref, and like cont.resume, a single catch takes multiple events and corresponding labels to jump. This might not be the scope of this issue, but if you have any opinions, I'd appreciate that.

@aheejin, cont.throw is a sibling to cont.resume, but instead of resuming with a value it resumes with an exception at the point where the prior suspend occurred. If there is no exception handler on that stack that would catch this exception, then it will indeed propagate all the way back to the cont.throw and beyond.

Consider:

(event $evt)
(exception $exn)

(func $f1
  (cont.suspend $evt)
)

(func $main
  (local $c1 (cont))
  (local.set $c1 (cont.new (ref.func $f1)))
  (try
    (do
      (block $h (result (cont))
        (cont.resume (event $evt $h) (local.get $c1))
        (unreachable)  ;; we never get back here
      )
      (cont.throw $exn)  ;; cont is on stack; this throws
      (unreachable)  ;; we never get back here either
    )
    (catch (nop))  ;; instead, we get here
  )
)

Compare to:

(func $f2
  (try
    (do (cont.suspend $evt))
    (catch (return))  ;; swallow exception
  )
)

(func $main
  (local $c1 (cont))
  (local.set $c2 (cont.new (ref.func $f2)))
  (try
    (do
      (block $h (result (cont))
        (con.resume (event $evt $h) (local.get $c2))
        (unreachable)  ;; we never get back here
      )
      (cont.throw $exn)  ;; cont is on stack; continuation catches the exception before returning
      (unreachable)  ;; so we get back here
    )
    (catch (nop))  ;; but never here
  )
)

Now that this proposal doesn't have evtref either, I was wondering if the EH proposal can be more like this in syntax, meaning, we don't use exnref, and like cont.resume, a single catch takes multiple events and corresponding labels to jump. This might not be the scope of this issue, but if you have any opinions, I'd appreciate that.

Yes, I agree that if we go with this proposal then it would make sense to harmonise it with exceptions. In particular, catch could (optionally) be given a list of handlers similar to resume. One question, though, is how to model catch-all handlers and rethrow that way.

@kayceesrk @daanx I noticed that both your works use deep semantics, but this proposal uses shallow semantics. I know that deep semantics can be encoded through shallow semantics, but their direct implementation is more efficient than this encoding. Is there a reason why you do not want direct support for deep semantics? (Note that having such support does not preclude direct support for shallow semantics; a good low-level design should be able to directly support both of these through simpler primitives. #1390, for example, can directly support both while at the same time ensuring compositionality, e.g. one effect's semantics is deep and another's is shallow without requiring any coordination between the two.)

@RossTate, the proposed mechanism is more low-level than conventional effect handlers. In particular, stack creation is separated from installing a handler. That way, we believe it can encode both shallow and deep handlers efficiently, unlike either of the other two.

The straightforward low-level implementation of shallow handlers is to allocate the stack for the handlee/child without the handler code on the stack and then to specify the handling code in the current/parent stack. That's exactly what the current proposal does. To encode deep handlers with shallow handlers, you make sure to always run the continuation with the relevant handler around it, which again is exactly how the current proposal indirectly supports deep handlers. The current proposal isn't particularly lower-level; it's baked in specifically shallow handlers, but it happens to do so in WebAssembly.

On the other hand, direct support for deep handlers would permit the handling code be part of (the root of) the allocated stack.

My understanding regarding high-level effect handlers is the following, @slindley, @dhil and @kayceesrk please correct me if i'm wrong. See Daniel & Sam's paper for more details.

  • In their high-level form, shallow handlers and deep handlers have different performance trade-offs and neither approach beats the other -- deep is a fold, shallow is a case. Either one fares better on a certain set of abstractions.

  • Expressing deep handlers in terms of shallow handlers is fairly simple: you merely need to manually reenter the handler in a loop. The reason that can be expensive is mainly because it will naively create an additional stack for every resumption. -- But that is not the case in the low-level mechanism we propose, because we make stack creation a separate operation, such that the same stack can be reused. The only cost is for switching the stack & handler, which can be cheap, as @kayceesrk's numbers showed.

  • Expressing shallow handlers in terms of deep handlers is more involved, because you have to express a case in terms of a fold. So you have to jump through some hoops to break out of the inherent recursion. That would remain somewhat clumsy and expensive, even in a low-level mechanism like the one proposed.

  • With regard to the lower-level mechanism we propose, the difference between the two approaches only materialises in the fact that resume can specify a new handler, which is more flexible. Because stack creation is separated, it can express both high-level flavours without much overhead.

The only cost is for switching the stack & handler, which can be cheap, as @kayceesrk's numbers showed.

If Multicore OCaml used shallow handlers, then the only additional cost would be a single additional store to update the handler function slot in the stack at resumption. Multicore OCaml uses deep handlers since that is what you want almost all of the time for expressing concurrency (async i/o, generators). Choosing deep handlers leads to succinct high-level programs.

Okay, so the answer seems to be that you opted not to directly support deep handlers because you were willing to accept the overhead of the encoding into shallow handlers.

But that is not the case in the low-level mechanism we propose, because we make stack creation a separate operation, such that the same stack can be reused.

Yes, and there's a low-level mechanism for deep handlers that optimizes for deep handlers but enables efficient encoding of shallow handlers. It sounds like this is a better fit for common applications of effect handlers. Such an alternative is also a better fit for languages that allow the prompt/tag to be determined dynamically. It's also easier to extend with optimizations for tail resumption and continuation elimination.

Why was such an option not considered?

You seem to be making a lot of claims about an alternative design. Do you have any empirical evidence for any of them?

Well no one has any empirical evidence for any of these designs. But I can say that #1360 supports optimizations for tail-resumptive effect handlers—meaning that it can express the more optimized lowered code that tail-resumptive effect handlers permit—and as such it can eliminate the 226 stack switches in @kayceesrk's example for generators depending on how the generated values are consumed. In particular, generators are very often used by for each constructs, which are encodable as tail-resumptive effect handlers, and which consequently #1360 can implement without stack switching. In the less common case, though, where the generated values are enumerated through successive stateful function calls, #1360 can fall back on the less efficient implementation using a stack. Importantly, the functions generating values (say through yield) do not need to know which context they are executing within.

The trick is that #1360 breaks down algebraic effects into lower-level components, namely stack inspection and stack creation/switching (without a trampoline). The "perform operation" step translates to an (eager) stack inspection (call_stack), looking for the most immediate executable stack mark (answer) for that operation and invoking it, giving it direct access to the stack frame it marked. In an unoptimized implementation of deep handlers, this stack mark sits at the root of a stack the handler and handlee were run on, and it has a pointer to the parent stack. In this case, the body of the executable stack mark says to switch control to the parent stack, and then when it's switched back to with some value, it "answers" the stack inspection with whatever value it received (after storing the new parent). But in an optimized implementation of a tail-resumptive handler, there is no need to allocate a new stack and instead the current stack is marked, and the mark simply executes the handler code on the same stack and then "answers" with whatever the handler code would have resumed the continuation with—no stack allocation or stack switching necessary.

1360 also allows optimizations for more mixed patterns. For example, some effect handlers are tail-resumptive on some control-flow paths but not all. In these cases, #1360 lets you allocate a new stack in case the continuation escapes the handler. But #1360 also lets the stack mark at the root of the allocated stack execute the handler before any stack switch is performed. If control goes down a tail-resumptive branch, then the mark simply "answers" with whatever the handler provides. Only when control goes down a non-tail-resumptive branch is control switched to the parent stack (extended to execute the remaining computation).

The proposal here does not support these optimizations. Furthermore, notice that it's important, especially in the final case with mixed tail resumption, for the handler to be executed on the newly allocated stack rather than the parent stack, like in deep handlers. So the orientation of the current proposal around shallow handlers makes it difficult to add support for such optimizations in the future.

Note: I realized that we might be using the term "generator" differently, and after looking at the documentation for Multicore OCaml I indeed found that was the case. So I edited the above comment to give a more nuanced and detailed answer that bridges the difference in terminology.

The example of generators in todays presentation appeared to be precisely the "for each" pattern for which all stack allocation and switching can be eliminated (even with "deep" yields, i.e. no need to employ an automaton transformation). This proposal cannot express that optimized implementation strategy though.

Can you explain why you do not want the optimized implementation strategy to be expressible?

To discuss the finer points of this proposal and #1360, I think it would be helpful, if both could be formulated in terms of a "global model of the world" that includes a notion of suspended stacks, the currently executing stack, and how they relate to each other.

It looks to me like both proposals show formal semantics only from the "viewpoint" of a single thread. The consequence of that is that, for the most interesting instructions, we need to fall back to informal language, because the semantics can not be expressed completely in the single-threaded execution model.

@sabine The semantics of both proposals are expressible and implementable in a single-threaded execution model. Could you clarify why you believe they aren't? I feel like that will uproot a miscommunication that hopefully will provide the insight you're looking for.

@RossTate even if the semantics can be expressed single-threaded, should it be called out that it is not expected that the continations can transfer between threads as it makes it explicit that work stealing schedulers on different threads won't be able to grab a continuation? It was mentioned in the meetings, but it probably should be more explicit than just the meeting notes. It precludes the work-stealing that Erlang's BEAM does, so we won't be able to do that in Lumen.

Certainly the CG should be made aware that the applications of both proposals to multi-threading are still limited by the multi-threading model of WebAssembly and the embedder.

@RossTate certainly, the semantics must be implementable in a single-threaded execution model, and thus, it must be expressible in a single-threaded execution model.

I think what I'm having trouble with is the use of big-step semantics and the use of informal explanations to describe the core parts of stack switch semantics (and it's very likely that I also need to read some papers that you are all familiar with). For example,

  • here in #1359, cont.resume (event $evt $handler)* : [t1* (cont $ft)] -> [t2*] we have big-step semantics in the sense that evaluating a cont.resume instruction results in a new stack after the continuation has been executed completely. All the other effects of cont.resume are described informally (though, I think I get the general idea).
  • in #1360, we have stack.switch $event : [t* stackref] -> unreachable where event $event : [t* stackref], which, I assume, is what the stack switch looks like from the viewpoint of the current thread. In a global model with small-step semantics, however, we should see the stack we just switched to?

I can see that both proposals follow the regular WebAssembly specification conventions from https://webassembly.github.io/spec/core/exec/conventions.html, where it says "The execution rules also assume the presence of an implicit stack that is modified by pushing or popping values, labels, and frames.".

I have this feeling that stack switch semantics would need less informal explanations if we could use a formalism that has an explicit model of the stack (which we could then reuse to express what a suspended stack looks like).

Using small-step semantics with an explicit stack, we might be able to formulate mathematically what happens on suspend, resume, and the more expressible primitives from #1360.

I might be wrong, though, and expressing what happens in small-step semantics is not trivial or is not as helpful as I think it would be.

@sabine, the semantics of continuations can certainly be described in a small-step semantics. We actually had that written down for an earlier version of this proposal, but haven't adapted it yet to the current incarnation.

Roughly, a continuation/stack is represented by an evaluation context E, and there would be a new administrative instruction for representing handlers on the stack. It's very similar to how you model exceptions, with the continuation just capturing and reifying the evaluation context you have thrown across, and resume reentering it.

Not sure that would be too helpful to most people at the current stage, though.

One standard way of explicitly reifying stacks is using a CEK machine. @dhil has a WIP extension of the Wasm semantics with effect handlers which I assume is modelled as a CEK machine.

@rossberg is the earlier (outdated) version of small-step semantics available somewhere?

@sabine, you might be able to access it here, but beware, it's more than 2 years old, still modelling everything as an extension to exceptions, and it's in fancy markdown.

Ah, thanks for the clarification @sabine! The link @rossberg above should be useful for understanding the evaluation-context approach. @kayceesrk mentions the CEK machine, which is another approach. It makes the stack explicit part of the machine state, much like the heap. So if the stack of the machine state looks like (cons K K'), i.e. a stack K' with a pointer to another stack K on top, then a stack switch changes the stack of the machine state to look like (cons K' K). The stack.switch instruction then does something similar, but further incorporating exceptions to manage untyped communication. Hope that helps.

Saw this OOPSLA '20 paper from a retweet by @yaahc of @AndrewCMyers about _bidirectional_ effect handlers. I'm not good at reading the formal semantics in PL papers yet, so going off just the prose, these sections caught my interest related to WASM's proposed usage of effect handler:

Multicore OCAML's async/await + exception interaction

Prior encodings. The ability to encode promise-based async–await [Dolan et al. 2017; Leijen 1
2017a,b]speakstotheexpressivepowerofalgebraiceffects, butencodingsinexistinglanguage designs compromise on how they accommodate exceptional computations. Koka [Leijen 2017a,b] supports structured asynchrony via algebraic effects, but uses an either monad for possible exceptional outcomes of the await operation—but encoding exceptional outcomes into monadic values is a pattern that algebraic effects in Koka are intended to help avoid! Unlike Koka, Multicore OCaml [Dolan et al. 2017] does not check algebraic effects statically. To notify user programs about asynchronously raised exceptions, the language adds a special discontinue construct. It is our goal to treat async–await and asynchronous exceptions in a more unified way: both are statically checked algebraic effects.

The part about Multicore OCAML seems relevant since the WASM Stacks Group has been using Multicore OCAML as an example, working implementation, so will we also need a special discontinue construct to mesh together async / await and exceptions if we don't support something like the bidirectional effect handlers this paper proses?

Using normal effect handlers to emulate bidirectional

The paper also includes implementation test cases where if we think the effect handlers will need to also produce effects (such as exceptions from async/ await, that yes, they can be implemented with normal effect handlers and closures, but there will be a performance hit.

We performed two hand translations of the ping-pong communication example (Section 3.3, with ponger implemented as in Section 3.3.2) into ÎĽC++, with one of the two using callbacks. This program was chosen because it exercises high-frequency bidirectional control transfer. We ran the hand-translated code using the modified ÎĽC++ implementation with the extra run-time check disabled, and measured the running time on a 3.2GHz Intel Xeon Gold processor, averaging 500 runs. The translation relying on callbacks incurred a 2.1Ă— slowdown: 42.6 ms vs. 19.8 ms. This result argues for bidirectional handlers as a first-class language feature: obtaining bidirectionality via a desugaring into callbacks is less efficient. (In the other way around, compiling callbacks away into efficient bidirectionality would involve a complex interprocedural analysis.)
8:07

Accidental handling

Preventing accidental handling. Accidental handling of algebraic effects in the presence of effect polymorphism is a known problem. Tunneling (lexically scoped, lifetime-bounded handlers) as a way to avoid accidental handling of exceptions was introduced by Zhang et al. [2016]; follow-on work adapted it to explicit effect polymorphism and proved parametricity [Zhang and Myers 2019]. Brachthäuser et al. [2018, 2020] implement tunneled algebraic effects in a Scala library, called Effekt, that encodes lifetime effects through Scala’s intersection types and path-dependent types.

The paper proves a lot of goodness of bidirectionality because of the type system, so I'm sure it actually would apply to our since it seems to be that like the Exception proposal already, we will have some sort of tag system for effect handlers and won't actually be able to show all effects are handled to allow mixing different languages together.

The paper did make me wonder how we handle the effect handlers not swallowing the wrong effects without all the types the paper outlines, which also involves lifetimes like in Rust too.

Does this even apply to us?

Again, I don't normally read PL papers, so I could be completely misunderstanding how this paper relates to WASM's effect handlers, but I thought I should bring it up since it seemed on-topic.

Oh cool! Glad to see Yizhou has kept this line of work going at his new position. There's a lot to digest from this paper. To keep from fragmenting the discussion too much, I'll focus just on the second item you raised.

Interestingly, the measurements they provided are more relevant to Stack Inspection (#1356). The reason is that both evaluations are done on an implementation of tail-resumptive algebraic effects that does no stack switching, and whereas this proposal (#1359) does not support such an implementation strategy, #1356 does with its answer construct. An answer is two parts: a stack mark (which is akin to dynamic scoping), and a closure (over the stack frame) that is essentially an tail-resumptive effect handler. One way to implement an answer is to directly use the stack frame as its closure's environment, i.e. a stack-allocated closure. Another way is to allocate the closure environment in the heap and copy to and from the closure environment as necessary. What their measurement indicates is that just the cost of heap allocation rather than reusing the stack frame could make stack-inspection-intensive programs 2.1x slower. (For comparison, these are the same authors who found that eager stack tracing makes exception-intensive programs 6x slower.)

Following up, the second big thing in this paper is its use of lexical scoping. This proposal, as well as Stack Inspection, use dynamic scoping—one walks up the stack to find the handler for a given effectful operation. This has its advantages—in particular, it works well in untyped (in terms of effects) settings—but also its disadvantages—such as accidentally finding a different handler than what the programmer intended, as described in the paper. To implement lexically scoped handlers, you pass the handler to the effectful function as an explicit low-level argument. The value you are handing off is in general a closure. But the stuff they do with lifetimes ensures that the closure's lifetime never exceeds the lifetime of its stack frame. This enables them to use stack-allocated closures rather than heap-allocated closures. This implementation technique corresponds to the higher-order parameters discussed in WebAssembly/exception-handling#105.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

JimmyVV picture JimmyVV  Â·  4Comments

nikhedonia picture nikhedonia  Â·  7Comments

thysultan picture thysultan  Â·  4Comments

Artur-A picture Artur-A  Â·  3Comments

dpw picture dpw  Â·  3Comments