Zig: coroutine semantics are unsound. proposal to fix them

Created on 10 Aug 2018 · 14Comments · Source: ziglang/zig

Currently there is a race condition when using multithreaded coroutines. Consider the following scenario:

Thread 1 does epoll_wait/kevent/GetQueuedCompletionStatus, blocking until some work can be done. The thread wakes up, and it is about to execute resume coroutine_handle.
Thread 2 is executing the coroutine, which reaches a suspend point, and then reads its atomic state and learns that it is scheduled to be canceled (destroyed). It runs the defers and errdefers, which dutifully remove the coroutine from the epoll set. However because of (1) it isn't even in the epoll set/kqueue/iocp. (1) is about to resume the coroutine and there's nothing (2) can do to stop it. (2) proceeds to destroy the coroutine frame. The memory is gone.
Thread 1 does resume coroutine_handle which now points to invalid memory. Boom.

In order to fix this, we need to introduce new syntax and semantics. Current semantics are:

async creates a promise which must be consumed with cancel or await.
suspend must be consumed with a resume.

The problem here described above - resume and cancel racing with each other. When a suspended coroutine is canceled, the memory must not be destroyed until the suspend is canceled or resumed. Proposal for new semantics:

async creates a promise which must be consumed with cancelasync or await.
suspend must be consumed with a cancelsuspend or resume.

With these semantics, each coroutine essentially has 0, 1, or 2 references, and when the reference count reaches 0 it is destroyed. There is the "async reference", which is the main one, and the "suspend reference", which might never exist.

Therefore it is crucial that when a coroutine uses suspend - which is a low level feature intended mainly for low level library implementations of concurrency primitives - that it ensures it will be consumed with either resume or cancelsuspend.

defer and errdefer will both run when a coroutine's cancellation process begins. This means that the first cancelasync or cancelsuspend will run the defers of the coroutine. When the other reference drops, the memory will be destroyed.

accepted breaking bug proposal

Source

andrewrk

Most helpful comment

Yes, this is accurate. Reworking coroutines is the first major thing on my list. That, plus something like one bug fix + one documentation improvement per day. roadmap

andrewrk on 1 Oct 2018

👍2

All 14 comments

What you describe has the same race condition but between cancelsuspend and resume.

The way I see to fix this is to wrap the epoll_wait() with a read-write mutex+semaphore. The `epoll_wait() thread does:

acquire the mutex
wait till the semaphore in 0
epoll_wait()
resume co-routine
release mutex
goto 1

While the cancel thread does:

increment the semaphore. If the mutex is acquired goto 4, else goto 2.
deliver an event that will wake up the epoll_wait() thread <= prevents starvation
wait for the epoll thread to set the mutex to 0
cancel the co-routine
decrement the semaphore

But this introduces a deadlock where the epoll_wait() thread is trying to resume the thread that is trying to cancel itself. It can't drop this event (because it is using EPOLLET) so after it detects a dead-lock it has to let itsself be resumed (while holding the lock preventing epoll_wait() from being called) and cancel the co-routine on the next suspend.

But that introduces a stall in the main loop while the resume thread is running, so a second lock has to be introduced, individual to promise [1]. The cancelling thread:

detect deadlock, if not destroy thread and return
lock promise
remove fd from epoll
release epoll lock
suspend
on resume (from epoll thread) unlock promise
destroy thread

Dead locks are detected with two locks taken in opposite orders.

shawnl on 1 Sep 2018

My proposal can be done entirely in user-space, except to keep cancel thread-safe it needs a new suspendnotfinal keyword.

Speaking of, I don't really like cancel. Why can't it just be a method on the promise. .scheduleCancellation()?

shawnl on 2 Sep 2018

@shawnl check-out #1307

kristate on 2 Sep 2018

So, I'm guessing that one of the better things to do here is to keep the current keyword semantics, except to hold-off on defer & delete until a resume is called instead of doing it after a suspend.

The epoll loop will always call resume on its promises, so if we are in cancel state then we defer and delete and then return control flow back to the loop. If not in a cancel state, we simply resume.

And hey, I'm made some ASCII art:

        +-[PROMISE / INSIDE COROUTINE]--+                                      
        |'''''''''''''''''''''''''''''''|                                      
        |''''''+------------------------+--+--------------+                    
        |''''''v''''''''''''''''''''''''|  |              |                    
+-----+ |''+-------+''''+-----------+'''|  |              |                    
|async|-+->|suspend|-+->|  return   |'''|  |              |                    
+-----+ |''+-------+'|''+-----------+'''|  |              |                    
        |''''''''''''|''+-----------+'''|  |              |                    
        |''''''''''''+->|cancel(noop|'''|  |              |                    
        |''''''''''''|''+-----------+'''|  |              |                    
        +------------+------------------+  |              |                    
                     |  +---------------+  |              |                    
                     +->|     await     |--+              |                    
                     |  +---------------+   +------+      |                    
                     |  +---------------+   | set  |      |                    
                     +->|    cancel     |-->|cancel|      |NO                  
                     |  +---------------+   | bit  |  .-------.      +--------+
                     |  +---------------+   +------+ /  TO BE  \ YES | defer& |
                     +->|    resume     |---------->( CANCELED? )--->| delete |
                        +---------------+            `.       ,'     +--------+
                                                       `-----'

kristate on 2 Sep 2018

The epoll loop will always call resume on its promises,

No. If no events come it they never get resumed. The cancel part can be taken out of the promise type and become part of the loop: .scheduleCancellation(). (which works according to above locking semantics) Then before suspending a co-routine has to check if it was scheduled to be cancelled while it was running, and if so destroys itself.

I opened a LLVM bug on whether co-routines are allowed to destroy themselves (as zig currently does in suspend)

https://bugs.llvm.org/show_bug.cgi?id=38805

shawnl on 2 Sep 2018

I think the scenario I outlined above is not actually a problem, because a coroutine couldn't be in the epoll set and executing at the same time. But here's a problem scenario:

Thread 1 does epoll_wait/kevent/GetQueuedCompletionStatus, blocking until some work can be done. The thread wakes up, and it is about to execute resume coroutine_handle.
Thread 2 does a cancel on the coroutine. Since it's at a suspend point, it runs the defers and errdefers, which dutifully remove the coroutine from the epoll set. However because of (1) it isn't even in the epoll set/kqueue/iocp. (1) is about to resume the coroutine and there's nothing (2) can do to stop it. (2) proceeds to destroy the coroutine frame. The memory is gone.
Thread 1 does resume coroutine_handle which now points to invalid memory. Boom.

With the proposed semantics:

Thread 1 does epoll_wait/kevent/GetQueuedCompletionStatus, blocking until some work can be done. The thread wakes up, and it is about to execute resume coroutine_handle.
Thread 2 does a cancelasync on the coroutine. Since it's at a suspend point, it runs the defers and errdefers, which dutifully remove the coroutine from the epoll set. However because of (1) it isn't even in the epoll set/kqueue/iocp. (1) is about to resume the coroutine and there's nothing (2) can do to stop it. However (2) does not destroy the coroutine frame because the suspend reference is present. The memory is still there.
Thread 1 does resume coroutine_handle which finds out that the thread has been canceled. So instead of resuming, it frees the memory.

andrewrk on 4 Sep 2018

I think it would be best if Andrew were to implement his proposal and we could test it/ fuzz it.
I tried to think through this discussion and while Andrews explanation makes sense to me, I've made just enough concurrency bugs myself to know that this does not proof the correctness.

Testing the implementation does not give a proof either of course but I think its the best we can currently do until maybe zig attract some (other) domain expert that thinks it through.

I'd certainly be willing to run some test suit for an extensive time, compared to peoples time, hardware is cheap 😆

ghost on 5 Sep 2018

When I get some time later tonight, I will try to make another ASCII chart trying to clarify things. It's an area where we should be documenting.

kristate on 5 Sep 2018

rather than ascii chart we'd need a formal state diagram/ automata with a formal proof, but I've seen a professor attempting the same and still failing (false positive actually) so...brute force is the best we have although its still no good.

ghost on 5 Sep 2018

@monouser7dig yes, I agree. An ASCII chart is no replacement for a formal proof. Inch by inch.

kristate on 5 Sep 2018

so after 0.3.0 is this the first major thing on the list?
like get async working, because the self hosted compiler needs it, network needs it (sort of?) and thus

async (rewrite coroutines)
self hosted + comptime allocator/ bugs + tooling
docs
??? lots of things

@andrewrk is this roughly accurate or what is to be expected?

ghost on 1 Oct 2018

Yes, this is accurate. Reworking coroutines is the first major thing on my list. That, plus something like one bug fix + one documentation improvement per day. roadmap

andrewrk on 1 Oct 2018

👍2

Maybe related to this: I've written a few lines in Rust's futures repository why synchronous cancellation might not work out well for all use-cases: https://github.com/rust-lang-nursery/futures-rs/issues/1278

I guess similar things would apply to zig's coroutines as well, and it might in some cases be favorable to force them to run to completion (or at least to a safe exit point).