Currently there is a race condition when using multithreaded coroutines. Consider the following scenario:
epoll_wait/kevent/GetQueuedCompletionStatus, blocking until some work can be done. The thread wakes up, and it is about to execute resume coroutine_handle.resume the coroutine and there's nothing (2) can do to stop it. (2) proceeds to destroy the coroutine frame. The memory is gone.resume coroutine_handle which now points to invalid memory. Boom.In order to fix this, we need to introduce new syntax and semantics. Current semantics are:
async creates a promise which must be consumed with cancel or await.suspend must be consumed with a resume.The problem here described above - resume and cancel racing with each other. When a suspended coroutine is canceled, the memory must not be destroyed until the suspend is canceled or resumed. Proposal for new semantics:
async creates a promise which must be consumed with cancelasync or await.suspend must be consumed with a cancelsuspend or resume.With these semantics, each coroutine essentially has 0, 1, or 2 references, and when the reference count reaches 0 it is destroyed. There is the "async reference", which is the main one, and the "suspend reference", which might never exist.
Therefore it is crucial that when a coroutine uses suspend - which is a low level feature intended mainly for low level library implementations of concurrency primitives - that it ensures it will be consumed with either resume or cancelsuspend.
defer and errdefer will both run when a coroutine's cancellation process begins. This means that the first cancelasync or cancelsuspend will run the defers of the coroutine. When the other reference drops, the memory will be destroyed.
What you describe has the same race condition but between cancelsuspend and resume.
The way I see to fix this is to wrap the epoll_wait() with a read-write mutex+semaphore. The `epoll_wait() thread does:
epoll_wait()While the cancel thread does:
But this introduces a deadlock where the epoll_wait() thread is trying to resume the thread that is trying to cancel itself. It can't drop this event (because it is using EPOLLET) so after it detects a dead-lock it has to let itsself be resumed (while holding the lock preventing epoll_wait() from being called) and cancel the co-routine on the next suspend.
But that introduces a stall in the main loop while the resume thread is running, so a second lock has to be introduced, individual to promise [1]. The cancelling thread:
Dead locks are detected with two locks taken in opposite orders.
My proposal can be done entirely in user-space, except to keep cancel thread-safe it needs a new suspendnotfinal keyword.
Speaking of, I don't really like cancel. Why can't it just be a method on the promise. .scheduleCancellation()?
@shawnl check-out #1307
So, I'm guessing that one of the better things to do here is to keep the current keyword semantics, except to hold-off on defer & delete until a resume is called instead of doing it after a suspend.
The epoll loop will always call resume on its promises, so if we are in cancel state then we defer and delete and then return control flow back to the loop. If not in a cancel state, we simply resume.
And hey, I'm made some ASCII art:
+-[PROMISE / INSIDE COROUTINE]--+
|'''''''''''''''''''''''''''''''|
|''''''+------------------------+--+--------------+
|''''''v''''''''''''''''''''''''| | |
+-----+ |''+-------+''''+-----------+'''| | |
|async|-+->|suspend|-+->| return |'''| | |
+-----+ |''+-------+'|''+-----------+'''| | |
|''''''''''''|''+-----------+'''| | |
|''''''''''''+->|cancel(noop|'''| | |
|''''''''''''|''+-----------+'''| | |
+------------+------------------+ | |
| +---------------+ | |
+->| await |--+ |
| +---------------+ +------+ |
| +---------------+ | set | |
+->| cancel |-->|cancel| |NO
| +---------------+ | bit | .-------. +--------+
| +---------------+ +------+ / TO BE \ YES | defer& |
+->| resume |---------->( CANCELED? )--->| delete |
+---------------+ `. ,' +--------+
`-----'
The epoll loop will always call resume on its promises,
No. If no events come it they never get resumed. The cancel part can be taken out of the promise type and become part of the loop: .scheduleCancellation(). (which works according to above locking semantics) Then before suspending a co-routine has to check if it was scheduled to be cancelled while it was running, and if so destroys itself.
I opened a LLVM bug on whether co-routines are allowed to destroy themselves (as zig currently does in suspend)
I think the scenario I outlined above is not actually a problem, because a coroutine couldn't be in the epoll set and executing at the same time. But here's a problem scenario:
epoll_wait/kevent/GetQueuedCompletionStatus, blocking until some work can be done. The thread wakes up, and it is about to execute resume coroutine_handle.cancel on the coroutine. Since it's at a suspend point, it runs the defers and errdefers, which dutifully remove the coroutine from the epoll set. However because of (1) it isn't even in the epoll set/kqueue/iocp. (1) is about to resume the coroutine and there's nothing (2) can do to stop it. (2) proceeds to destroy the coroutine frame. The memory is gone.resume coroutine_handle which now points to invalid memory. Boom.With the proposed semantics:
epoll_wait/kevent/GetQueuedCompletionStatus, blocking until some work can be done. The thread wakes up, and it is about to execute resume coroutine_handle.cancelasync on the coroutine. Since it's at a suspend point, it runs the defers and errdefers, which dutifully remove the coroutine from the epoll set. However because of (1) it isn't even in the epoll set/kqueue/iocp. (1) is about to resume the coroutine and there's nothing (2) can do to stop it. However (2) does not destroy the coroutine frame because the suspend reference is present. The memory is still there.resume coroutine_handle which finds out that the thread has been canceled. So instead of resuming, it frees the memory.I think it would be best if Andrew were to implement his proposal and we could test it/ fuzz it.
I tried to think through this discussion and while Andrews explanation makes sense to me, I've made just enough concurrency bugs myself to know that this does not proof the correctness.
Testing the implementation does not give a proof either of course but I think its the best we can currently do until maybe zig attract some (other) domain expert that thinks it through.
I'd certainly be willing to run some test suit for an extensive time, compared to peoples time, hardware is cheap 馃槅
When I get some time later tonight, I will try to make another ASCII chart trying to clarify things. It's an area where we should be documenting.
rather than ascii chart we'd need a formal state diagram/ automata with a formal proof, but I've seen a professor attempting the same and still failing (false positive actually) so...brute force is the best we have although its still no good.
@monouser7dig yes, I agree. An ASCII chart is no replacement for a formal proof. Inch by inch.
so after 0.3.0 is this the first major thing on the list?
like get async working, because the self hosted compiler needs it, network needs it (sort of?) and thus
@andrewrk is this roughly accurate or what is to be expected?
Yes, this is accurate. Reworking coroutines is the first major thing on my list. That, plus something like one bug fix + one documentation improvement per day. roadmap
Maybe related to this: I've written a few lines in Rust's futures repository why synchronous cancellation might not work out well for all use-cases: https://github.com/rust-lang-nursery/futures-rs/issues/1278
I guess similar things would apply to zig's coroutines as well, and it might in some cases be favorable to force them to run to completion (or at least to a safe exit point).
This issue is obsoleted with the merge of #3033.
Most helpful comment
Yes, this is accurate. Reworking coroutines is the first major thing on my list. That, plus something like one bug fix + one documentation improvement per day. roadmap