Kotlinx.coroutines: Abolish distinction between cancelled and failed Job/Deferred

Created on 5 Feb 2018  Â·  9Comments  Â·  Source: Kotlin/kotlinx.coroutines

Distinction between _cancelled_ and _failed_ states for Job and Deferred shall be abolished. The rationale follows.

  1. Public documentation for Job does not mention the _failed_ state at all, however using getCancellationException one can still indirectly figure out if the coroutine that was started with launch had completed normally or with exception, so hiding that from the Job state machine does not really help.

  2. When Job uses parallel decomposition of its work by launching child coroutines it accidentally leaks its internal implementation details. When a child coroutine crashes with exception it cancels its parent, thereby obviously revealing this failure with _cancelled_ state of the parent job. However, when the parent itself crashes, it still technically in _completed_ state (not _cancelled_), but see above.

  3. When the parent coroutine completes exceptionally it waits for its children before becoming _failed_. This is not an ideal behavior for a parallel decomposition of work. When the parent coroutine crashes, then we really want to cancel all the children asap, since the "work as a whole" was obviously not successful.

All in all, this makes distinction between _cancelled_ and _failed_ Job or Deferred really irrelevant.

The plan is as follows:

  • Deprecate Deferred.isCompletedExceptionally in favor of isCancelled.
  • Deprecate CompletableDeferred.completeExceptionally in favor of cancel.

Change the behavior for crashed coroutines. When coroutine crashes with exception it shall transition into _cancelling_ state, cancel all its children, wait for their completion, and transition into _cancelled_ state.

With this change the behavior of CoroutineExceptionHandler is going to be more consistent. When a standalone coroutine (the coroutine that is started with launch) crashes with exception or is cancelled it should transition to _cancelling_ state and invoke CoroutineExceptionHandler with the context of the parent coroutine. The default behavior is to cancel the parent. Since the crashed child is already in the _cancelling_ state it is never going cause a problem of cancelling the _failing_ child.

Documentation needs to be adjusted appropriately. In particular, the state machine for Deferred is going to be identical to a Job and its docs can be simpler and more focused on the Deferred/Job differences.

219 shall be implemented first, since CancellableContinuation must keep distinction between the state where it was "resumed with exception" and where it was "cancelled". This is often needed to for a proper resource-managent to avoid closing some external resource when it was already closed and the continuation was cancelled.

enhancement for 1.0 release

Most helpful comment

We're back on track. We've figured out how to preserve all our channel-related use-cases while abolishing cancelling/failing distinction. Job becomes conceptually so much simpler.

All 9 comments

Doesn't CoroutineExceptionHandler become CoroutineCancellationHandler then, since it will also be called on non-exceptional cancellations?

It _is_ now called on non-exceptional cancellation. The general approach (that is also implemented in the default handler) is to check exception is CancellationException and act accordingly.

I think there has been some confusion about CancellationException in the first place, people think it really is exceptional. I think renaming this would clear up some ambiguity, since according to your proposition, even a Exception is just another way to cancel a coroutine. Therefore we are handling a cancellation that can either be a user requested cancellation, or one caused by an exception. But it is ambiguous to say that it handles an exception, because in people's eyes, an exception isn't usually triggered by a voluntary cancellation.

The special role of CancellationException is indeed confusing. I'll think on how this could be addressed. Renaming exception handler to CancellationHandler might do the trick, especially in light of renaming "exceptional completion" to "cancellation".

Btw, I've tried hard to make it all consistent while still keeping "exceptional completion" distinct from "cancellation". It just does not click. It becomes a real mess in corner cases like "coroutine that was cancelled, but then crashes with some exception in its finally sections" and "coroutine that has crashed, but was cancelled while waiting for its children to complete". It becomes much easier to grasp when "exceptional completion" and "cancellation" are simply the same concept.

@elizarov will this change make isCancelled become true if a job/deferred cancels due to an exception. For example:

fun main(args: Array<String>) = runBlocking<Unit> {
    val job = launch {
        TODO("blabla")
    }

    delay(2000)

    println(job.isCancelled) // currently this will be false
}

As shown above, isCancelled is currently false for a "failed" job.

Some questions:

  1. If I am understanding the description of this issue correctly, the example above will print true, is that correct?
  2. In order to know if a job fails, will it be mandatory to set a CancellationHandler to it?
  3. In order to know if a deferred fails, will it be mandatory to either set a CancellationHandler to it or try/catch the await()?

Note:
With the current discussion/implementation it seems that it could also be done by unwrapping the cause of getCancellationException(), but that should be discouraged – or impeded – because it seems hacky.

Yes. That is the proposal. job.isCancelled would return true on a failed job.

To know if job fails you will have the following options:

  • Use job.invokeOnCompletion { cause -> ... }
  • Start a job with async { ... } instead of launch and use await() to get the corresponding exception.
  • Install CoroutineExceptionHandler into the coroutine's context.

The main problem with the current implementation is that it violated encapsulation of job's work. The job may decide to delegate parts of its work to its children and these details (that are supposed to be completely encapsulated in the job implementation) leak through the current api.

We've decided not go ahead with this change, but solve only a part of the problem (see #585). We've found channel-related use-cases where it is still important to distinguish whether the job was explicitly _cancelled_ or whether it had _failed_ by itself.

We're back on track. We've figured out how to preserve all our channel-related use-cases while abolishing cancelling/failing distinction. Job becomes conceptually so much simpler.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

qwwdfsad picture qwwdfsad  Â·  47Comments

elizarov picture elizarov  Â·  35Comments

PaulWoitaschek picture PaulWoitaschek  Â·  47Comments

elizarov picture elizarov  Â·  60Comments

elizarov picture elizarov  Â·  45Comments