Kotlinx.coroutines: Circular exception with async/await in coroutineScope

Created on 10 Jun 2019  路  8Comments  路  Source: Kotlin/kotlinx.coroutines

A circular JobCancellationException from coroutines keeps causing a StackOverflowException in logback which I use for logging.
IMO the recursion should be broken up at some point.

This is quite similar to #305.

Output:

java.lang.RuntimeException: 1
   suppressed: java.lang.RuntimeException: 2
      cause: kotlinx.coroutines.JobCancellationException: Parent job is Cancelling; job=ScopeCoroutine{Cancelled}@257ed010
         cause: java.lang.RuntimeException: 1 // RECURSION!

Repro:

suspend fun main() {
   try {
      coroutineScope {
         val def1: Deferred<Unit> = async {
            delay(100)
            throw RuntimeException("1")
         }
         val def2: Deferred<Unit> = async {
            try {
               while (true)
                  delay(10)
            }
            catch (e: Exception) {
               throw RuntimeException("2", e)
            }
         }

         def1.await()
         def2.await()
      }
   }
   catch (throwable: Throwable) {
      debug(throwable)
   }
}


fun debug(throwable: Throwable, prefix: String = "", processed: MutableSet<Throwable> = hashSetOf()) {
   print(throwable)

   if (processed.contains(throwable)) {
      println(" // RECURSION!")
      return
   }
   println()

   processed += throwable

   throwable.cause?.let {
      print("$prefix   cause: ")
      debug(it, prefix = "$prefix   ", processed = processed)
   }
   throwable.suppressed?.firstOrNull()?.let {
      print("$prefix   suppressed: ")
      debug(it, prefix = "$prefix   ", processed = processed)
   }
}

Workaround is to catch CancellationException and rethrow it without wrapping it into another exception.

enhancement question

Most helpful comment

I understand your pain with circular exceptions and I would like to fix this problem as well.

What about changing the CE's cause dynamically depending on the coroutine context?
According to Throwable.getCause() documentation it is fine to override the method.

It is indeed okay to override getCause, but making it non-pure is not: it will confuse both developers and some frameworks (e.g. I am not sure that all logging frameworks handle exceptions with non-pure getCause without bugs).

What about changing the CE's cause dynamically depending on the coroutine context?

It is another layer of complications because coroutine may not have a context (e.g. just a Job() created without a coroutine). Additional complexity comes from the fact that some coroutine handle their own exceptions, while others don't etc.

What I think we can do is to use CE without a cause when coroutine has a parent, though I am not sure this is a valid change. I will investigate it further

All 8 comments

There is not much we can do about it.

Let me simplify your example and explain why does it happen.

  • If multiple children of the coroutine fail, the resulting exception will be the first one. But the rest of the exceptions cannot be lost, so they are added as suppressed exceptions to the winner.
  • The winner in your case is RE(1), thus RE(2) should be added to suppressed exceptions. It is clear and simple, but where this circularity comes from?

When a coroutine is cancelled (in that case, it is the second async), CancellationException is thrown from the suspension point. But this diagnostic may not be enough, because if you log it you'll see something like "Parent was cancelled" and it is not very informative. To add more contextual information, we add the original exception (RE(1)) to the cause of cancellation exception.

So CancellationException(cause = RE(1)) is thrown in the second await. You are catching this and rethrowing RE(2, cause = CE(RE(1))).

Now, when coroutineScope builds the final exception, it chooses RE(1) and adds RE(2) to suppressed. Having in mind the previous point, the resulting exception will be RE(1, suppressed = RE(2, cause = CE(RE(1))), creating a cycle.

Honestly, I don't see how to avoid it without losing some contextual debug information, so not wrapping cancellation exceptions seems the only reasonable way for me

Thank you for the detailed explanation. I understand the cause and the reasoning. Yet it ends up in a cycle and cycles don't make sense for exceptions.

Unfortunately that puts our case in kind of a deadlock between this issue with Kotlin coroutines and logback which doesn't support circular exceptions although the issue is open for 5 years now - resulting in a StackOverflowError. Bottom line is that we lose some stability in the back-end, which is not good. These issues are difficult to foresee while programming because CEs can potentially be thrown all over the place, so there would be a need for a lot of checking for CEs and a lack of context if we cannot wrap it.

What about changing the CE's cause dynamically depending on the coroutine context?

  • If the context is the same as where the CE was thrown then cause returns the cause (RE(1)) because otherwise that information would be inaccessible.
  • If the context is different then we're either in a parent or in a completely different one in which case cause then returns null. In the parent the cause RE(1) is still accessible because it's the root exception and we merely have broken the cycle.
  • There could be an additional property just for CEs which always returns the cause independent of the current context. That would allow for some edge cases where the developer still needs to access the cause in a context which is neither the cancelled coroutine nor a parent.

According to Throwable.getCause() documentation it is fine to override the method.

I understand your pain with circular exceptions and I would like to fix this problem as well.

What about changing the CE's cause dynamically depending on the coroutine context?
According to Throwable.getCause() documentation it is fine to override the method.

It is indeed okay to override getCause, but making it non-pure is not: it will confuse both developers and some frameworks (e.g. I am not sure that all logging frameworks handle exceptions with non-pure getCause without bugs).

What about changing the CE's cause dynamically depending on the coroutine context?

It is another layer of complications because coroutine may not have a context (e.g. just a Job() created without a coroutine). Additional complexity comes from the fact that some coroutine handle their own exceptions, while others don't etc.

What I think we can do is to use CE without a cause when coroutine has a parent, though I am not sure this is a valid change. I will investigate it further

I hope this could be prioritized, because this (combined with libraries that cannot process circular graphs in suppressed/cause chain) is a very unpleasant surprise for production systems.

I found https://github.com/apache/logging-log4j2/commit/aa0d2abaf66e107f66ea3ebbe410ddd440212a6b#diff-f86cfba96bcd45fd3d972f8ede9759e8 possibly adressing this issue.

If I want to transform these exceptions into something loggable with a summary message and simplified stacktrace, how would I go about it? Looking for a solution perhaps with an extension function on Exception, which returns a new summary exception I can use as cause when logging.

This is indeed very subtle and WTF-inducing issue. I've just run into similar problem using reflection: InvocationTargetException wraps JobCancellationException, and then cycle happens and logback panics. Re-throwing target of the InvocationTargetException seems to avoid the cycle. But it looks extremely fragile especially if you use libraries which may wrap JobCancellationExceptions beyond your control. Can anyone please suggest a workaround?

@haizz Libraries should not wrap JobCancellationException. Did you ever encounter one doing so?

@haizz Libraries should not wrap JobCancellationException. Did you ever encounter one doing so?

I'm using graphql-java with data fetcher functions returning CompletableFutures built using coroutines future builder. These coroutines run in parallel inside my coroutineScope. When one of the coroutines throws, all the others fail with JobCancellationException (which is expected). Now, graphql-java has default failure handler which receives failures from all the futures and logs them with logback. So what this library handler receives is all sorts of JobCancellation wrapping:
JobCancellationException unwrapped
InvocationTargetException -> JobCancellationException
CompletionException -> JobCancellationException
CompletionException -> InvocationTargetException -> JobCancellationException
...
When there is a cycle, it fails inside the library. It's not easy to reproduce because it seems to be dependent on coroutines race. Sometimes there's no cycle and no panic. Okay, when I remove InvocationTargetException from the chain by catching it and rethrowing JobCancellationException from inside, it seems to work but I can't prove it will work 100% of all cases. It's really hard to explain because it's super hard to understand what's going on in the first place.

What would prevent any third-party code from throwing this InvocationTargetException -> JobCancellationException combination?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Leftwitch picture Leftwitch  路  3Comments

ScottPierce picture ScottPierce  路  3Comments

ZakTaccardi picture ZakTaccardi  路  3Comments

elizarov picture elizarov  路  3Comments

mhernand40 picture mhernand40  路  3Comments