A circular JobCancellationException from coroutines keeps causing a StackOverflowException in logback which I use for logging.
IMO the recursion should be broken up at some point.
This is quite similar to #305.
Output:
java.lang.RuntimeException: 1
suppressed: java.lang.RuntimeException: 2
cause: kotlinx.coroutines.JobCancellationException: Parent job is Cancelling; job=ScopeCoroutine{Cancelled}@257ed010
cause: java.lang.RuntimeException: 1 // RECURSION!
Repro:
suspend fun main() {
try {
coroutineScope {
val def1: Deferred<Unit> = async {
delay(100)
throw RuntimeException("1")
}
val def2: Deferred<Unit> = async {
try {
while (true)
delay(10)
}
catch (e: Exception) {
throw RuntimeException("2", e)
}
}
def1.await()
def2.await()
}
}
catch (throwable: Throwable) {
debug(throwable)
}
}
fun debug(throwable: Throwable, prefix: String = "", processed: MutableSet<Throwable> = hashSetOf()) {
print(throwable)
if (processed.contains(throwable)) {
println(" // RECURSION!")
return
}
println()
processed += throwable
throwable.cause?.let {
print("$prefix cause: ")
debug(it, prefix = "$prefix ", processed = processed)
}
throwable.suppressed?.firstOrNull()?.let {
print("$prefix suppressed: ")
debug(it, prefix = "$prefix ", processed = processed)
}
}
Workaround is to catch CancellationException and rethrow it without wrapping it into another exception.
There is not much we can do about it.
Let me simplify your example and explain why does it happen.
RE(1), thus RE(2) should be added to suppressed exceptions. It is clear and simple, but where this circularity comes from?When a coroutine is cancelled (in that case, it is the second async), CancellationException is thrown from the suspension point. But this diagnostic may not be enough, because if you log it you'll see something like "Parent was cancelled" and it is not very informative. To add more contextual information, we add the original exception (RE(1)) to the cause of cancellation exception.
So CancellationException(cause = RE(1)) is thrown in the second await. You are catching this and rethrowing RE(2, cause = CE(RE(1))).
Now, when coroutineScope builds the final exception, it chooses RE(1) and adds RE(2) to suppressed. Having in mind the previous point, the resulting exception will be RE(1, suppressed = RE(2, cause = CE(RE(1))), creating a cycle.
Honestly, I don't see how to avoid it without losing some contextual debug information, so not wrapping cancellation exceptions seems the only reasonable way for me
Thank you for the detailed explanation. I understand the cause and the reasoning. Yet it ends up in a cycle and cycles don't make sense for exceptions.
Unfortunately that puts our case in kind of a deadlock between this issue with Kotlin coroutines and logback which doesn't support circular exceptions although the issue is open for 5 years now - resulting in a StackOverflowError. Bottom line is that we lose some stability in the back-end, which is not good. These issues are difficult to foresee while programming because CEs can potentially be thrown all over the place, so there would be a need for a lot of checking for CEs and a lack of context if we cannot wrap it.
What about changing the CE's cause dynamically depending on the coroutine context?
cause returns the cause (RE(1)) because otherwise that information would be inaccessible.cause then returns null. In the parent the cause RE(1) is still accessible because it's the root exception and we merely have broken the cycle.According to Throwable.getCause() documentation it is fine to override the method.
I understand your pain with circular exceptions and I would like to fix this problem as well.
What about changing the CE's cause dynamically depending on the coroutine context?
According to Throwable.getCause() documentation it is fine to override the method.
It is indeed okay to override getCause, but making it non-pure is not: it will confuse both developers and some frameworks (e.g. I am not sure that all logging frameworks handle exceptions with non-pure getCause without bugs).
What about changing the CE's cause dynamically depending on the coroutine context?
It is another layer of complications because coroutine may not have a context (e.g. just a Job() created without a coroutine). Additional complexity comes from the fact that some coroutine handle their own exceptions, while others don't etc.
What I think we can do is to use CE without a cause when coroutine has a parent, though I am not sure this is a valid change. I will investigate it further
I hope this could be prioritized, because this (combined with libraries that cannot process circular graphs in suppressed/cause chain) is a very unpleasant surprise for production systems.
I found https://github.com/apache/logging-log4j2/commit/aa0d2abaf66e107f66ea3ebbe410ddd440212a6b#diff-f86cfba96bcd45fd3d972f8ede9759e8 possibly adressing this issue.
If I want to transform these exceptions into something loggable with a summary message and simplified stacktrace, how would I go about it? Looking for a solution perhaps with an extension function on Exception, which returns a new summary exception I can use as cause when logging.
This is indeed very subtle and WTF-inducing issue. I've just run into similar problem using reflection: InvocationTargetException wraps JobCancellationException, and then cycle happens and logback panics. Re-throwing target of the InvocationTargetException seems to avoid the cycle. But it looks extremely fragile especially if you use libraries which may wrap JobCancellationExceptions beyond your control. Can anyone please suggest a workaround?
@haizz Libraries should not wrap JobCancellationException. Did you ever encounter one doing so?
@haizz Libraries should not wrap
JobCancellationException. Did you ever encounter one doing so?
I'm using graphql-java with data fetcher functions returning CompletableFutures built using coroutines future builder. These coroutines run in parallel inside my coroutineScope. When one of the coroutines throws, all the others fail with JobCancellationException (which is expected). Now, graphql-java has default failure handler which receives failures from all the futures and logs them with logback. So what this library handler receives is all sorts of JobCancellation wrapping:
JobCancellationException unwrapped
InvocationTargetException -> JobCancellationException
CompletionException -> JobCancellationException
CompletionException -> InvocationTargetException -> JobCancellationException
...
When there is a cycle, it fails inside the library. It's not easy to reproduce because it seems to be dependent on coroutines race. Sometimes there's no cycle and no panic. Okay, when I remove InvocationTargetException from the chain by catching it and rethrowing JobCancellationException from inside, it seems to work but I can't prove it will work 100% of all cases. It's really hard to explain because it's super hard to understand what's going on in the first place.
What would prevent any third-party code from throwing this InvocationTargetException -> JobCancellationException combination?
Most helpful comment
I understand your pain with circular exceptions and I would like to fix this problem as well.
It is indeed okay to override
getCause, but making it non-pure is not: it will confuse both developers and some frameworks (e.g. I am not sure that all logging frameworks handle exceptions with non-puregetCausewithout bugs).It is another layer of complications because coroutine may not have a context (e.g. just a
Job()created without a coroutine). Additional complexity comes from the fact that some coroutine handle their own exceptions, while others don't etc.What I think we can do is to use
CEwithout a cause when coroutine has a parent, though I am not sure this is a valid change. I will investigate it further