As of v1.0-RC4, there's a possible non-deterministic interruption leak / deadlock bug under high concurrency. I've started seeing interrupt IOs that permanently block once every thousand calls in the context of a large working app.
I've managed to create a bare minimum, standalone reproducer:
import java.util.concurrent.atomic.LongAdder
import scalaz.zio.console._
import scalaz.zio.{Task, UIO, ZIO, ZSchedule}
import scalaz.zio.duration._
object ZioInterruptLeakOrDeadlockRepo extends scalaz.zio.App {
def run(args: List[String]): ZIO[Environment, Nothing, Int] = {
val work = UIO.succeed("whatever")
// Simulates an external "kill-switch"
val abort = Task.unit.delay(1.minute)
val completedCounter = new LongAdder
val pendingGauge = new LongAdder
val timeoutCounter = new LongAdder
val leakOrDeadlockTest = for {
_ <- UIO(pendingGauge.increment())
valueFib <- work.fork
abortFib <- (abort *> valueFib.interrupt).fork
_ <- valueFib.join
// This blocks / leaks once every 20,000 rounds or so
_ <- abortFib.interrupt
_ <- UIO(pendingGauge.decrement())
} yield ()
val main = for {
_ <- UIO(s"completed=${completedCounter.longValue()} pending=${pendingGauge.longValue()} timed-out=${timeoutCounter.longValue()}")
.flatMap(putStrLn)
.repeat(ZSchedule.fixed(1.second))
.fork
_ <- ZIO
.foreachPar(1 to Runtime.getRuntime.availableProcessors()) { _ =>
leakOrDeadlockTest
.timeout(5.seconds)
.repeat(ZSchedule.identity[Option[Unit]].logInput {
case Some(_) => UIO(completedCounter.increment())
case None => UIO(timeoutCounter.increment())
})
}
} yield 0
main
.catchAll(e => UIO(println(s"Died with $e")) *> UIO.succeed(1))
}
}
Expected:
Actual:
It's worth noting that by reducing the CPU load, for example with:
val work = UIO.succeed("whatever").delay(1.milli)
this issue would not be visible. So far I've only been able to reproduce this under high CPU utilization, which might indicate it's a scheduling fairness problem?
@nkpart Does this happen in RC3?
As far as I can tell from a first glance, the interruption of abortFib code should never block (it obviously itself blocks on valueFib but there are no finalizers there that could delay interruption indefinitely). Which means it's a bug and will be fixed. Thanks for your work minimizing!
@jdegoes I tested this back to RC1 and could still reproduce it.
Also interestingly if we fork abortFib.interrupt, i.e:
for {
...
_ <- (abortFib.interrupt *> UIO(pendingGauge.decrement())).fork
}
Then everything is fine. The interrupt happens asynchronously but also practically instantaneous. pendingGauge will always remain <= the number of threads even after an hour of 100% CPU load test, indicating there's no blocking or leaking whatsoever.
@jdegoes It's still reproducible as of v1.0.0-RC10-1, do you think #1081 will address this?
@nktpro It's possible, as the implementation of interrupt is changing completely.
If not, however, I'll try to fix this right after supervision!
Most helpful comment
@nktpro It's possible, as the implementation of
interruptis changing completely.If not, however, I'll try to fix this right after supervision!