Zio: Interrupt leak / deadlock bug under high concurrency

Created on 23 Apr 2019  路  4Comments  路  Source: zio/zio

As of v1.0-RC4, there's a possible non-deterministic interruption leak / deadlock bug under high concurrency. I've started seeing interrupt IOs that permanently block once every thousand calls in the context of a large working app.

I've managed to create a bare minimum, standalone reproducer:

import java.util.concurrent.atomic.LongAdder
import scalaz.zio.console._
import scalaz.zio.{Task, UIO, ZIO, ZSchedule}
import scalaz.zio.duration._

object ZioInterruptLeakOrDeadlockRepo extends scalaz.zio.App {
  def run(args: List[String]): ZIO[Environment, Nothing, Int] = {
    val work = UIO.succeed("whatever")

    // Simulates an external "kill-switch"
    val abort = Task.unit.delay(1.minute) 

    val completedCounter = new LongAdder
    val pendingGauge = new LongAdder
    val timeoutCounter = new LongAdder

    val leakOrDeadlockTest = for {
      _ <- UIO(pendingGauge.increment())
      valueFib <- work.fork
      abortFib <- (abort *> valueFib.interrupt).fork
      _ <- valueFib.join

      // This blocks / leaks once every 20,000 rounds or so
      _ <- abortFib.interrupt 

      _ <- UIO(pendingGauge.decrement())
    } yield ()

    val main = for {
      _ <- UIO(s"completed=${completedCounter.longValue()} pending=${pendingGauge.longValue()} timed-out=${timeoutCounter.longValue()}")
        .flatMap(putStrLn)
        .repeat(ZSchedule.fixed(1.second))
        .fork

      _ <- ZIO
        .foreachPar(1 to Runtime.getRuntime.availableProcessors()) { _ =>
          leakOrDeadlockTest
            .timeout(5.seconds)
            .repeat(ZSchedule.identity[Option[Unit]].logInput {
              case Some(_) => UIO(completedCounter.increment())
              case None => UIO(timeoutCounter.increment())
            })
        }
    } yield 0

    main
      .catchAll(e => UIO(println(s"Died with $e")) *> UIO.succeed(1))
  }
}

Expected:

  • Steady, relatively constant pending count, ~= the number of system threads.
  • No task timeouts (exceeds 5 seconds in this test)

Actual:

  • Pending task count gradually increases, so as the number of timed-out tasks.

It's worth noting that by reducing the CPU load, for example with:

val work = UIO.succeed("whatever").delay(1.milli)

this issue would not be visible. So far I've only been able to reproduce this under high CPU utilization, which might indicate it's a scheduling fairness problem?

bug rts

Most helpful comment

@nktpro It's possible, as the implementation of interrupt is changing completely.

If not, however, I'll try to fix this right after supervision!

All 4 comments

@nkpart Does this happen in RC3?

As far as I can tell from a first glance, the interruption of abortFib code should never block (it obviously itself blocks on valueFib but there are no finalizers there that could delay interruption indefinitely). Which means it's a bug and will be fixed. Thanks for your work minimizing!

@jdegoes I tested this back to RC1 and could still reproduce it.

Also interestingly if we fork abortFib.interrupt, i.e:

for {
  ...
  _ <- (abortFib.interrupt *> UIO(pendingGauge.decrement())).fork
}

Then everything is fine. The interrupt happens asynchronously but also practically instantaneous. pendingGauge will always remain <= the number of threads even after an hour of 100% CPU load test, indicating there's no blocking or leaking whatsoever.

@jdegoes It's still reproducible as of v1.0.0-RC10-1, do you think #1081 will address this?

@nktpro It's possible, as the implementation of interrupt is changing completely.

If not, however, I'll try to fix this right after supervision!

Was this page helpful?
0 / 5 - 0 ratings