A lot of the method implementations in Validated are extremely inefficient, which isn't ideal for projects like circe that rely on it extensively and are aiming for reasonable performance. For example, given this simplified definition (where eqF is the current === and eqP is a lightly optimized version):
import cats.kernel.Eq
sealed abstract class Validated[+E, +A] extends Product with Serializable {
import Validated.{Invalid, Valid}
def fold[B](fe: E => B, fa: A => B): B = this match {
case Invalid(e) => fe(e)
case Valid(a) => fa(a)
}
def eqF[EE >: E, AA >: A](that: Validated[EE, AA])
(implicit EE: Eq[EE], AA: Eq[AA]): Boolean = fold(
a => that.fold(EE.eqv(a, _), _ => false),
b => that.fold(_ => false, AA.eqv(b, _))
)
def eqP[EE >: E, AA >: A](that: Validated[EE, AA])
(implicit EE: Eq[EE], AA: Eq[AA]): Boolean = this match {
case Invalid(e) => that match {
case Invalid(ee) => EE.eqv(e, ee)
case _ => false
}
case Valid(a) => that match {
case Valid(aa) => AA.eqv(a, aa)
case _ => false
}
}
}
object Validated {
final case class Valid[+A](a: A) extends Validated[Nothing, A]
final case class Invalid[+E](e: E) extends Validated[E, Nothing]
}
And a benchmark like this:
import cats.kernel.instances.all._
import java.util.concurrent.TimeUnit
import org.openjdk.jmh.annotations._
@State(Scope.Thread)
class ValidatedBenchmark {
val a: Validated[String, Int] = Validated.Valid(1)
val b: Validated[String, Int] = Validated.Valid(2)
val c: Validated[String, Int] = Validated.Invalid("bad")
val d: Validated[String, Int] = Validated.Invalid("worse")
@Benchmark def eqF: Boolean =
a.eqF(a) && !a.eqF(b) && !a.eqF(c) && c.eqF(c) && !c.eqF(d) && !c.eqF(a)
@Benchmark def eqP: Boolean =
a.eqP(a) && !a.eqP(b) && !a.eqP(c) && c.eqP(c) && !c.eqP(d) && !c.eqP(a)
}
The pattern matching implementation gets over four times the throughput on 2.12:
Benchmark Mode Cnt Score Error Units
ValidatedBenchmark.eqF thrpt 40 27131124.981 ± 164113.218 ops/s
ValidatedBenchmark.eqP thrpt 40 118711782.519 ± 1166236.166 ops/s
The situation is even worse on 2.11, where the difference is a little larger and the fold version also involves a lot of unnecessary allocations:
Benchmark Mode Cnt Score Error Units
ValidatedBenchmark.eqF:gc.alloc.rate.norm thrpt 5 144.000 ± 0.001 B/op
ValidatedBenchmark.eqP:gc.alloc.rate.norm thrpt 5 ≈ 10⁻⁵ B/op
For most use cases these differences are unlikely to matter at all, but it seems like a shame to impose this extra cost on everyone just for the sake of slightly more concise implementations.
(There are similar issues in IndexedStateT, etc., but Validated is the one piece I personally really care that much about, because of how it affects circe.)
👍 , it's the reasoning we took in https://github.com/typelevel/cats/pull/1289 as well
What if fold is an abstract method in Validated and implemented in Invalid and Valid? (we would trade pattern matching for dynamic dispatch)
+1
In general core library code in my view should have a nice API but then be
implemented as efficiently as possible. I’d love to see us use the highest
performance approach we can find.
On Thu, Oct 5, 2017 at 21:10 Julien Richard-Foy notifications@github.com
wrote:
What if fold is an abstract method in Validated and implemented in Invalid
and Valid? (we would trade pattern matching for dynamic dispatch)—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/typelevel/cats/issues/1951#issuecomment-334676701,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEJdkvM7HPVPALDnEoC2dUPjC-cRCTlks5spdJbgaJpZM4PwDNP
.>
P. Oscar Boykin, Ph.D. | http://twitter.com/posco | http://pobox.com/~boykin
It will be interesting to benchmark @julienrf 's abstract fold approach.
@julienrf @kailuowang I don't think pattern matching is the real culprit here—it's more the overhead of the functions. Making fold abstract does seem to help a bit on 2.12:
Benchmark Mode Cnt Score Error Units
ValidatedBenchmark.eqF thrpt 40 26938203.527 ± 189709.343 ops/s
ValidatedBenchmark.eqF2 thrpt 40 29927603.497 ± 199789.080 ops/s
ValidatedBenchmark.eqP thrpt 40 118540469.078 ± 754414.614 ops/s
…but there's no measurable difference on 2.11 and the allocation situation is the same on both.
functions, I've always heard, are very hard on the jit due to them being megamorphic. So, one of the best optimizations you can often do in scala is remove a function and replace with literal code that the JIT won't skip over.
I would love if this could be solved in the future with dotty linker, or changes to the JIT to better handle lambdas (maybe java will start to care about this now that they have lambdas).
How can we apply that in our context?
@julienrf I think just rewriting fold in all of the implementations to use pattern matching (as in my example above) is a reasonable way to avoid unnecessary functions.
Users can decide whether for their own cases which approach they want. For example I use Json#fold all the time in application code that depends on circe, but don't use it anywhere internally in circe itself.
Most helpful comment
+1
In general core library code in my view should have a nice API but then be
implemented as efficiently as possible. I’d love to see us use the highest
performance approach we can find.
On Thu, Oct 5, 2017 at 21:10 Julien Richard-Foy notifications@github.com
wrote:
P. Oscar Boykin, Ph.D. | http://twitter.com/posco | http://pobox.com/~boykin