Hello, I spotted a weird behaviour of partition method.
I include code examples from Java and Scala which I used to make the comparison.
Java
public class PartitionVavr {
private static final Map<String, Eat> fruitsBeingEaten = new HashMap<>();
public static void main(String[] args) {
final Set<String> fruitsToEat = HashSet.of("apple", "banana");
final var partition = fruitsToEat.partition(PartitionVavr::biteAndCheck);
System.out.println(partition);
}
private static boolean biteAndCheck(String name) {
final var eat = fruitsBeingEaten.getOrDefault(name, Eat.prepare(name)).bite();
fruitsBeingEaten.put(name, eat);
return eat.isEaten();
}
private static class Eat {
final int bitesLimit = 2;
final int bites;
final String name;
public static Eat prepare(String name) {
return new Eat(0, name);
}
private Eat(int bites, String name) {
this.bites = bites;
this.name = name;
}
public Eat bite() {
return new Eat(bites + 1, name);
}
public boolean isEaten() {
return bites >= bitesLimit;
}
}
}
Scala
object PartitionScala extends App {
private val fruitsBeingEaten: mutable.Map[String, Eat] = mutable.Map()
private val fruitsToEat = Set("apple", "banana")
private val partition = fruitsToEat.partition(name => biteAndCheck(name))
println(partition)
private def biteAndCheck(name: String): Boolean = {
val eat = fruitsBeingEaten.getOrElse(name, Eat.prepare(name)).bite
fruitsBeingEaten += (name -> eat)
eat.isEaten
}
private object Eat {
def prepare(name: String) = new Eat(0, name)
}
private class Eat private(val bites: Int, val name: String) {
final private val bitesLimit = 2
def bite = new Eat(bites + 1, name)
def isEaten: Boolean = bites >= bitesLimit
}
}
The Java output is:
(HashSet(), HashSet())
The Scala output is:
(Set(),Set(apple, banana))
I think the difference is because vavr traverses the collection and checks the predicate twice while scala does it once.
Is this a bug or expected behaviour?
I used vavr 0.10.2
You are right! We need to change that.
Generally, we focus on immutable objects. Your case is very special, because the predicate also mutates the object on every call. Normally, you would expect that a predicate reflects a state without changing the state.
But I understand your concern, our goal is to align to Scala.
The affected code lines in HashSet are:
@Override
default Tuple2<Iterator<T>, Iterator<T>> partition(Predicate<? super T> predicate) {
Objects.requireNonNull(predicate, "predicate is null");
if (!hasNext()) {
return Tuple.of(empty(), empty());
} else {
final Stream<T> that = Stream.ofAll(this);
final Iterator<T> first = that.iterator().filter(predicate);
final Iterator<T> second = that.iterator().filter(predicate.negate());
return Tuple.of(first, second);
}
}
Here we filter twice. We should do it in one turn.
@kefasb, thanks for reporting and @danieldietrich, thanks for you hints. I sent PR #2577 to fix the problem.
Thanks I will take a look!
By default, Scala's Iterable type, the base of all collections, provides a default implementation of partition that lazily iterates all elements _twice_. It is based on views:
def partition(p: A => Boolean): (C, C) = {
val first = new View.Filter(this, p, false)
val second = new View.Filter(this, p, true)
(fromSpecific(first), fromSpecific(second))
}
Scala has so-called _strict_ and _lazy_ collections. The docs state that strict collections provide their own implementation in order to ensure that elements are iterated only twice:
* The default implementation provided here needs to traverse the collection twice.
* Strict collections have an overridden version of `partition` in `StrictOptimizedIterableOps`,
* which requires only a single traversal.
Because Scala's Stream is already lazy, the default implementation of partition is overridden, based of filter and filterNot:
override def partition(p: A => Boolean): (Stream[A], Stream[A]) = (filter(p(_)), filterNot(p(_)))
As pointed out by you, we already do the same because we align to Scala:
@Override
public final Tuple2<Stream<T>, Stream<T>> partition(Predicate<? super T> predicate) {
Objects.requireNonNull(predicate, "predicate is null");
return Tuple.of(filter(predicate), filter(predicate.negate()));
}
Scala's Iterator is also special, because its element can be iterated only once. Therefore it lazily duplicates the elements and then filters both resulting iterators, very much like we saw in Stream:
def partition(p: A => Boolean): (Iterator[A], Iterator[A]) = {
val (a, b) = duplicate
(a filter p, b filterNot p)
}
Scala's Iterator.duplicate method isn't trivial.
I think it is sufficient to take the following actions:
Iterator and Stream for strictness in the way that they override the default Iterable.partition implementation. We should align to that behavior. Scala's LazyList is currently out of scope because we don't have it, yet.@danieldietrich , I'm a bit lost and I need your help on this issue. So far, I created some tests to compare the behavior between Scala 2.13.2 and VAVR 1.0.0-SNAPSHOT, but I don't know which direction to continue.
On Scala side, I created a Git repo and added a test Issue2559Spec.scala to demonstrate the behavior of different Scala classes in Scala 2.13.2. I never used Scala myself and maybe more test-cases need to be added. From that file, we can see that:
StrictOptimizedIterableOps, predicate is used exactly N times for partitioning the values, where N is the size of the values.Array, Steam, LazyList are not strict. In Array, predicate is used N times as well. In Stream and LazyList, predicate is used N * 2 times. Array uses N times because the partition method is overridden by ArrayOps.scala (source code).On VAVR side, I added a test in my PR as commit https://github.com/vavr-io/vavr/pull/2577/commits/42117bf323ca26376972ead5a3e243414c701280 to demonstrate the behavior of partition() on different final classes. Predicate is used exactly N times for all the classes, except Stream, where it is used N * 2 times. So the behavior is similar to Scala.
Several things confuse me here:
partition() in particular. In my PR, I changed the implementation of Iterator, but none of the tests failed, I don't know what is expected.partition? Should we create StrictOptimizedIterableOps? This is not trivial...Personally I would like to use the one-iteration implementation as the default partition strategy. And override this strategy in Stream (and LazyList later on). It means keeping my PR https://github.com/vavr-io/vavr/pull/2577 as it is. But the biggest concerns I have are the laziness and the partition-strategy alignment as stated above.
Thank you for your analysis! That's interesting. We do it as follows:
How to test the laziness?
We could create an infinite Stream of 0, 1, 0, 1, 0, 1, ... and partition it. If partition isn't lazy, the test would run forever, which is ok. We attribute that test with @Test(timeout=5000) or something similar.
Should we create StrictOptimizedIterableOps?
No, I see currently no use-case. We already have Traversable.isLazy(), which means: lazy := !strict. So we need no special interface in order to reflect the strictness property.
where should I put the optimized, one-iteration implementation of partition?
Sorry, I misunderstood the Scala implementation. There, the duplicate method does not solve the single iteration problem. It just ensures that the underlying Iterator can be iterated twice. We need that too, it will solve one task of #1945.
duplicate to Iteratorpublic interface Iterator<T> {
@Override
default Tuple2<Iterator<T>, Iterator<T>> partition(Predicate<? super T> predicate) {
Objects.requireNonNull(predicate, "predicate is null");
if (!hasNext()) {
return Tuple.of(empty(), empty());
} else {
final Tuple<Iterator<T>, Iterator<T>> dup = duplicate();
return Tuple.of(dup._1.filter(p), dup._2.filterNot(p));
}
}
}
interface IteratorModule {
static <T> Tuple2<Iterator<T>, Iterator<T>> duplicate(Iterator<T> iter) {
// see https://github.com/scala/scala/blob/v2.13.2/src/library/scala/collection/Iterator.scala#L839-L884
}
}
Hint: We keep the existing Stream.partition() method.
Many already have it, for example Array:
@Override
public Tuple2<Array<T>, Array<T>> partition(Predicate<? super T> predicate) {
Objects.requireNonNull(predicate, "predicate is null");
final java.util.List<T> left = new ArrayList<>(), right = new ArrayList<>();
for (T t : this) {
(predicate.test(t) ? left : right).add(t);
}
return Tuple.of(ofAll(left), ofAll(right));
}
I suggest to move that functionality to Collections.java:
// caution, this is from the top of my head
class Collections {
static <C> Tuple2<C, C> partition(Iterable<T> iterable, Predicate<? super T> predicate, Function<Iterable<T>, C> ofAll) {
Objects.requireNonNull(predicate, "predicate is null");
final java.util.List<T> left = new ArrayList<>(), right = new ArrayList<>();
for (T t : this) {
(predicate.test(t) ? left : right).add(t);
}
return Tuple.of(ofAll(left), ofAll(right));
}
}
Then Array and all other strict collections would be implemented like this:
class Array<T> {
@Override
public Tuple2<Array<T>, Array<T>> partition(Predicate<? super T> predicate) {
return Collections.partition(this, Array::ofAll, predicate);
}
}
I think this should also work for Maps:
class HashMap<K, V> {
@Override
public Tuple2<HashMap<K, V>, HashMap<K, V>> partition(Predicate<? super Tuple2<K, V>> predicate) {
return Collections.partition(this, predicate, this::createFromEntries);
}
}
If so, the partition method in the utility class Maps.java can be deleted.
Please don't hesitate to ask on any further questions!
Thanks for your help!
- Daniel
Thanks for your hints 馃憖 , Daniel! I will continue the work on it and ping you when I have progress. I'm not sure I can finish it today. If not, we may need to wait until next weekend.
@mincong-h thanks! you don't need to rush