Guava: BloomFilter - lightweight for ready for use in a m/r environment (spark, hadoop, stream)

Created on 9 May 2017 · 4Comments · Source: google/guava

BloomFilter should follow the same philosophy as the Java Collections FrameWork: lightweight - no built-in thread-safety. When building BloomFilters we should focus on building it through reducing:

(bf,value)-> { bf.put(value); return bf; }
(bfl,bfr) -> { bfl.putAll(bfr); return  bfl; }

During this reduction process, which happens in parallel, at no point in time we need the properties of thread-safety.

Nice opportunity to write a java.util.stream.Collector that helps us reducing this straight out of Java 8 streams.

For reference - please see http://stackoverflow.com/questions/31788714/reducing-with-a-bloom-filter

P3 platform=java8 type=addition type=enhancement

Source

jdesmet

All 4 comments

We just answered a feature request to make BloomFilters thread-safe in https://github.com/google/guava/pull/2761 , I think less than a week ago. We found those arguments compelling: it doesn't seem to incur significant overhead; it was a pretty straight conversion from a long[] to an AtomicLongArray.

It does seem reasonable that we should expose a Java 8 collector, though, which we can give the Collector.Characteristics.CONCURRENT property, which would allow it to benefit from concurrency when used with parallel streams. That makes tons of sense to me; I think we'd just have to figure out the API and method naming.

lowasser on 9 May 2017

👍1

Argh, ninja'd by @lowasser. :wink:

Anyhoo, here is what I was going to say:

Hi @jdesmet, I think it'll take a _lot_ of convincing towards the Guava team to drop thread-safety in BloomFilter, because they only just recently accepted a huge PR to allow BloomFilter to be thread-safe (https://github.com/google/guava/pull/2761).

I admit I don't understand why "reducing" is any better than BloomFilter's current way of doing things. Can you elaborate further? Also, do you have any numbers to prove that BloomFilter as it is is not fast enough?

jbduncan on 9 May 2017

👍1

Of course I did not have any concrete evidence that non-thred-safe would be faster. That was just assuming by the way the Collections Framework was laid out and documented, and lack of better researching myself. I trust the findings in #2761. Interesting topic thread.

On the other hand - there will be always some level of distrust, maybe also in the correctness as extra levels of complexity are introduced. After all I had this first reaction to the @ThreadSafe tagging.

"Reducing" was quoted as a method for building the BloomFilter, because it only requires the properties of a monoid, and does not require thread-safety of the underlying data structures being reduced. Otherwise, if no additional overhead, I believe that current way BloomFilter works can be properly wrapped in any reducer. All I need is the put and putAll.

jdesmet on 9 May 2017

👍1

The correctness is fudged because Guava's BloomFilter doesn't have a clear() method and, fundamentally, the data structure is probabilistic. That meant the concurrent version could be low overhead because races are benign.

Had a clear() method been in the API, then a more invasive rewrite would have been required to have satisfactory performance. In that case a Bloom-1 design would have been appropriate. But most likely the proposal of thread safety would have been rejected as too invasive and not driven by an internal need.