Hello,
I tried to poke around on the documentation for rough guidance on the effect of (size,fpp) on memory usage on the bloom filter, but couldn't find anything. Two main questions:
Mainly it seems to work so good, I can't get noticeable numbers on it :)
But this will be in a library, incorporated into a long-running process with lots and lots of requests, so it would be good to know, before I crash something with an OOME one day.
I did notice that, based on some other issues filed, it does look like there are basically two types of allocations:
Mainly I'm wondering if this is a conscious choice or something that will likely change.
Just to update, I did a rough review of the code, and I don't see anything re-allocating the BitArray to grow the filter if the FP rate shoots up or something similar. I've don't recall hearing of an implementation guava or elsewhere that does this, so unless there is some feedback that guava may do this in the future, I'll close this out myself in a little bit.
As far as the used memory, it doesn't look like anything tricky is going on here:
It just implements the formula from wikipedia:
http://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives
Or:
(-n * Math.log(p) / (Math.log(2) * Math.log(2)));
where p is the FP rate, and n is the expected number of items.
Ok, I went ahead and just grabbed the code above and plopped it in this silly program, because I was lazy to calculate estimates:
import com.google.common.math.LongMath;
import com.google.common.primitives.Ints;
import java.math.BigInteger;
import java.math.RoundingMode;
import java.util.Arrays;
import java.util.List;
/**
* Created by stuart.smith on 7/7/16.
*/
public class CalculateBloomBytes {
public static void main( String[] args ) {
List<Long> expectedSize = Arrays.asList(
1000L, 1000L*1000L, 1000L*1000L*1000L, 10L*1000L*1000L*1000L, 100L*1000L*1000L*1000L
);
List<Double> fpRate = Arrays.asList(
0.1,
0.01,
0.001,
0.0001,
0.00001
);
for( double rate : fpRate ) {
System.out.println("** FP Rate: " + rate * 100 + "% (" + rate + ")");
for( long size : expectedSize ) {
System.out.format("\t**Expected size: %,d\n", size);
long numBits = (long) (-size * Math.log(rate) / (Math.log(2) * Math.log(2)));
long longArraySize;
try {
longArraySize = Ints.checkedCast(
LongMath.divide(numBits, 64, RoundingMode.CEILING)
);
} catch( IllegalArgumentException e ) {
System.out.println("\t\tBloom Filter cannot handle these settings.");
continue;
}
BigInteger numBytes = new BigInteger(longArraySize+"").multiply(new BigInteger(Long.BYTES+""));
double numKilobytes = numBytes.doubleValue() / (1024.0 * 1024.0);
System.out.format("\t\tSpace allocated: %.3f MB, (%,d bytes)\n", numKilobytes, numBytes);
System.out.println("\t\tAssuming well-behaved clients, max number of FPs: " + rate * size);
}
}
}
Output below
So, some rough idea of the space taken:
** FP Rate: 10.0% (0.1)
**Expected size: 1,000
Space allocated: 0.001 MB, (600 bytes)
Assuming well-behaved clients, max number of FPs: 100.0
**Expected size: 1,000,000
Space allocated: 0.571 MB, (599,072 bytes)
Assuming well-behaved clients, max number of FPs: 100000.0
**Expected size: 1,000,000,000
Space allocated: 571.314 MB, (599,066,152 bytes)
Assuming well-behaved clients, max number of FPs: 1.0E8
**Expected size: 10,000,000,000
Space allocated: 5713.140 MB, (5,990,661,488 bytes)
Assuming well-behaved clients, max number of FPs: 1.0E9
**Expected size: 100,000,000,000
Bloom Filter cannot handle these settings.
** FP Rate: 1.0% (0.01)
**Expected size: 1,000
Space allocated: 0.001 MB, (1,200 bytes)
Assuming well-behaved clients, max number of FPs: 10.0
**Expected size: 1,000,000
Space allocated: 1.143 MB, (1,198,136 bytes)
Assuming well-behaved clients, max number of FPs: 10000.0
**Expected size: 1,000,000,000
Space allocated: 1142.628 MB, (1,198,132,304 bytes)
Assuming well-behaved clients, max number of FPs: 1.0E7
**Expected size: 10,000,000,000
Space allocated: 11426.280 MB, (11,981,322,976 bytes)
Assuming well-behaved clients, max number of FPs: 1.0E8
**Expected size: 100,000,000,000
Bloom Filter cannot handle these settings.
** FP Rate: 0.1% (0.001)
**Expected size: 1,000
Space allocated: 0.002 MB, (1,800 bytes)
Assuming well-behaved clients, max number of FPs: 1.0
**Expected size: 1,000,000
Space allocated: 1.714 MB, (1,797,200 bytes)
Assuming well-behaved clients, max number of FPs: 1000.0
**Expected size: 1,000,000,000
Space allocated: 1713.942 MB, (1,797,198,448 bytes)
Assuming well-behaved clients, max number of FPs: 1000000.0
**Expected size: 10,000,000,000
Bloom Filter cannot handle these settings.
**Expected size: 100,000,000,000
Bloom Filter cannot handle these settings.
** FP Rate: 0.01% (1.0E-4)
**Expected size: 1,000
Space allocated: 0.002 MB, (2,400 bytes)
Assuming well-behaved clients, max number of FPs: 0.1
**Expected size: 1,000,000
Space allocated: 2.285 MB, (2,396,272 bytes)
Assuming well-behaved clients, max number of FPs: 100.0
**Expected size: 1,000,000,000
Space allocated: 2285.256 MB, (2,396,264,600 bytes)
Assuming well-behaved clients, max number of FPs: 100000.0
**Expected size: 10,000,000,000
Bloom Filter cannot handle these settings.
**Expected size: 100,000,000,000
Bloom Filter cannot handle these settings.
** FP Rate: 0.001% (1.0E-5)
**Expected size: 1,000
Space allocated: 0.003 MB, (3,000 bytes)
Assuming well-behaved clients, max number of FPs: 0.01
**Expected size: 1,000,000
Space allocated: 2.857 MB, (2,995,336 bytes)
Assuming well-behaved clients, max number of FPs: 10.0
**Expected size: 1,000,000,000
Space allocated: 2856.570 MB, (2,995,330,744 bytes)
Assuming well-behaved clients, max number of FPs: 10000.0
**Expected size: 10,000,000,000
Bloom Filter cannot handle these settings.
**Expected size: 100,000,000,000
Bloom Filter cannot handle these settings.
Closed it out - although it would be nice to move this into the documentation proper.
Oh, and a graph :)
Most helpful comment
So, some rough idea of the space taken: