We want to store the HyperLogLog bitmap state data, which is convenient for secondary aggregation, because the approx_distinct function presto is already provided, so share this suggestion here.
NAME
agg_bitmap,sum_bitmap,merge_bitmap,value_bitmap
TYPE
**UDAF*
agg_bitmap(string) -> varbinary
sum_bitmap(varbinary) -> long
merge_bitmap(varbinary) -> varbinary
**UDF**
value_bitmap(varbinary) -> long
**example**
approx_distinct(string) -> long
NULL HANDLING
If input is NULL, output NULL
Note that if input is an empty string, its hash value is not NULL and is well-defined.
demo:
agg_bitmap:
@InputFunction
public static void input(@AggregationState HyperLogLogState state, @SqlType(StandardTypes.VARCHAR) Slice value)
{
HyperLogLog hll = getOrCreateHyperLogLog(state, maxStandardError);
state.addMemoryUsage(-hll.estimatedInMemorySize());
hll.add(value);
state.addMemoryUsage(hll.estimatedInMemorySize());
}
@CombineFunction
public static void combine(@AggregationState HyperLogLogState state, @AggregationState HyperLogLogState otherState)
{
HyperLogLog input = otherState.getHyperLogLog();
HyperLogLog previous = state.getHyperLogLog();
if (previous == null) {
state.setHyperLogLog(input);
state.addMemoryUsage(input.estimatedInMemorySize());
}
else {
state.addMemoryUsage(-previous.estimatedInMemorySize());
previous.mergeWith(input);
state.addMemoryUsage(previous.estimatedInMemorySize());
}
}
@OutputFunction(VARBINARY)
public static void output(@AggregationState HyperLogLogState state, BlockBuilder out)
{
final HyperLogLog bitMap = state.getHyperLogLog();
if (bitMap == null) {
out.appendNull();
}
else {
VarcharType.VARCHAR.writeSlice(out, bitMap.serialize());
}
}
That's awesome, dude!
Presto already has functions to do this. See:
https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/HyperLogLogFunctions.java
https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/type/HyperLogLogOperators.java
https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/aggregation/MergeHyperLogLogAggregation.java
https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/aggregation/ApproximateSetAggregation.java
We need to document these
We need to document these
@martint @electrum Thank you guys for following this question, I have two small ideas
1,
Presto-0.189 bitmap state varbinary write to jbcSink has a bug ,The reasons are as follows:
class RecordPageSink:
else if (type.getJavaType() == Slice.class) {
recordSink.appendString(type.getSlice(block, position).getBytes());
}
class JdbcRecordSink:
@Override
public void appendString(byte[] value)
{
try {
statement.setString(next(), new String(value, UTF_8));
}
catch (SQLException e) {
throw new PrestoException(JDBC_ERROR, e);
}
}
2,
Bitmap state does not support other tools such as hive or spark or mysql.
I very much hope that the HyperLogLog algorithm you provide can support more tools
Here are my efforts for hive compatibility, See:
https://github.com/ideal-hp/hive-udf/blob/master/hive-udfs/src/main/java/cn/ideal/hive/udf/agg/MyHyperLogLogAggBitMap.java
The JDBC issue should be fixed in later versions. We removed the RecordSink interface.
Most helpful comment
We need to document these