Presto: New UDAF : HyperLogLog bitmap state Related

Created on 7 Jun 2018  路  7Comments  路  Source: prestodb/presto

We want to store the HyperLogLog bitmap state data, which is convenient for secondary aggregation, because the approx_distinct function presto is already provided, so share this suggestion here.

NAME
agg_bitmap,sum_bitmap,merge_bitmap,value_bitmap

TYPE

**UDAF*
agg_bitmap(string) -> varbinary
sum_bitmap(varbinary) -> long
merge_bitmap(varbinary) -> varbinary

**UDF**
value_bitmap(varbinary) -> long

**example**
approx_distinct(string) -> long

NULL HANDLING
If input is NULL, output NULL
Note that if input is an empty string, its hash value is not NULL and is well-defined.

Most helpful comment

We need to document these

All 7 comments

demo:
agg_bitmap:

    @InputFunction
    public static void input(@AggregationState HyperLogLogState state, @SqlType(StandardTypes.VARCHAR) Slice value)
    {
        HyperLogLog hll = getOrCreateHyperLogLog(state, maxStandardError);
        state.addMemoryUsage(-hll.estimatedInMemorySize());
        hll.add(value);
        state.addMemoryUsage(hll.estimatedInMemorySize());
    }

    @CombineFunction
    public static void combine(@AggregationState HyperLogLogState state, @AggregationState HyperLogLogState otherState)
    {
        HyperLogLog input = otherState.getHyperLogLog();
        HyperLogLog previous = state.getHyperLogLog();
        if (previous == null) {
            state.setHyperLogLog(input);
            state.addMemoryUsage(input.estimatedInMemorySize());
        }
        else {
            state.addMemoryUsage(-previous.estimatedInMemorySize());
            previous.mergeWith(input);
            state.addMemoryUsage(previous.estimatedInMemorySize());
        }
    }

    @OutputFunction(VARBINARY)
    public static void output(@AggregationState HyperLogLogState state, BlockBuilder out)
    {
        final HyperLogLog bitMap = state.getHyperLogLog();
        if (bitMap == null) {
            out.appendNull();
        }
        else {
            VarcharType.VARCHAR.writeSlice(out, bitMap.serialize());
        }
    }

That's awesome, dude!

We need to document these

@martint @electrum Thank you guys for following this question, I have two small ideas
1,
Presto-0.189 bitmap state varbinary write to jbcSink has a bug ,The reasons are as follows:

class RecordPageSink:
        else if (type.getJavaType() == Slice.class) {
            recordSink.appendString(type.getSlice(block, position).getBytes());
        }
class JdbcRecordSink:
    @Override
    public void appendString(byte[] value)
    {
        try {
            statement.setString(next(), new String(value, UTF_8));
        }
        catch (SQLException e) {
            throw new PrestoException(JDBC_ERROR, e);
        }
    }

2,
Bitmap state does not support other tools such as hive or spark or mysql.
I very much hope that the HyperLogLog algorithm you provide can support more tools

Here are my efforts for hive compatibility, See:
https://github.com/ideal-hp/hive-udf/blob/master/hive-udfs/src/main/java/cn/ideal/hive/udf/agg/MyHyperLogLogAggBitMap.java

The JDBC issue should be fixed in later versions. We removed the RecordSink interface.

Was this page helpful?
0 / 5 - 0 ratings