I noticed that these quantile functions on a qdigest compute the percentile values by looping over the provided percentiles array argument and calling the underlying airlift qdigest method getQuantile (singular).
However, the airlift qdigest object has a plural version of this method getQuantiles which seems to be more efficient because it only traverses the qdigest tree once for a given array of quantiles.
I haven't had a chance to quantify this optimisation yet, but wanted to gauge whether there'd be any opposition to a pull request on this topic?
Thanks
CC: @mehrdad-honarkhah @tdcmeehan
Good idea!
considering that the size of the Q-Digest is a function of error approximation (and smaller errors require a larger tree; where the final space requirement is ~3k, with k being the compression factor), I think it can save computations slightly. A binary representation of the tree with default params can be 150KBs so post-order traversing it multiple times is less optimal than once.
However, the main part of the computation is calculation of Q-Digest itself. So in comparison, this is less of an speed boost to the overall operation. That being said and considering that Q-Digest data structure has recently been exposed as a data type, those tables that end up storing the digests for later extraction of quantiles would greatly take advantage of this speed increase as it is the main and only operation that is being performed in such situations.
Overall, as long as it is there, we should take advantage. Every optimization counts especially since the function already exists.
Thanks for the detailed reply! I'll have a look at implementing it this week and see what kind of real-world speed-up it produces
I agree with @mehrdad-honarkhah, and we'll be happy to review a pull request for this optimization.
Please be aware, the plural version of the function validates that the quantiles are sorted in ascending order. To preserve compatibility with approx_percentile and existing users of values_at_quantiles, we don't want to introduce that validation to the values_at_quantiles function. I think it would also be worthwhile to port this over to approx_percentile, as it does the same thing currently.
I did some perf testing on this and interestingly found no measurable difference in the two approaches. I tested on the TPCH data on several scale factors, evaluating 20 percentiles with the two approaches outlined in the issue summary. It seems the qdigest building is so dominant that extracting the percentiles has trivial cost, regardless of how it is done.
Most helpful comment
I did some perf testing on this and interestingly found no measurable difference in the two approaches. I tested on the TPCH data on several scale factors, evaluating 20 percentiles with the two approaches outlined in the issue summary. It seems the qdigest building is so dominant that extracting the percentiles has trivial cost, regardless of how it is done.