Dataframes.jl: describe on columns with e.g. Integers

Created on 26 May 2020 · 10Comments · Source: JuliaData/DataFrames.jl

Right now, describe is documented with:

If a column's base type derives from Real, :nunique will return nothings.

The worry is that, in general. computing the number of distinct elements requires memory proportional to the length of the column.
There are perhaps a few ways we could tweak things:

Compute the number of unique values below some threshold
Above some threshold, use a technique like HyperLogLog to get an approximate number.

The drawbacks of thresholds are that it could perhaps be surprising to write code that works for a given dataframe size and have it suddenly stop working when you load a bigger dataframe. For techniques like approximate counting (HyperLogLog) the drawback is that user might think that the count is always exact and allocate datastructures with a certain size based on it (which might then fail when one tries to put in all the distinct values)

Ref https://github.com/JuliaData/DataFrames.jl/pull/1435

breaking decision

Source

KristofferC

👍1

Most helpful comment

Unless there is some other comment I will open a PR implementing the recommendation:

disable computing of :median and :nunique by default (leave them as opt-in)

(the issue has popped up again on Slack and newcomers can be really surprised by the slow response time for large data sets)

bkamins on 29 Jun 2020

👍3

All 10 comments

Another advantage of not special-casing Real is that the high cardinality problem can also happen with other types (e.g. dates and times) so it would be more robust. I think we could start with the simple solution of stopping after some threshold, and later we could implement HyperLogLog (see the old HyperLogLog.jl).

nalimilan on 26 May 2020

👍1

My recommendation would be:

handle everything in the same way when computing :nunique; I do not think presenting an approximate count is a good idea - we should do an exact calculation;
disable computing of :median and :nunique by default (leave them as opt-in)

This will dramatically improve usability of describe. Currently for non-toy data sets describe is in practice very slow because of computing :median and :nunique. If someone really wants to see these statistics they can be opt-in.

bkamins on 26 May 2020

👍1

Maybe relevant: HyperLogLog is implemented in OnlineStats here.
On the same vein, OnlineStats also has the P square algorithm for approximated quantiles.

piever on 26 May 2020

I have an implementation of HyperLogLog with the best new corrections to the estimator that I can contribute. I should probably put it up in a new package. Here are some additional details: https://github.com/joshday/OnlineStats.jl/issues/177. Probably best to integrate the improvements into OnlineStats though.

StefanKarpinski on 26 May 2020

My personal preference would be to leave that out of DataFrames.jl to keep the list of the dependencies low. describe allows to pass any function for aggregation, so in a sense "this is already available". The question is in my opinion only about what we will show by default (and as I have commented actually I would prefer to limit this to most basic statistics to make describe respond fast).

bkamins on 26 May 2020

I'm fine with not reporting the number of uniques by default.

Though even if we do that it would still be nice to be able to use it 1) without it taking ages even if you have a numeric column with many unique values and 2) if you want to know the number of unique values of e.g. an integer column with few unique values. That said, I don't have a perfect solution to that.

nalimilan on 26 May 2020

The solution I think is describe(df, :nunique => fun, other statistics ...) where fun is a function of your choice (exact count or approximate, from whatever package you like).

bkamins on 26 May 2020

Stata's summarize, the inspiration for this command, doesn't include it. So I'm fine dropping it as we aren't losing feature parity. Plus i've found describe a bit too wide sometimes. This will help with that.