Dataframes.jl: describe on columns with e.g. Integers

Created on 26 May 2020  路  10Comments  路  Source: JuliaData/DataFrames.jl

Right now, describe is documented with:

If a column's base type derives from Real, :nunique will return nothings.

The worry is that, in general. computing the number of distinct elements requires memory proportional to the length of the column.
There are perhaps a few ways we could tweak things:

  • Compute the number of unique values below some threshold
  • Above some threshold, use a technique like HyperLogLog to get an approximate number.

The drawbacks of thresholds are that it could perhaps be surprising to write code that works for a given dataframe size and have it suddenly stop working when you load a bigger dataframe. For techniques like approximate counting (HyperLogLog) the drawback is that user might think that the count is always exact and allocate datastructures with a certain size based on it (which might then fail when one tries to put in all the distinct values)

Ref https://github.com/JuliaData/DataFrames.jl/pull/1435

breaking decision

Most helpful comment

Unless there is some other comment I will open a PR implementing the recommendation:

disable computing of :median and :nunique by default (leave them as opt-in)

(the issue has popped up again on Slack and newcomers can be really surprised by the slow response time for large data sets)

All 10 comments

Another advantage of not special-casing Real is that the high cardinality problem can also happen with other types (e.g. dates and times) so it would be more robust. I think we could start with the simple solution of stopping after some threshold, and later we could implement HyperLogLog (see the old HyperLogLog.jl).

My recommendation would be:

  • handle everything in the same way when computing :nunique; I do not think presenting an approximate count is a good idea - we should do an exact calculation;
  • disable computing of :median and :nunique by default (leave them as opt-in)

This will dramatically improve usability of describe. Currently for non-toy data sets describe is in practice very slow because of computing :median and :nunique. If someone really wants to see these statistics they can be opt-in.

Maybe relevant: HyperLogLog is implemented in OnlineStats here.
On the same vein, OnlineStats also has the P square algorithm for approximated quantiles.

I have an implementation of HyperLogLog with the best new corrections to the estimator that I can contribute. I should probably put it up in a new package. Here are some additional details: https://github.com/joshday/OnlineStats.jl/issues/177. Probably best to integrate the improvements into OnlineStats though.

My personal preference would be to leave that out of DataFrames.jl to keep the list of the dependencies low. describe allows to pass any function for aggregation, so in a sense "this is already available". The question is in my opinion only about what we will show by default (and as I have commented actually I would prefer to limit this to most basic statistics to make describe respond fast).

I'm fine with not reporting the number of uniques by default.

Though even if we do that it would still be nice to be able to use it 1) without it taking ages even if you have a numeric column with many unique values and 2) if you want to know the number of unique values of e.g. an integer column with few unique values. That said, I don't have a perfect solution to that.

The solution I think is describe(df, :nunique => fun, other statistics ...) where fun is a function of your choice (exact count or approximate, from whatever package you like).

Stata's summarize, the inspiration for this command, doesn't include it. So I'm fine dropping it as we aren't losing feature parity. Plus i've found describe a bit too wide sometimes. This will help with that.

Unless there is some other comment I will open a PR implementing the recommendation:

disable computing of :median and :nunique by default (leave them as opt-in)

(the issue has popped up again on Slack and newcomers can be really surprised by the slow response time for large data sets)

See #2339

Was this page helpful?
0 / 5 - 0 ratings

Related issues

davidanthoff picture davidanthoff  路  4Comments

bkamins picture bkamins  路  8Comments

xiaodaigh picture xiaodaigh  路  5Comments

blackeneth picture blackeneth  路  5Comments

tlienart picture tlienart  路  8Comments