Dataframes.jl: Add column length checks to expensive operations

Created on 10 Jun 2019  路  15Comments  路  Source: JuliaData/DataFrames.jl

From: https://github.com/JuliaData/DataFrames.jl/issues/1844#issuecomment-500577183

... we will always have holes) is to create a list of methods in DataFrames.jl that actively check if all columns have the same length before performing their operation (essentially - all expensive methods).

In short: checking column lengths is cheap,
We should do that a bunch of operations,
even of we do think we have patched all holes that let a user resize columns out of sync,
we can still catch bugs in internal methods via this (defensive programming)

Most helpful comment

I think maing nrow take a columns argument would be confusing to the user.
It is supposed to be impossible for columns to have different lengths.
And that looks like it is saying that they can, and that I am querying different columns,
And that make it look like this is a feature.

All 15 comments

This is my point 馃槃. Thank you!
(I will get to it after I am finished with broadcasting and setindex! cleanups) - unless someone else will by then (which I am happy to review then)

Here is a benchmark of checking cost for 20,000 columns (which I think is a typically reasonable maximum in practice):

julia> df = DataFrame(rand(10, 20_000));

julia> function samelength(df)
           rows = nrow(df)
           for col in _columns(df)
               length(col) == rows || return false
           end
           return true
       end
samelength (generic function with 1 method)

julia> @benchmark samelength($df)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     492.100 渭s (0.00% GC)
  median time:      497.400 渭s (0.00% GC)
  mean time:        627.434 渭s (0.00% GC)
  maximum time:     2.256 ms (0.00% GC)
  --------------
  samples:          7955
  evals/sample:     1

So given this the question is for what functions we should add this check.

20,000 columns is definately an upper bound.
E.g. basically all spreedsheet programs crash out at over 16,384 columns.

I think it should be called if you do

  • [ ] show / print
  • [ ] first (when called with the nrows argument)

Which will catch things during in interactive use.

  • [ ] join
  • [ ] groupby

It is tempting to check it also with every call of nrows but I have to check if we do not call it somewhere in hot loops.

I would also add:

  • [ ] reshape operations
  • [ ] filtering
  • [ ] removing duplicates
  • [ ] removing missings
  • [ ] eachcol/eachrow upon creation of the object

It is tempting to check it also with every call of nrows but I have to check if we do not call it somewhere in hot loops.

If we do, then we should maybe define: nrows(check_column_constancy=true)
And then use nrows(false) in the hotloops.

Agreed. Mostly this other form will be needed internally only so this should be OK (the key place where we will probably use it is creation of DataFrameRow from eachrow as doing the check on each iteration would be very slow)

I feel like @inbounds should supress this check.
We can opt into that by wrapping the check in @boundscheck.
then if @inbounds is active (and things are wrapped in @propergate_inbounds, as required)
the check would be supressed.

idk though it might be too magic -- people are used to @inbounds applying to indexing operations,
and this is only indirectly indexing.

Doing checks from nrow might be a bit too much. People could expect it to be very cheap, like length on arrays.

The working rule I have in my head now for performing of this check is that:
each time an internal function is called and it works on a subset of columns (possibly all) it should check if these columns have the same length.

As for nrow I think that it does not hurt to add an argument telling which columns should be checked with the default to check the first column. A signature like:

nrow(::AbstractDataFrame, cols=nothing)

I think maing nrow take a columns argument would be confusing to the user.
It is supposed to be impossible for columns to have different lengths.
And that looks like it is saying that they can, and that I am querying different columns,
And that make it look like this is a feature.

Good point. So we should have such function internally anyway.

The working rule I have in my head now for performing of this check is that:
each time an internal function is called and it works on a subset of columns (possibly all) it should check if these columns have the same length.

Sounds reasonable. Though I'm not sure we need a very clear rule, since throwing an error if a data frame is corrupt will always be OK, and missing a check would be OK too. What matters it that we add as many checks we can without affecting performance.

So here is my list of functions that should do the checks. Please add/remove from it and then we can make a PR:

  • groupby, aggregate
  • join
  • melt, stack, unstack, stackdf, meltdf
  • categorical!
  • copy
  • describe, show
  • allowmissing!, completecases, disallowmissing!, dropmissing, dropmissing!
  • nonunique, unique!
  • eachrow, eachcol
  • filter, filter!
  • hcat, vcat
  • mapcols
  • repeat
  • sort, sort!

In particular I left out: getindex, setindex!, view, select, select!, append! and push! (we might have an opinion to add a check sometimes for them also).

Fixed by #1887

Was this page helpful?
0 / 5 - 0 ratings

Related issues

abieler picture abieler  路  7Comments

CameronBieganek picture CameronBieganek  路  6Comments

ahalwright picture ahalwright  路  3Comments

mattBrzezinski picture mattBrzezinski  路  5Comments

bkamins picture bkamins  路  7Comments