Dataframes.jl: Add column length checks to expensive operations

Created on 10 Jun 2019 · 15Comments · Source: JuliaData/DataFrames.jl

From: https://github.com/JuliaData/DataFrames.jl/issues/1844#issuecomment-500577183

... we will always have holes) is to create a list of methods in DataFrames.jl that actively check if all columns have the same length before performing their operation (essentially - all expensive methods).

In short: checking column lengths is cheap,
We should do that a bunch of operations,
even of we do think we have patched all holes that let a user resize columns out of sync,
we can still catch bugs in internal methods via this (defensive programming)

Source

oxinabox

Most helpful comment

I think maing nrow take a columns argument would be confusing to the user.
It is supposed to be impossible for columns to have different lengths.
And that looks like it is saying that they can, and that I am querying different columns,
And that make it look like this is a feature.

oxinabox on 12 Jun 2019

👍3

All 15 comments

This is my point 😄. Thank you!
(I will get to it after I am finished with broadcasting and setindex! cleanups) - unless someone else will by then (which I am happy to review then)

bkamins on 10 Jun 2019

Here is a benchmark of checking cost for 20,000 columns (which I think is a typically reasonable maximum in practice):

julia> df = DataFrame(rand(10, 20_000));

julia> function samelength(df)
           rows = nrow(df)
           for col in _columns(df)
               length(col) == rows || return false
           end
           return true
       end
samelength (generic function with 1 method)

julia> @benchmark samelength($df)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     492.100 μs (0.00% GC)
  median time:      497.400 μs (0.00% GC)
  mean time:        627.434 μs (0.00% GC)
  maximum time:     2.256 ms (0.00% GC)
  --------------
  samples:          7955
  evals/sample:     1

So given this the question is for what functions we should add this check.

bkamins on 11 Jun 2019

20,000 columns is definately an upper bound.
E.g. basically all spreedsheet programs crash out at over 16,384 columns.

I think it should be called if you do

[ ] show / print
[ ] first (when called with the nrows argument)

Which will catch things during in interactive use.

[ ] join
[ ] groupby

oxinabox on 11 Jun 2019

It is tempting to check it also with every call of nrows but I have to check if we do not call it somewhere in hot loops.

I would also add:

[ ] reshape operations
[ ] filtering
[ ] removing duplicates
[ ] removing missings
[ ] eachcol/eachrow upon creation of the object

bkamins on 11 Jun 2019

👍1

It is tempting to check it also with every call of nrows but I have to check if we do not call it somewhere in hot loops.

If we do, then we should maybe define: nrows(check_column_constancy=true)
And then use nrows(false) in the hotloops.

oxinabox on 11 Jun 2019

Agreed. Mostly this other form will be needed internally only so this should be OK (the key place where we will probably use it is creation of DataFrameRow from eachrow as doing the check on each iteration would be very slow)

bkamins on 11 Jun 2019

I feel like @inbounds should supress this check.
We can opt into that by wrapping the check in @boundscheck.
then if @inbounds is active (and things are wrapped in @propergate_inbounds, as required)
the check would be supressed.

idk though it might be too magic -- people are used to @inbounds applying to indexing operations,
and this is only indirectly indexing.

oxinabox on 12 Jun 2019

Doing checks from nrow might be a bit too much. People could expect it to be very cheap, like length on arrays.

nalimilan on 12 Jun 2019

The working rule I have in my head now for performing of this check is that:
each time an internal function is called and it works on a subset of columns (possibly all) it should check if these columns have the same length.

As for nrow I think that it does not hurt to add an argument telling which columns should be checked with the default to check the first column. A signature like:

nrow(::AbstractDataFrame, cols=nothing)

bkamins on 12 Jun 2019

oxinabox on 12 Jun 2019

👍3

Good point. So we should have such function internally anyway.

bkamins on 12 Jun 2019

The working rule I have in my head now for performing of this check is that:
each time an internal function is called and it works on a subset of columns (possibly all) it should check if these columns have the same length.

Sounds reasonable. Though I'm not sure we need a very clear rule, since throwing an error if a data frame is corrupt will always be OK, and missing a check would be OK too. What matters it that we add as many checks we can without affecting performance.

nalimilan on 12 Jun 2019

So here is my list of functions that should do the checks. Please add/remove from it and then we can make a PR:

groupby, aggregate
join
melt, stack, unstack, stackdf, meltdf
categorical!
copy
describe, show
allowmissing!, completecases, disallowmissing!, dropmissing, dropmissing!
nonunique, unique!
eachrow, eachcol
filter, filter!
hcat, vcat
mapcols
repeat
sort, sort!

In particular I left out: getindex, setindex!, view, select, select!, append! and push! (we might have an opinion to add a check sometimes for them also).