From: https://github.com/JuliaData/DataFrames.jl/issues/1844#issuecomment-500577183
... we will always have holes) is to create a list of methods in DataFrames.jl that actively check if all columns have the same length before performing their operation (essentially - all expensive methods).
In short: checking column lengths is cheap,
We should do that a bunch of operations,
even of we do think we have patched all holes that let a user resize columns out of sync,
we can still catch bugs in internal methods via this (defensive programming)
This is my point 馃槃. Thank you!
(I will get to it after I am finished with broadcasting and setindex! cleanups) - unless someone else will by then (which I am happy to review then)
Here is a benchmark of checking cost for 20,000 columns (which I think is a typically reasonable maximum in practice):
julia> df = DataFrame(rand(10, 20_000));
julia> function samelength(df)
rows = nrow(df)
for col in _columns(df)
length(col) == rows || return false
end
return true
end
samelength (generic function with 1 method)
julia> @benchmark samelength($df)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 492.100 渭s (0.00% GC)
median time: 497.400 渭s (0.00% GC)
mean time: 627.434 渭s (0.00% GC)
maximum time: 2.256 ms (0.00% GC)
--------------
samples: 7955
evals/sample: 1
So given this the question is for what functions we should add this check.
20,000 columns is definately an upper bound.
E.g. basically all spreedsheet programs crash out at over 16,384 columns.
I think it should be called if you do
first (when called with the nrows argument)Which will catch things during in interactive use.
It is tempting to check it also with every call of nrows but I have to check if we do not call it somewhere in hot loops.
I would also add:
It is tempting to check it also with every call of nrows but I have to check if we do not call it somewhere in hot loops.
If we do, then we should maybe define: nrows(check_column_constancy=true)
And then use nrows(false) in the hotloops.
Agreed. Mostly this other form will be needed internally only so this should be OK (the key place where we will probably use it is creation of DataFrameRow from eachrow as doing the check on each iteration would be very slow)
I feel like @inbounds should supress this check.
We can opt into that by wrapping the check in @boundscheck.
then if @inbounds is active (and things are wrapped in @propergate_inbounds, as required)
the check would be supressed.
idk though it might be too magic -- people are used to @inbounds applying to indexing operations,
and this is only indirectly indexing.
Doing checks from nrow might be a bit too much. People could expect it to be very cheap, like length on arrays.
The working rule I have in my head now for performing of this check is that:
each time an internal function is called and it works on a subset of columns (possibly all) it should check if these columns have the same length.
As for nrow I think that it does not hurt to add an argument telling which columns should be checked with the default to check the first column. A signature like:
nrow(::AbstractDataFrame, cols=nothing)
I think maing nrow take a columns argument would be confusing to the user.
It is supposed to be impossible for columns to have different lengths.
And that looks like it is saying that they can, and that I am querying different columns,
And that make it look like this is a feature.
Good point. So we should have such function internally anyway.
The working rule I have in my head now for performing of this check is that:
each time an internal function is called and it works on a subset of columns (possibly all) it should check if these columns have the same length.
Sounds reasonable. Though I'm not sure we need a very clear rule, since throwing an error if a data frame is corrupt will always be OK, and missing a check would be OK too. What matters it that we add as many checks we can without affecting performance.
So here is my list of functions that should do the checks. Please add/remove from it and then we can make a PR:
groupby, aggregatejoinmelt, stack, unstack, stackdf, meltdfcategorical!copydescribe, showallowmissing!, completecases, disallowmissing!, dropmissing, dropmissing!nonunique, unique!eachrow, eachcolfilter, filter!hcat, vcatmapcolsrepeatsort, sort!In particular I left out: getindex, setindex!, view, select, select!, append! and push! (we might have an opinion to add a check sometimes for them also).
Add append! to the list, see https://github.com/JuliaData/DataFrames.jl/issues/1885.
Fixed by #1887
Most helpful comment
I think maing
nrowtake a columns argument would be confusing to the user.It is supposed to be impossible for columns to have different lengths.
And that looks like it is saying that they can, and that I am querying different columns,
And that make it look like this is a feature.