Related to #1200.
Compare:
julia> arr = Array{Any}((0,2))
0脳2 Array{Any,2}
julia> isempty(arr)
true
with
julia> df = DataFrame(col1=[], col2=[])
0脳2 DataFrames.DataFrame
julia> isempty(df)
false
It seems clear to me that a DataFrame with zero rows should be considered empty.
FWIW, @quinnj's PR from a few days ago https://github.com/JuliaData/DataFrames.jl/pull/1224 fixes this as well, but this particular issue seems less controversial than how we define length.
I think it's got to be a wholesale switch from viewing DataFrames as a columnar datastore to more of a "bag of tuples" definition (without needing to change the underlying actual representation obviously). The current view I think grew out of viewing a DataFrame as a "data-smart" Matrix (Array{T, 2}), which is column-oriented.
I'd argue that the current behaviour is still broken even from the "data-smart" Matrix perspective though.
Yes, what justifies the current behavior is rather the definition of DataFrame as a vector of columns (which also justifies the behavior of df[i] and of length). That's clearly not very natural, the only issue with getting rid of this representation is that df[i] is quite convenient compared to df[:, i]. Maybe we could keep it even if we fix isempty and length.
@nalimilan I don't want to derail this issue, but when (or how often) do you want to get a column based in its integer index? I don't think I've ever wanted to do that, but that could just be my use cases.
FWIW, I do column integer-indexing all the time, but that's because in my workflows, I code towards integer indexing instead of symbol indexing; in my mind it's faster because I can avoid the extra indirection lookup of symbol=>integer, but that extra cost is probably negligible in production. Anyway, I'd re-iterate again though that I think we need to commit to either a _column-oriented_ or _row-oriented_ representation, regardless of the internal implementation. Currently things are (mostly) consistent for a _column-orientation_, but there's obviously some desire to switch that. For example, if we switch to a row-orientation, I would definitely expect df[1] to give me the first row instead of the first column.
Either way, isn't it clearer to write df[:,i] and df[i,:] so it's immediately obvious what you're asking for?
@nalimilan I don't want to derail this issue, but when (or how often) do you want to get a column based in its integer index? I don't think I've ever wanted to do that, but that could just be my use cases.
@rofinn i wasn't necessarily an integer index in my example, it could have been a symbol too. I think the problem is the same.
@ararslan I agree df[:, i] is clearer than df[i], but it's less convenient to type, which is annoying since that's sometime you need to type all the time. Maybe with things like Query it shouldn't be as common as in R, though. If we get field overloading, we could use df.i instead (which I think is the reason why column names are required to be valid identifiers).
Anyway, I'd re-iterate again though that I think we need to commit to either a column-oriented or row-oriented representation, regardless of the internal implementation. Currently things are (mostly) consistent for a column-orientation, but there's obviously some desire to switch that. For example, if we switch to a row-orientation, I would definitely expect df[1] to give me the first row instead of the first column.
Seeing how people disagree on what's the most natural orientation, I'd rather make DataFrame orientation-agnostic and require people to be explicit about what they want, e.g. using for r in eachrow(df) / for r in eachcol(df) or df[i, :]/df[i, :].
Discussion of notation aside, I think a 0-row DataFrame should be isempty regardless of whether DataFramess are column- or row-oriented. It seems a natural definition of emptiness, even if it doesn't correspond directly to length(df) == 0 (though when that's true we'd also necessarily have isempty).
I could put together a separate PR to address this in the morning, if there's interest.
Oops, I missed your comment @spurll.
Hey, I would have done it, but I'm chairing a board meeting right now.
Most helpful comment
Seeing how people disagree on what's the most natural orientation, I'd rather make
DataFrameorientation-agnostic and require people to be explicit about what they want, e.g. usingfor r in eachrow(df)/for r in eachcol(df)ordf[i, :]/df[i, :].