Dataframes.jl: `isempty` checks number of columns, rather than number of rows

Created on 11 Sep 2017 · 12Comments · Source: JuliaData/DataFrames.jl

Related to #1200.

Compare:

julia> arr = Array{Any}((0,2))
0×2 Array{Any,2}

julia> isempty(arr)
true

with

julia> df = DataFrame(col1=[], col2=[])
0×2 DataFrames.DataFrame


julia> isempty(df)
false

It seems clear to me that a DataFrame with zero rows should be considered empty.

FWIW, @quinnj's PR from a few days ago https://github.com/JuliaData/DataFrames.jl/pull/1224 fixes this as well, but this particular issue seems less controversial than how we define length.

Source

spurll

👍4

Most helpful comment

Anyway, I'd re-iterate again though that I think we need to commit to either a column-oriented or row-oriented representation, regardless of the internal implementation. Currently things are (mostly) consistent for a column-orientation, but there's obviously some desire to switch that. For example, if we switch to a row-orientation, I would definitely expect df[1] to give me the first row instead of the first column.

Seeing how people disagree on what's the most natural orientation, I'd rather make DataFrame orientation-agnostic and require people to be explicit about what they want, e.g. using for r in eachrow(df) / for r in eachcol(df) or df[i, :]/df[i, :].

nalimilan on 12 Sep 2017

👍5

All 12 comments

I think it's got to be a wholesale switch from viewing DataFrames as a columnar datastore to more of a "bag of tuples" definition (without needing to change the underlying actual representation obviously). The current view I think grew out of viewing a DataFrame as a "data-smart" Matrix (Array{T, 2}), which is column-oriented.

quinnj on 11 Sep 2017

👍2

I'd argue that the current behaviour is still broken even from the "data-smart" Matrix perspective though.

rofinn on 11 Sep 2017

Yes, what justifies the current behavior is rather the definition of DataFrame as a vector of columns (which also justifies the behavior of df[i] and of length). That's clearly not very natural, the only issue with getting rid of this representation is that df[i] is quite convenient compared to df[:, i]. Maybe we could keep it even if we fix isempty and length.

nalimilan on 11 Sep 2017

@nalimilan I don't want to derail this issue, but when (or how often) do you want to get a column based in its integer index? I don't think I've ever wanted to do that, but that could just be my use cases.

rofinn on 11 Sep 2017

👍1

FWIW, I do column integer-indexing all the time, but that's because in my workflows, I code towards integer indexing instead of symbol indexing; in my mind it's faster because I can avoid the extra indirection lookup of symbol=>integer, but that extra cost is probably negligible in production. Anyway, I'd re-iterate again though that I think we need to commit to either a _column-oriented_ or _row-oriented_ representation, regardless of the internal implementation. Currently things are (mostly) consistent for a _column-orientation_, but there's obviously some desire to switch that. For example, if we switch to a row-orientation, I would definitely expect df[1] to give me the first row instead of the first column.

quinnj on 11 Sep 2017

Either way, isn't it clearer to write df[:,i] and df[i,:] so it's immediately obvious what you're asking for?

ararslan on 11 Sep 2017

👍1

@nalimilan I don't want to derail this issue, but when (or how often) do you want to get a column based in its integer index? I don't think I've ever wanted to do that, but that could just be my use cases.

@rofinn i wasn't necessarily an integer index in my example, it could have been a symbol too. I think the problem is the same.

@ararslan I agree df[:, i] is clearer than df[i], but it's less convenient to type, which is annoying since that's sometime you need to type all the time. Maybe with things like Query it shouldn't be as common as in R, though. If we get field overloading, we could use df.i instead (which I think is the reason why column names are required to be valid identifiers).

nalimilan on 11 Sep 2017

Anyway, I'd re-iterate again though that I think we need to commit to either a column-oriented or row-oriented representation, regardless of the internal implementation. Currently things are (mostly) consistent for a column-orientation, but there's obviously some desire to switch that. For example, if we switch to a row-orientation, I would definitely expect df[1] to give me the first row instead of the first column.

nalimilan on 12 Sep 2017

👍5

Discussion of notation aside, I think a 0-row DataFrame should be isempty regardless of whether DataFramess are column- or row-oriented. It seems a natural definition of emptiness, even if it doesn't correspond directly to length(df) == 0 (though when that's true we'd also necessarily have isempty).

ararslan on 12 Sep 2017

👍3

I could put together a separate PR to address this in the morning, if there's interest.

spurll on 12 Sep 2017

👍3

Oops, I missed your comment @spurll.

rofinn on 12 Sep 2017

😄2

Hey, I would have done it, but I'm chairing a board meeting right now.

spurll on 12 Sep 2017

❤2

Was this page helpful?