Dataframes.jl: add missing columns when push! ing?

Created on 11 Mar 2020 · 10Comments · Source: JuliaData/DataFrames.jl

I'm trying to build up a data frame one row at a time, except that not every row has all the columns. I'd like to be able to push! the rows on one at a time, filling in the missing columns with missing. using cols=:subset gets me halfway there:

julia> df = DataFrame()
0×0 DataFrame

julia> push!(df, (a=1, b=2))
1×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 2     │

julia> allowmissing!(df)
1×2 DataFrame
│ Row │ a      │ b      │
│     │ Int64⍰ │ Int64⍰ │
├─────┼────────┼────────┤
│ 1   │ 1      │ 2      │
julia> push!(df, (a=1, c=3), cols=:subset)
2×2 DataFrame
│ Row │ a      │ b       │
│     │ Int64⍰ │ Int64⍰  │
├─────┼────────┼─────────┤
│ 1   │ 1      │ 2       │
│ 2   │ 1      │ missing │

But notice that no column for c is added. Having to call allowmissing! is also potentially annoying since I'll have to call that for every row or have to do the bookkeeping of keeping track of which columns have already missings in them in the loop over rows. But I think that's more related to #1716 .

I'd like either an additional argument like fill, or another value for cols which opts into adding columns as they are encountered.

decision non-breaking

Source

kleinschmidt

Most helpful comment

I think it is a duplicate (actually a subset) of https://github.com/JuliaData/DataFrames.jl/issues/2032. Could you please have a look there and if yes this can be closed.

Just as a comment - things are not easy here as you have to decide about widening rule. If you have a column of Ints and want to push a float to it - should the column be converted to floats then?

In particular - maybe you want something simpler, which could be implemented before https://github.com/JuliaData/DataFrames.jl/issues/2032. Something like cols=:union where, as you proposed missing columns would be added filled with missing in existing rows and the new entry given type it has.

Now having to call allowmissing! is another issue, and potentially another kwarg could be added, like widen, again it is tricky, do we want to allow promotion:

only to Missing union
using promote_type (could lead to conversion of existing data)
using typeunion (will produce abstract types)
using Union of encountered types (this is not what normally is done)

In short - we know about the problem, but the decision what is a robust and future proof design here is hard :sob:, so for now we disallowed this.

If you would have an idea of a minimal API extension that would do what is expected most of the time then it would be great (my fear is that having fully flexible API here would be an overkill for a regular user, but maybe I am wrong).

bkamins on 11 Mar 2020

👍2

All 10 comments

I think it is a duplicate (actually a subset) of https://github.com/JuliaData/DataFrames.jl/issues/2032. Could you please have a look there and if yes this can be closed.

Just as a comment - things are not easy here as you have to decide about widening rule. If you have a column of Ints and want to push a float to it - should the column be converted to floats then?

Now having to call allowmissing! is another issue, and potentially another kwarg could be added, like widen, again it is tricky, do we want to allow promotion:

only to Missing union
using promote_type (could lead to conversion of existing data)
using typeunion (will produce abstract types)
using Union of encountered types (this is not what normally is done)

In short - we know about the problem, but the decision what is a robust and future proof design here is hard :sob:, so for now we disallowed this.

bkamins on 11 Mar 2020

👍2

We could expand vcat to work with anything that is property accessible.

pdeffebach on 13 Mar 2020

The problem is how we should determine the return type? Would you want to dispatch on a first argument to vcat (i.e. if the first argument is a AbstractDataFrame then we return a DataFrame)? This is probably doable (I have just checked that there should not be dispatch ambiguities - which was a high risk given vcat in Base is pretty flexible). Then would you want to make anything that is Tables.jl compliant to be accepted?

Also I think converting via DataFrame! before vcat-ting should be cheap relative to vcat cost so it is not crucial to have it. But maybe I am missing something (for row-oriented storage anyway probably it is better to convert to column-oriented before vcat anyway)

bkamins on 13 Mar 2020

Also I think converting via DataFrame! before vcat-ting should be cheap relative to vcat cost so it is not crucial to have it. But maybe I am missing something (for row-oriented storage anyway probably it is better to convert to column-oriented before vcat anyway)

Yes I think it is best to keep vcat for DataFrame to DataFrame, given that any making a DataFrame is cheap.

This could also be handled at the Tables.jl level, which could provide a function to fill in columns and expand types as needed. For example

julia> t = [(a = 1, b = 2), (a = 3, b = 4, c = 5)]
2-element Array{NamedTuple,1}:
 (a = 1, b = 2)
 (a = 3, b = 4, c = 5)

julia> DataFrame(t)
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 2     │
│ 2   │ 3     │ 4     │

pdeffebach on 13 Mar 2020

I did not know about this. So Tables.columns effectively uses the :subset strategy of push!, which is also visible here:

julia> t = [(a = 1, b = 2), (b=3,a=4)]
2-element Array{NamedTuple{names,Tuple{Int64,Int64}} where names,1}:
 (a = 1, b = 2)
 (b = 3, a = 4)

julia> DataFrame(t)
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 2     │
│ 2   │ 4     │ 3     │

@quinnj - is this intended, or this is just a consequence of performance considerations?

bkamins on 13 Mar 2020

For any Tables.jl source without a well-defined schema (i.e. returns nothing from Tables.schema(tbl)), which in this case, includes Arrays of non-homogenous NamedTuples, it uses the Tables.columnnames of the first row. So I don't think that's necessarily :subset because I think it's actually an error if a subsequent row doesn't have one of the column names from the first row.

quinnj on 3 Apr 2020

Actually what the example at https://github.com/JuliaData/DataFrames.jl/issues/2150#issuecomment-598866058 shows is that new columns that appear after the first row are silently ignored. Is that intended?

nalimilan on 3 Apr 2020

Ah - right, so it is :intersect option then with promote=true (so that we promote column types if needed, which is added in #2152 PR) - right?

bkamins on 3 Apr 2020

👍1

Actually what the example at #2150 (comment) shows is that new columns that appear after the first row are silently ignored. Is that intended?

Yes, extra columns _not_ in the first row will be ignored; what I meant in my comment is that if the _first_ row has extra columns that aren't present in subsequent rows, _that_ would be an error (since it would be trying to get a column that didn't exist on later rows).

quinnj on 6 Apr 2020

Yes - this is exactly :intersect.

bkamins on 6 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings