Dataframes.jl: add missing columns when push! ing?

Created on 11 Mar 2020  ยท  10Comments  ยท  Source: JuliaData/DataFrames.jl

I'm trying to build up a data frame one row at a time, except that not every row has all the columns. I'd like to be able to push! the rows on one at a time, filling in the missing columns with missing. using cols=:subset gets me halfway there:

julia> df = DataFrame()
0ร—0 DataFrame

julia> push!(df, (a=1, b=2))
1ร—2 DataFrame
โ”‚ Row โ”‚ a     โ”‚ b     โ”‚
โ”‚     โ”‚ Int64 โ”‚ Int64 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 1     โ”‚ 2     โ”‚

julia> allowmissing!(df)
1ร—2 DataFrame
โ”‚ Row โ”‚ a      โ”‚ b      โ”‚
โ”‚     โ”‚ Int64โฐ โ”‚ Int64โฐ โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 1      โ”‚ 2      โ”‚
julia> push!(df, (a=1, c=3), cols=:subset)
2ร—2 DataFrame
โ”‚ Row โ”‚ a      โ”‚ b       โ”‚
โ”‚     โ”‚ Int64โฐ โ”‚ Int64โฐ  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 1      โ”‚ 2       โ”‚
โ”‚ 2   โ”‚ 1      โ”‚ missing โ”‚

But notice that no column for c is added. Having to call allowmissing! is also potentially annoying since I'll have to call that for every row or have to do the bookkeeping of keeping track of which columns have already missings in them in the loop over rows. But I think that's more related to #1716 .

I'd like either an additional argument like fill, or another value for cols which opts into adding columns as they are encountered.

decision non-breaking

Most helpful comment

I think it is a duplicate (actually a subset) of https://github.com/JuliaData/DataFrames.jl/issues/2032. Could you please have a look there and if yes this can be closed.

Just as a comment - things are not easy here as you have to decide about widening rule. If you have a column of Ints and want to push a float to it - should the column be converted to floats then?

In particular - maybe you want something simpler, which could be implemented before https://github.com/JuliaData/DataFrames.jl/issues/2032. Something like cols=:union where, as you proposed missing columns would be added filled with missing in existing rows and the new entry given type it has.

Now having to call allowmissing! is another issue, and potentially another kwarg could be added, like widen, again it is tricky, do we want to allow promotion:

  • only to Missing union
  • using promote_type (could lead to conversion of existing data)
  • using typeunion (will produce abstract types)
  • using Union of encountered types (this is not what normally is done)

In short - we know about the problem, but the decision what is a robust and future proof design here is hard :sob:, so for now we disallowed this.

If you would have an idea of a minimal API extension that would do what is expected most of the time then it would be great (my fear is that having fully flexible API here would be an overkill for a regular user, but maybe I am wrong).

All 10 comments

I think it is a duplicate (actually a subset) of https://github.com/JuliaData/DataFrames.jl/issues/2032. Could you please have a look there and if yes this can be closed.

Just as a comment - things are not easy here as you have to decide about widening rule. If you have a column of Ints and want to push a float to it - should the column be converted to floats then?

In particular - maybe you want something simpler, which could be implemented before https://github.com/JuliaData/DataFrames.jl/issues/2032. Something like cols=:union where, as you proposed missing columns would be added filled with missing in existing rows and the new entry given type it has.

Now having to call allowmissing! is another issue, and potentially another kwarg could be added, like widen, again it is tricky, do we want to allow promotion:

  • only to Missing union
  • using promote_type (could lead to conversion of existing data)
  • using typeunion (will produce abstract types)
  • using Union of encountered types (this is not what normally is done)

In short - we know about the problem, but the decision what is a robust and future proof design here is hard :sob:, so for now we disallowed this.

If you would have an idea of a minimal API extension that would do what is expected most of the time then it would be great (my fear is that having fully flexible API here would be an overkill for a regular user, but maybe I am wrong).

We could expand vcat to work with anything that is property accessible.

The problem is how we should determine the return type? Would you want to dispatch on a first argument to vcat (i.e. if the first argument is a AbstractDataFrame then we return a DataFrame)? This is probably doable (I have just checked that there should not be dispatch ambiguities - which was a high risk given vcat in Base is pretty flexible). Then would you want to make anything that is Tables.jl compliant to be accepted?

Also I think converting via DataFrame! before vcat-ting should be cheap relative to vcat cost so it is not crucial to have it. But maybe I am missing something (for row-oriented storage anyway probably it is better to convert to column-oriented before vcat anyway)

Also I think converting via DataFrame! before vcat-ting should be cheap relative to vcat cost so it is not crucial to have it. But maybe I am missing something (for row-oriented storage anyway probably it is better to convert to column-oriented before vcat anyway)

Yes I think it is best to keep vcat for DataFrame to DataFrame, given that any making a DataFrame is cheap.

This could also be handled at the Tables.jl level, which could provide a function to fill in columns and expand types as needed. For example

julia> t = [(a = 1, b = 2), (a = 3, b = 4, c = 5)]
2-element Array{NamedTuple,1}:
 (a = 1, b = 2)
 (a = 3, b = 4, c = 5)

julia> DataFrame(t)
2ร—2 DataFrame
โ”‚ Row โ”‚ a     โ”‚ b     โ”‚
โ”‚     โ”‚ Int64 โ”‚ Int64 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 1     โ”‚ 2     โ”‚
โ”‚ 2   โ”‚ 3     โ”‚ 4     โ”‚

I did not know about this. So Tables.columns effectively uses the :subset strategy of push!, which is also visible here:

julia> t = [(a = 1, b = 2), (b=3,a=4)]
2-element Array{NamedTuple{names,Tuple{Int64,Int64}} where names,1}:
 (a = 1, b = 2)
 (b = 3, a = 4)

julia> DataFrame(t)
2ร—2 DataFrame
โ”‚ Row โ”‚ a     โ”‚ b     โ”‚
โ”‚     โ”‚ Int64 โ”‚ Int64 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 1     โ”‚ 2     โ”‚
โ”‚ 2   โ”‚ 4     โ”‚ 3     โ”‚

@quinnj - is this intended, or this is just a consequence of performance considerations?

For any Tables.jl source without a well-defined schema (i.e. returns nothing from Tables.schema(tbl)), which in this case, includes Arrays of non-homogenous NamedTuples, it uses the Tables.columnnames of the first row. So I don't think that's necessarily :subset because I think it's actually an error if a subsequent row doesn't have one of the column names from the first row.

Actually what the example at https://github.com/JuliaData/DataFrames.jl/issues/2150#issuecomment-598866058 shows is that new columns that appear after the first row are silently ignored. Is that intended?

Ah - right, so it is :intersect option then with promote=true (so that we promote column types if needed, which is added in #2152 PR) - right?

Actually what the example at #2150 (comment) shows is that new columns that appear after the first row are silently ignored. Is that intended?

Yes, extra columns _not_ in the first row will be ignored; what I meant in my comment is that if the _first_ row has extra columns that aren't present in subsequent rows, _that_ would be an error (since it would be trying to get a column that didn't exist on later rows).

Yes - this is exactly :intersect.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

garborg picture garborg  ยท  8Comments

yakir12 picture yakir12  ยท  6Comments

cossio picture cossio  ยท  5Comments

CameronBieganek picture CameronBieganek  ยท  6Comments

davidanthoff picture davidanthoff  ยท  4Comments