I'm trying to build up a data frame one row at a time, except that not every row has all the columns. I'd like to be able to push! the rows on one at a time, filling in the missing columns with missing. using cols=:subset gets me halfway there:
julia> df = DataFrame()
0ร0 DataFrame
julia> push!(df, (a=1, b=2))
1ร2 DataFrame
โ Row โ a โ b โ
โ โ Int64 โ Int64 โ
โโโโโโโผโโโโโโโโผโโโโโโโโค
โ 1 โ 1 โ 2 โ
julia> allowmissing!(df)
1ร2 DataFrame
โ Row โ a โ b โ
โ โ Int64โฐ โ Int64โฐ โ
โโโโโโโผโโโโโโโโโผโโโโโโโโโค
โ 1 โ 1 โ 2 โ
julia> push!(df, (a=1, c=3), cols=:subset)
2ร2 DataFrame
โ Row โ a โ b โ
โ โ Int64โฐ โ Int64โฐ โ
โโโโโโโผโโโโโโโโโผโโโโโโโโโโค
โ 1 โ 1 โ 2 โ
โ 2 โ 1 โ missing โ
But notice that no column for c is added. Having to call allowmissing! is also potentially annoying since I'll have to call that for every row or have to do the bookkeeping of keeping track of which columns have already missings in them in the loop over rows. But I think that's more related to #1716 .
I'd like either an additional argument like fill, or another value for cols which opts into adding columns as they are encountered.
I think it is a duplicate (actually a subset) of https://github.com/JuliaData/DataFrames.jl/issues/2032. Could you please have a look there and if yes this can be closed.
Just as a comment - things are not easy here as you have to decide about widening rule. If you have a column of Ints and want to push a float to it - should the column be converted to floats then?
In particular - maybe you want something simpler, which could be implemented before https://github.com/JuliaData/DataFrames.jl/issues/2032. Something like cols=:union where, as you proposed missing columns would be added filled with missing in existing rows and the new entry given type it has.
Now having to call allowmissing! is another issue, and potentially another kwarg could be added, like widen, again it is tricky, do we want to allow promotion:
Missing unionpromote_type (could lead to conversion of existing data)typeunion (will produce abstract types)Union of encountered types (this is not what normally is done)In short - we know about the problem, but the decision what is a robust and future proof design here is hard :sob:, so for now we disallowed this.
If you would have an idea of a minimal API extension that would do what is expected most of the time then it would be great (my fear is that having fully flexible API here would be an overkill for a regular user, but maybe I am wrong).
We could expand vcat to work with anything that is property accessible.
The problem is how we should determine the return type? Would you want to dispatch on a first argument to vcat (i.e. if the first argument is a AbstractDataFrame then we return a DataFrame)? This is probably doable (I have just checked that there should not be dispatch ambiguities - which was a high risk given vcat in Base is pretty flexible). Then would you want to make anything that is Tables.jl compliant to be accepted?
Also I think converting via DataFrame! before vcat-ting should be cheap relative to vcat cost so it is not crucial to have it. But maybe I am missing something (for row-oriented storage anyway probably it is better to convert to column-oriented before vcat anyway)
Also I think converting via DataFrame! before vcat-ting should be cheap relative to vcat cost so it is not crucial to have it. But maybe I am missing something (for row-oriented storage anyway probably it is better to convert to column-oriented before vcat anyway)
Yes I think it is best to keep vcat for DataFrame to DataFrame, given that any making a DataFrame is cheap.
This could also be handled at the Tables.jl level, which could provide a function to fill in columns and expand types as needed. For example
julia> t = [(a = 1, b = 2), (a = 3, b = 4, c = 5)]
2-element Array{NamedTuple,1}:
(a = 1, b = 2)
(a = 3, b = 4, c = 5)
julia> DataFrame(t)
2ร2 DataFrame
โ Row โ a โ b โ
โ โ Int64 โ Int64 โ
โโโโโโโผโโโโโโโโผโโโโโโโโค
โ 1 โ 1 โ 2 โ
โ 2 โ 3 โ 4 โ
I did not know about this. So Tables.columns effectively uses the :subset strategy of push!, which is also visible here:
julia> t = [(a = 1, b = 2), (b=3,a=4)]
2-element Array{NamedTuple{names,Tuple{Int64,Int64}} where names,1}:
(a = 1, b = 2)
(b = 3, a = 4)
julia> DataFrame(t)
2ร2 DataFrame
โ Row โ a โ b โ
โ โ Int64 โ Int64 โ
โโโโโโโผโโโโโโโโผโโโโโโโโค
โ 1 โ 1 โ 2 โ
โ 2 โ 4 โ 3 โ
@quinnj - is this intended, or this is just a consequence of performance considerations?
For any Tables.jl source without a well-defined schema (i.e. returns nothing from Tables.schema(tbl)), which in this case, includes Arrays of non-homogenous NamedTuples, it uses the Tables.columnnames of the first row. So I don't think that's necessarily :subset because I think it's actually an error if a subsequent row doesn't have one of the column names from the first row.
Actually what the example at https://github.com/JuliaData/DataFrames.jl/issues/2150#issuecomment-598866058 shows is that new columns that appear after the first row are silently ignored. Is that intended?
Ah - right, so it is :intersect option then with promote=true (so that we promote column types if needed, which is added in #2152 PR) - right?
Actually what the example at #2150 (comment) shows is that new columns that appear after the first row are silently ignored. Is that intended?
Yes, extra columns _not_ in the first row will be ignored; what I meant in my comment is that if the _first_ row has extra columns that aren't present in subsequent rows, _that_ would be an error (since it would be trying to get a column that didn't exist on later rows).
Yes - this is exactly :intersect.
Most helpful comment
I think it is a duplicate (actually a subset) of https://github.com/JuliaData/DataFrames.jl/issues/2032. Could you please have a look there and if yes this can be closed.
Just as a comment - things are not easy here as you have to decide about widening rule. If you have a column of
Ints and want to push a float to it - should the column be converted to floats then?In particular - maybe you want something simpler, which could be implemented before https://github.com/JuliaData/DataFrames.jl/issues/2032. Something like
cols=:unionwhere, as you proposed missing columns would be added filled with missing in existing rows and the new entry given type it has.Now having to call
allowmissing!is another issue, and potentially another kwarg could be added, likewiden, again it is tricky, do we want to allow promotion:Missingunionpromote_type(could lead to conversion of existing data)typeunion(will produce abstract types)Unionof encountered types (this is not what normally is done)In short - we know about the problem, but the decision what is a robust and future proof design here is hard :sob:, so for now we disallowed this.
If you would have an idea of a minimal API extension that would do what is expected most of the time then it would be great (my fear is that having fully flexible API here would be an overkill for a regular user, but maybe I am wrong).