Dataframes.jl: `similar(::DataFrame, 0)` changes column types

Created on 27 Nov 2017 · 14Comments · Source: JuliaData/DataFrames.jl

Using DataFrames v0.11.0:

julia> row = DataFrame(a="foo")
1×1 DataFrames.DataFrame
│ Row │ a   │
├─────┼─────┤
│ 1   │ foo │

julia> df = similar(row, 0)
0×1 DataFrames.DataFrame


julia> append!(df, row)
ERROR: Column eltypes do not match
Stacktrace:
 [1] append!(::DataFrames.DataFrame, ::DataFrames.DataFrame) at /tmp/julia/v0.6/DataFrames/src/dataframe/dataframe.jl:765

julia> DataFrames.columns(row)
1-element Array{Any,1}:
 String["foo"]

julia> DataFrames.columns(df)
1-element Array{Any,1}:
 Union{Missings.Missing, String}[]

Source

omus

Most helpful comment

We definitely need to relax this eltypes(df1) == eltypes(df2) || error("Column eltypes do not match") check. At the very least it should allow Union{T,Missing}, checking that there are no missing values in df2 columns for which df1 does not allow for missing values.

We could even go one step further and simply rely on the append! methods on column vectors to do the conversion, but if a failure happens some columns may have been mutated already. Maybe calling resize! on all columns in case of failure would be enough.

nalimilan on 17 Jun 2018

👍2

All 14 comments

Maybe we should update the append! element check to use subtype checks:

all(issubtype(el2, el1) for (el1, el2) in zip(eltypes(df1), eltypes(df2)))

omus on 27 Nov 2017

👍1

I guess we should change similar to respect column eltypes.

Using subtype checks in append! also make sense, but note that in your example it wouldn't help since it's the first eltype which is a subtype of the second one. A more general solution would be to attempt a conversion (i.e. call copy! and see what happens), but it will be tricky to handle when conversion fails mid-way.

nalimilan on 27 Nov 2017

@omus The behavior of similar should now be correct on master, thank you for the report. I'll leave this open until we address append!

cjprybol on 8 Dec 2017

👍1

When loading a file with CSV.jl , it happens quite often that the columns of the dataframe are a union with missing.
This make it difficult to append a new line.

How am I expected to create a dataframe which columns are Union{Float64, Missings.Missing}?

julia> eltypes(DataFrame(c = 0.1::Union{Float64, Missings.Missing}))
1-element Array{Type,1}:
 Float64

Seems I'll have to copy a line from the dataframe, change it's value, then maybe it will append ...

Would be much better to be able to insert a dataframe with Float64 in a dataframe composed of Union{Float64, Missings.Missing}.

JonWel on 17 Jun 2018

@JonWel if you pass in a vector you can set the element type appropriately:

julia> eltypes(DataFrame(c = Union{Float64, Missings.Missing}[0.1]))
1-element Array{Type,1}:
 Union{Float64, Missings.Missing}

omus on 17 Jun 2018

@omus Thanks, I think I'm almost there.

Still a weirdo remaining:

julia> eltypes(df1)
Any[Union{Float64, Missings.Missing}[280.12, 285.12], Union{CategoricalArrays.CategoricalString{UInt32}, Missings.Missing}["B", "B"], Union{Float64, Missings.Missing}[0.0, 0.06], Union{Float64, Missings.Missing}[201900.0, 239700.0], Union{Float64, Missings.Missing}[2.0481e5, 3.12964e5]]
julia> eltypes(df2)
Any[Union{Float64, Missings.Missing}[280.12], CategoricalArrays.CategoricalString{UInt32}["B"], Union{Float64, Missings.Missing}[0.0], Union{Float64, Missings.Missing}[NaN], Union{Float64, Missings.Missing}[NaN]]

I tried something like this, but it fails at the append step:

julia> eltype(Union{CategoricalArrays.CategoricalString{UInt32}, Missings.Missing}[df1[2,3])
Union{CategoricalArrays.CategoricalString{UInt32}, Missings.Missing}

julia> eltypes(df1)
Any[Union{Float64, Missings.Missing}[280.12, 285.12], Union{CategoricalArrays.CategoricalString{UInt32}, Missings.Missing}["B", "B"], Union{Float64, Missings.Missing}[0.0, 0.06], Union{Float64, Missings.Missing}[201900.0, 239700.0], Union{Float64, Missings.Missing}[2.0481e5, 3.12964e5]]
julia> eltypes(df2)
Any[Union{Float64, Missings.Missing}[280.12], Union{CategoricalArrays.CategoricalString{UInt32}, Missings.Missing}["B"], Union{Float64, Missings.Missing}[0.0], Union{Float64, Missings.Missing}[NaN], Union{Float64, Missings.Missing}[NaN]]

julia> append!(df1,df2)
ERROR: LoadError: MethodError: no method matching append!(::CategoricalArrays.CategoricalArray{Union{Missings.Missing, String},1,UInt32,String,CategoricalArrays.CategoricalString{UInt32},Missings.Missing}, ::Array{Union{CategoricalArrays.CategoricalString{UInt32}, Missings.Missing},1})

I went arround the issue by imposing a much simpler type when reading the CSV file => CSV.read(file,types=Dict("colx"=>String)); I can do it because that column never have missing values....

JonWel on 17 Jun 2018

nalimilan on 17 Jun 2018

👍2

I've filed PR https://github.com/JuliaData/DataFrames.jl/pull/1432.

nalimilan on 18 Jun 2018

This can be closed, right? I get this now:

julia> row = DataFrame(a="foo")
1×1 DataFrame
│ Row │ a      │
│     │ String │
├─────┼────────┤
│ 1   │ foo    │

julia> df = similar(row, 0)
0×1 DataFrame


julia> eltypes(df)
1-element Array{Type,1}:
 String

(BTW, it would make sense to print the column types even when there are no rows.)

nalimilan on 27 Sep 2018

The issue has been addressed. I would suggest adding an explicit test with similar as the #1432 PR doesn't actually do that.

omus on 27 Sep 2018

#1432 didn't change similar, but I don't remember which PR fixed it.

nalimilan on 27 Sep 2018

@nalimilan is this fixed now?