Hello,
After asking on SO https://stackoverflow.com/questions/56684447/convert-a-julia-dataframe-column-with-string-to-one-with-int-and-missing-values/56685891?noredirect=1#comment99940131_56685891 I think this should in fact be discussed here.
I need to convert the following DataFrame
julia> df = DataFrame(:A=>["", "2", "3"], :B=>[1.1, 2.2, 3.3])
which looks like
3Γ2 DataFrame
β Row β A β B β
β β String β Float64 β
βββββββΌβββββββββΌββββββββββ€
β 1 β β 1.1 β
β 2 β 2 β 2.2 β
β 3 β 3 β 3.3 β
I would like to convert A column from Array{String,1} to array of Int with missing values.
I tried
julia> df.A = tryparse.(Int, df.A)
3-element Array{Union{Nothing, Int64},1}:
nothing
2
3
julia> df
3Γ2 DataFrame
β Row β A β B β
β β Unionβ¦ β Float64 β
βββββββΌβββββββββΌββββββββββ€
β 1 β β 1.1 β
β 2 β 2 β 2.2 β
β 3 β 3 β 3.3 β
julia> eltype(df.A)
Union{Nothing, Int64}
but I'm getting A column with elements of type Union{Nothing, Int64}.
nothing (of type Nothing) and missing (of type Missing) seems to be 2 differents kind of values.
After asking on SO, it seems that a solution could be
julia> df.A = map(x->begin val = tryparse(Int, x)
ifelse(typeof(val) == Nothing, missing, val)
end, df.A)
3-element Array{Union{Missing, Int64},1}:
missing
2
3
Despite it perfectly answered my question I don't think that's what we can expect from DataFrames users to do so.
Maybe we should have a function which could replace nothing by missing or maybe another approach could be to have an other definition for tryparse function (which could output missing).
What is you opinion?
Kind regards
I noticed that replacing nothing by missing can be done using:
df.A = replace(df.A, nothing=>missing)
maybe doc should provide such a DataFrame example (with values as String, tryparse to parse as Int, and replace)
Having tryparse being able to directly return missing when a String can't be parsed would simplify this https://github.com/JuliaLang/julia/issues/32378
How about:
tryparsem(T, str) = something(tryparse(T, str), missing)
df.A = tryparsem.(df.A)
I didn't know something function. Thanks for the idea.
Should tryparsem be included in Base or in DataFrames.jl or in user code?
This idea is too clever for not being part of a package or the language itself :wink:
I think the reason it isn't included is because of how simple it is. As long as you're aware of the something function (and its missing counterpart coalesce), there are some really quick ways to switch between things.
@scls19fr - can this be closed given the solution given by @quinnj?
I'm still wondering what should be done here and I definitely think that closing this simply is not the best action.
At least, the doc should be improved to provide this idea.
But I still don't know why we couldn't / shouldn't add such a function (even if it's so simple).
By providing such a function in Base or in DataFrame it will urge developer to use same function name which is (imho) a good practice to improve code readability.
I am asking, because this functionality is not DataFrames.jl related. It should live in Base or Missings.jl (probably you can first discuss it in Missings.jl, as this is a place where experimental missing relate functionality is implemented before it is introduced in Base).
Ok it seems you opened a quite similar issue https://github.com/JuliaData/Missings.jl/issues/61
Yes - but we were not sure what was the best way to do it π.
Most helpful comment
I think the reason it isn't included is because of how simple it is. As long as you're aware of the
somethingfunction (and itsmissingcounterpartcoalesce), there are some really quick ways to switch between things.