The manual shows how to replace values in multiple columns, e.g.
df2 = ifelse.(df .== 999, missing, df)
That's a neat trick, but it would be convenient if we had replace and replace! methods for data frames. Something like the following:
df2 = replace(df, :, 999 => missing)
df3 = replace(df, Between(:a, :c), 999 => missing)
Of course the eltype conversion behavior would mirror the behavior for Base.replace:
julia> y = [2, 5, 999, 7];
julia> replace(y, 999 => missing)
4-element Array{Union{Missing, Int64},1}:
2
5
missing
7
julia> replace!(y, 999 => missing)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Int64
Why:
replace!.(eachcol(df[!, cols]), 999 => missing)
is not enough for you? (this would be in-place)
The issue with replace and replace! is that it would treat data frame as a matrix, and we tend to define functions that treat it as a collection of rows. It would not be end of the world, but still ...
So let us wait what others think.
Hmm, well I think my suggested API for the replace data frame methods is intuitive at least. And maybe I haven't been following closely enough, but parts of the current API feel more column oriented to me. For example:
julia> df = DataFrame(a = 2:3);
julia> transform(df, :a => (x -> log.(x)) => :log_a)
2ร2 DataFrame
โ Row โ a โ log_a โ
โ โ Int64 โ Float64 โ
โโโโโโโผโโโโโโโโผโโโโโโโโโโโค
โ 1 โ 2 โ 0.693147 โ
โ 2 โ 3 โ 1.09861 โ
julia> transform(df, :a => ByRow(log) => :log_a)
2ร2 DataFrame
โ Row โ a โ log_a โ
โ โ Int64 โ Float64 โ
โโโโโโโผโโโโโโโโผโโโโโโโโโโโค
โ 1 โ 2 โ 0.693147 โ
โ 2 โ 3 โ 1.09861 โ
That feels to me like transform is treating :a as a column. You have to explicitly use ByRow() if you want to get by-row behavior.
But it could be my bias coming from R/dplyr where table manipulation is usually column based.
A nice side benefit would be that replace!(::DataFrame, ...) would probably return the modified data frame, rather than the modified columns. I currently have this in one of my functions:
function foo(df)
replace!(df.x, 9 => missing) # only returns the array :x
df
end
which would reduce to this under the new syntax:
function foo(df)
replace!(df, :x, 9 => missing)
end
This is true that select/transform/combine work differently, we could add replace to this group, I just noted about a general trend we want to follow (but still I agree that what is intuitive and useful should be taken into consideration). Let us see what other people think and then decide.
I think that replace!.(eachcol(df[!, cols]), nothing => missing) is "sufficient".
But I'm in favor of replace!(::AbstractDataFrame, cols, ...). It's a logical function call to those that don't necessarily know/want to know how a DataFrame is implemented.
The ifelse syntax is not all that memorable/intuitive, and replace! already exists.
It'd remove a small pain point for new users, I think
another option is just:
select!(df, cols .=> x -> replace!(x, nothing => missing), renamecols=false)