Dataframes.jl: Add replace!(::AbstractDataFrame, cols, ...) method

Created on 14 May 2020  ยท  6Comments  ยท  Source: JuliaData/DataFrames.jl

The manual shows how to replace values in multiple columns, e.g.

df2 = ifelse.(df .== 999, missing, df)

That's a neat trick, but it would be convenient if we had replace and replace! methods for data frames. Something like the following:

df2 = replace(df, :, 999 => missing)

df3 = replace(df, Between(:a, :c), 999 => missing)

Of course the eltype conversion behavior would mirror the behavior for Base.replace:

julia> y = [2, 5, 999, 7];

julia> replace(y, 999 => missing)
4-element Array{Union{Missing, Int64},1}:
 2
 5
  missing
 7

julia> replace!(y, 999 => missing)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Int64
decision non-breaking

All 6 comments

Why:

replace!.(eachcol(df[!, cols]), 999 => missing)

is not enough for you? (this would be in-place)

The issue with replace and replace! is that it would treat data frame as a matrix, and we tend to define functions that treat it as a collection of rows. It would not be end of the world, but still ...

So let us wait what others think.

Hmm, well I think my suggested API for the replace data frame methods is intuitive at least. And maybe I haven't been following closely enough, but parts of the current API feel more column oriented to me. For example:

julia> df = DataFrame(a = 2:3);

julia> transform(df, :a => (x -> log.(x)) => :log_a)
2ร—2 DataFrame
โ”‚ Row โ”‚ a     โ”‚ log_a    โ”‚
โ”‚     โ”‚ Int64 โ”‚ Float64  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 2     โ”‚ 0.693147 โ”‚
โ”‚ 2   โ”‚ 3     โ”‚ 1.09861  โ”‚

julia> transform(df, :a => ByRow(log) => :log_a)
2ร—2 DataFrame
โ”‚ Row โ”‚ a     โ”‚ log_a    โ”‚
โ”‚     โ”‚ Int64 โ”‚ Float64  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 2     โ”‚ 0.693147 โ”‚
โ”‚ 2   โ”‚ 3     โ”‚ 1.09861  โ”‚

That feels to me like transform is treating :a as a column. You have to explicitly use ByRow() if you want to get by-row behavior.

But it could be my bias coming from R/dplyr where table manipulation is usually column based.

A nice side benefit would be that replace!(::DataFrame, ...) would probably return the modified data frame, rather than the modified columns. I currently have this in one of my functions:

function foo(df)
    replace!(df.x, 9 => missing)  # only returns the array :x
    df
end

which would reduce to this under the new syntax:

function foo(df)
    replace!(df, :x, 9 => missing)
end

This is true that select/transform/combine work differently, we could add replace to this group, I just noted about a general trend we want to follow (but still I agree that what is intuitive and useful should be taken into consideration). Let us see what other people think and then decide.

I think that replace!.(eachcol(df[!, cols]), nothing => missing) is "sufficient".

But I'm in favor of replace!(::AbstractDataFrame, cols, ...). It's a logical function call to those that don't necessarily know/want to know how a DataFrame is implemented.

The ifelse syntax is not all that memorable/intuitive, and replace! already exists.

It'd remove a small pain point for new users, I think

another option is just:

select!(df, cols .=> x -> replace!(x, nothing => missing), renamecols=false)
Was this page helpful?
0 / 5 - 0 ratings

Related issues

cossio picture cossio  ยท  5Comments

pdeffebach picture pdeffebach  ยท  8Comments

xiaodaigh picture xiaodaigh  ยท  5Comments

garborg picture garborg  ยท  8Comments

jangorecki picture jangorecki  ยท  7Comments