Dataframes.jl: Merge two DataFrames only to missing values

Created on 11 May 2020 · 22Comments · Source: JuliaData/DataFrames.jl

Suppose I have an dataframe called df_missing:

6×3 DataFrame
│ Row │ id    │ val     │ other |
│     │ Int64 │ Int64?  │ Int64 │
├─────┼───────┼─────────┼───────┤
│ 1   │ 5     │ 1       │ 4     │
│ 2   │ 2     │ missing │ 4     │
│ 3   │ 1     │ 3       │ 3     │
│ 4   │ 4     │ 8       │ 4     │
│ 5   │ 6     │ 2       │ 4     │
│ 6   │ 8     │ missing │ 3     │

and I also have another dataset, called df_completion:

2x2 DataFrame
│ Row │ id    │ val   |
│     │ Int64 │ Int64 │ 
├─────┼───────┼───────┤
│ 1   │ 2     │ 5     │ 
│ 2   │ 8     │ 13    │

So my suggestion is: couldn't there be a more straightforward way to replace the missing values using the common id of the two dataframes, thus creating df_full?

6×3 DataFrame
│ Row │ id    │ val     │ other |
│     │ Int64 │ Int64?  │ Int64 │
├─────┼───────┼─────────┼───────┤
│ 1   │ 5     │ 1       │ 4     │
│ 2   │ 2     │ 5       │ 4     │
│ 3   │ 1     │ 3       │ 3     │
│ 4   │ 4     │ 8       │ 4     │
│ 5   │ 6     │ 2       │ 4     │
│ 6   │ 8     │ 13      │ 3     │

The two current ways that seem to be the best ones are:
julia> df_missing[in(df_completion.id).(df_missing.id), :val] = df_completion.val
or

julia> df_full = leftjoin(df_missing, df_completion, on = :id, makeunique = true);

julia> df_full.val = map(df_full.val, df_full.val_1) do a, b
       ismissing(a) ? b : a
       end;

julia> select!(df_full, Not(:val_1))

besides, of course, creating a loop. Maybe there could be a method for merging two DataFrames in this way? I think it's a very common problem.

feature

Source

chedieck

Most helpful comment

Some kind of functionality like this, in general an "update" would be really nice to add. We just need to find a general design that is flexible and future proof (i.e. to have a swiss army knife function like e.g. combine not just function that does only one specific thing).

As proposed above maybe we can add to leftjoin an update kwarg which takes a:

pair (:left => :right)
single value as a shorthand for :val => :val
vector of the above (if more updates are required in one operation)

bkamins on 15 May 2020

👍3

All 22 comments

I do not think you can do any better than joining and then combining the results. In particular in general (if if were to become a standard library function) you would have to define what should happen if either of the data frames has a duplicate entry in :id, what should happen if the right data frame has some entry, and the left data frame also contains it but it is not missing etc.

I general I believe that there are so many possible options that it is hard to define an API for such a function, simply an appropriate join should be used. Note then that the last operation you mention can be written as df_full.val .= coalesce.(df_full.val, df_full_val_1) which will be fast and update df_full.val in place.

bkamins on 11 May 2020

Stata has a good API for this via update

update: update missing values of same-named variables in master with values
from using

I think Bogumil's idea for this is elegant, and is good for 1.0. But I think following Stata on this could solve the coordination problem for what a good option is.

My proposal would be

leftjoin(df1, df2, on = :id, update = :val)

This would perform the coalesce automatically. update could be a column identifier or a vector thereof. Such columns would be excempt from makeunique = true.

A quick google search shows that data.table has something similar to this but dplyr does not .

pdeffebach on 11 May 2020

What does Stata do when:

duplicate entries in df2 are present? Throw an error?
duplicate entries in df1 key are present and for one of them the value is non-missing, while for the other the value is missing? Is this also an error? (or maybe the on column must be unique in general also in df1 in this case)?

bkamins on 11 May 2020

duplicate entries in df2 are present? Throw an error?

Interestingly, this will add both observations

. use df_completion_dups, clear

. list

     +----------+
     | id   val |
     |----------|
  1. |  2   220 |
  2. |  2   240 |
     +----------+

. use df_missing_normal.dta, clear

. qui merge 1:m id using df_completion_dups, update

. list

     +--------------------------------+
     | id   val                _merge |
     |--------------------------------|
  1. |  1   100       master only (1) |
  2. |  2   220   missing updated (4) |
  3. |  2   240   missing updated (4) |
     +--------------------------------+

duplicate entries in df1 key are present and for one of them the value is non-missing, while for the other the value is missing? Is this also an error? (or maybe the on column must be unique in general also in df1 in this case)?

You can specify whether or not the matches need to be unique. merge 1:1 will throw an error if there are any duplicates, merge m:1 (many to 1) will throw an error if the RHS (called using) has duplicates etc.

The default behavior in this case is also to add new rows. See below.

. use df_completion_dups, clear

. list

     +----------+
     | id   val |
     |----------|
  1. |  2   220 |
  2. |  2   240 |
     +----------+

. use df_missing_normal.dta, clear

. qui merge 1:m id using df_completion_dups, update

. list

     +--------------------------------+
     | id   val                _merge |
     |--------------------------------|
  1. |  1   100       master only (1) |
  2. |  2   220   missing updated (4) |
  3. |  2   240   missing updated (4) |
     +--------------------------------+

I will have to think more about the mapping of Stata's 1:m syntax to our joins and whether or not the behavior I propose would be intuitive.

pdeffebach on 11 May 2020

OK - let us try to find out a consistent design requirements (like we did select/transform/combine combo) and then we can implement it.

(and I will be commenting from a "defensive position" :smile:)

In particular I think that this operation is most intuitive if also allowed to be done "in place" (but we need API for this).

bkamins on 11 May 2020

Thanks. I will try and work on a proposal that is similar to stata but matches our current infrastructure. This is yet another opportunity to combine the best of stata and dplyr.

pdeffebach on 11 May 2020

❤1

I guess I am not that late to the party. Here would be the approach from SQL

UPDATE A
SET a = B.x, b = B.y
FROM B
WHERE A.id = B.id

Nosferican on 15 May 2020

As proposed above maybe we can add to leftjoin an update kwarg which takes a:

pair (:left => :right)
single value as a shorthand for :val => :val
vector of the above (if more updates are required in one operation)

bkamins on 15 May 2020

👍3

As proposed above maybe we can add to leftjoin an update kwarg which takes a:
• pair ( :left => :right )
• single value as a shorthand for :val => :val
• vector of the above (if more updates are required in one operation)

This works as long as you're not supporting various options on the update itself. Adding kwargs that only modify one other kwarg seems like a bad idea and cries out for splitting.

This is not a feature I've ever needed to use, so I don't have strong opinions on it, but the fact that it's found in stata and SQL makes a compelling case to add it as a separate function that could have its own options.

(i.e. to have a swiss army knife function like e.g. combine not just function that does only one specific thing).

Not sure if you've ever owned a Swiss army knife, but the downside of being able to do everything is that you don't do any of them super well. Not sure how much to torture the metaphor, but as a kid I always wanted the big fat ones with tons of different tools, because they seemed so cool, but my (Swiss) grandfather's with 2 blades and a bottle opener was almost always a better choice for actually doing anything with.

kescobo on 26 May 2020

As discussed on Slack - we have a tension here. We can either leave joins "as is" and add a new method doing updates of existing columns and working in place (something like update!) or we can extend existing functions with these functionalities. I am not really sure what would be the best approach.

Adding a new function is probably easier to understand. On the other hand it would do almost the same as leftjoin "on steroids" so I am not really sure what would be best.

bkamins on 26 May 2020

I think another function might be best. Since I was also going to bring up that not only does Stata have an update option, it also has a replace option.

So adding that would risk bloating the function even further. I would propose update and overwrite as added functions.

pdeffebach on 26 May 2020

👍1

I would propose update and overwrite as added functions.

What would be the exact difference between them? I was thinking of an API of the kind:

update!(df1, df2, on=..., args) and then args would be things that would specify the operations you want (so you could mix different operations you want: just add a column, replace, fill missings, whatever other ideas we have)

bkamins on 26 May 2020

One only replaces if missing, the other overwrites the whole columns. Maybe they could be the same function, though.

pdeffebach on 26 May 2020

I would prefer the same function and just have some nice way to specify the operation you want to do, I guess there are 4 natural operations:

fill missings
replace where second data frame has a value
fully replace (if second data frame does not have a value fill with missing)
add a column

bkamins on 26 May 2020

Actually, I've just thought of a problem with this approach. Lets say you have df1 and df2 where df2 has many new columns you are interested :age, :marital_status but also compete information for some columns in df1. So df1 and df2 both have the varriable :income but :income has missing values in df1

When you do

innerjoin(df1, df2, on = :id, makeunique = true)

you get :income and :income_1. A separate update function wouldn't help with this

No need to use update after the join happens: you can clean it up yourself easily
Doing update before the join happens will still result in either an error or :income_1
You can't do just the update since then you would lose the demographic variables of interest.

This kind of scenario is an argument for an update argument in innerjoin.

pdeffebach on 26 May 2020

I do not think it would be a problem, as adding demographic columns, would be a type 4 of the operation (just add columns).

So in a sense update would be a kind of select from df2 into df1. If you just pass a column - it behaves in the same way as join (except that by default nothing is added - join is a kind of transform in this sense), but you are allowed to do 3 different operations (update existing column in modes: only missing, everywhere where present and just replace old column).

bkamins on 26 May 2020

In Stata terms this is the keepusing keyword argument, where you choose to keep certain variables from the using (RHS) data set.

I'm not sure I totally agree with the approach above because generally the number of columns I want to be added and not overridden is an order of magnitude larger than the columns I do want overridden. I would inevitably do update(df1, df2, on = :id, overwrite = [:income], add = :) because I don't want to specify all of the demographic variables to tack on individually.

An additional problem: We have 5 different join functions, and all of them, save maybe antijoin, could use this kind of feature. I'm worried that we will have to re-invent complicated rules for the creation of new columns.

pdeffebach on 26 May 2020

Yeah, as usual - given the Stata experience - is it possible to put some proposal to start with?

bkamins on 26 May 2020

The SQL syntax I find it easier (example of in-place),

    t1
SET 
    t1.c1 = COALESCE(t1.c1, t2.c1)
FROM 
    t1
    INNER JOIN t2
    ON t1.id = t2.id
;

Nosferican on 26 May 2020

@Nosferican this is mostly what I have imagined ideally but without metaprogramming it is hard to come up with a concise syntax, i.e. it could be something like:

update!(fun, df1, df2, on=:id)

where fun would get two data frames as an argument having only matching rows in df1 and df2 and would follow the same rules of processing output as in combine(fun, df). Then what it produces would be appended to df1 and if the produced result would have columns that match the existing column then the column would be updated in matching places.

So your example would read:

update!(t1, t2, on=:id) do (t1, t2)
    return (c1 = coalesce.(t1.c1, t2.c2),)
end

This would be almost fully flexible (it would not allow full column replacement, but it is easy enough to almost achieve it using leftjoin).

Still I am not fully happy with this (nor any other syntax I could come up with). But maybe we will find the update! I propose "good enough"?

bkamins on 26 May 2020

OK. I have thought about it and here are some conclusions. Joins in DataFrames.jl work in three steps (I omit implementation details):

ROW MATCHING: appropriately resize df1 and df2 in terms of rows, drop from df2 joining columns, create two vectors indicating if given row was originally present in df1 and df2
TABLE MERGING: merge resized df1 and df2
POST PROCESSING: optionally add a column which is a transformation of indicator column

What we discuss here is essentially replacing TABLE MERGING step by a custom function. This means (I give na example on leftjoin but it applies to all other joins) that:

leftjoin(df1, df2, on=cols)

can be seen as calling

leftjoin(default_joiner, df1, df2, on=cols)

where default_joiner is a function taking 4 arguments (resized df1, resized df2, vector of row indicators for df1, vector of row indicators for df2) and performs a merge of these two data frames (by default just hcat-ing them; note that default_joiner mutates the first argument passed in place).

And actually we could expose this leftjoin(default_joiner, df1, df2, on=cols) as a public API and allow for users to provide custom joiners (where all things that are asked for can be implemented, most probably using do block).

Now if we went this way it is also natural how leftjoin! should be defined. Currently leftjoin passes to default_joiner a copy of df1. Now leftjoin! would manipulate df1 in place before passing it to default_joiner and by this the effect would be that leftjoin! would mutate df1 in place as needed.

The only problem is that leftjoin(custom_joiner, df1, df2, on=cols) is a very low level API, but maybe it is good enough for practical purposes?

bkamins on 27 May 2020

I also found this feature would be much helpful in our work.

the update! indicates the contents of df1 will be refreshed by the corresponding contents of df2, which match the leftjoin, while merge! means the content of df1 and df2 should be combined, which match the outerjoin. so I think the join APIs are appropriate to include this feature.

the problems here are how to handle content conflicts. current implementation will throw error or if makeunique=true leave conflicts to user.

but for some common use cases, for example, to update/leftjoin df1, if conflicts occurs, we usually want to overwrite content of df1 with the corresponding content of df2. From a user perspective, keywords makeunique=false, overwrite=true will be clear enough to express the intentions. of course, the content here would be single cell in a spreadsheet, not the whole column. if missing is to filled as much as possible, means if df1 have missing and df2 not, missing is replaced, but if df1 not and df2 have missing, data will not be evacuated by missing, another keywords writemissing=false should be ok.

for flexibility, a relatively low-level API @bkamins mentioned may be needed, but to the most users, I think join would be a good target to incorporate this feature.