Data.table: indicator for full outer join: left match, right match

Created on 23 Jul 2020 · 3Comments · Source: Rdatatable/data.table

I hope to have a merge indicator feature in data.table.merge() function, like merge(x,y, by='a', all=TRUE, indicator=TRUE). When enabled, it generates _merge column after outer join that indicates the merge result with left_only or right_only or both as below example. Currently this feature is supported in pandas.

key1 = c('a','b','c','d','e')
v1 = c(1,2,3, NA, 5)
key2 = c('a','b','d','f')
v2 = c(4,5,6,7)
df1 = data.frame(key=key1,v1)
df2 = data.frame(key=key2,v2)

> df1
   key v1
1:   a  1
2:   b  2
3:   c  3
4:   d NA
5:   e  5

> df2
   key v2
1:   a  4
2:   b  5
3:   d  6
4:   f  7

# merge result I'd like to have

   key v1 v2     _merge
1:   a  1  4       both
2:   b  2  5       both
3:   c  3 NA  left_only
4:   d NA  6       both # <- not right_only, both
5:   e  5 NA  left_only
6:   f NA  7 right_only

feature request joins

Source

matthewgson

👍1

Most helpful comment

Just for completeness, as of now to get desire results, the easiest way is to create extra column with any non-NA value, can easily be a constant 1L. Then after making full outer join, based on those extra columns one can know if there was a match. NA means a non-match, while the original value means a match.

Note that there is alternative approach suggested for full outer join in #4370 PR. Feature you requested is not there.

Also be aware it is better not to name your data.table column key, because it overlaps with data.table() arguments and a function in package.

I am providing working example below, could eventually be re-used in PR to resolve this FR, or eventually left as is and FR closed as "not in scope". Looking forward for feedback from others.

library(data.table)
key1 = c('a','b','c','d','e')
v1 = c(1,2,3, NA, 5)
key2 = c('a','b','d','f')
v2 = c(4,5,6,7)
df1 = data.table(id1=key1, v1)
df2 = data.table(id1=key2, v2)

merge2 = function(x, y, ..., indicator=FALSE) {
  stopifnot(is.data.table(x), is.data.table(y), is.logical(indicator), !is.na(indicator))
  if (indicator) {
    x[, "__tmp_x" := 1L]
    y[, "__tmp_y" := 1L]
  }
  ans = merge.data.table(x, y, ...)
  if (indicator) {
    ans[!is.na(`__tmp_x`) & !is.na(`__tmp_y`), "_merge" := "both"]
    ans[!is.na(`__tmp_x`) & is.na(`__tmp_y`), "_merge" := "left_only"]
    ans[is.na(`__tmp_x`) & !is.na(`__tmp_y`), "_merge" := "right_only"]
    ans[, c("__tmp_x","__tmp_y") := NULL]
  }
  ans[]
}
merge2(df1, df2, by='id1', all=TRUE, indicator=TRUE)
#   id1 v1 v2     _merge
#1:   a  1  4       both
#2:   b  2  5       both
#3:   c  3 NA  left_only
#4:   d NA  6       both
#5:   e  5 NA  left_only
#6:   f NA  7 right_only

jangorecki on 26 Aug 2020

👍2

All 3 comments

Note that there is alternative approach suggested for full outer join in #4370 PR. Feature you requested is not there.

Also be aware it is better not to name your data.table column key, because it overlaps with data.table() arguments and a function in package.

I am providing working example below, could eventually be re-used in PR to resolve this FR, or eventually left as is and FR closed as "not in scope". Looking forward for feedback from others.

library(data.table)
key1 = c('a','b','c','d','e')
v1 = c(1,2,3, NA, 5)
key2 = c('a','b','d','f')
v2 = c(4,5,6,7)
df1 = data.table(id1=key1, v1)
df2 = data.table(id1=key2, v2)

merge2 = function(x, y, ..., indicator=FALSE) {
  stopifnot(is.data.table(x), is.data.table(y), is.logical(indicator), !is.na(indicator))
  if (indicator) {
    x[, "__tmp_x" := 1L]
    y[, "__tmp_y" := 1L]
  }
  ans = merge.data.table(x, y, ...)
  if (indicator) {
    ans[!is.na(`__tmp_x`) & !is.na(`__tmp_y`), "_merge" := "both"]
    ans[!is.na(`__tmp_x`) & is.na(`__tmp_y`), "_merge" := "left_only"]
    ans[is.na(`__tmp_x`) & !is.na(`__tmp_y`), "_merge" := "right_only"]
    ans[, c("__tmp_x","__tmp_y") := NULL]
  }
  ans[]
}
merge2(df1, df2, by='id1', all=TRUE, indicator=TRUE)
#   id1 v1 v2     _merge
#1:   a  1  4       both
#2:   b  2  5       both
#3:   c  3 NA  left_only
#4:   d NA  6       both
#5:   e  5 NA  left_only
#6:   f NA  7 right_only

jangorecki on 26 Aug 2020

👍2

The implementation looks very clean! The original x and y would likely need cleanup too so they do not have the additional indicator column. E.g.

ans = merge.data.table(x, y, ...)
if (indicator) {
  ...
  x[, __tmp_x := NULL]
  y[,  __tmp_y := NULL]
}

This seems in scope, probably because the implementation is relatively straight forward. I'll +1 yours and feel free to mark this as resolved or delete. I just didn't want to edit yours willy nilly.