I hope to have a merge indicator feature in data.table.merge() function, like merge(x,y, by='a', all=TRUE, indicator=TRUE). When enabled, it generates _merge column after outer join that indicates the merge result with left_only or right_only or both as below example. Currently this feature is supported in pandas.
key1 = c('a','b','c','d','e')
v1 = c(1,2,3, NA, 5)
key2 = c('a','b','d','f')
v2 = c(4,5,6,7)
df1 = data.frame(key=key1,v1)
df2 = data.frame(key=key2,v2)
> df1
key v1
1: a 1
2: b 2
3: c 3
4: d NA
5: e 5
> df2
key v2
1: a 4
2: b 5
3: d 6
4: f 7
# merge result I'd like to have
key v1 v2 _merge
1: a 1 4 both
2: b 2 5 both
3: c 3 NA left_only
4: d NA 6 both # <- not right_only, both
5: e 5 NA left_only
6: f NA 7 right_only
Just for completeness, as of now to get desire results, the easiest way is to create extra column with any non-NA value, can easily be a constant 1L. Then after making full outer join, based on those extra columns one can know if there was a match. NA means a non-match, while the original value means a match.
Note that there is alternative approach suggested for full outer join in #4370 PR. Feature you requested is not there.
Also be aware it is better not to name your data.table column key, because it overlaps with data.table() arguments and a function in package.
I am providing working example below, could eventually be re-used in PR to resolve this FR, or eventually left as is and FR closed as "not in scope". Looking forward for feedback from others.
library(data.table)
key1 = c('a','b','c','d','e')
v1 = c(1,2,3, NA, 5)
key2 = c('a','b','d','f')
v2 = c(4,5,6,7)
df1 = data.table(id1=key1, v1)
df2 = data.table(id1=key2, v2)
merge2 = function(x, y, ..., indicator=FALSE) {
stopifnot(is.data.table(x), is.data.table(y), is.logical(indicator), !is.na(indicator))
if (indicator) {
x[, "__tmp_x" := 1L]
y[, "__tmp_y" := 1L]
}
ans = merge.data.table(x, y, ...)
if (indicator) {
ans[!is.na(`__tmp_x`) & !is.na(`__tmp_y`), "_merge" := "both"]
ans[!is.na(`__tmp_x`) & is.na(`__tmp_y`), "_merge" := "left_only"]
ans[is.na(`__tmp_x`) & !is.na(`__tmp_y`), "_merge" := "right_only"]
ans[, c("__tmp_x","__tmp_y") := NULL]
}
ans[]
}
merge2(df1, df2, by='id1', all=TRUE, indicator=TRUE)
# id1 v1 v2 _merge
#1: a 1 4 both
#2: b 2 5 both
#3: c 3 NA left_only
#4: d NA 6 both
#5: e 5 NA left_only
#6: f NA 7 right_only
The implementation looks very clean! The original x and y would likely need cleanup too so they do not have the additional indicator column. E.g.
ans = merge.data.table(x, y, ...)
if (indicator) {
...
x[, __tmp_x := NULL]
y[, __tmp_y := NULL]
}
This seems in scope, probably because the implementation is relatively straight forward. I'll +1 yours and feel free to mark this as resolved or delete. I just didn't want to edit yours willy nilly.
Better just do shallow copy at start
Most helpful comment
Just for completeness, as of now to get desire results, the easiest way is to create extra column with any non-NA value, can easily be a constant
1L. Then after making full outer join, based on those extra columns one can know if there was a match. NA means a non-match, while the original value means a match.Note that there is alternative approach suggested for full outer join in #4370 PR. Feature you requested is not there.
Also be aware it is better not to name your data.table column
key, because it overlaps withdata.table()arguments and a function in package.I am providing working example below, could eventually be re-used in PR to resolve this FR, or eventually left as is and FR closed as "not in scope". Looking forward for feedback from others.