Dplyr: feature request : add `merge` indicator after a merge in dplyr

Created on 18 Oct 2016  Â·  21Comments  Â·  Source: tidyverse/dplyr

Hello there,

Congrats for the great work here! I have a suggestion to make.

Is there a way to get the equivalent of a _merge indicator variable after a merge in Dplyr?

Something similar to Pandas' indicator = True option that essentially tells you how the merge went (how many matches from each dataset, etc).

Here is an example in Pandas

import pandas as pd

df1 = pd.DataFrame({'key1' : ['a','b','c'], 'v1' : [1,2,3]})
df2 = pd.DataFrame({'key1' : ['a','b','d'], 'v2' : [4,5,6]})

match = df1.merge(df2, how = 'left', indicator = True)

Here, after a left join between df1 and df2, you want to immediately know how many rows in df1 found a match in df2 and how many of them did not

Out[53]: 
  key1  v1   v2     _merge
0    a   1  4.0       both
1    b   2  5.0       both
2    c   3  NaN  left_only

and I can tabulate this merge variable:

match._merge.value_counts()
Out[52]: 
both          2
left_only     1
right_only    0
Name: _merge, dtype: int64

I don't see any option available after a, say, left join in Dplyr

key1 = c('a','b','c')
v1 = c(1,2,3)
key2 = c('a','b','d')
v2 = c(4,5,6)
df1 = data.frame(key1,v1)
df2 = data.frame(key2,v2)

> left_join(df1,df2, by = c('key1' = 'key2'))
  key1 v1 v2
1    a  1  4
2    b  2  5
3    c  3 NA

Am I missing something here?

Originally posted on SO http://stackoverflow.com/questions/40110644/dplyr-is-there-a-merge-indicator-available-after-a-merge.

The solution provided by the user is nice, but I believe this should be be an option implemented directly in dplyr merge. Having control over how the match went is fundamental.

What do you think?
Thanks!

feature verbs

Most helpful comment

Stata similarly creates a new variable _merge when doing any type of merge or join. It works nearly identically to the Pandas example above. I too find it helpful to have as an option in order to diagnose a merge quickly after performing it, especially when doing basic data exploration and cleaning.

I'd also suggest considering an option that simply prints the merge results after a merge is done. For example, ___ cases in the newly merged data are from only the left df; ___ cases are from only the right, and ___ matched. Often times I just want to see that and don't always need the actual variable _merge

For the last few months I've been using basic functions I wrote that simply embellish the dplyr joins. There are probably more efficient ways of doing this, but here is an example of one that embellishes full_join. If you set the option .merge = T you'll get a variable similar to _merge in Stata or Pandas. (This also just prints off a diagnostic message about how many cases matched and didn't match each time you use it.)

``` r function, evaluate = FALSE

full_join_track <- function(x, y, by = NULL, suffix = c(".x", ".y"),
.merge = FALSE, ...){

# Checking to make sure used variable names are not already in use
if(".x_tracker" %in% names(x)){
    message("Warning: variable .x_tracker in left data was dropped")
}
if(".y_tracker" %in% names(y)){
    message("Warning: variable .y_tracker in right data was dropped")
}
if(.merge & (".merge" %in% names(x) | ".merge" %in% names(y))){
    stop("Variable .merge already exists; change name before proceeding")
}

# Adding simple merge tracker variables to data frames
x[, ".x_tracker"] <- 1
y[, ".y_tracker"] <- 1

# Doing full join
joined <- full_join(x, y, by = by, suffix = suffix,  ...)

# Calculating merge diagnoses 
matched <- joined %>%
    filter(!is.na(.x_tracker) & !is.na(.y_tracker)) %>%
    NROW()
unmatched_x <- joined %>%
    filter(!is.na(.x_tracker) & is.na(.y_tracker)) %>%
    NROW()
unmatched_y <- joined %>%
    filter(is.na(.x_tracker) & !is.na(.y_tracker)) %>%
    NROW()

# Print merge diagnoses
message(
    unmatched_x, " Rows ONLY from left data frame", "\n",
    unmatched_y, " Rows ONLY from right data frame", "\n",
    matched, " Rows matched"
)

# Create .merge variable if specified
if(.merge){
    joined <- joined %>%
        mutate(.merge = 
                   case_when(
                       !is.na(.$.x_tracker) & is.na(.$.y_tracker) ~ "left_only",
                       is.na(.$.x_tracker) & !is.na(.$.y_tracker) ~ "right_only",
                       TRUE ~ "matched"
                       )
               )
}

# Dropping tracker variables and returning data frame
joined <- joined %>%
    select(-.x_tracker, -.y_tracker)
return(joined)

}
```

All 21 comments

Stata similarly creates a new variable _merge when doing any type of merge or join. It works nearly identically to the Pandas example above. I too find it helpful to have as an option in order to diagnose a merge quickly after performing it, especially when doing basic data exploration and cleaning.

I'd also suggest considering an option that simply prints the merge results after a merge is done. For example, ___ cases in the newly merged data are from only the left df; ___ cases are from only the right, and ___ matched. Often times I just want to see that and don't always need the actual variable _merge

For the last few months I've been using basic functions I wrote that simply embellish the dplyr joins. There are probably more efficient ways of doing this, but here is an example of one that embellishes full_join. If you set the option .merge = T you'll get a variable similar to _merge in Stata or Pandas. (This also just prints off a diagnostic message about how many cases matched and didn't match each time you use it.)

``` r function, evaluate = FALSE

full_join_track <- function(x, y, by = NULL, suffix = c(".x", ".y"),
.merge = FALSE, ...){

# Checking to make sure used variable names are not already in use
if(".x_tracker" %in% names(x)){
    message("Warning: variable .x_tracker in left data was dropped")
}
if(".y_tracker" %in% names(y)){
    message("Warning: variable .y_tracker in right data was dropped")
}
if(.merge & (".merge" %in% names(x) | ".merge" %in% names(y))){
    stop("Variable .merge already exists; change name before proceeding")
}

# Adding simple merge tracker variables to data frames
x[, ".x_tracker"] <- 1
y[, ".y_tracker"] <- 1

# Doing full join
joined <- full_join(x, y, by = by, suffix = suffix,  ...)

# Calculating merge diagnoses 
matched <- joined %>%
    filter(!is.na(.x_tracker) & !is.na(.y_tracker)) %>%
    NROW()
unmatched_x <- joined %>%
    filter(!is.na(.x_tracker) & is.na(.y_tracker)) %>%
    NROW()
unmatched_y <- joined %>%
    filter(is.na(.x_tracker) & !is.na(.y_tracker)) %>%
    NROW()

# Print merge diagnoses
message(
    unmatched_x, " Rows ONLY from left data frame", "\n",
    unmatched_y, " Rows ONLY from right data frame", "\n",
    matched, " Rows matched"
)

# Create .merge variable if specified
if(.merge){
    joined <- joined %>%
        mutate(.merge = 
                   case_when(
                       !is.na(.$.x_tracker) & is.na(.$.y_tracker) ~ "left_only",
                       is.na(.$.x_tracker) & !is.na(.$.y_tracker) ~ "right_only",
                       TRUE ~ "matched"
                       )
               )
}

# Dropping tracker variables and returning data frame
joined <- joined %>%
    select(-.x_tracker, -.y_tracker)
return(joined)

}
```

hello everyone! just a follow up, is anyone interested in coding that up one day? I think this would be a really useful feature

thanks!!

I'm still wondering how to implement this for SQL backends.

Hi @krlmlr , I dont know about SQL backends, but providing this info after a regular merge with a regular dataframe in RStudio would be, for itself, a great first step. In my humble opinion, having control over the matching process is probably one of the most important things when analysing data.

Thanks!

@adamMaier: Actually, after looking more carefully at your implementation, I think it can be changed so that it also works for SQL backends. In some cases, computing number of matched or unmatched rows might benefit from a compute() call for SQL backends.

@hadley: Are tracked joins a good fit for dplyr, or perhaps tidyr? For data frames, we could certainly be more efficient than adding tracker columns.

@krlmlr it seems like a good idea to me.

Thanks, all. Glad to see you're considering this.

@krlmlr : Yes, you can certainly implement a more efficient approach than what I wrote with additional tracker columns. My approach was just a simple fill-in until I could figure out a smarter way to do this. (I'm coming from a longtime Stata background, so don't know much about SQL backends.) Happy to help in any way I can.

@adamMaier: Actually, I don't think it's much more than to use mutate() instead of [<-(), and perhaps count(joined, ...) to compute matched, unmatched_x and unmatched_y.

Related: #1619.

Hello there @krlmlr @hadley ! I am proud that my feature request makes sense ;-) Do you guys have an idea when it will be available on dplyr ? Cannot wait to see how my merge are doing :D

I am proud that my feature request makes sense ;-) Do you guys have an idea when it will be available on dplyr ? Cannot wait to see how my merges are doing :D

In the meantime, one approach is something like this:

library(dplyr, warn.conflicts = FALSE)

df_x <- memdb_frame(a = 1:2600L, b = rep(letters, 100))

df_y <- memdb_frame(a = 1501:2500L, c = rep(LETTERS[1:10], 100))

merged <- df_x %>%
    mutate(on_x = 1L) %>%
    left_join(
        df_y %>% 
            mutate(on_y = 1L)) %>%
    mutate(on_x = coalesce(on_x, 0L),
           on_y = coalesce(on_y, 0L))
#> Joining, by = "a"

merged %>%
    count(on_x, on_y)
#> Source:   query [?? x 3]
#> Database: sqlite 3.11.1 [:memory:]
#> Groups: on_x
#> 
#>    on_x  on_y     n
#>   <int> <int> <int>
#> 1     1     0  1600
#> 2     1     1  1000

Or, using PostgreSQL (I think this would work with tibbles too).

library(dplyr, warn.conflicts = FALSE)

pg <- src_postgres()

df_x <- 
    tibble(a = 1:2600L, b = rep(letters, 100)) 

df_x <- copy_to(pg, df_x)

df_y <- tibble(a = 2001:3000L, c = rep(LETTERS[1:10], 100))

df_y <- copy_to(pg, df_y)

merged <- df_x %>%
    mutate(on_x = TRUE) %>%
    full_join(
        df_y %>% 
            mutate(on_y = TRUE)) %>%
    mutate(on_x = coalesce(on_x, FALSE),
           on_y = coalesce(on_y, FALSE))
#> Joining, by = "a"

merged %>%
    count(on_x, on_y)
#> Source:   query [?? x 3]
#> Database: postgres 9.6.2 [igow@hostname:5432/crsp]
#> Groups: on_x
#> 
#>    on_x  on_y     n
#>   <lgl> <lgl> <dbl>
#> 1 FALSE  TRUE   400
#> 2  TRUE FALSE  2000
#> 3  TRUE  TRUE   600

@iangow @hadley

thanks for the suggestion. I think that works but, again, it is a pain to create fake vectors of 1s in each dataframe every time one does a merge. The fact that @hadley removed this from to_do means you are dropping this feature from future versions of dplyr?

Thanks!

No, it just means it didn't make it for this release.

This would be a very useful feature indeed. Can't wait to see it implemented.

@DaniloImparato: I ended up putting a function to print diagnostics after implementing a dplyr join (as well as functions to print helpful information after other dplyr functions, like filter, mutate, and summarize) in a package, here. There is help documentation and a vignette in the package in case you're interested: reviewr vignette.pdf

I mostly created it for my own use and for data analysts with whom I work as I find having noisier functions helpful in the early data wrangling stages. I'm using it as a stand-in until something like this becomes a feature of dplyr.

@adamMaier Sounds cool. I'll give it a try :)

For the past week I've been flirting with matthieugomez' statar::join as it additionally sports some nice key integrity checking beforehand. It bugs me that such features aren't prioritized.

Possibly of interest: https://github.com/elbersb/tidylog
And especially this issue: https://github.com/elbersb/tidylog/issues/25
Feedback welcome!

Given that tidylog has solved this problem so wonderfully, I no longer think this needs to be in dplyr — thanks @elbersb !

While I appreciate the ability of tidylog to give summary information about how a merge went, the original request was for an option that would create a new variable that tells you if a given row was merged. I think that would still be really useful! For instance, I might like to subsequently subset the data to only the rows that merged for some operations. I was looking to see if this feature existed and found this thread.

@JackLandry I thought the same thing. It might be worth asking @elbersb if the wrapper functions in tidylog could be used to that end. The code above gives a stop-gap solution.

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

Was this page helpful?
0 / 5 - 0 ratings