Hello there,
Congrats for the great work here! I have a suggestion to make.
Is there a way to get the equivalent of a _merge
indicator variable after a merge in Dplyr
?
Something similar to Pandas' indicator = True
option that essentially tells you how the merge went (how many matches from each dataset, etc).
Here is an example in Pandas
import pandas as pd
df1 = pd.DataFrame({'key1' : ['a','b','c'], 'v1' : [1,2,3]})
df2 = pd.DataFrame({'key1' : ['a','b','d'], 'v2' : [4,5,6]})
match = df1.merge(df2, how = 'left', indicator = True)
Here, after a left join
between df1
and df2
, you want to immediately know how many rows in df1
found a match in df2
and how many of them did not
Out[53]:
key1 v1 v2 _merge
0 a 1 4.0 both
1 b 2 5.0 both
2 c 3 NaN left_only
and I can tabulate this merge
variable:
match._merge.value_counts()
Out[52]:
both 2
left_only 1
right_only 0
Name: _merge, dtype: int64
I don't see any option available after a, say, left join in Dplyr
key1 = c('a','b','c')
v1 = c(1,2,3)
key2 = c('a','b','d')
v2 = c(4,5,6)
df1 = data.frame(key1,v1)
df2 = data.frame(key2,v2)
> left_join(df1,df2, by = c('key1' = 'key2'))
key1 v1 v2
1 a 1 4
2 b 2 5
3 c 3 NA
Am I missing something here?
Originally posted on SO http://stackoverflow.com/questions/40110644/dplyr-is-there-a-merge-indicator-available-after-a-merge.
The solution provided by the user is nice, but I believe this should be be an option implemented directly in dplyr
merge. Having control over how the match went is fundamental.
What do you think?
Thanks!
Stata similarly creates a new variable _merge
when doing any type of merge or join. It works nearly identically to the Pandas
example above. I too find it helpful to have as an option in order to diagnose a merge quickly after performing it, especially when doing basic data exploration and cleaning.
I'd also suggest considering an option that simply prints the merge results after a merge is done. For example, ___ cases in the newly merged data are from only the left df; ___ cases are from only the right, and ___ matched. Often times I just want to see that and don't always need the actual variable _merge
For the last few months I've been using basic functions I wrote that simply embellish the dplyr
joins. There are probably more efficient ways of doing this, but here is an example of one that embellishes full_join
. If you set the option .merge = T
you'll get a variable similar to _merge in Stata or Pandas. (This also just prints off a diagnostic message about how many cases matched and didn't match each time you use it.)
``` r function, evaluate = FALSE
full_join_track <- function(x, y, by = NULL, suffix = c(".x", ".y"),
.merge = FALSE, ...){
# Checking to make sure used variable names are not already in use
if(".x_tracker" %in% names(x)){
message("Warning: variable .x_tracker in left data was dropped")
}
if(".y_tracker" %in% names(y)){
message("Warning: variable .y_tracker in right data was dropped")
}
if(.merge & (".merge" %in% names(x) | ".merge" %in% names(y))){
stop("Variable .merge already exists; change name before proceeding")
}
# Adding simple merge tracker variables to data frames
x[, ".x_tracker"] <- 1
y[, ".y_tracker"] <- 1
# Doing full join
joined <- full_join(x, y, by = by, suffix = suffix, ...)
# Calculating merge diagnoses
matched <- joined %>%
filter(!is.na(.x_tracker) & !is.na(.y_tracker)) %>%
NROW()
unmatched_x <- joined %>%
filter(!is.na(.x_tracker) & is.na(.y_tracker)) %>%
NROW()
unmatched_y <- joined %>%
filter(is.na(.x_tracker) & !is.na(.y_tracker)) %>%
NROW()
# Print merge diagnoses
message(
unmatched_x, " Rows ONLY from left data frame", "\n",
unmatched_y, " Rows ONLY from right data frame", "\n",
matched, " Rows matched"
)
# Create .merge variable if specified
if(.merge){
joined <- joined %>%
mutate(.merge =
case_when(
!is.na(.$.x_tracker) & is.na(.$.y_tracker) ~ "left_only",
is.na(.$.x_tracker) & !is.na(.$.y_tracker) ~ "right_only",
TRUE ~ "matched"
)
)
}
# Dropping tracker variables and returning data frame
joined <- joined %>%
select(-.x_tracker, -.y_tracker)
return(joined)
}
```
hello everyone! just a follow up, is anyone interested in coding that up one day? I think this would be a really useful feature
thanks!!
I'm still wondering how to implement this for SQL backends.
Hi @krlmlr , I dont know about SQL backends, but providing this info after a regular merge with a regular dataframe in RStudio
would be, for itself, a great first step. In my humble opinion, having control over the matching process is probably one of the most important things when analysing data.
Thanks!
@adamMaier: Actually, after looking more carefully at your implementation, I think it can be changed so that it also works for SQL backends. In some cases, computing number of matched or unmatched rows might benefit from a compute() call for SQL backends.
@hadley: Are tracked joins a good fit for dplyr, or perhaps tidyr? For data frames, we could certainly be more efficient than adding tracker columns.
@krlmlr it seems like a good idea to me.
Thanks, all. Glad to see you're considering this.
@krlmlr : Yes, you can certainly implement a more efficient approach than what I wrote with additional tracker columns. My approach was just a simple fill-in until I could figure out a smarter way to do this. (I'm coming from a longtime Stata background, so don't know much about SQL backends.) Happy to help in any way I can.
@adamMaier: Actually, I don't think it's much more than to use mutate()
instead of [<-()
, and perhaps count(joined, ...)
to compute matched, unmatched_x and unmatched_y.
Related: #1619.
Hello there @krlmlr @hadley ! I am proud that my feature request makes sense ;-) Do you guys have an idea when it will be available on dplyr
? Cannot wait to see how my merge
are doing :D
I am proud that my feature request makes sense ;-) Do you guys have an idea when it will be available on
dplyr
? Cannot wait to see how my merges are doing :D
In the meantime, one approach is something like this:
library(dplyr, warn.conflicts = FALSE)
df_x <- memdb_frame(a = 1:2600L, b = rep(letters, 100))
df_y <- memdb_frame(a = 1501:2500L, c = rep(LETTERS[1:10], 100))
merged <- df_x %>%
mutate(on_x = 1L) %>%
left_join(
df_y %>%
mutate(on_y = 1L)) %>%
mutate(on_x = coalesce(on_x, 0L),
on_y = coalesce(on_y, 0L))
#> Joining, by = "a"
merged %>%
count(on_x, on_y)
#> Source: query [?? x 3]
#> Database: sqlite 3.11.1 [:memory:]
#> Groups: on_x
#>
#> on_x on_y n
#> <int> <int> <int>
#> 1 1 0 1600
#> 2 1 1 1000
Or, using PostgreSQL (I think this would work with tibbles too).
library(dplyr, warn.conflicts = FALSE)
pg <- src_postgres()
df_x <-
tibble(a = 1:2600L, b = rep(letters, 100))
df_x <- copy_to(pg, df_x)
df_y <- tibble(a = 2001:3000L, c = rep(LETTERS[1:10], 100))
df_y <- copy_to(pg, df_y)
merged <- df_x %>%
mutate(on_x = TRUE) %>%
full_join(
df_y %>%
mutate(on_y = TRUE)) %>%
mutate(on_x = coalesce(on_x, FALSE),
on_y = coalesce(on_y, FALSE))
#> Joining, by = "a"
merged %>%
count(on_x, on_y)
#> Source: query [?? x 3]
#> Database: postgres 9.6.2 [igow@hostname:5432/crsp]
#> Groups: on_x
#>
#> on_x on_y n
#> <lgl> <lgl> <dbl>
#> 1 FALSE TRUE 400
#> 2 TRUE FALSE 2000
#> 3 TRUE TRUE 600
@iangow @hadley
thanks for the suggestion. I think that works but, again, it is a pain to create fake vectors of 1s in each dataframe every time one does a merge. The fact that @hadley removed this from to_do means you are dropping this feature from future versions of dplyr
?
Thanks!
No, it just means it didn't make it for this release.
This would be a very useful feature indeed. Can't wait to see it implemented.
@DaniloImparato: I ended up putting a function to print diagnostics after implementing a dplyr join (as well as functions to print helpful information after other dplyr functions, like filter, mutate, and summarize) in a package, here. There is help documentation and a vignette in the package in case you're interested: reviewr vignette.pdf
I mostly created it for my own use and for data analysts with whom I work as I find having noisier functions helpful in the early data wrangling stages. I'm using it as a stand-in until something like this becomes a feature of dplyr
.
@adamMaier Sounds cool. I'll give it a try :)
For the past week I've been flirting with matthieugomez' statar::join as it additionally sports some nice key integrity checking beforehand. It bugs me that such features aren't prioritized.
Possibly of interest: https://github.com/elbersb/tidylog
And especially this issue: https://github.com/elbersb/tidylog/issues/25
Feedback welcome!
Given that tidylog has solved this problem so wonderfully, I no longer think this needs to be in dplyr — thanks @elbersb !
While I appreciate the ability of tidylog to give summary information about how a merge went, the original request was for an option that would create a new variable that tells you if a given row was merged. I think that would still be really useful! For instance, I might like to subsequently subset the data to only the rows that merged for some operations. I was looking to see if this feature existed and found this thread.
@JackLandry I thought the same thing. It might be worth asking @elbersb if the wrapper functions in tidylog
could be used to that end. The code above gives a stop-gap solution.
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/
Most helpful comment
Stata similarly creates a new variable
_merge
when doing any type of merge or join. It works nearly identically to thePandas
example above. I too find it helpful to have as an option in order to diagnose a merge quickly after performing it, especially when doing basic data exploration and cleaning.I'd also suggest considering an option that simply prints the merge results after a merge is done. For example, ___ cases in the newly merged data are from only the left df; ___ cases are from only the right, and ___ matched. Often times I just want to see that and don't always need the actual variable
_merge
For the last few months I've been using basic functions I wrote that simply embellish the
dplyr
joins. There are probably more efficient ways of doing this, but here is an example of one that embellishesfull_join
. If you set the option.merge = T
you'll get a variable similar to _merge in Stata or Pandas. (This also just prints off a diagnostic message about how many cases matched and didn't match each time you use it.)``` r function, evaluate = FALSE
full_join_track <- function(x, y, by = NULL, suffix = c(".x", ".y"),
.merge = FALSE, ...){
}
```