Dplyr: Feature Request: Preferred Column Values After Merge

Created on 29 Dec 2017  路  7Comments  路  Source: tidyverse/dplyr

I often want to join two data.frames and then select the "best" result from the output columns.

This happens when I may have two sources of information with partially overlapping information. One source may be more reliable than the other, so I would prefer to use source 1 if it has a value. If source 1 doesn't have a value, I'd prefer to use the other source.

The function below does what I'm looking for, and I think it would fit in well in dplyr. If of interest, I can generate a pull request with this and tests.

#' Following a join, select non-missing values from one source
#' removing the other and combining duplicated column names.
#'
#' @param x The previously-joined data.frame (or similar)
#' @param suffix_order The column name suffixes to combine in preference
#'   order.  Values with the first suffix that are not missing will be
#'   kept.  Values missing with the first suffix will be replaced with
#'   values from the second suffix.
#' @param missing_values Either a vector or a list.  If a vector, all
#'   columns will be compared to the vector.  If a list, then column
#'   names will be matched to list names (and if not all column names.
#' @param duplicated_columns Column names (without the suffix) that are
#'   duplicated.  If not provided, they will be auto-detected.
#' @return A data.frame (or similar; input class is not changed) with
#'   \code{duplicated_columns} combined into one based on preference
#'   order.
#' @examples
#' # Missing values are substituted
#' full_join(data.frame(A=1, B=2, C=3),
#'           data.frame(A=4, B=5, C=6), by="A") %>%
#'   join_prefer()
#'
#' # Non-missing values (C==7) are kept
#' full_join(data.frame(A=c(1, 4), B=c(2, NA), C=c(3, 7)),
#'           data.frame(A=4, B=5, C=6), by="A") %>%
#'   join_prefer()
#'
#' # Not all columns must be operated on
#' full_join(data.frame(A=c(1, 4), B=c(2, NA), C=c(3, 7)),
#'           data.frame(A=4, B=5, C=6), by="A") %>%
#'   join_prefer(duplicated_columns="B")
#'
#' # Nonstandard missing values may be used
#' full_join(data.frame(A=c(1, 4), B=c(2, NA), C=c(3, 7)),
#'           data.frame(A=4, B=5, C=6), by="A") %>%
#'   join_prefer(missing_values=7)
#'
#' # Multiple missing values may be used
#' full_join(data.frame(A=c(1, 4), B=c(2, NA), C=c(3, 7)),
#'           data.frame(A=4, B=5, C=6), by="A") %>%
#'   join_prefer(missing_values=c(NA, 7))
#'
#' # Missing values can be specific to a column
#' full_join(data.frame(A=c(1, 4), B=c(2, NA), C=c(3, 7)),
#'           data.frame(A=4, B=5, C=6), by="A") %>%
#'   join_prefer(missing_values=list(B=NA))
#'
#' # Nonstandard missing values can be specific to a column
#' full_join(data.frame(A=c(1, 4), B=c(2, NA), C=c(3, 7)),
#'           data.frame(A=4, B=5, C=6), by="A") %>%
#'   join_prefer(missing_values=list(B=5))
#' @export
join_prefer <- function(x, suffix_order=c(".x", ".y"),
                        missing_values=NA,
                        duplicated_columns=NULL) {
  trim_suffix <- function(nm, suffix) {
    mask_match <- grepl(suffix, nm, fixed=TRUE)
    if (any(mask_match)) {
      gsub(suffix, "", nm[mask_match], fixed=TRUE)
    } else {
      character(0)
    }
  }
  if (is.null(duplicated_columns)) {
    if (is.list(missing_values)) {
      duplicated_columns <- names(missing_values)
    } else {
      trimmed_columns_1 <- trim_suffix(names(x), suffix_order[1])
      trimmed_columns_2 <- trim_suffix(names(x), suffix_order[2])
      duplicated_columns <- intersect(trimmed_columns_1, trimmed_columns_2)
    }
  }
  expected_columns <- c(paste0(duplicated_columns, suffix_order[1]),
                        paste0(duplicated_columns, suffix_order[2]))
  if (any(mask_missing_columns <- !(expected_columns %in% names(x)))) {
    stop("All duplicated_columns with suffix must be names of columns of x.  The following are not found: ",
         paste(expected_columns[mask_missing_columns], collapse=", "))
  }
  if (is.list(missing_values)) {
    if (!all(mask_not_found <- names(missing_values) %in% duplicated_columns)) {
      stop("Some names of missing_values are not duplicated_columns: ",
           paste(names(missing_values)[mask_not_found], collapse=", "))
    } else if (!all(mask_not_found <- duplicated_columns %in% names(missing_values))) {
      stop("Some duplicated_columns are not in the names of missing_values: ",
           paste(duplicated_columns[mask_not_found], collapse=", "))
    }
  }
  if (length(duplicated_columns)) {
    for (nm in duplicated_columns) {
      colname_1 <- paste0(nm, suffix_order[1])
      colname_2 <- paste0(nm, suffix_order[2])
      x[[nm]] <- x[[colname_1]]
      current_missing <-
        if (is.list(missing_values)) {
          missing_values[[nm]]
        } else {
          missing_values
        }
      mask_missing <- x[[colname_1]] %in% current_missing
      x[[nm]][mask_missing] <- x[[colname_2]][mask_missing]
      # Drop the duplicated column names
      x[[colname_1]] <- x[[colname_2]] <- NULL
    }
  } else {
    message("No duplicated columns found.")
  }
  x
}
feature verbs

Most helpful comment

The proposed operations are some join combined with a vectorized reduction operator, by default coalesce().

I'd like to keep an initial implementation as simple as possible, no "new" columns should be allowed on the RHS, and all observations from the LHS are kept. We may want to be able to specify which side takes precedence, this seems to give two separately useful operations (LHS wins: partial dictionary update, RHS wins: upsert). Would that solve your most common use cases?

A standalone verb that operates on an already expanded table might be better suited in tidyr. Can you outline a use case?

Some of this has been covered in more detail in https://github.com/tidyverse/tidyr/issues/183, and earlier in #2075. I now think performance considerations make it worthwhile to consider an implementation in dplyr, a tidyr interface might add more bells and whistles.

All 7 comments

Thanks. This looks like an useful feature, I would use it for applying patch tables: partial dictionaries that contain replacement values for only a subset of the data. Maybe we can create new verbs for this operation, because it only seems useful in full joins?

update_join <- function(x, y, by) ...
combine_join <- function(x, y, by) ...

(Not sure what a good naming choice would be here.)

The operations can be implemented much faster when combining values right away during the join. Also, SQL backends could do this very efficiently. We'd have to decide what to do with incompatible data types, I'd say we should be strict and throw an error. What do you think?

Perhaps stay near SQL naming and use the term coalesce()?

I'm glad that you like the idea, @krlmlr!

I see it as useful in many scenarios (not just in the context of a join). An example for me is that I often work with meta-analysis databases (data are extracted from many journal articles in the scientific literature), and I have similar sets of information from different articles which I'd want to choose the best from. For example, the number of individuals with a measurement may come from a column specific to "number of individuals with a measurement" or it may come from "number of individuals in the study". (And that sounds like your partial dictionary example, with the difference that it originates in a single data set rather than from a join.)

Within the context of joins, I see value in full_join and in a variant of semi_join where I would not want to expand x (as full_join may).

My reason for giving the example is: I like the function existing separately so that I can use it at times when I may not just be joining. But, I do often need it with a join, and I can see that it would be much faster performed during the join than post-processing when part of a join.

For naming... shrug I don't have a strong preference. Of the choices you list, I like "update" rather than "combine" because that sounds more like the operation occurring to me. (Edit just before sending, I like @JohnMount 's suggestion of coalesce.)

For incompatible data types there are two questions:

  • Incompatible in missing_values, which I think should be an error if given as a list and an attempt at coercion if given as a vector. My method was that if %in% could compare, that seemed sufficient.
  • Incompatible in the potentially duplicated columns, which I think makes sense to be an error (after trying some simple coercion).

The proposed operations are some join combined with a vectorized reduction operator, by default coalesce().

I'd like to keep an initial implementation as simple as possible, no "new" columns should be allowed on the RHS, and all observations from the LHS are kept. We may want to be able to specify which side takes precedence, this seems to give two separately useful operations (LHS wins: partial dictionary update, RHS wins: upsert). Would that solve your most common use cases?

A standalone verb that operates on an already expanded table might be better suited in tidyr. Can you outline a use case?

Some of this has been covered in more detail in https://github.com/tidyverse/tidyr/issues/183, and earlier in #2075. I now think performance considerations make it worthwhile to consider an implementation in dplyr, a tidyr interface might add more bells and whistles.

@krlmlr, I agree that a simpler version makes sense within dplyr and a more bells-and-whistles interface makes sense for tidyr. I'll add into the tidyr discussion. I'm not the best to help with a SQL-friendly implementation, unfortunately.

Sorry, I think this is out of scope for dplyr, and would be best in tidyr.

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

Was this page helpful?
0 / 5 - 0 ratings