Tidyr: NA handling in unite

Created on 13 Jun 2016 · 30Comments · Source: tidyverse/tidyr

Consider the following df:

ID   d1   d2   
1    G    G
2    A    G
3    A    A
4    G    A
5    NA   NA
6    G    G

When uniting d1 and d2:

tidyr::unite(df, new, d1, d2, remove = FALSE, sep = "")

Row 5 gives NANA instead of the expected NA

  ID  new   d1   d2
1  1   GG    G    G
2  2   AG    A    G
3  3   AA    A    A
4  4   GA    G    A
5  5 NANA <NA> <NA>
6  6   GG    G    G

feature strings

Source

voxnonecho

👍8

Most helpful comment

This is not the requested solution, but a clean way to get the desired result is:

library(tidyverse)
df <- tribble(
  ~ID, ~d1, ~d2,   
    1, "G", "G",
    2, "A", "G",
    3, "A", "A",
    4, "G", "A",
    5,  NA,  NA,
    6, "G", "G")
df %>% 
  replace_na(list(d1 = "", d2 = "")) %>% 
  unite(new, d1, d2, remove = FALSE, sep = "")
#> # A tibble: 6 × 4
#>      ID   new    d1    d2
#> * <dbl> <chr> <chr> <chr>
#> 1     1    GG     G     G
#> 2     2    AG     A     G
#> 3     3    AA     A     A
#> 4     4    GA     G     A
#> 5     5                  
#> 6     6    GG     G     G

jennybc on 27 Oct 2016

👍16 ❤5

All 30 comments

unite() is just following the standard paste rules:

paste(NA, NA)
#> [1] "NA NA"

hadley on 13 Jun 2016

I was thinking of a pre-processing treatement similar to: with(df, ifelse(is.na(d1)|is.na(d2), NA, paste0(d1, d2))).

voxnonecho on 13 Jun 2016

I think you need a compelling argument as to why unite() should work differently to paste()

hadley on 13 Jun 2016

Well, I think unite() _should_ work like paste() but could maybe provide an additional argument to handle NAs, à la na.rm = TRUE

voxnonecho on 13 Jun 2016

👍19

I think in some cases the omit NA option could be useful. My df has many columns that contain mostly NA, as a result of multiple rounds of join.

recipe  potato  tomato  cucumber    rock
A       potato  NA      cucumber    NA
B       NA      NA      NA          rock
C       NA      tomato  NA          NA
...

So I was trying to combine the columns into one and remove the NA to see things better.

recipe  ingredients
A       potato,cucumber
B       rock
C       tomato
...

The solution is not hard, just not quite as tidy.

danrlu on 9 Sep 2016

👍10

I just ran into this issue and also suggest adding an option to handle NA in unite. In fact, I'd suggest that the following expressions (though perhaps with an extra param to omit NAs in unite) should produce output identical to its input:

df <- data.frame(x = c("a", "a b", "a b c", NA))
df
x
1 a
2 a b
3 a b c
4
df %>% separate(x, c("a", "b"), extra = "merge") %>% unite(x, a, b, sep=" ")
x
1 a NA
2 a b
3 a b c
4 NA NA
Warning message:
Too few values at 1 locations: 1

In other words, if separate and unite are complements one should be able to use one of them to reverse the operation of the other.

tjohnson250 on 26 Oct 2016

👍5

This is not the requested solution, but a clean way to get the desired result is:

library(tidyverse)
df <- tribble(
  ~ID, ~d1, ~d2,   
    1, "G", "G",
    2, "A", "G",
    3, "A", "A",
    4, "G", "A",
    5,  NA,  NA,
    6, "G", "G")
df %>% 
  replace_na(list(d1 = "", d2 = "")) %>% 
  unite(new, d1, d2, remove = FALSE, sep = "")
#> # A tibble: 6 × 4
#>      ID   new    d1    d2
#> * <dbl> <chr> <chr> <chr>
#> 1     1    GG     G     G
#> 2     2    AG     A     G
#> 3     3    AA     A     A
#> 4     4    GA     G     A
#> 5     5                  
#> 6     6    GG     G     G

jennybc on 27 Oct 2016

👍16 ❤5

I would also suggest adding an option of na.rm = TRUE to handle NAs. Although @jennybc 's alternative solution works for this particular problem, it will show blanks and separators when the separator is not "".

My problem is the same with @danrlu 's. Is there a better and neat solution to ignore the NAs? Currently I just unite all columns and then str_replace_all NAs and adjacent separators with empty strings.

e-clin on 20 Jun 2017

👍7

Here is a generalization of unite that allows you to create a new column from an arbitrary function applied to columns selected as you would with unite. The function needs to return a character vector because it relies on pmap_chr, but swap in pmap_* to taste.

```{r}
library(tidyverse)

df <- tribble(
~ID, ~d1, ~d2, ~d3,
1, "G", "G", "C",
2, NA, "G", "T",
3, "A", NA, "G",
4, "G", "A", "A",
5, NA, NA, NA,
6, "G", "G", "G")

' Semi-general \code{unite} to vectorize a function across columns of dataframe

'

' Accepts columns from a dataframe and vectorizes/parallel maps a function

' across them, returning the result in a new column. Function must return a

' character vector because \code{purrr::pmap_char} enforces type-safety.

'

' @param df A dataframe

' @param col Bare (unquoted) name of results column

' @param ... Bare (unquoted) names of argument columns

' @param fun A function that accepts as many arguments as provided argument

' columns. Gets passed to \code{purrr::pmap_chr} so formula-style lambda

' specification also works.

' @param remove Whether or not to remove the argument columns (defaults

' to \code{true})

'

' @return Dataframe with new column generated by applying \code{fun} to

' argument columns element-wise

combine <- function(df, col, ..., fun, remove = TRUE) {
to_merge <- quos(...)
new_col <- quo_name(enquo(col))
merge_cols <- map_chr(to_merge, quo_name)

df <- mutate(df, !!new_col := pmap_chr(list(!!!to_merge), fun))
if (remove) df <- select(df, -one_of(merge_cols))
df
}

combine(df, new, d1, d2, d3, fun = paste0)

> # A tibble: 6 x 2

> ID new

>

> 1 1 GGC

> 2 2 NAGT

> 3 3 ANAG

> 4 4 GAA

> 5 5 NANANA

> 6 6 GGG

```

I'm not a huge fan of this interface, but I have found it more useful than unite a couple of times. I've also found myself doing stupid things with this function and feel like it's some sort of anti-pattern.

If there are any changes coming to the unite interface, I'd enjoy something in the vein of facilitating vectorized operations for non-vectorized function across dataframe columns.

alexpghayes on 23 Jun 2017

I'm not convinced that unite _should_ work like paste, as it's a very rare situation when a user would actually want to turn NA values into strings. More concerningly, in terms of API consistency separate will introduce NAs in a way that unite can't reverse:

library(tidyr)

example <- tibble::data_frame(x = c('foo', 'foo bar', 'foo bar baz'))

example %>% separate(x, c('foo', 'bar', 'baz'), fill = 'right')    # without `fill = 'right'` same result with a message 
#> # A tibble: 3 x 3
#>     foo   bar   baz
#> * <chr> <chr> <chr>
#> 1   foo  <NA>  <NA>
#> 2   foo   bar  <NA>
#> 3   foo   bar   baz

example %>% 
    separate(x, c('foo', 'bar', 'baz'), fill = 'right') %>% 
    unite(x, foo:baz, sep = ' ')
#> # A tibble: 3 x 1
#>             x
#> *       <chr>
#> 1   foo NA NA
#> 2  foo bar NA
#> 3 foo bar baz

If NAs are in the middle of columns that get united and then separated then paste-like behavior would allow the NA location to be saved (at the cost of requiring them to be converted from strings to actual NA again), but most of the time the NA handling keeps the functions from being inverses. Making na.rm = TRUE the default would be a breaking change, but probably not one that would break much code.

alistaire47 on 11 Sep 2017

👍12

There are actually two feature requests in this thread:

Make NAs infections so that if any input is NA, then the output is NA
Provide an easy way to drop NAs.

2. seems like the more useful option so I will implement that.

@alexpghayes the plan is to extract out a general helper for turning the vectorised functions that power many tidyr functions in a tibblicious version

hadley on 16 Nov 2017

👍4

It'll be way easy (and faster) if this can be implemented at the stringi level, so I'm going to put this aside until https://github.com/gagolews/stringi/issues/289 is resolved.

hadley on 16 Nov 2017

👍1

https://github.com/gagolews/stringi/issues/289 is closed :)

moodymudskipper on 10 Oct 2018

@moodymudskipper it's closed but not implemented in stri_paste().

hadley on 7 Mar 2019

Minimal reprex

library(tidyr)
df <- expand_grid(x = c("a", NA), y = c("b", NA))
unite(df, z, c("x", "y"), remove = FALSE)
#> # A tibble: 4 x 3
#>   z     x     y    
#>   <chr> <chr> <chr>
#> 1 a_b   a     b    
#> 2 a_NA  a     <NA> 
#> 3 NA_b  <NA>  b    
#> 4 NA_NA <NA>  <NA>

^{Created on 2019-03-07 by the reprex package (v0.2.1.9000)}

hadley on 7 Mar 2019

Note that you'll need na.rm = TRUE (I left the default as is to preserve backward compatibility since it seems like many people have probably worked around the previous behaviour in various way)

library(tidyr)
df <- expand_grid(x = c("a", NA), y = c("b", NA))
df %>% unite("z", x:y, na.rm = TRUE, remove = FALSE)
#> # A tibble: 4 x 3
#>   z     x     y    
#>   <chr> <chr> <chr>
#> 1 a_b   a     b    
#> 2 a     a     <NA> 
#> 3 b     <NA>  b    
#> 4 ""    <NA>  <NA>

^{Created on 2019-03-07 by the reprex package (v0.2.1.9000)}

hadley on 7 Mar 2019

🎉8

Hi @hadley ,

I am having trouble getting na.rm = TRUE to work within the unite() function.

I tried the following:

Update R from 3.5.1 to 3.5.3
Delete the old tidyverse and tidyr packages
install fresh tidyverse package
run the following code:

> library("tidyr")
> df <- expand.grid(x = c("a", NA), y = c("b", NA))
> df
     x    y
1    a    b
2 <NA>    b
3    a <NA>
4 <NA> <NA>
> df %>% unite("z", x:y, na.rm = TRUE, remove = FALSE)
Error: `TRUE` must evaluate to column positions or names, not a logical vector
Call `rlang::last_error()` to see a backtrace

Which gives me this error:

Error: `TRUE` must evaluate to column positions or names, not a logical vector
Call `rlang::last_error()` to see a backtrace

Backtracing error:

> rlang::last_error()
<error>
message: `TRUE` must evaluate to column positions or names, not a logical vector
class:   `rlang_error`
backtrace:
  1. tidyr::unite(., "z", x:y, na.rm = TRUE, remove = FALSE)
 10. tidyselect::vars_select(colnames(data), ...)
 11. tidyselect:::bad_calls(bad, "must evaluate to { singular(.vars) } positions or names, \\\n       not { first_type }")
 12. tidyselect:::glubort(fmt_calls(calls), ..., .envir = .envir)
 13. tidyr::unite(., "z", x:y, na.rm = TRUE, remove = FALSE)
Call `rlang::last_trace()` to see the full backtrace

> rlang::last_trace()
     x
  1. \-df %>% unite("z", x:y, na.rm = TRUE, remove = FALSE)
  2.   +-base::withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
  3.   \-base::eval(quote(`_fseq`(`_lhs`)), env, env)
  4.     \-base::eval(quote(`_fseq`(`_lhs`)), env, env)
  5.       \-global::`_fseq`(`_lhs`)
  6.         \-magrittr::freduce(value, `_function_list`)
  7.           +-base::withVisible(function_list[[k]](value))
  8.           \-function_list[[k]](value)
  9.             +-tidyr::unite(., "z", x:y, na.rm = TRUE, remove = FALSE)
 10.             \-tidyr:::unite.data.frame(., "z", x:y, na.rm = TRUE, remove = FALSE)
 11.               \-tidyselect::vars_select(colnames(data), ...)
 12.                 \-tidyselect:::bad_calls(bad, "must evaluate to { singular(.vars) } positions or names, \\\n       not { first_type }")
 13.                   \-tidyselect:::glubort(fmt_calls(calls), ..., .envir = .envir)

kasperav on 28 Mar 2019

@kasperav you probably have not installed the development version of tidyr.

hadley on 28 Mar 2019

👍1

@hadley you are right! I have no luck with installing the dev version, so I'll wait for this to be implemented in a CRAN version of tidyr :)

kasperav on 28 Mar 2019

FWIW, I found the behavior where unite takes two NA values and produces an empty string to be very confusing and unexpected. Seems clear to me that uniting two NA values should produce an NA value.

I'm guessing this is clearer to people who have used paste a lot :) Simple to fix up with a na_if("") (but one has to hope that empty string wasn't a meaningful value distinct from _NA_character in the original columns!)

jameshowison on 27 Dec 2019

👍1

I have a use case where I need to use na.rm = TRUE and unite for 8 columns. One of the columns is all NA. Using na.rm = T with unite seems to have different behavior when one of the columns is all NA. Is that expected behavior? Should I just ignore columns that are all NA before using unite?

library(tidyr)
df_notwork <- expand_grid(x = c("a", NA), y = c(NA, NA))
df_notwork %>% unite("z", x:y, na.rm = TRUE, remove = FALSE)

# A tibble: 4 x 3
  z     x     y    
  <chr> <chr> <lgl>
1 a_NA  a     NA   
2 a_NA  a     NA   
3 NA    NA    NA   
4 NA    NA    NA

lindsayplatt on 25 Feb 2020

What version are you using? That's not the result I get (on 1.0.2.9000)

suppressPackageStartupMessages(require(tidyverse))
df_notwork <- expand_grid(x = c("a", NA), y = c(NA, NA))
df_notwork %>% unite("z", x:y, na.rm = T, remove = FALSE)
#> # A tibble: 4 x 3
#>   z     x     y    
#>   <chr> <chr> <lgl>
#> 1 "a"   a     NA   
#> 2 "a"   a     NA   
#> 3 ""    <NA>  NA   
#> 4 ""    <NA>  NA

^{Created on 2020-02-25 by the reprex package (v0.3.0)}

jzadra on 25 Feb 2020

I am using a newer version.

packageVersion("tidyverse")
[1] ‘1.3.0’

lindsayplatt on 25 Feb 2020

tidyverse is different from tidyr; it is a collection of other packages put together for easy loading. So it will have a different version than all the packages within it. Check your tidyr version.

jzadra on 25 Feb 2020

Oh, sorry I saw that you were loading tidyverse so I assumed that was the version you were referring to. I always assumed that updating tidyverse would update the packages within it so I normally just update that one. I guess that is an inappropriate assumption!

Even with updating tidyr using the GitHub version, I still have that issue. Maybe it is another out-of-date package?

packageVersion("tidyr")
[1] ‘1.0.2.9000’
> library(tidyr)
> df_notwork <- expand_grid(x = c("a", NA), y = c(NA, NA))
> df_notwork %>% unite("z", x:y, na.rm = TRUE, remove = FALSE)
# A tibble: 4 x 3
  z     x     y    
  <chr> <chr> <lgl>
1 a_NA  a     NA   
2 a_NA  a     NA   
3 NA    NA    NA   
4 NA    NA    NA

lindsayplatt on 25 Feb 2020

Interesting. I'm not sure why we are getting different results.

Regardless, it looks to me as if your NA's aren't being removed despite na.rm = F.

Yes, I would try update your other packages and see if that solves it. But since both expand_grid and unite are from tidyr I'm not sure why that would be the case.

jzadra on 25 Feb 2020

It appears that my version of tidyselect was quite out-of-date (<1.0). I updated that and now it is functioning as expected.

packageVersion("tidyr")
[1] ‘1.0.2.9000’

packageVersion("tidyselect")
[1] ‘1.0.0’

library(tidyr)
df_notwork <- expand_grid(x = c("a", NA), y = c(NA, NA))
df_notwork %>% unite("z", x:y, na.rm = TRUE, remove = FALSE)

# A tibble: 4 x 3
  z     x     y    
  <chr> <chr> <lgl>
1 "a"   a     NA   
2 "a"   a     NA   
3 ""    NA    NA   
4 ""    NA    NA

lindsayplatt on 25 Feb 2020

Hello,

I've updated to all the latest versions of the packages (tidyr 1.0.2.900, tidyselect 1.0.0) and I'm still getting the same error. I tried Lindsay's df_notwork, and get the same version as what she has prior to the updates. Any help would be appreciated!

anjaollodart on 1 Apr 2020

@anjaollodart - perhaps you can try updating additional packages that tidyr depends on. It's just a guess, but the need to separately update tidyselect from tidyr was surprising to me, so maybe there is another package dependency that has the same issue.

lindsayplatt on 2 Apr 2020

Dear Lindsay,

I solved this issue on my own when I used my own data frame (not one that
was in the example). And as soon as I did it, in 10-15 minutes, I deleted
the comment because the issue was not about this function.
It is strange that GitHub still put this comment through.

Thank you,

Julia

On Thu, Apr 2, 2020 at 4:02 PM Lindsay (Carr) Platt <
[email protected]> wrote:

@anjaollodart https://github.com/anjaollodart - perhaps you can try
updating additional packages that tidyr depends on. It's just a guess,
but the need to separately update tidyselect from tidyr was surprising to
me, so maybe there is another package dependency that has the same issue.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/tidyverse/tidyr/issues/203#issuecomment-607865602,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AFUQAKM6SJFDD6MZGPN2B6DRKSLHHANCNFSM4CGRBQRQ
.