Consider the following df:
ID d1 d2
1 G G
2 A G
3 A A
4 G A
5 NA NA
6 G G
When uniting d1 and d2:
tidyr::unite(df, new, d1, d2, remove = FALSE, sep = "")
Row 5 gives NANA instead of the expected NA
ID new d1 d2
1 1 GG G G
2 2 AG A G
3 3 AA A A
4 4 GA G A
5 5 NANA <NA> <NA>
6 6 GG G G
unite() is just following the standard paste rules:
paste(NA, NA)
#> [1] "NA NA"
I was thinking of a pre-processing treatement similar to: with(df, ifelse(is.na(d1)|is.na(d2), NA, paste0(d1, d2))).
I think you need a compelling argument as to why unite() should work differently to paste()
Well, I think unite() _should_ work like paste() but could maybe provide an additional argument to handle NAs, Ã la na.rm = TRUE
I think in some cases the omit NA option could be useful. My df has many columns that contain mostly NA, as a result of multiple rounds of join.
recipe potato tomato cucumber rock
A potato NA cucumber NA
B NA NA NA rock
C NA tomato NA NA
...
So I was trying to combine the columns into one and remove the NA to see things better.
recipe ingredients
A potato,cucumber
B rock
C tomato
...
The solution is not hard, just not quite as tidy.
I just ran into this issue and also suggest adding an option to handle NA in unite. In fact, I'd suggest that the following expressions (though perhaps with an extra param to omit NAs in unite) should produce output identical to its input:
df <- data.frame(x = c("a", "a b", "a b c", NA))
df
x
1 a
2 a b
3 a b c
4
df %>% separate(x, c("a", "b"), extra = "merge") %>% unite(x, a, b, sep=" ")
x
1 a NA
2 a b
3 a b c
4 NA NA
Warning message:
Too few values at 1 locations: 1
In other words, if separate and unite are complements one should be able to use one of them to reverse the operation of the other.
This is not the requested solution, but a clean way to get the desired result is:
library(tidyverse)
df <- tribble(
~ID, ~d1, ~d2,
1, "G", "G",
2, "A", "G",
3, "A", "A",
4, "G", "A",
5, NA, NA,
6, "G", "G")
df %>%
replace_na(list(d1 = "", d2 = "")) %>%
unite(new, d1, d2, remove = FALSE, sep = "")
#> # A tibble: 6 × 4
#> ID new d1 d2
#> * <dbl> <chr> <chr> <chr>
#> 1 1 GG G G
#> 2 2 AG A G
#> 3 3 AA A A
#> 4 4 GA G A
#> 5 5
#> 6 6 GG G G
I would also suggest adding an option of na.rm = TRUE to handle NAs. Although @jennybc 's alternative solution works for this particular problem, it will show blanks and separators when the separator is not "".
My problem is the same with @danrlu 's. Is there a better and neat solution to ignore the NAs? Currently I just unite all columns and then str_replace_all NAs and adjacent separators with empty strings.
Here is a generalization of unite that allows you to create a new column from an arbitrary function applied to columns selected as you would with unite. The function needs to return a character vector because it relies on pmap_chr, but swap in pmap_* to taste.
```{r}
library(tidyverse)
df <- tribble(
~ID, ~d1, ~d2, ~d3,
1, "G", "G", "C",
2, NA, "G", "T",
3, "A", NA, "G",
4, "G", "A", "A",
5, NA, NA, NA,
6, "G", "G", "G")
combine <- function(df, col, ..., fun, remove = TRUE) {
to_merge <- quos(...)
new_col <- quo_name(enquo(col))
merge_cols <- map_chr(to_merge, quo_name)
df <- mutate(df, !!new_col := pmap_chr(list(!!!to_merge), fun))
if (remove) df <- select(df, -one_of(merge_cols))
df
}
combine(df, new, d1, d2, d3, fun = paste0)
```
I'm not a huge fan of this interface, but I have found it more useful than unite a couple of times. I've also found myself doing stupid things with this function and feel like it's some sort of anti-pattern.
If there are any changes coming to the unite interface, I'd enjoy something in the vein of facilitating vectorized operations for non-vectorized function across dataframe columns.
I'm not convinced that unite _should_ work like paste, as it's a very rare situation when a user would actually want to turn NA values into strings. More concerningly, in terms of API consistency separate will introduce NAs in a way that unite can't reverse:
library(tidyr)
example <- tibble::data_frame(x = c('foo', 'foo bar', 'foo bar baz'))
example %>% separate(x, c('foo', 'bar', 'baz'), fill = 'right') # without `fill = 'right'` same result with a message
#> # A tibble: 3 x 3
#> foo bar baz
#> * <chr> <chr> <chr>
#> 1 foo <NA> <NA>
#> 2 foo bar <NA>
#> 3 foo bar baz
example %>%
separate(x, c('foo', 'bar', 'baz'), fill = 'right') %>%
unite(x, foo:baz, sep = ' ')
#> # A tibble: 3 x 1
#> x
#> * <chr>
#> 1 foo NA NA
#> 2 foo bar NA
#> 3 foo bar baz
If NAs are in the middle of columns that get united and then separated then paste-like behavior would allow the NA location to be saved (at the cost of requiring them to be converted from strings to actual NA again), but most of the time the NA handling keeps the functions from being inverses. Making na.rm = TRUE the default would be a breaking change, but probably not one that would break much code.
There are actually two feature requests in this thread:
NA, then the output is NANAs. 2. seems like the more useful option so I will implement that.
@alexpghayes the plan is to extract out a general helper for turning the vectorised functions that power many tidyr functions in a tibblicious version
It'll be way easy (and faster) if this can be implemented at the stringi level, so I'm going to put this aside until https://github.com/gagolews/stringi/issues/289 is resolved.
https://github.com/gagolews/stringi/issues/289 is closed :)
@moodymudskipper it's closed but not implemented in stri_paste().
Minimal reprex
library(tidyr)
df <- expand_grid(x = c("a", NA), y = c("b", NA))
unite(df, z, c("x", "y"), remove = FALSE)
#> # A tibble: 4 x 3
#> z x y
#> <chr> <chr> <chr>
#> 1 a_b a b
#> 2 a_NA a <NA>
#> 3 NA_b <NA> b
#> 4 NA_NA <NA> <NA>
Created on 2019-03-07 by the reprex package (v0.2.1.9000)
Note that you'll need na.rm = TRUE (I left the default as is to preserve backward compatibility since it seems like many people have probably worked around the previous behaviour in various way)
library(tidyr)
df <- expand_grid(x = c("a", NA), y = c("b", NA))
df %>% unite("z", x:y, na.rm = TRUE, remove = FALSE)
#> # A tibble: 4 x 3
#> z x y
#> <chr> <chr> <chr>
#> 1 a_b a b
#> 2 a a <NA>
#> 3 b <NA> b
#> 4 "" <NA> <NA>
Created on 2019-03-07 by the reprex package (v0.2.1.9000)
Hi @hadley ,
I am having trouble getting na.rm = TRUE to work within the unite() function.
I tried the following:
> library("tidyr")
> df <- expand.grid(x = c("a", NA), y = c("b", NA))
> df
x y
1 a b
2 <NA> b
3 a <NA>
4 <NA> <NA>
> df %>% unite("z", x:y, na.rm = TRUE, remove = FALSE)
Error: `TRUE` must evaluate to column positions or names, not a logical vector
Call `rlang::last_error()` to see a backtrace
Which gives me this error:
Error: `TRUE` must evaluate to column positions or names, not a logical vector
Call `rlang::last_error()` to see a backtrace
Backtracing error:
> rlang::last_error()
<error>
message: `TRUE` must evaluate to column positions or names, not a logical vector
class: `rlang_error`
backtrace:
1. tidyr::unite(., "z", x:y, na.rm = TRUE, remove = FALSE)
10. tidyselect::vars_select(colnames(data), ...)
11. tidyselect:::bad_calls(bad, "must evaluate to { singular(.vars) } positions or names, \\\n not { first_type }")
12. tidyselect:::glubort(fmt_calls(calls), ..., .envir = .envir)
13. tidyr::unite(., "z", x:y, na.rm = TRUE, remove = FALSE)
Call `rlang::last_trace()` to see the full backtrace
> rlang::last_trace()
x
1. \-df %>% unite("z", x:y, na.rm = TRUE, remove = FALSE)
2. +-base::withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
3. \-base::eval(quote(`_fseq`(`_lhs`)), env, env)
4. \-base::eval(quote(`_fseq`(`_lhs`)), env, env)
5. \-global::`_fseq`(`_lhs`)
6. \-magrittr::freduce(value, `_function_list`)
7. +-base::withVisible(function_list[[k]](value))
8. \-function_list[[k]](value)
9. +-tidyr::unite(., "z", x:y, na.rm = TRUE, remove = FALSE)
10. \-tidyr:::unite.data.frame(., "z", x:y, na.rm = TRUE, remove = FALSE)
11. \-tidyselect::vars_select(colnames(data), ...)
12. \-tidyselect:::bad_calls(bad, "must evaluate to { singular(.vars) } positions or names, \\\n not { first_type }")
13. \-tidyselect:::glubort(fmt_calls(calls), ..., .envir = .envir)
@kasperav you probably have not installed the development version of tidyr.
@hadley you are right! I have no luck with installing the dev version, so I'll wait for this to be implemented in a CRAN version of tidyr :)
FWIW, I found the behavior where unite takes two NA values and produces an empty string to be very confusing and unexpected. Seems clear to me that uniting two NA values should produce an NA value.
I'm guessing this is clearer to people who have used paste a lot :) Simple to fix up with a na_if("") (but one has to hope that empty string wasn't a meaningful value distinct from _NA_character in the original columns!)
I have a use case where I need to use na.rm = TRUE and unite for 8 columns. One of the columns is all NA. Using na.rm = T with unite seems to have different behavior when one of the columns is all NA. Is that expected behavior? Should I just ignore columns that are all NA before using unite?
library(tidyr)
df_notwork <- expand_grid(x = c("a", NA), y = c(NA, NA))
df_notwork %>% unite("z", x:y, na.rm = TRUE, remove = FALSE)
# A tibble: 4 x 3
z x y
<chr> <chr> <lgl>
1 a_NA a NA
2 a_NA a NA
3 NA NA NA
4 NA NA NA
What version are you using? That's not the result I get (on 1.0.2.9000)
suppressPackageStartupMessages(require(tidyverse))
df_notwork <- expand_grid(x = c("a", NA), y = c(NA, NA))
df_notwork %>% unite("z", x:y, na.rm = T, remove = FALSE)
#> # A tibble: 4 x 3
#> z x y
#> <chr> <chr> <lgl>
#> 1 "a" a NA
#> 2 "a" a NA
#> 3 "" <NA> NA
#> 4 "" <NA> NA
Created on 2020-02-25 by the reprex package (v0.3.0)
I am using a newer version.
packageVersion("tidyverse")
[1] ‘1.3.0’
tidyverse is different from tidyr; it is a collection of other packages put together for easy loading. So it will have a different version than all the packages within it. Check your tidyr version.
Oh, sorry I saw that you were loading tidyverse so I assumed that was the version you were referring to. I always assumed that updating tidyverse would update the packages within it so I normally just update that one. I guess that is an inappropriate assumption!
Even with updating tidyr using the GitHub version, I still have that issue. Maybe it is another out-of-date package?
packageVersion("tidyr")
[1] ‘1.0.2.9000’
> library(tidyr)
> df_notwork <- expand_grid(x = c("a", NA), y = c(NA, NA))
> df_notwork %>% unite("z", x:y, na.rm = TRUE, remove = FALSE)
# A tibble: 4 x 3
z x y
<chr> <chr> <lgl>
1 a_NA a NA
2 a_NA a NA
3 NA NA NA
4 NA NA NA
Interesting. I'm not sure why we are getting different results.
Regardless, it looks to me as if your NA's aren't being removed despite na.rm = F.
Yes, I would try update your other packages and see if that solves it. But since both expand_grid and unite are from tidyr I'm not sure why that would be the case.
It appears that my version of tidyselect was quite out-of-date (<1.0). I updated that and now it is functioning as expected.
packageVersion("tidyr")
[1] ‘1.0.2.9000’
packageVersion("tidyselect")
[1] ‘1.0.0’
library(tidyr)
df_notwork <- expand_grid(x = c("a", NA), y = c(NA, NA))
df_notwork %>% unite("z", x:y, na.rm = TRUE, remove = FALSE)
# A tibble: 4 x 3
z x y
<chr> <chr> <lgl>
1 "a" a NA
2 "a" a NA
3 "" NA NA
4 "" NA NA
Hello,
I've updated to all the latest versions of the packages (tidyr 1.0.2.900, tidyselect 1.0.0) and I'm still getting the same error. I tried Lindsay's df_notwork, and get the same version as what she has prior to the updates. Any help would be appreciated!
@anjaollodart - perhaps you can try updating additional packages that tidyr depends on. It's just a guess, but the need to separately update tidyselect from tidyr was surprising to me, so maybe there is another package dependency that has the same issue.
Dear Lindsay,
I solved this issue on my own when I used my own data frame (not one that
was in the example). And as soon as I did it, in 10-15 minutes, I deleted
the comment because the issue was not about this function.
It is strange that GitHub still put this comment through.
Thank you,
Julia
On Thu, Apr 2, 2020 at 4:02 PM Lindsay (Carr) Platt <
[email protected]> wrote:
@anjaollodart https://github.com/anjaollodart - perhaps you can try
updating additional packages that tidyr depends on. It's just a guess,
but the need to separately update tidyselect from tidyr was surprising to
me, so maybe there is another package dependency that has the same issue.—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/tidyverse/tidyr/issues/203#issuecomment-607865602,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AFUQAKM6SJFDD6MZGPN2B6DRKSLHHANCNFSM4CGRBQRQ
.
Most helpful comment
This is not the requested solution, but a clean way to get the desired result is: