Tidyr: unite(na.rm = TRUE) only removes character NAs

Created on 27 Sep 2019  路  4Comments  路  Source: tidyverse/tidyr

The NA values from all-NA columns get concatenated in as "NA" strings, as if na.rm = FALSE. If this is intended, then that's not documented.

library(tidyverse)

data <- tribble(
  ~Name,    ~Postalcode, ~Parent,  ~Parent2, ~Parent3, 
  "Paul",   "4732",      "Mother", NA,       NA,       
  "Edward", "9045",      NA,       NA,       NA, 
  "Mary",   "3476",      "Mother", NA,       NA,       
  NA,       NA,          NA,       NA,       NA,      
  NA,       "2468",      NA,       NA,       NA
)

# The NAs from both Parent2 and Parent3 are pasted in as strings, while the NAs
# from Parent1 are properly removed
data %>% unite(Parent_full, Parent:Parent3, sep = "|", na.rm = TRUE)
#> # A tibble: 5 x 3
#>   Name   Postalcode Parent_full 
#>   <chr>  <chr>      <chr>       
#> 1 Paul   4732       Mother|NA|NA
#> 2 Edward 9045       NA|NA       
#> 3 Mary   3476       Mother|NA|NA
#> 4 <NA>   <NA>       NA|NA       
#> 5 <NA>   2468       NA|NA

# Add a value anywhere in Parent3, and all its NAs get removed, but Parent2 is
# still getting pasted in in the middle
data[[2, "Parent3"]] <- "Uncle"
data %>% unite(Parent_full, Parent:Parent3, sep = "|", na.rm = TRUE)
#> # A tibble: 5 x 3
#>   Name   Postalcode Parent_full
#>   <chr>  <chr>      <chr>      
#> 1 Paul   4732       Mother|NA  
#> 2 Edward 9045       NA|Uncle   
#> 3 Mary   3476       Mother|NA  
#> 4 <NA>   <NA>       NA         
#> 5 <NA>   2468       NA

# Add a value to Parent3, and now there's no columns with all NAs, so no NAs are
# pasted in (also, concatenating all-missing values results in "" instead of an NA)
data[[1, "Parent2"]] <- "Aunt"
data %>% unite(Parent_full, Parent:Parent3, sep = "|", na.rm = TRUE)
#> # A tibble: 5 x 3
#>   Name   Postalcode Parent_full
#>   <chr>  <chr>      <chr>      
#> 1 Paul   4732       Mother|Aunt
#> 2 Edward 9045       Uncle      
#> 3 Mary   3476       Mother     
#> 4 <NA>   <NA>       ""         
#> 5 <NA>   2468       ""

Created on 2019-09-28 by the reprex package (v0.3.0)

bug strings

Most helpful comment

TL;DR This occurs for all logical columns. When all vals are NA the column is parsed as logical.

Here's a reprex of your example above, with intermediate results shown. The columns that are all NA are different to the some NA ones in that they're parsed as logical.

So, we could generalize this to say that NAs are not being removed when you're uniting a logical column containing all NAs. (The second reprex shows that, if the Parent2 and Parent3 variables are made characters, na.rm works as expected).

library(tidyverse)

data <- tribble(
  ~Name,    ~Postalcode, ~Parent,  ~Parent2, ~Parent3, 
  "Paul",   "4732",      "Mother", NA,       NA,       
  "Edward", "9045",      NA,       NA,       NA, 
  "Mary",   "3476",      "Mother", NA,       NA,       
  NA,       NA,          NA,       NA,       NA,      
  NA,       "2468",      NA,       NA,       NA
)

glimpse(data)
#> Observations: 5
#> Variables: 5
#> $ Name       <chr> "Paul", "Edward", "Mary", NA, NA
#> $ Postalcode <chr> "4732", "9045", "3476", NA, "2468"
#> $ Parent     <chr> "Mother", NA, "Mother", NA, NA
#> $ Parent2    <lgl> NA, NA, NA, NA, NA
#> $ Parent3    <lgl> NA, NA, NA, NA, NA

# The NAs from both Parent2 and Parent3 are pasted in as strings, while the NAs
# from Parent1 are properly removed
data %>% unite(Parent_full, Parent:Parent3, sep = "|", na.rm = TRUE)
#> # A tibble: 5 x 3
#>   Name   Postalcode Parent_full 
#>   <chr>  <chr>      <chr>       
#> 1 Paul   4732       Mother|NA|NA
#> 2 Edward 9045       NA|NA       
#> 3 Mary   3476       Mother|NA|NA
#> 4 <NA>   <NA>       NA|NA       
#> 5 <NA>   2468       NA|NA

# Add a value anywhere in Parent3, and all its NAs get removed, but Parent2 is
# still getting pasted in in the middle
data[[2, "Parent3"]] <- "Uncle"

data
#> # A tibble: 5 x 5
#>   Name   Postalcode Parent Parent2 Parent3
#>   <chr>  <chr>      <chr>  <lgl>   <chr>  
#> 1 Paul   4732       Mother NA      <NA>   
#> 2 Edward 9045       <NA>   NA      Uncle  
#> 3 Mary   3476       Mother NA      <NA>   
#> 4 <NA>   <NA>       <NA>   NA      <NA>   
#> 5 <NA>   2468       <NA>   NA      <NA>

data %>% unite(Parent_full, Parent:Parent3, sep = "|", na.rm = TRUE)
#> # A tibble: 5 x 3
#>   Name   Postalcode Parent_full
#>   <chr>  <chr>      <chr>      
#> 1 Paul   4732       Mother|NA  
#> 2 Edward 9045       NA|Uncle   
#> 3 Mary   3476       Mother|NA  
#> 4 <NA>   <NA>       NA         
#> 5 <NA>   2468       NA

# Add a value to Parent3, and now there's no columns with all NAs, so no NAs are
# pasted in
data[[1, "Parent2"]] <- "Aunt"

data
#> # A tibble: 5 x 5
#>   Name   Postalcode Parent Parent2 Parent3
#>   <chr>  <chr>      <chr>  <chr>   <chr>  
#> 1 Paul   4732       Mother Aunt    <NA>   
#> 2 Edward 9045       <NA>   <NA>    Uncle  
#> 3 Mary   3476       Mother <NA>    <NA>   
#> 4 <NA>   <NA>       <NA>   <NA>    <NA>   
#> 5 <NA>   2468       <NA>   <NA>    <NA>

data %>% unite(Parent_full, Parent:Parent3, sep = "|", na.rm = TRUE)
#> # A tibble: 5 x 3
#>   Name   Postalcode Parent_full
#>   <chr>  <chr>      <chr>      
#> 1 Paul   4732       Mother|Aunt
#> 2 Edward 9045       Uncle      
#> 3 Mary   3476       Mother     
#> 4 <NA>   <NA>       ""         
#> 5 <NA>   2468       ""

Created on 2019-09-27 by the reprex package (v0.3.0)

library(tidyverse)

data <- tribble(
  ~Name,    ~Postalcode, ~Parent,  ~Parent2, ~Parent3, 
  "Paul",   "4732",      "Mother", NA,       NA,       
  "Edward", "9045",      NA,       NA,       NA, 
  "Mary",   "3476",      "Mother", NA,       NA,       
  NA,       NA,          NA,       NA,       NA,      
  NA,       "2468",      NA,       NA,       NA
)

data <- data %>%
  mutate(Parent2 = as.character(Parent2),
         Parent3 = as.character(Parent3))

data %>% unite(Parent_full, Parent:Parent3, sep = "|", na.rm = TRUE)
#> # A tibble: 5 x 3
#>   Name   Postalcode Parent_full
#>   <chr>  <chr>      <chr>      
#> 1 Paul   4732       Mother     
#> 2 Edward 9045       ""         
#> 3 Mary   3476       Mother     
#> 4 <NA>   <NA>       ""         
#> 5 <NA>   2468       ""

Created on 2019-09-27 by the reprex package (v0.3.0)

Edit:

This also occurs if the column is logical and _not_ all NA.

data <- tribble(
  ~Name,    ~Postalcode, ~Parent,  ~Parent2, ~Parent3, 
  "Paul",   "4732",      "Mother", TRUE,     NA,       
  "Edward", "9045",      NA,       NA,       TRUE, 
  "Mary",   "3476",      "Mother", NA,       NA,       
  NA,       NA,          NA,       FALSE,    NA,      
  NA,       "2468",      NA,       NA,       NA
)


data %>% unite(Parent_full, Parent:Parent3, sep = "|", na.rm = TRUE)
#> # A tibble: 5 x 3
#>   Name   Postalcode Parent_full   
#>   <chr>  <chr>      <chr>         
#> 1 Paul   4732       Mother|TRUE|NA
#> 2 Edward 9045       NA|TRUE       
#> 3 Mary   3476       Mother|NA|NA  
#> 4 <NA>   <NA>       FALSE|NA      
#> 5 <NA>   2468       NA|NA

All 4 comments

TL;DR This occurs for all logical columns. When all vals are NA the column is parsed as logical.

Here's a reprex of your example above, with intermediate results shown. The columns that are all NA are different to the some NA ones in that they're parsed as logical.

So, we could generalize this to say that NAs are not being removed when you're uniting a logical column containing all NAs. (The second reprex shows that, if the Parent2 and Parent3 variables are made characters, na.rm works as expected).

library(tidyverse)

data <- tribble(
  ~Name,    ~Postalcode, ~Parent,  ~Parent2, ~Parent3, 
  "Paul",   "4732",      "Mother", NA,       NA,       
  "Edward", "9045",      NA,       NA,       NA, 
  "Mary",   "3476",      "Mother", NA,       NA,       
  NA,       NA,          NA,       NA,       NA,      
  NA,       "2468",      NA,       NA,       NA
)

glimpse(data)
#> Observations: 5
#> Variables: 5
#> $ Name       <chr> "Paul", "Edward", "Mary", NA, NA
#> $ Postalcode <chr> "4732", "9045", "3476", NA, "2468"
#> $ Parent     <chr> "Mother", NA, "Mother", NA, NA
#> $ Parent2    <lgl> NA, NA, NA, NA, NA
#> $ Parent3    <lgl> NA, NA, NA, NA, NA

# The NAs from both Parent2 and Parent3 are pasted in as strings, while the NAs
# from Parent1 are properly removed
data %>% unite(Parent_full, Parent:Parent3, sep = "|", na.rm = TRUE)
#> # A tibble: 5 x 3
#>   Name   Postalcode Parent_full 
#>   <chr>  <chr>      <chr>       
#> 1 Paul   4732       Mother|NA|NA
#> 2 Edward 9045       NA|NA       
#> 3 Mary   3476       Mother|NA|NA
#> 4 <NA>   <NA>       NA|NA       
#> 5 <NA>   2468       NA|NA

# Add a value anywhere in Parent3, and all its NAs get removed, but Parent2 is
# still getting pasted in in the middle
data[[2, "Parent3"]] <- "Uncle"

data
#> # A tibble: 5 x 5
#>   Name   Postalcode Parent Parent2 Parent3
#>   <chr>  <chr>      <chr>  <lgl>   <chr>  
#> 1 Paul   4732       Mother NA      <NA>   
#> 2 Edward 9045       <NA>   NA      Uncle  
#> 3 Mary   3476       Mother NA      <NA>   
#> 4 <NA>   <NA>       <NA>   NA      <NA>   
#> 5 <NA>   2468       <NA>   NA      <NA>

data %>% unite(Parent_full, Parent:Parent3, sep = "|", na.rm = TRUE)
#> # A tibble: 5 x 3
#>   Name   Postalcode Parent_full
#>   <chr>  <chr>      <chr>      
#> 1 Paul   4732       Mother|NA  
#> 2 Edward 9045       NA|Uncle   
#> 3 Mary   3476       Mother|NA  
#> 4 <NA>   <NA>       NA         
#> 5 <NA>   2468       NA

# Add a value to Parent3, and now there's no columns with all NAs, so no NAs are
# pasted in
data[[1, "Parent2"]] <- "Aunt"

data
#> # A tibble: 5 x 5
#>   Name   Postalcode Parent Parent2 Parent3
#>   <chr>  <chr>      <chr>  <chr>   <chr>  
#> 1 Paul   4732       Mother Aunt    <NA>   
#> 2 Edward 9045       <NA>   <NA>    Uncle  
#> 3 Mary   3476       Mother <NA>    <NA>   
#> 4 <NA>   <NA>       <NA>   <NA>    <NA>   
#> 5 <NA>   2468       <NA>   <NA>    <NA>

data %>% unite(Parent_full, Parent:Parent3, sep = "|", na.rm = TRUE)
#> # A tibble: 5 x 3
#>   Name   Postalcode Parent_full
#>   <chr>  <chr>      <chr>      
#> 1 Paul   4732       Mother|Aunt
#> 2 Edward 9045       Uncle      
#> 3 Mary   3476       Mother     
#> 4 <NA>   <NA>       ""         
#> 5 <NA>   2468       ""

Created on 2019-09-27 by the reprex package (v0.3.0)

library(tidyverse)

data <- tribble(
  ~Name,    ~Postalcode, ~Parent,  ~Parent2, ~Parent3, 
  "Paul",   "4732",      "Mother", NA,       NA,       
  "Edward", "9045",      NA,       NA,       NA, 
  "Mary",   "3476",      "Mother", NA,       NA,       
  NA,       NA,          NA,       NA,       NA,      
  NA,       "2468",      NA,       NA,       NA
)

data <- data %>%
  mutate(Parent2 = as.character(Parent2),
         Parent3 = as.character(Parent3))

data %>% unite(Parent_full, Parent:Parent3, sep = "|", na.rm = TRUE)
#> # A tibble: 5 x 3
#>   Name   Postalcode Parent_full
#>   <chr>  <chr>      <chr>      
#> 1 Paul   4732       Mother     
#> 2 Edward 9045       ""         
#> 3 Mary   3476       Mother     
#> 4 <NA>   <NA>       ""         
#> 5 <NA>   2468       ""

Created on 2019-09-27 by the reprex package (v0.3.0)

Edit:

This also occurs if the column is logical and _not_ all NA.

data <- tribble(
  ~Name,    ~Postalcode, ~Parent,  ~Parent2, ~Parent3, 
  "Paul",   "4732",      "Mother", TRUE,     NA,       
  "Edward", "9045",      NA,       NA,       TRUE, 
  "Mary",   "3476",      "Mother", NA,       NA,       
  NA,       NA,          NA,       FALSE,    NA,      
  NA,       "2468",      NA,       NA,       NA
)


data %>% unite(Parent_full, Parent:Parent3, sep = "|", na.rm = TRUE)
#> # A tibble: 5 x 3
#>   Name   Postalcode Parent_full   
#>   <chr>  <chr>      <chr>         
#> 1 Paul   4732       Mother|TRUE|NA
#> 2 Edward 9045       NA|TRUE       
#> 3 Mary   3476       Mother|NA|NA  
#> 4 <NA>   <NA>       FALSE|NA      
#> 5 <NA>   2468       NA|NA

Maybe this is related:

library(tidyverse)
unite_dbl <- tribble(
  ~Date,        ~First, ~Second,
  "2019-01-07", 2.75,   NA,
  "2019-01-07", NA,     2.5,
  "2019-01-08", 0.25,   NA,
  "2019-01-08", NA,     4.5
)
glimpse(unite_dbl)
#> Observations: 4
#> Variables: 3
#> $ Date   <chr> "2019-01-07", "2019-01-07", "2019-01-08", "2019-01-08"
#> $ First  <dbl> 2.75, NA, 0.25, NA
#> $ Second <dbl> NA, 2.5, NA, 4.5

unite_dbl %>% unite(col = tmp, 2:3, na.rm = TRUE)
#> # A tibble: 4 x 2
#>   Date       tmp    
#>   <chr>      <chr>  
#> 1 2019-01-07 2.75_NA
#> 2 2019-01-07 NA_2.5 
#> 3 2019-01-08 0.25_NA
#> 4 2019-01-08 NA_4.5 

devtools::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.6.1 (2019-07-05)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  German_Germany.1252         
#>  ctype    German_Germany.1252         
#>  tz       Europe/Berlin               
#>  date     2019-11-17                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version date       lib source        
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.5.3)
#>  backports     1.1.5   2019-10-02 [1] CRAN (R 3.6.1)
#>  broom         0.5.2   2019-04-07 [1] CRAN (R 3.5.3)
#>  callr         3.3.2   2019-09-22 [1] CRAN (R 3.6.1)
#>  cellranger    1.1.0   2016-07-27 [1] CRAN (R 3.5.0)
#>  cli           1.1.0   2019-03-19 [1] CRAN (R 3.5.3)
#>  colorspace    1.4-1   2019-03-18 [1] CRAN (R 3.5.3)
#>  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.0)
#>  desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.1)
#>  devtools      2.2.1   2019-09-24 [1] CRAN (R 3.6.1)
#>  digest        0.6.22  2019-10-21 [1] CRAN (R 3.6.1)
#>  dplyr       * 0.8.3   2019-07-04 [1] CRAN (R 3.6.1)
#>  ellipsis      0.3.0   2019-09-20 [1] CRAN (R 3.6.1)
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 3.6.0)
#>  forcats     * 0.4.0   2019-02-17 [1] CRAN (R 3.5.3)
#>  fs            1.3.1   2019-05-06 [1] CRAN (R 3.6.0)
#>  generics      0.0.2   2018-11-29 [1] CRAN (R 3.5.2)
#>  ggplot2     * 3.2.1   2019-08-10 [1] CRAN (R 3.6.1)
#>  glue          1.3.1   2019-03-12 [1] CRAN (R 3.5.3)
#>  gtable        0.3.0   2019-03-25 [1] CRAN (R 3.5.3)
#>  haven         2.2.0   2019-11-08 [1] CRAN (R 3.6.1)
#>  highr         0.8     2019-03-20 [1] CRAN (R 3.5.3)
#>  hms           0.5.2   2019-10-30 [1] CRAN (R 3.6.1)
#>  htmltools     0.4.0   2019-10-04 [1] CRAN (R 3.6.1)
#>  httr          1.4.1   2019-08-05 [1] CRAN (R 3.6.1)
#>  jsonlite      1.6     2018-12-07 [1] CRAN (R 3.5.2)
#>  knitr         1.26    2019-11-12 [1] CRAN (R 3.6.1)
#>  lattice       0.20-38 2018-11-04 [2] CRAN (R 3.6.1)
#>  lazyeval      0.2.2   2019-03-15 [1] CRAN (R 3.5.3)
#>  lifecycle     0.1.0   2019-08-01 [1] CRAN (R 3.6.1)
#>  lubridate     1.7.4   2018-04-11 [1] CRAN (R 3.5.0)
#>  magrittr      1.5     2014-11-22 [1] CRAN (R 3.5.0)
#>  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.0)
#>  modelr        0.1.5   2019-08-08 [1] CRAN (R 3.6.1)
#>  munsell       0.5.0   2018-06-12 [1] CRAN (R 3.5.0)
#>  nlme          3.1-142 2019-11-07 [1] CRAN (R 3.6.1)
#>  pillar        1.4.2   2019-06-29 [1] CRAN (R 3.6.1)
#>  pkgbuild      1.0.6   2019-10-09 [1] CRAN (R 3.6.1)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 3.6.1)
#>  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.1)
#>  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.0)
#>  processx      3.4.1   2019-07-18 [1] CRAN (R 3.6.1)
#>  ps            1.3.0   2018-12-21 [1] CRAN (R 3.5.2)
#>  purrr       * 0.3.3   2019-10-18 [1] CRAN (R 3.6.1)
#>  R6            2.4.1   2019-11-12 [1] CRAN (R 3.6.1)
#>  Rcpp          1.0.3   2019-11-08 [1] CRAN (R 3.6.1)
#>  readr       * 1.3.1   2018-12-21 [1] CRAN (R 3.5.2)
#>  readxl        1.3.1   2019-03-13 [1] CRAN (R 3.5.3)
#>  remotes       2.1.0   2019-06-24 [1] CRAN (R 3.6.1)
#>  rlang         0.4.1   2019-10-24 [1] CRAN (R 3.6.1)
#>  rmarkdown     1.17    2019-11-13 [1] CRAN (R 3.6.1)
#>  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.0)
#>  rvest         0.3.5   2019-11-08 [1] CRAN (R 3.6.1)
#>  scales        1.0.0   2018-08-09 [1] CRAN (R 3.5.1)
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.1)
#>  stringi       1.4.3   2019-03-12 [1] CRAN (R 3.5.3)
#>  stringr     * 1.4.0   2019-02-10 [1] CRAN (R 3.5.3)
#>  testthat      2.3.0   2019-11-05 [1] CRAN (R 3.6.1)
#>  tibble      * 2.1.3   2019-06-06 [1] CRAN (R 3.6.0)
#>  tidyr       * 1.0.0   2019-09-11 [1] CRAN (R 3.6.1)
#>  tidyselect    0.2.5   2018-10-11 [1] CRAN (R 3.5.1)
#>  tidyverse   * 1.2.1   2017-11-14 [1] CRAN (R 3.5.0)
#>  usethis       1.5.1   2019-07-04 [1] CRAN (R 3.6.1)
#>  vctrs         0.2.0   2019-07-05 [1] CRAN (R 3.6.1)
#>  withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.0)
#>  xfun          0.11    2019-11-12 [1] CRAN (R 3.6.1)
#>  xml2          1.2.2   2019-08-09 [1] CRAN (R 3.6.1)
#>  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.5.1)
#>  zeallot       0.1.0   2018-01-28 [1] CRAN (R 3.5.0)

Did I miss something here?

Minimal reprex:

library(tidyr)

df <- tibble(
  x = "x",
  lgl = NA,
  dbl = NA_real_,
  chr = NA_character_
)

df %>% unite(out, c("x", "lgl"), na.rm = TRUE) %>% .$out
#> [1] "x_NA"
df %>% unite(out, c("x", "dbl"), na.rm = TRUE) %>% .$out
#> [1] "x_NA"
df %>% unite(out, c("x", "chr"), na.rm = TRUE) %>% .$out
#> [1] "x"

Created on 2019-11-24 by the reprex package (v0.3.0)

Was this page helpful?
0 / 5 - 0 ratings