Tidyr: unnest() drops empty columns

Created on 22 Jun 2019  路  6Comments  路  Source: tidyverse/tidyr

Hello,

I have identified a possible bug in using the unnest function after mutate.
I thought it would be the bug reported in #483 , but then I realized that they are different. The bug only appears when I run the mutate function before unnest.

With bug: mutate function beforeunnest:

library(magrittr)

df_a <- tibble::tribble(~x, ~z,
                            "a", 10,
                            "b", 20)

df_b <- df_a %>% 
        dplyr::filter(!z %in% c(10, 20)) %>%
        dplyr::mutate(z = stringr::str_extract_all(x, pattern = "\\d{2}")) %>%
        tidyr::unnest(z) %>%
        dplyr::select(z,
                      x)
#> Error in .f(.x[[i]], ...): objeto 'z' n茫o encontrado

Created on 2019-06-22 by the reprex package (v0.3.0)

No bug: mutate function off:

library(magrittr)

df_a <- tibble::tribble(~x, ~z,
                            "a", 10,
                            "b", 20)

df_b <- df_a %>% 
        dplyr::filter(!z %in% c(10, 20)) %>%
        # dplyr::mutate(z = stringr::str_extract_all(x, pattern = "\\d{2}")) %>%
        tidyr::unnest(z) %>%
        dplyr::select(z,
                      x)

Created on 2019-06-22 by the reprex package (v0.3.0)

R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252 
[2] LC_CTYPE=Portuguese_Brazil.1252   
[3] LC_MONETARY=Portuguese_Brazil.1252
[4] LC_NUMERIC=C                      
[5] LC_TIME=Portuguese_Brazil.1252    

attached base packages:
[1] stats     graphics  grDevices utils    
[5] datasets  methods   base     

other attached packages:
[1] magrittr_1.5

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1       tidyr_0.8.3     
 [3] packrat_0.5.0    crayon_1.3.4    
 [5] dplyr_0.8.1      assertthat_0.2.1
 [7] R6_2.4.0         pillar_1.4.1    
 [9] stringi_1.4.3    rlang_0.3.4     
[11] rstudioapi_0.10  tools_3.6.0     
[13] stringr_1.4.0    glue_1.3.1      
[15] purrr_0.3.2      compiler_3.6.0  
[17] pkgconfig_2.0.2  tidyselect_0.2.5
[19] tibble_2.1.3

bug rectangling

All 6 comments

I don't think this is something to do with mutate directly. Using str_extract_all is changing the type of the column from double vector to list. So you got from an empty z double vector to a an empty list. It is why you see the issue only after using mutate.

library(magrittr)

df_a <- tibble::tribble(~x, ~z,
                        "a", 10,
                        "b", 20)

df_a %>% 
  dplyr::filter(!z %in% c(10, 20)) %>%
  dplyr::glimpse()
#> Observations: 0
#> Variables: 2
#> $ x <chr> 
#> $ z <dbl>

df_a %>% 
  dplyr::filter(!z %in% c(10, 20)) %>%
  dplyr::mutate(z = stringr::str_extract_all(x, pattern = "\\d{2}")) %>%
  dplyr::glimpse()
#> Observations: 0
#> Variables: 2
#> $ x <chr> 
#> $ z <list> []

Created on 2019-06-23 by the reprex package (v0.3.0.9000)

I think this has to do with #483 as with last version as it seems the issue is still here with column a not being kept when an empty table

library(tidyr)

df <- tibble::tibble(a = list(1), y = 1) 

df %>% .[0, ] %>% unnest(a) %>% str()
#> Classes 'tbl_df', 'tbl' and 'data.frame':    0 obs. of  1 variable:
#>  $ y: num

df %>% unnest(a) %>% str()
#> Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
#>  $ a: num 1
#>  $ y: num 1

last version of tidyr

sessioninfo::package_info('tidyr')
#>  package    * version     date       lib
#>  assertthat   0.2.1       2019-03-21 [1]
#>  backports    1.1.4       2019-04-10 [1]
#>  BH           1.69.0-1    2019-01-07 [1]
#>  cli          1.1.0       2019-03-19 [1]
#>  crayon       1.3.4       2017-09-16 [1]
#>  digest       0.6.19      2019-05-20 [1]
#>  dplyr        0.8.1.9000  2019-06-23 [1]
#>  ellipsis     0.1.0.9000  2019-06-12 [1]
#>  fansi        0.4.0       2018-10-05 [1]
#>  glue         1.3.1.9000  2019-05-24 [1]
#>  magrittr     1.5.0.9000  2019-01-06 [1]
#>  pillar       1.4.1.9000  2019-06-12 [1]
#>  pkgconfig    2.0.2       2018-08-16 [1]
#>  plogr        0.2.0       2018-03-25 [1]
#>  purrr        0.3.2.9000  2019-06-12 [1]
#>  R6           2.4.0       2019-02-14 [1]
#>  Rcpp         1.0.1.3     2019-05-25 [1]
#>  rlang        0.3.99.9003 2019-06-23 [1]
#>  stringi      1.4.3       2019-03-12 [1]
#>  tibble       2.1.3.9000  2019-06-23 [1]
#>  tidyr      * 0.8.3.9000  2019-06-13 [1]
#>  tidyselect   0.2.5       2018-10-11 [1]
#>  utf8         1.1.4       2018-05-24 [1]
#>  vctrs        0.1.0.9004  2019-06-23 [1]
#>  zeallot      0.1.0       2018-01-28 [1]
#>  source                             
#>  CRAN (R 3.5.3)                     
#>  CRAN (R 3.5.3)                     
#>  CRAN (R 3.5.2)                     
#>  CRAN (R 3.5.3)                     
#>  CRAN (R 3.5.3)                     
#>  CRAN (R 3.5.3)                     
#>  Github (tidyverse/dplyr@3471814)   
#>  Github (r-lib/ellipsis@d8bf8a3)    
#>  CRAN (R 3.5.3)                     
#>  Github (tidyverse/glue@ea0edcb)    
#>  Github (tidyverse/magrittr@4104d6b)
#>  Github (r-lib/pillar@c017f20)      
#>  CRAN (R 3.5.3)                     
#>  CRAN (R 3.5.3)                     
#>  Github (tidyverse/purrr@e4d5539)   
#>  CRAN (R 3.5.3)                     
#>  Github (RcppCore/Rcpp@6062d56)     
#>  Github (r-lib/rlang@96a69a2)       
#>  CRAN (R 3.5.3)                     
#>  Github (tidyverse/tibble@abc5390)  
#>  Github (tidyverse/tidyr@7a2b843)   
#>  CRAN (R 3.5.3)                     
#>  CRAN (R 3.5.3)                     
#>  Github (r-lib/vctrs@0abd575)       
#>  CRAN (R 3.5.3)                     
#> 
#> [1] C:/Users/chris/Documents/R-dev
#> [2] C:/Users/chris/Documents/R/win-library/3.5
#> [3] C:/Program Files/R/R-3.5.3/library

Created on 2019-06-23 by the reprex package (v0.3.0.9000)

Even more minimal reprex:

library(tidyr)

df <- tibble(x = list(), y = integer()) 
df %>% unnest(y) %>% names()
#> [1] "x"

Created on 2019-07-24 by the reprex package (v0.3.0)

The problem arises because unnest() does three things:

  • Turns each column into a list-col of data frames
  • unchop() each column
  • unpacks() each column

So

df1 <- tibble(x = list(1), y = 1) 
df1 %>% unnest(x)
#> # A tibble: 1 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     1

is equivalent to

df2 <- tibble(x = data.frame(x = 1), y = 1)
df2 %>% unpack(x)
#> # A tibble: 1 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     1

The problem is that when x is empty it doesn't get "data frame-d", so when unpacked() it disappears. I think that means the fix is for unnest() to handle empty columns specially. It think it'll have to create a column of vctrs::unspecified() type, since that can be coerced to anything else.

Oops, that's not quite right - it needs to be a length-0 list_of(.ptype = tibble(x = unspecified())

Even more minimal reprex:

library(tidyr)

df <- tibble(x = list(), y = integer()) 
df %>% unnest(y) %>% names()
#> [1] "x"

Apologies for jumping in in a closed issue, but I'm still having trouble with this issue at tidyr==1.0.0.
A recursive dose of unnest_wider is the core workhorse for a package I'm working on, but it trips over list columns with gaps in the first record, and it appears that the useful looking keep_empty is not supported any more.

Is there any other fix than supplying .ptype for each column?

Update, the current master of tidyr seems to have fixed my problem (but not the example here), I need to investigate more.

```{r}
library(tidyr)
df <- tibble(x = list(), y = integer())
df %>% unnest(y) %>% names()

> Error: Input must be list of vectors


<details>

### SessionInfo
```{r}
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 19.10

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C               LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8    LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.4.0    stringr_1.4.0    dplyr_0.8.3      purrr_0.3.3.9000 readr_1.3.1      tidyr_1.0.0      tibble_2.1.3    
 [8] ggplot2_3.2.1    tidyverse_1.3.0  ruODK_0.6.6.9005

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3           cellranger_1.1.0     pillar_1.4.2         compiler_3.6.1       dbplyr_1.4.2         tools_3.6.1         
 [7] packrat_0.5.0        lubridate_1.7.4      jsonlite_1.6         lifecycle_0.1.0.9000 nlme_3.1-141         gtable_0.3.0        
[13] lattice_0.20-38      pkgconfig_2.0.3      rlang_0.4.2.9000     reprex_0.3.0         cli_1.1.0            DBI_1.0.0           
[19] rstudioapi_0.10      haven_2.2.0          withr_2.1.2          xml2_1.2.2           httr_1.4.1           hms_0.5.2           
[25] generics_0.0.2       fs_1.3.1             vctrs_0.2.0.9007     grid_3.6.1           tidyselect_0.2.5     glue_1.3.1          
[31] R6_2.4.1             fansi_0.4.0          readxl_1.3.1         modelr_0.1.5         magrittr_1.5         backports_1.1.5     
[37] scales_1.1.0         usethis_1.5.1.9000   rvest_0.3.5          assertthat_0.2.1     colorspace_1.4-1     utf8_1.1.4          
[43] stringi_1.4.3        lazyeval_0.2.2       munsell_0.5.0        broom_0.5.2          crayon_1.3.4        

System info

Distributor ID: Ubuntu
Description:    Ubuntu 19.10
Release:        19.10
Codename:       eoan

@florianm please open a new issue with reprex created the reprex package. No need to include session info unless it is explicitly asked for.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

romagnolid picture romagnolid  路  8Comments

andrewpbray picture andrewpbray  路  8Comments

atusy picture atusy  路  4Comments

kendonB picture kendonB  路  5Comments

coatless picture coatless  路  6Comments