Tidyr: unnest crashes R on nested data.frame

Created on 17 Apr 2017  路  11Comments  路  Source: tidyverse/tidyr

Running unnest on this nested data.frame crashes R, while running unnest on the nested data.frame (correctly?) gives an error.

library(tidyr)
probdf <-
structure(list(ceref = "AUF080",  
    jsoninfo = list(structure(list(regulatedActivity = structure(list(
        status = c("R", "R", "R", "R", "A", "A", "A", "A"), actType = c(1L, 2L, 
        2L, 2L, 1L, 2L, 4L, 5L)), .Names = c("status", "actType"
    ), class = "data.frame", row.names = c(NA, 8L)), effectivePeriodList = list(structure(list(endDate = "Apr 25, 2013 12:00:00 AM", 
        effectiveDate = "Dec 1, 2009 12:00:00 AM"), .Names = c("endDate", 
    "effectiveDate"), class = "data.frame", row.names = 1L), 
        structure(list(endDate = "Apr 25, 2013 12:00:00 AM", 
            effectiveDate = "Dec 1, 2009 12:00:00 AM"), .Names = c("endDate", 
        "effectiveDate"), class = "data.frame", row.names = 1L), 
        structure(list(endDate = "Apr 23, 2013 12:00:00 AM", 
            effectiveDate = "Feb 4, 2010 12:00:00 AM"), .Names = c("endDate", 
        "effectiveDate"), class = "data.frame", row.names = 1L), 
        structure(list(endDate = "Feb 19, 2010 12:00:00 AM", 
            effectiveDate = "Dec 1, 2009 12:00:00 AM"), .Names = c("endDate", 
        "effectiveDate"), class = "data.frame", row.names = 1L), 
        structure(list(endDate = NA, effectiveDate = "Apr 25, 2013 12:00:00 AM"), .Names = c("endDate", 
        "effectiveDate"), class = "data.frame", row.names = 1L), 
        structure(list(endDate = NA, effectiveDate = "Apr 25, 2013 12:00:00 AM"), .Names = c("endDate", 
        "effectiveDate"), class = "data.frame", row.names = 1L), 
        structure(list(endDate = NA, effectiveDate = "Mar 17, 2015 12:00:00 AM"), .Names = c("endDate", 
        "effectiveDate"), class = "data.frame", row.names = 1L), 
        structure(list(endDate = NA, effectiveDate = "Mar 17, 2015 12:00:00 AM"), .Names = c("endDate", 
        "effectiveDate"), class = "data.frame", row.names = 1L))), .Names = c("regulatedActivity", "effectivePeriodList"), class = "data.frame", row.names = c(NA, 
    8L)))), .Names = c("ceref", "jsoninfo"), row.names = 5L, class = "data.frame")
unnest(probdf$jsoninfo[[1]])
#> Error: Each variable must be a 1d atomic vector or list.
#> Problem variables: 'regulatedActivity'
### NOT RUN
unnest(probdf)

Session info

devtools::session_info()
#> Session info--------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.3.3 (2017-03-06)
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  tz       Asia/Hong_Kong
#> Packages------------------------------------------------------------------
#>  package    * version date       source        
#>  assertthat   0.1     2013-12-06 CRAN (R 3.3.1)
#>  backports    1.0.5   2017-01-18 CRAN (R 3.3.2)
#>  DBI          0.6     2017-03-09 CRAN (R 3.3.3)
#>  devtools     1.6.1   2014-10-07 CRAN (R 3.1.1)
#>  digest       0.6.10  2016-08-02 CRAN (R 3.3.1)
#>  dplyr        0.5.0   2016-06-24 CRAN (R 3.3.1)
#>  evaluate     0.10    2016-10-11 CRAN (R 3.3.3)
#>  htmltools    0.3.5   2016-03-21 CRAN (R 3.3.1)
#>  knitr        1.15.1  2016-11-22 CRAN (R 3.3.3)
#>  lazyeval     0.2.0   2016-06-12 CRAN (R 3.3.1)
#>  magrittr     1.5     2014-11-22 CRAN (R 3.3.1)
#>  R6           2.1.3   2016-08-19 CRAN (R 3.3.1)
#>  Rcpp         0.12.7  2016-09-05 CRAN (R 3.3.1)
#>  rmarkdown    1.3     2016-12-21 CRAN (R 3.3.2)
#>  rprojroot    1.2     2017-01-16 CRAN (R 3.3.2)
#>  rstudioapi   0.6     2016-06-27 CRAN (R 3.3.3)
#>  stringi      1.1.2   2016-10-01 CRAN (R 3.3.2)
#>  stringr      1.1.0   2016-08-19 CRAN (R 3.3.1)
#>  tibble       1.2     2016-08-26 CRAN (R 3.3.1)
#>  tidyr      * 0.6.1   2017-01-10 CRAN (R 3.3.3)
#>  yaml         2.1.13  2014-06-12 CRAN (R 3.1.1)

bug

All 11 comments

thanks for the report. Could you transform this into a minimal reprex please? dput() dumps are not the most readable form for an example.

It's hard to minimise it much further. I found a slightly simpler structure beyond the convoluted one I gave above that crashes unnest [now the nested data.frame is just made of two data.frames]. The context is that it is a data.frame given by jsonlite::fromJSON. Here I show the simplest json I could find that causes the crash, along with showing that even a small change avoids a crash:

library(tidyr)
library(jsonlite)
probdf <- fromJSON('[{"ceref":"AUF080","jsoninfo":[{"regulatedActivity":{"status":"R","actType":2},"effectivePeriodList":{"endDate":"Apr 25, 2013","effectiveDate":"Dec 1, 2009"}},{"regulatedActivity":{"status":"R","actType":2},"effectivePeriodList":{"endDate":"Apr 23, 2013","effectiveDate":"Feb 4, 2010"}},{"regulatedActivity":{"status":"R","actType":2},"effectivePeriodList":{"endDate":"Feb 19, 2010","effectiveDate":"Dec 1, 2009"}}]}]')
str(probdf)
#> 'data.frame':    1 obs. of  2 variables:
#>  $ ceref   : chr "AUF080"
#>  $ jsoninfo:List of 1
#>   ..$ :'data.frame': 3 obs. of  2 variables:
#>   .. ..$ regulatedActivity  :'data.frame':   3 obs. of  2 variables:
#>   .. .. ..$ status : chr  "R" "R" "R"
#>   .. .. ..$ actType: int  2 2 2
#>   .. ..$ effectivePeriodList:'data.frame':   3 obs. of  2 variables:
#>   .. .. ..$ endDate      : chr  "Apr 25, 2013" "Apr 23, 2013" "Feb 19, 2010"
#>   .. .. ..$ effectiveDate: chr  "Dec 1, 2009" "Feb 4, 2010" "Dec 1, 2009"
# just deleting one row from nested data.frame avoids a crash
probdf12 <- probdf
probdf13 <- probdf
probdf23 <- probdf
probdf12$jsoninfo[[1]] <- probdf12$jsoninfo[[1]][c(1,2),]
probdf13$jsoninfo[[1]] <- probdf13$jsoninfo[[1]][c(1,3),]
probdf23$jsoninfo[[1]] <- probdf23$jsoninfo[[1]][c(2,3),]
str(probdf12)
#> 'data.frame':    1 obs. of  2 variables:
#>  $ ceref   : chr "AUF080"
#>  $ jsoninfo:List of 1
#>   ..$ :'data.frame': 2 obs. of  2 variables:
#>   .. ..$ regulatedActivity  :'data.frame':   2 obs. of  2 variables:
#>   .. .. ..$ status : chr  "R" "R"
#>   .. .. ..$ actType: int  2 2
#>   .. ..$ effectivePeriodList:'data.frame':   2 obs. of  2 variables:
#>   .. .. ..$ endDate      : chr  "Apr 25, 2013" "Apr 23, 2013"
#>   .. .. ..$ effectiveDate: chr  "Dec 1, 2009" "Feb 4, 2010"
unnest(probdf12)
#>    ceref regulatedActivity        effectivePeriodList
#> 1 AUF080              R, R Apr 25, 2013, Apr 23, 2013
#> 2 AUF080              2, 2   Dec 1, 2009, Feb 4, 2010
unnest(probdf13)
#>    ceref regulatedActivity        effectivePeriodList
#> 1 AUF080              R, R Apr 25, 2013, Feb 19, 2010
#> 2 AUF080              2, 2   Dec 1, 2009, Dec 1, 2009
unnest(probdf23)
#>    ceref regulatedActivity        effectivePeriodList
#> 1 AUF080              R, R Apr 23, 2013, Feb 19, 2010
#> 2 AUF080              2, 2   Feb 4, 2010, Dec 1, 2009
### NOT RUN
unnest(probdf) # crashes R

I have a similar problem with some tweet analysis.

screen shot 2017-05-14 at 16 46 52

I put the offending file here for download https://storage.googleapis.com/mark-edmondson-public-files/tidyr_bug.rds

Reproduce via:

download.file("https://storage.googleapis.com/mark-edmondson-public-files/tidyr_bug.rds", "tidyr_bug.rds")

problem <- readRDS("tidyr_bug.rds")
problem

## A tibble: 5 脳 5
#           status_id        nlp sentiment_mag sentiment_score             entities
#*              <chr>     <list>         <dbl>           <dbl>               <list>
#1 861213541776449540 <list [5]>           0.1            -0.1 <data.frame [5 脳 6]>
#2 861213390349496320 <list [5]>           0.3             0.3 <data.frame [5 脳 6]>
#3 861211688015732736 <list [5]>           0.0             0.0 <data.frame [5 脳 6]>
#4 861211516426608640 <list [5]>           0.0             0.0 <data.frame [4 脳 6]>
#5 861211458419527680 <list [5]>           0.1             0.1 <data.frame [5 脳 6]>

library(tidyr)
unnest(problem, entities)

# RStudio crashes
> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.4

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tidyr_0.6.2

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.9            XML_3.98-1.4           assertthat_0.1         digest_0.6.12         
 [5] withr_1.0.2            mime_0.5               bitops_1.0-6           R6_2.2.0              
 [9] xtable_1.8-2           magrittr_1.5           httr_1.2.1.9000        googleAuthR_0.5.1.9000
[13] devtools_1.12.0.9000   RJSONIO_1.3-0          tools_3.3.2            RSelenium_1.4.2       
[17] RCurl_1.95-4.8         shiny_1.0.0            httpuv_1.3.3           pkgload_0.0.0.9000    
[21] pkgbuild_0.0.0.9000    caTools_1.17.1         memoise_1.0.0          htmltools_0.3.5       
[25] tibble_1.2  

I spotted the same crash. In my case, tibbles within the list-column have identical column names but are of different column types. The following code crashes R:

WARNING 鈿狅笍 CODE BELOW CRASHES R

library(tibble)
library(magrittr)

tib_one <- tibble(a = runif(10), b = runif(10))
tib_two <- tibble(a = runif(10),  b = LETTERS[1:10])

tib_lc <- tibble(lc = list(tib_one, tib_two))

tib_lc %>% tidyr::unnest(lc)
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] purrr_0.2.2.2 tibble_1.3.1  magrittr_1.5 

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.11           influxdbr_0.11.12      munsell_0.4.3          colorspace_1.3-2       lattice_0.20-35        R6_2.2.1              
 [7] rlang_0.1.1            httr_1.2.1             plyr_1.8.4             dplyr_0.5.0            tools_3.4.0            xts_0.9-7             
[13] grid_3.4.0             gtable_0.2.0           DBI_0.6-1              yaml_2.1.14            lazyeval_0.2.0         assertthat_0.2.0      
[19] crayon_1.3.2           ggplot2_2.2.1          tidyr_0.6.3            microbenchmark_1.4-2.1 testthat_1.0.2         curl_2.6              
[25] compiler_3.4.0         scales_0.4.1           jsonlite_1.4           zoo_1.8-0

edit: executing the code above on an rstudio-server instance results in an error (instead of crashing):

Error in bind_rows_(x, .id) : 
Can not automatically convert from numeric to character in column "b". 

So maybe it's OS specific?

That narrows it down a bit for me, as I had duplicate names but within the nest e.g. data.frame(entities = nlp) where names(nlp) = "sentiment", "entities" etc. so perhaps me renaming one column will stop the crash.

I can't replicate any of the crashes with dplyr 0.7.1. Can you please confirm?

Both Mac OS and Debian with dplyr 0.7.1 give now an error (as expected). Thanks!

Error in bind_rows_(x, .id) : 
Column `b` can't be converted from numeric to character

I can confirm I get an error instead of a crash now:

> library(tidyr)
> library(jsonlite)
> probdf <- fromJSON('[{"ceref":"AUF080","jsoninfo":[{"regulatedActivity":{"status":"R","actType":2},"effectivePeriodList":{"endDate":"Apr 25, 2013","effectiveDate":"Dec 1, 2009"}},{"regulatedActivity":{"status":"R","actType":2},"effectivePeriodList":{"endDate":"Apr 23, 2013","effectiveDate":"Feb 4, 2010"}},{"regulatedActivity":{"status":"R","actType":2},"effectivePeriodList":{"endDate":"Feb 19, 2010","effectiveDate":"Dec 1, 2009"}}]}]')
> unnest(probdf)
Error in bind_rows_(x, .id) : 
  Argument 1 can't be a list containing data frames

However, data frames that could be unnested before now cannot:

> probdf12 <- probdf
> probdf12$jsoninfo[[1]] <- probdf12$jsoninfo[[1]][c(1,2),]
> unnest(probdf12)
Error in bind_rows_(x, .id) : 
  Argument 1 can't be a list containing data frames

In conclusion, the bug I filed is fixed, with some regression.

@slygent: Would you mind filing a new issue with some more context? What is the expected output?

I think I have a similar issue to the one presented in the previous comment.

df <- structure(list(location = "SPARTAN - CITEDEF", measurements = list(
  structure(list(parameter = "pm25", value = 18.1, lastUpdated = "2015-04-15T00:00:00.000Z",
                 unit = "脗碌g/m脗鲁", sourceName = "Spartan", averagingPeriod = structure(list(
                   unit = "hours", value = 1L), .Names = c("unit", "value"
                   ), class = "data.frame", row.names = 1L)), .Names = c("parameter",
                                                                         "value", "lastUpdated", "unit", "sourceName", "averagingPeriod"
                   ), class = "data.frame", row.names = 1L))), .Names = c("location",
                                                                          "measurements"), row.names = c(NA, -1L), class = c("tbl_df",
                                                                                                                             "tbl", "data.frame"))
df
#>            location
#> 1 SPARTAN - CITEDEF
#>                                                     measurements
#> 1 pm25, 18.1, 2015-04-15T00:00:00.000Z, 脗碌g/m脗鲁, Spartan, hours, 1
class(df$measurements[[1]])
#> [1] "data.frame"
class(df$measurements[[1]]$averagingPeriod)
#> [1] "data.frame"
tidyr::unnest_(df, "measurements")
#> Error in bind_rows_(x, .id): Argument 6 can't be a list containing data frames

Session info

devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                                      
#>  version  R version 3.4.0 Patched (2017-05-10 r72669)
#>  system   x86_64, mingw32                            
#>  ui       RTerm                                      
#>  language (EN)                                       
#>  collate  English_United States.1252                 
#>  tz       Europe/Paris                               
#>  date     2017-07-23
#> Packages -----------------------------------------------------------------
#>  package    * version date       source        
#>  assertthat   0.2.0   2017-04-11 CRAN (R 3.4.0)
#>  backports    1.0.5   2017-01-18 CRAN (R 3.4.0)
#>  base       * 3.4.0   2017-05-13 local         
#>  bindr        0.1     2016-11-13 CRAN (R 3.4.1)
#>  bindrcpp     0.2     2017-06-17 CRAN (R 3.4.1)
#>  compiler     3.4.0   2017-05-13 local         
#>  datasets   * 3.4.0   2017-05-13 local         
#>  devtools     1.13.2  2017-06-02 CRAN (R 3.4.1)
#>  digest       0.6.12  2017-01-27 CRAN (R 3.4.0)
#>  dplyr        0.7.1   2017-06-22 CRAN (R 3.4.1)
#>  evaluate     0.10    2016-10-11 CRAN (R 3.4.0)
#>  glue         1.1.1   2017-06-21 CRAN (R 3.4.1)
#>  graphics   * 3.4.0   2017-05-13 local         
#>  grDevices  * 3.4.0   2017-05-13 local         
#>  htmltools    0.3.6   2017-04-28 CRAN (R 3.4.0)
#>  knitr        1.16    2017-05-18 CRAN (R 3.4.1)
#>  magrittr     1.5     2014-11-22 CRAN (R 3.4.0)
#>  memoise      1.1.0   2017-04-21 CRAN (R 3.4.0)
#>  methods    * 3.4.0   2017-05-13 local         
#>  pkgconfig    2.0.1   2017-03-21 CRAN (R 3.4.0)
#>  R6           2.2.1   2017-05-10 CRAN (R 3.4.0)
#>  Rcpp         0.12.11 2017-05-22 CRAN (R 3.4.0)
#>  rlang        0.1.1   2017-05-18 CRAN (R 3.4.0)
#>  rmarkdown    1.6     2017-06-15 CRAN (R 3.4.1)
#>  rprojroot    1.2     2017-01-16 CRAN (R 3.4.0)
#>  stats      * 3.4.0   2017-05-13 local         
#>  stringi      1.1.5   2017-04-07 CRAN (R 3.4.0)
#>  stringr      1.2.0   2017-02-18 CRAN (R 3.4.0)
#>  tibble       1.3.3   2017-05-28 CRAN (R 3.4.0)
#>  tidyr        0.6.3   2017-05-15 CRAN (R 3.4.0)
#>  tools        3.4.0   2017-05-13 local         
#>  utils      * 3.4.0   2017-05-13 local         
#>  withr        1.0.2   2016-06-20 CRAN (R 3.4.0)
#>  yaml         2.1.14  2016-11-12 CRAN (R 3.4.0)

I get an error because I try to unnest a data.frame with 2 levels of nested data.frames: measurements is a list of data.frames inside df and inside measurements there's a column with nested data.frames.

I'd expect the unnest_(df, "measurements") to give me a data.frame df2 with a list-column averagingPeriod that contains data.frames, and then I'd apply unnest_(df2, "averagingPeriod") on it. Or maybe I'm thinking about it the wrong way. :-)

I'm no longer seeing the crash, so I'm going to close this issue.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

leungi picture leungi  路  19Comments

GillesSanMartin picture GillesSanMartin  路  12Comments

strengejacke picture strengejacke  路  8Comments

PMSeitzer picture PMSeitzer  路  9Comments

MarcusWalz picture MarcusWalz  路  16Comments