Does this error mean that the dataset was too big? Is there a way to make it work with this options(expressions=)
(but how)?
> library(sparklyr)
> spark_disconnect(sc)
> sc <- spark_connect("local", version = "2.0.0")
>
> mat_1000x1000 <- matrix(runif(1*(10^6)), ncol = 1000, nrow = 1000)
> mat_1000x1001 <- cbind(mat_1000x1000, 1)
> colnames(mat_1000x1001) <- paste0("topic", 1:1001)
> mat_1000x1001 <- as.data.frame(mat_1000x1001)
> mat_1000x1001_tbl <- copy_to(sc, mat_1000x1001, "mat_1000x1001", overwrite = TRUE)
>
> library(dplyr)
> mat_1000x1001_tbl %>%
+ mutate_each(funs(norm = . / topic1001)) %>%
+ select(contains("norm")) %>%
+ ml_kmeans(centers=3) -> km_model
Error: evaluation nested too deeply: infinite recursion / options(expressions=)?
Error during wrapup: evaluation nested too deeply: infinite recursion / options(expressions=)?
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS
locale:
[1] LC_CTYPE=pl_PL.UTF-8 LC_NUMERIC=C LC_TIME=pl_PL.UTF-8 LC_COLLATE=pl_PL.UTF-8
[5] LC_MONETARY=pl_PL.UTF-8 LC_MESSAGES=pl_PL.UTF-8 LC_PAPER=pl_PL.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=pl_PL.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.5.0 ggplot2_2.1.0 sparklyr_0.3.4
loaded via a namespace (and not attached):
[1] Rcpp_0.12.6 withr_1.0.2 digest_0.6.9 rprojroot_1.0-2 assertthat_0.1 rappdirs_0.3.1
[7] grid_3.3.1 R6_2.1.2 plyr_1.8.4 gtable_0.2.0 DBI_0.4-1 magrittr_1.5
[13] scales_0.4.0 lazyeval_0.2.0 labeling_0.3 config_0.1.0 tools_3.3.1 munsell_0.4.3
[19] parallel_3.3.1 yaml_2.1.13 colorspace_1.2-6 tibble_1.1
The error here is occurring due to the use of mutate_each
; at least, attempting to evaluate this reproduces the error for me:
mat_1000x1001_tbl %>%
mutate_each(funs(norm = . / topic1001))
makes munging large data sets (the kind one needs spark for) hard to deal with.
It does seem related to number of columns, the problem does not arise with a small number of columns.
testdf <- data.frame(a1 = rnorm(1e5), a2 = rnorm(1e5))
testdf_tbl <- copy_to(sc, testdf)
testdf_tbl <- copy_to(sc, testdf, overwrite = TRUE)
testdf_tbl %>% mutate_all(funs(sign(.))) %>% head()
Source: query [?? x 2]
Database: spark connection master=local[24] app=sparklyr local=TRUE
a1 a2
<dbl> <dbl>
1 1 1
2 1 -1
3 1 1
4 1 -1
5 1 1
6 1 -1
works.
but a table with 10 rows and 1000 columns fails.
testmat<-matrix(runif(10*1000), ncol=1000)
testdf <- as.data.frame(testmat)
testdf_tbl <- copy_to(sc, testdf, overwrite = TRUE)
testdf_tbl %>% mutate_all(funs(sign(.))) %>% head()
Error: evaluation nested too deeply: infinite recursion / options(expressions=)?
Error during wrapup: evaluation nested too deeply: infinite recursion / options(expressions=)?
This is still a case
This seems finally fixed in Spark 2.2.0
:
> mat_1000x1001_tbl %>%
+ mutate_each(funs(norm = . / topic1001)) %>%
+ select(contains("norm")) %>%
+ ml_kmeans(centers=3) -> km_model
`mutate_each()` is deprecated.
Use `mutate_all()`, `mutate_at()` or `mutate_if()` instead.
To map `funs` over all variables, use `mutate_all()`
* No rows dropped by 'na.omit' call
> km_model
K-means clustering with 3 clusters
Cluster centers:
topic1_norm topic2_norm topic3_norm topic4_norm topic5_norm topic6_norm topic7_norm topic8_norm topic9_norm
topic10_norm topic11_norm topic12_norm topic13_norm topic14_norm topic15_norm topic16_norm topic17_norm
topic18_norm topic19_norm topic20_norm topic21_norm topic22_norm topic23_norm topic24_norm topic25_norm
topic26_norm topic27_norm topic28_norm topic29_norm topic30_norm topic31_norm topic32_norm topic33_norm
topic34_norm topic35_norm topic36_norm topic37_norm topic38_norm topic39_norm topic40_norm topic41_norm
topic42_norm topic43_norm topic44_norm topic45_norm topic46_norm topic47_norm topic48_norm topic49_norm
Most helpful comment
makes munging large data sets (the kind one needs spark for) hard to deal with.
It does seem related to number of columns, the problem does not arise with a small number of columns.
works.
but a table with 10 rows and 1000 columns fails.