unable to get ml_pipline to test multiple formulas:
dt <- as.data.frame(rbind(c(667,1,0,1) ,
c(39,1,1,1) ,
c(644,0,1,0) ,
c(334,0,0,0) ,
c(678,1,0,1) ,
c(900,0,0,1) ,
c(613,1,1,1) ,
c(787,1,1,0) ,
c(736,1,0,1) ,
c(363,0,0,1) ,
c(636,1,0,1) ,
c(215,0,1,0) ,
c(443,1,1,0) ,
c(428,1,1,1) ,
c(421,0,0,1) ,
c(842,1,1,1) ,
c(858,1,1,1) ,
c(936,0,1,1) ,
c(127,0,0,1) ,
c(895,1,0,1)
))
colnames(dt) <- c('Cost','Female','X','rand_assignment')
dt
spark_version <- "2.1.0"
sc <- spark_connect(master = "local", version = spark_version)
dt_spark <- copy_to(sc, dt)
rm(dt)
param_grid2 <- list(
formula_stage = list(
ft_r_formula = list('Cost ~ Female + X + rand_assignment',
'Cost ~ X + rand_assignment',
'Cost ~ rand_assignment',
'Cost ~ X')
)
)
rf_pipeline2 <- ml_pipeline(sc) %>%
ft_r_formula(uid = 'formula_stage') %>%
ml_random_forest_regressor( feature_subset_strategy = "onethird",
seed = sample(1:10000, 1),
subsampling_rate = .75)
Error in stop(simpleError(sprintf(fmt, ...), if (call.) sys.call(sys.parent()))) :
bad error message
rf_cv2 <- ml_cross_validator(sc,
estimator = rf_pipeline2,
estimator_param_maps = param_grid2,
evaluator = ml_regression_evaluator(sc, metric_name = "rmse"),
num_folds = 2)
cv_model_formula <- rf_cv2 %>% ml_fit(dt_spark)
Couple things here.
formula:param_grid2 <- list(
formula_stage = list(
formula = list('Cost ~ Female + X + rand_assignment',
'Cost ~ X + rand_assignment',
'Cost ~ rand_assignment',
'Cost ~ X')
)
)
ft_r_formula() requires that formula be specified, which makes sense most of the time, except when you want to tune it in a grid. I'm tracking this here https://github.com/rstudio/sparklyr/issues/1513. In the meantime, just pass a dummy formula and it won't complain:rf_pipeline2 <- ml_pipeline(sc) %>%
ft_r_formula("foo ~ bar", uid = 'formula_stage') %>%
ml_random_forest_regressor( feature_subset_strategy = "onethird",
seed = sample(1:10000, 1),
subsampling_rate = .75)
rf_cv2 <- ml_cross_validator(sc,
estimator = rf_pipeline2,
estimator_param_maps = param_grid2,
evaluator = ml_regression_evaluator(sc, metric_name = "rmse"),
num_folds = 2)
cv_model_formula <- rf_cv2 %>% ml_fit(dt_spark)
ml_validation_metrics(cv_model_formula)
Thanks for trying out this feature. The documentation on this is currently very sparse, and we'll be adding more.
Most helpful comment
Couple things here.
formula:ft_r_formula()requires thatformulabe specified, which makes sense most of the time, except when you want to tune it in a grid. I'm tracking this here https://github.com/rstudio/sparklyr/issues/1513. In the meantime, just pass a dummy formula and it won't complain:Thanks for trying out this feature. The documentation on this is currently very sparse, and we'll be adding more.