Sparklyr: using ft_r_formula with ml_pipelines

Created on 22 May 2018 · 1Comment · Source: sparklyr/sparklyr

unable to get ml_pipline to test multiple formulas:

dt <- as.data.frame(rbind(c(667,1,0,1)  ,
c(39,1,1,1) ,
c(644,0,1,0)    ,
c(334,0,0,0)    ,
c(678,1,0,1)    ,
c(900,0,0,1)    ,
c(613,1,1,1)    ,
c(787,1,1,0)    ,
c(736,1,0,1)    ,
c(363,0,0,1)    ,
c(636,1,0,1)    ,
c(215,0,1,0)    ,
c(443,1,1,0)    ,
c(428,1,1,1)    ,
c(421,0,0,1)    ,
c(842,1,1,1)    ,
c(858,1,1,1)    ,
c(936,0,1,1)    ,
c(127,0,0,1)    ,
c(895,1,0,1)    
))  

colnames(dt) <- c('Cost','Female','X','rand_assignment')  

dt

spark_version <- "2.1.0"
sc <- spark_connect(master = "local", version = spark_version)

dt_spark <- copy_to(sc, dt)
rm(dt)

param_grid2 <-   list(
  formula_stage = list(
    ft_r_formula  = list('Cost ~ Female + X + rand_assignment',
                      'Cost ~ X + rand_assignment',
                      'Cost ~ rand_assignment',
                      'Cost ~ X')
  )
)    

rf_pipeline2 <- ml_pipeline(sc) %>%
  ft_r_formula(uid = 'formula_stage') %>%
  ml_random_forest_regressor( feature_subset_strategy = "onethird",
                              seed = sample(1:10000, 1),
                              subsampling_rate = .75)

Error in stop(simpleError(sprintf(fmt, ...), if (call.) sys.call(sys.parent()))) :
bad error message

rf_cv2 <- ml_cross_validator(sc, 
                            estimator = rf_pipeline2, 
                            estimator_param_maps = param_grid2,
                            evaluator = ml_regression_evaluator(sc, metric_name = "rmse"),
                            num_folds = 2)

cv_model_formula <- rf_cv2 %>% ml_fit(dt_spark)

ml question

Source

jkylearmstrong

Most helpful comment

Couple things here.

When specifying the param grid, you need to specify the name of the parameter you're tuning, which in this case is formula:

param_grid2 <-   list(
  formula_stage = list(
    formula  = list('Cost ~ Female + X + rand_assignment',
                         'Cost ~ X + rand_assignment',
                         'Cost ~ rand_assignment',
                         'Cost ~ X')
  )
)

Currently, ft_r_formula() requires that formula be specified, which makes sense most of the time, except when you want to tune it in a grid. I'm tracking this here https://github.com/rstudio/sparklyr/issues/1513. In the meantime, just pass a dummy formula and it won't complain:

rf_pipeline2 <- ml_pipeline(sc) %>%
  ft_r_formula("foo ~ bar", uid = 'formula_stage') %>%
  ml_random_forest_regressor( feature_subset_strategy = "onethird",
                              seed = sample(1:10000, 1),
                              subsampling_rate = .75)

rf_cv2 <- ml_cross_validator(sc, 
                             estimator = rf_pipeline2, 
                             estimator_param_maps = param_grid2,
                             evaluator = ml_regression_evaluator(sc, metric_name = "rmse"),
                             num_folds = 2)
cv_model_formula <- rf_cv2 %>% ml_fit(dt_spark) 
ml_validation_metrics(cv_model_formula)

Thanks for trying out this feature. The documentation on this is currently very sparse, and we'll be adding more.

kevinykuo on 22 May 2018

👍3

>All comments

Couple things here.

When specifying the param grid, you need to specify the name of the parameter you're tuning, which in this case is formula:

param_grid2 <-   list(
  formula_stage = list(
    formula  = list('Cost ~ Female + X + rand_assignment',
                         'Cost ~ X + rand_assignment',
                         'Cost ~ rand_assignment',
                         'Cost ~ X')
  )
)

Currently, ft_r_formula() requires that formula be specified, which makes sense most of the time, except when you want to tune it in a grid. I'm tracking this here https://github.com/rstudio/sparklyr/issues/1513. In the meantime, just pass a dummy formula and it won't complain:

rf_pipeline2 <- ml_pipeline(sc) %>%
  ft_r_formula("foo ~ bar", uid = 'formula_stage') %>%
  ml_random_forest_regressor( feature_subset_strategy = "onethird",
                              seed = sample(1:10000, 1),
                              subsampling_rate = .75)

rf_cv2 <- ml_cross_validator(sc, 
                             estimator = rf_pipeline2, 
                             estimator_param_maps = param_grid2,
                             evaluator = ml_regression_evaluator(sc, metric_name = "rmse"),
                             num_folds = 2)
cv_model_formula <- rf_cv2 %>% ml_fit(dt_spark) 
ml_validation_metrics(cv_model_formula)

Thanks for trying out this feature. The documentation on this is currently very sparse, and we'll be adding more.

kevinykuo on 22 May 2018

👍3

Was this page helpful?

0 / 5 - 0 ratings