Sparklyr: using ft_r_formula with ml_pipelines

Created on 22 May 2018  路  1Comment  路  Source: sparklyr/sparklyr

unable to get ml_pipline to test multiple formulas:

dt <- as.data.frame(rbind(c(667,1,0,1)  ,
c(39,1,1,1) ,
c(644,0,1,0)    ,
c(334,0,0,0)    ,
c(678,1,0,1)    ,
c(900,0,0,1)    ,
c(613,1,1,1)    ,
c(787,1,1,0)    ,
c(736,1,0,1)    ,
c(363,0,0,1)    ,
c(636,1,0,1)    ,
c(215,0,1,0)    ,
c(443,1,1,0)    ,
c(428,1,1,1)    ,
c(421,0,0,1)    ,
c(842,1,1,1)    ,
c(858,1,1,1)    ,
c(936,0,1,1)    ,
c(127,0,0,1)    ,
c(895,1,0,1)    
))  

colnames(dt) <- c('Cost','Female','X','rand_assignment')  

dt

spark_version <- "2.1.0"
sc <- spark_connect(master = "local", version = spark_version)

dt_spark <- copy_to(sc, dt)
rm(dt)

param_grid2 <-   list(
  formula_stage = list(
    ft_r_formula  = list('Cost ~ Female + X + rand_assignment',
                      'Cost ~ X + rand_assignment',
                      'Cost ~ rand_assignment',
                      'Cost ~ X')
  )
)    

rf_pipeline2 <- ml_pipeline(sc) %>%
  ft_r_formula(uid = 'formula_stage') %>%
  ml_random_forest_regressor( feature_subset_strategy = "onethird",
                              seed = sample(1:10000, 1),
                              subsampling_rate = .75)

Error in stop(simpleError(sprintf(fmt, ...), if (call.) sys.call(sys.parent()))) :
bad error message

rf_cv2 <- ml_cross_validator(sc, 
                            estimator = rf_pipeline2, 
                            estimator_param_maps = param_grid2,
                            evaluator = ml_regression_evaluator(sc, metric_name = "rmse"),
                            num_folds = 2)

cv_model_formula <- rf_cv2 %>% ml_fit(dt_spark) 
ml question

Most helpful comment

Couple things here.

  1. When specifying the param grid, you need to specify the name of the parameter you're tuning, which in this case is formula:
param_grid2 <-   list(
  formula_stage = list(
    formula  = list('Cost ~ Female + X + rand_assignment',
                         'Cost ~ X + rand_assignment',
                         'Cost ~ rand_assignment',
                         'Cost ~ X')
  )
)    
  1. Currently, ft_r_formula() requires that formula be specified, which makes sense most of the time, except when you want to tune it in a grid. I'm tracking this here https://github.com/rstudio/sparklyr/issues/1513. In the meantime, just pass a dummy formula and it won't complain:
rf_pipeline2 <- ml_pipeline(sc) %>%
  ft_r_formula("foo ~ bar", uid = 'formula_stage') %>%
  ml_random_forest_regressor( feature_subset_strategy = "onethird",
                              seed = sample(1:10000, 1),
                              subsampling_rate = .75)

rf_cv2 <- ml_cross_validator(sc, 
                             estimator = rf_pipeline2, 
                             estimator_param_maps = param_grid2,
                             evaluator = ml_regression_evaluator(sc, metric_name = "rmse"),
                             num_folds = 2)
cv_model_formula <- rf_cv2 %>% ml_fit(dt_spark) 
ml_validation_metrics(cv_model_formula)

Thanks for trying out this feature. The documentation on this is currently very sparse, and we'll be adding more.

>All comments

Couple things here.

  1. When specifying the param grid, you need to specify the name of the parameter you're tuning, which in this case is formula:
param_grid2 <-   list(
  formula_stage = list(
    formula  = list('Cost ~ Female + X + rand_assignment',
                         'Cost ~ X + rand_assignment',
                         'Cost ~ rand_assignment',
                         'Cost ~ X')
  )
)    
  1. Currently, ft_r_formula() requires that formula be specified, which makes sense most of the time, except when you want to tune it in a grid. I'm tracking this here https://github.com/rstudio/sparklyr/issues/1513. In the meantime, just pass a dummy formula and it won't complain:
rf_pipeline2 <- ml_pipeline(sc) %>%
  ft_r_formula("foo ~ bar", uid = 'formula_stage') %>%
  ml_random_forest_regressor( feature_subset_strategy = "onethird",
                              seed = sample(1:10000, 1),
                              subsampling_rate = .75)

rf_cv2 <- ml_cross_validator(sc, 
                             estimator = rf_pipeline2, 
                             estimator_param_maps = param_grid2,
                             evaluator = ml_regression_evaluator(sc, metric_name = "rmse"),
                             num_folds = 2)
cv_model_formula <- rf_cv2 %>% ml_fit(dt_spark) 
ml_validation_metrics(cv_model_formula)

Thanks for trying out this feature. The documentation on this is currently very sparse, and we'll be adding more.

Was this page helpful?
0 / 5 - 0 ratings