Xgboost: [jvm-packages] eval_set for xgboost4j-spark

Created on 9 Apr 2018  路  8Comments  路  Source: dmlc/xgboost

There is no way to set custom evaluation set for ml.dmlc.xgboost4j.scala.spark.XGBoost#trainDistributed. Code inside uses private ml.dmlc.xgboost4j.scala.spark.Watches class which just splits train with predefined trainTestRatio and doesn't accept any custom eval set through params.
Is there any particular reason for this limitation or it's just stub and can be extended for example with DMatrix passed through params? Is there any complications caused by fact that this is distributed XGBoost? How such dataset should be stored in params then, as DMatrix or RDD, or something else?

feature-request

Most helpful comment

Hi @CodingCat,

I need to define a separate validation set for cross validation, using xgboost4j on spark. I tried the approach here. It does not look like that setting "eval_sets" -> Map("dev" -> dev_df) make any difference! Should I expect the following set up work as cross validation does (using TrainValidationSplit)?

        val params = scala.collection.mutable.Map(
            "eta" -> 0.1,
            "objective" -> "binary:logistic",
            "eval_sets" -> Map("dev" -> dev_df))
        val booster = new XGBoostClassifier(params.toMap)
        booster.setFeaturesCol("features")
        booster.setLabelCol("label")
        booster.setMaxDepth(5)
        booster.setNumRound(150)
        booster.setNumWorkers(4)
        val xgb_model = booster.fit(train_df)

All 8 comments

I think there is a comment when bring the code in, https://github.com/dmlc/xgboost/pull/2710#discussion_r141479583

Would you like to give this requirement a shot?

All feature requests are now consolidated to #3439. This issue should be re-opened if someone decides to actively work on implementing this feature.

I will work on eval set this week

@CodingCat There is a work in progress to implement watchlist in the XGBoost4J Scala wrapper: #3544. Can we take advantage of this to implement watchlist in XGBoost4J-Spark?

spark's problem is you have to find some way to pass in, join (or zip), multiple dataframes

and pass some part of each of them to each Spark task, create DMatrix, and take each DMatrix in each Spark task as each watch dataset.....

that part is kind of complicated and needs to refactor the current Watch thing, I think we can do it in the next version.....

Consolidating to the feature request tracker #3439. Feel free to re-open this issue when anyone starts working on this.

the feature is implemented in https://github.com/dmlc/xgboost/pull/3910

Hi @CodingCat,

I need to define a separate validation set for cross validation, using xgboost4j on spark. I tried the approach here. It does not look like that setting "eval_sets" -> Map("dev" -> dev_df) make any difference! Should I expect the following set up work as cross validation does (using TrainValidationSplit)?

        val params = scala.collection.mutable.Map(
            "eta" -> 0.1,
            "objective" -> "binary:logistic",
            "eval_sets" -> Map("dev" -> dev_df))
        val booster = new XGBoostClassifier(params.toMap)
        booster.setFeaturesCol("features")
        booster.setLabelCol("label")
        booster.setMaxDepth(5)
        booster.setNumRound(150)
        booster.setNumWorkers(4)
        val xgb_model = booster.fit(train_df)
Was this page helpful?
0 / 5 - 0 ratings