For distributed training it would be useful to calculate a group of items to be evaluated during training to watch performance on the validation set. This is offered as a watch in single-node training.
Limitations of metrics collection in Spark are severe. However, presently I will need to modify the XGBoost source code in order to add my own accumulators. It would be helpful to have a structure in place for me to define them in the training method.
I suspect this is not feasible. If so, let's document that here.
Since the model up to each iteration is synchronized to each distributed worker, so it is possible to compute metric on an evaluation set on one of the worker, so long as it is small enough so that it won't take too long to evaluate.
I'm actually also interested in this, and I'll look into it and see if I can make a PR for this feature.
@mydpy I'll start working on this feature this week. Still contemplating what's the best strategy to run evaluation. I can either let the user pass a RDD of evaluation set, so that each worker performs its own evaluation on a portion of the validation set; Or, use broadcast to send the whole evaluation set to the workers, but only let one to perform evaluation. Will experiment with some proof of concepts and post updates as I make progress.
I am also trouble with not watchlist in xgboost-spark,does any news come out?
@xydrolase I am thinking about either use the original rabit to allreduce metrics, or use your akka-based rabit tracker (add some application layer message) + accumulator to achieve this....
Add "WatchList" to xgboost4j-spark, does any news come out now?
Any update on the watchlist usage in xgboost4j-spark?
There's API to subsample the input DataFrame into train and test, see #2711.
@superbobry
hi, i want to add "WatchList" to xgboost4j-spark, so i use latest master version。
when i run, some error happened:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.scheduler.LiveListenerBus.listeners()Ljava/util/concurrent/CopyOnWriteArrayList;
at org.apache.spark.SparkParallelismTracker.safeExecute(SparkParallelismTracker.scala:75)
at org.apache.spark.SparkParallelismTracker.execute(SparkParallelismTracker.scala:102)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:350)
at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.train(XGBoostEstimator.scala:139)
at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.train(XGBoostEstimator.scala:36)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:96)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainWithDataFrame(XGBoost.scala:199)
my spark version is 2.2.0 and use spark-core_2.11。
This looks like a classpath issue. Are you sure you don't have any other Spark version in the classpath when running your job?
I have set SPARK_HOME and PATH when running the job.
export SPARK_HOME=/usr/local/spark-2.1
export PATH=$SPARK_HOME/bin:$PATH
Most helpful comment
Since the model up to each iteration is synchronized to each distributed worker, so it is possible to compute metric on an evaluation set on one of the worker, so long as it is small enough so that it won't take too long to evaluate.
I'm actually also interested in this, and I'll look into it and see if I can make a PR for this feature.