Xgboost: [jvm-package] Add WatchList to xgboost4j-spark trainWithDataFrame / trainWithRDD

Created on 3 Jan 2017 · 10Comments · Source: dmlc/xgboost

For distributed training it would be useful to calculate a group of items to be evaluated during training to watch performance on the validation set. This is offered as a watch in single-node training.

Limitations of metrics collection in Spark are severe. However, presently I will need to modify the XGBoost source code in order to add my own accumulators. It would be helpful to have a structure in place for me to define them in the training method.

I suspect this is not feasible. If so, let's document that here.

java-scala

Source

mydpy

Most helpful comment

Since the model up to each iteration is synchronized to each distributed worker, so it is possible to compute metric on an evaluation set on one of the worker, so long as it is small enough so that it won't take too long to evaluate.

I'm actually also interested in this, and I'll look into it and see if I can make a PR for this feature.

xydrolase on 10 Jan 2017

👍5 🎉1

All 10 comments

I'm actually also interested in this, and I'll look into it and see if I can make a PR for this feature.

xydrolase on 10 Jan 2017

👍5 🎉1

@mydpy I'll start working on this feature this week. Still contemplating what's the best strategy to run evaluation. I can either let the user pass a RDD of evaluation set, so that each worker performs its own evaluation on a portion of the validation set; Or, use broadcast to send the whole evaluation set to the workers, but only let one to perform evaluation. Will experiment with some proof of concepts and post updates as I make progress.

xydrolase on 30 Jan 2017

I am also trouble with not watchlist in xgboost-spark,does any news come out?

buptzzl on 4 May 2017

@xydrolase I am thinking about either use the original rabit to allreduce metrics, or use your akka-based rabit tracker (add some application layer message) + accumulator to achieve this....

CodingCat on 4 May 2017

Add "WatchList" to xgboost4j-spark， does any news come out now？

scutlinhao on 21 Aug 2017

Any update on the watchlist usage in xgboost4j-spark?

chaithu123 on 5 Oct 2017

There's API to subsample the input DataFrame into train and test, see #2711.

superbobry on 5 Oct 2017

@superbobry
hi, i want to add "WatchList" to xgboost4j-spark, so i use latest master version。
when i run, some error happened:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.scheduler.LiveListenerBus.listeners()Ljava/util/concurrent/CopyOnWriteArrayList; at org.apache.spark.SparkParallelismTracker.safeExecute(SparkParallelismTracker.scala:75) at org.apache.spark.SparkParallelismTracker.execute(SparkParallelismTracker.scala:102) at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:350) at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.train(XGBoostEstimator.scala:139) at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.train(XGBoostEstimator.scala:36) at org.apache.spark.ml.Predictor.fit(Predictor.scala:96) at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainWithDataFrame(XGBoost.scala:199)

my spark version is 2.2.0 and use spark-core_2.11。

wangzhilong on 20 Dec 2017

This looks like a classpath issue. Are you sure you don't have any other Spark version in the classpath when running your job?

superbobry on 20 Dec 2017

I have set SPARK_HOME and PATH when running the job.

export SPARK_HOME=/usr/local/spark-2.1
export PATH=$SPARK_HOME/bin:$PATH

wangzhilong on 21 Dec 2017

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Approach (documentation) ambiguity

vkuznet · 3Comments

XGBoost Windows installation error

uasthana15 · 4Comments

One step update of python's xgboost.Booster fails with a segfault

ivannz · 3Comments

[Roadmap] 1.3.0 Roadmap

trivialfis · 3Comments

[PYTHON] Make Feature Name Validation Optional

tqchen · 4Comments