xgboost4j-spark: MLlib CrossValidator issues

Created on 6 Jan 2017 · 19Comments · Source: dmlc/xgboost

Environment: Amazon emr-5.2.1 (with Spark 2.0.2)
Package: jvm, xgboost4j-spark
xgboost version used:0.7
The commit hash (git rev-parse HEAD): d7406e07f3eec09654b17f7f08c1aa8623d96497

Steps to reproduce

I tried to replicate the basic logistic regression classification example from
https://spark.apache.org/docs/2.0.2/ml-classification-regression.html#logistic-regression
using XGBoostEstimator instead of LogisticRegression.

Fitting the xgboost pipeline to the training set works as expected but I can not get the CrossValidator to work. Please see code below:

import org.apache.spark.ml.Pipeline
import ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
import org.apache.spark.sql.Row
import scala.collection.mutable

// xgboost parameters
def get_param(): mutable.HashMap[String, Any] = {
  val params = new mutable.HashMap[String, Any]()
  params += "eta" -> 0.1
  params += "max_depth" -> 4
  params += "gamma" -> 0.0
  params += "colsample_bylevel" -> 1
  params += "objective" -> "binary:logistic"
  params += "booster" -> "gbtree"
  params += "num_rounds" -> 20

  return params
}

// Prepare training data from a list of (id, text, label) tuples.
val training = spark.createDataFrame(Seq(
  (0L, "a b c d e spark", 1.0),
  (1L, "b d", 0.0),
  (2L, "spark f g h", 1.0),
  (3L, "hadoop mapreduce", 0.0),
  (4L, "b spark who", 1.0),
  (5L, "g d a y", 0.0),
  (6L, "spark fly", 1.0),
  (7L, "was mapreduce", 0.0),
  (8L, "e spark program", 1.0),
  (9L, "a e c l", 0.0),
  (10L, "spark compile", 1.0),
  (11L, "hadoop software", 0.0)
)).toDF("id", "text", "label")

// Prepare test documents, which are unlabeled (id, text) tuples.
val test = spark.createDataFrame(Seq(
  (4L, "spark i j k"),
  (5L, "l m n"),
  (6L, "mapreduce spark"),
  (7L, "apache hadoop")
)).toDF("id", "text")

// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and xgb.
val tokenizer = new Tokenizer()
  .setInputCol("text")
  .setOutputCol("words")
val hashingTF = new HashingTF()
  .setInputCol(tokenizer.getOutputCol)
  .setOutputCol("features")

val xgb = new XGBoostEstimator(get_param().toMap)

val pipeline = new Pipeline()
  .setStages(Array(tokenizer, hashingTF, xgb))

// test pipeline .fit (works as expected)
val xgbModel = pipeline.fit(training)

xgbModel.transform(test).select("text", "probabilities", "prediction").show()
// +---------------+--------------------+----------+
// |           text|       probabilities|prediction|
// +---------------+--------------------+----------+
// |    spark i j k|[0.47003597021102...|       1.0|
// |          l m n|[0.52996402978897...|       0.0|
// |mapreduce spark|[0.47003597021102...|       1.0|
// |  apache hadoop|[0.52996402978897...|       0.0|
// +---------------+--------------------+----------+



md5-6608653ce387e3a8472e5efff4df5367



// grid
val paramGrid = new ParamGridBuilder()
  .addGrid(hashingTF.numFeatures, Array(10, 100, 1000))
  .addGrid(xgb.round, Array(10, 20))
  .build()

// cv
val evaluator = new BinaryClassificationEvaluator()
  .setRawPredictionCol("probabilities")

val cv = new CrossValidator()
  .setEstimator(pipeline)
  .setEvaluator(evaluator)
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(2)  // Use 3+ in practice

// Run cross-validation, and choose the best set of parameters.
val cvModel = cv.fit(training)



md5-6608653ce387e3a8472e5efff4df5367



// all predictions are 0
cvModel.transform(test).select("text", "probabilities", "prediction").show()
// +---------------+--------------------+----------+
// |           text|       probabilities|prediction|
// +---------------+--------------------+----------+
// |    spark i j k|[0.72372663021087...|       0.0|
// |          l m n|[0.72372663021087...|       0.0|
// |mapreduce spark|[0.72372663021087...|       0.0|
// |  apache hadoop|[0.72372663021087...|       0.0|
// +---------------+--------------------+----------+



md5-6608653ce387e3a8472e5efff4df5367



// all AUC equal to 0.5...
cvModel.avgMetrics
// res3: Array[Double] = Array(0.5, 0.5, 0.5, 0.5, 0.5, 0.5)

Any idea why this is happening?

Source

abdullahektor

Most helpful comment

fixed, see https://github.com/dmlc/xgboost/pull/2043 and https://github.com/dmlc/xgboost/pull/2042

CodingCat on 18 Feb 2017

👍2 🎉1

All 19 comments

I didn't see any error here, the different parameters can generate the same predictions especially when you taking the fact that you have only 12 instances in training set

CodingCat on 7 Jan 2017

Hi! Thanks for the swift response @CodingCat

I am fairly certain this is an error, probably because I've seen the same result on my actual data set with a couple of 100 million rows, but I understand that you are not convinced from the example provided above so let me provide another example to convince you. Just to be clear, I'm not concerned that the predictions are the same for both parameter settings, I'm concerned because non of the models perform better than random.

import scala.util.Random
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
import ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator
import scala.collection.mutable

// classifier parameters
def get_param(): mutable.HashMap[String, Any] = {
  val params = new mutable.HashMap[String, Any]()
  params += "eta" -> 0.1
  params += "max_depth" -> 4
  params += "gamma" -> 0.0
  params += "colsample_bylevel" -> 1
  params += "objective" -> "binary:logistic"
  params += "booster" -> "gbtree"
  params += "num_rounds" -> 20

  return params
}

Create training and test data with identical feature and label columns.

val r = new Random(0)

val training = spark.createDataFrame(
  Seq.fill(10000)(r.nextInt(2)).map(i => (i, i))
).toDF("feature", "label")

val test = spark.createDataFrame(
  Seq.fill(10000)(r.nextInt(2)).map(i => (i, i))
).toDF("feature", "label")

training.show(5)
// +-------+-----+
// |feature|label|
// +-------+-----+
// |      1|    1|
// |      1|    1|
// |      0|    0|
// |      1|    1|
// |      1|    1|
// +-------+-----+
// only showing top 5 rows

test.show(5)
// +-------+-----+
// |feature|label|
// +-------+-----+
// |      1|    1|
// |      1|    1|
// |      1|    1|
// |      1|    1|
// |      0|    0|
// +-------+-----+
// only showing top 5 rows



md5-85f3926cbfb043b169b2f694582b98f3



// create pipeline
val assembler = new VectorAssembler()
  .setInputCols(Array("feature"))
  .setOutputCol("features")

val xgb = new XGBoostEstimator(get_param().toMap)
  .setFeaturesCol("features")

val pipeline = new Pipeline()
  .setStages(Array(assembler, xgb))

// grid
val paramGrid = new ParamGridBuilder()
  .addGrid(xgb.round, Array(10, 20))
  .build()

// cv
val evaluator = new BinaryClassificationEvaluator().setRawPredictionCol("probabilities")
val cv = new CrossValidator()
  .setEstimator(pipeline)
  .setEvaluator(evaluator)
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(5)
  .setSeed(0)

// Run cross-validation, and choose the best set of parameters.
val cvModel = cv.fit(training)



md5-826dd65d439d60b2ff941661198e268b



cvModel.avgMetrics
// res0: Array[Double] = Array(0.5, 0.5)



md5-37459e9ca696c8f556212295488c50be



cvModel.transform(test).groupBy("label", "prediction").count().show()
// +-----+----------+-----+
// |label|prediction|count|
// +-----+----------+-----+
// |    1|       0.0| 5017|
// |    0|       0.0| 4983|
// +-----+----------+-----+



md5-6608653ce387e3a8472e5efff4df5367



evaluator.evaluate(cvModel.transform(test))
// res2: Double = 0.5



md5-4abf6a9f69a9b6570ce2f0af668a0c27



val xgbModel = pipeline.fit(training)
xgbModel.transform(test).groupBy("label", "prediction").count().show()
// +-----+----------+-----+
// |label|prediction|count|
// +-----+----------+-----+
// |    0|       0.0| 4983|
// |    1|       1.0| 5017|
// +-----+----------+-----+



md5-1624a9083782c745ba3d5e19c0166600



evaluator.evaluate(xgbModel.transform(test))
// res4: Double = 1.0

Maybe it's worth mentioning that a similar CV works as expected if I replace the XGBoostEstimator with LogisticRegression. What do you think?

abdullahektor on 7 Jan 2017

I will look at this

CodingCat on 9 Jan 2017

😕1 👍1

@CodingCat I spent some time looking into this. It looks to me that the issue is overloading .fit with a parameter grid:

val paramGrid = new ParamGridBuilder()
  .addGrid(xgb.round, Array(20))
  .build()

val xgb_model_0 = pipeline.fit(training)
val xgb_model_1 = pipeline.fit(training, paramGrid)

xgb_model_0.transform(test).groupBy("prediction", "label").count().show()
// +----------+-----+-----+
// |prediction|label|count|
// +----------+-----+-----+
// |       0.0|    0| 4983|
// |       1.0|    1| 5017|
// +----------+-----+-----+



md5-6608653ce387e3a8472e5efff4df5367



xgb_model_1(0).transform(test).groupBy("prediction", "label").count().show()
// +----------+-----+-----+
// |prediction|label|count|
// +----------+-----+-----+
// |       0.0|    0| 4983|
// |       0.0|    1| 5017|
// +----------+-----+-----+

abdullahektor on 9 Jan 2017

my first impression is that the copy() implementation may have some bug, but I will investigate it further in the following days

CodingCat on 10 Jan 2017

@CodingCat Have you had any time to look into this issue?

abdullahektor on 20 Jan 2017

not yet, if you can contribute on it, that would be great

CodingCat on 20 Jan 2017

Got a similar issue here.

Ewen2015 on 9 Feb 2017

👍3

fixed, see https://github.com/dmlc/xgboost/pull/2043 and https://github.com/dmlc/xgboost/pull/2042

CodingCat on 18 Feb 2017

👍2 🎉1

Thank you @CodingCat !

abdullahektor on 20 Feb 2017

@abdullahektor just wondering if this issue is fixed?

I downloaded the latest xgboost4j and run the sample codes by @abdullahektor but still got all AUC 0.5 and all predictions 0. Thanks!

labrook on 13 Mar 2017

@CodingCat when i used cvModel.write.overwrite.save("/tmp/cvModel") I got following error,
and another trouble is cvModel.bestModel.asInstanceOf[PipelineModel].stages(1).extractParamMap
will not see optimal parameter.

java.lang.UnsupportedOperationException: Pipeline write will fail on this Pipeline because it contains a stage which does not implement Writable. Non-Writable stage: XGBoostEstimator_2a415364841e of type class ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator
at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$validateStages$1.apply(Pipeline.scala:231)
at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$validateStages$1.apply(Pipeline.scala:228)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline$SharedReadWrite$.validateStages(Pipeline.scala:228)
at org.apache.spark.ml.Pipeline$PipelineWriter.(Pipeline.scala:202)
at org.apache.spark.ml.Pipeline.write(Pipeline.scala:188)
at org.apache.spark.ml.util.MLWritable$class.save(ReadWrite.scala:154)
at org.apache.spark.ml.Pipeline.save(Pipeline.scala:96)
at org.apache.spark.ml.tuning.ValidatorParams$.saveImpl(ValidatorParams.scala:148)
at org.apache.spark.ml.tuning.CrossValidatorModel$CrossValidatorModelWriter.saveImpl(CrossValidator.scala:256)
at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:111)
... 52 elided

Widerstehen on 16 Mar 2017

@labrook Yes, this issue is fixed. Plz ignore the first example and run the second one since the first probably could give 0 predictions and AUC 0.5 just by pure chance (such a small data set).

@Frank111 This seems completely unrelated... Also this issue is closed. ..

abdullahektor on 16 Mar 2017

@abdullahektor I ran the second example but still got AUC 0.5 and all predictions 0. I am certain I download the latest package this Wednesday but just wonder if there is any way to verify I have the correct package with this fix? Thanks!

labrook on 16 Mar 2017

@labrook me too, I got all predictions 0, and I have no idea.

YCG09 on 15 May 2017

@YCG09 I ended up writing my own cross validation codes for xgb models. So I guess probably that's the best way as of now.

labrook on 15 May 2017

@labrook see if you fall into the same issue here...I keep using fixed cross validation model...everything goes fine

CodingCat on 15 May 2017

@CodingCat Thanks for checking. Last time I tried was on March 16 (see my comment above), which was supposed to be after the bug was fixed. However, the results showed otherwise. I wrote a CV module myself and have been using it since. Also, I notice that you have a talk in the coming Spark Summit. Look forward to meeting you and having more discussion about xgboost.

labrook on 15 May 2017

@labrook sorry, forgot the link,, https://github.com/dmlc/xgboost/issues/2297 I mean check if you fall into the same issue as @YCG09

CodingCat on 15 May 2017

Was this page helpful?

0 / 5 - 0 ratings