Environment: Amazon emr-5.2.1 (with Spark 2.0.2)
Package: jvm, xgboost4j-spark
xgboost
version used:0.7
The commit hash (git rev-parse HEAD
): d7406e07f3eec09654b17f7f08c1aa8623d96497
I tried to replicate the basic logistic regression classification example from
https://spark.apache.org/docs/2.0.2/ml-classification-regression.html#logistic-regression
using XGBoostEstimator instead of LogisticRegression.
Fitting the xgboost pipeline to the training set works as expected but I can not get the CrossValidator to work. Please see code below:
import org.apache.spark.ml.Pipeline
import ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
import org.apache.spark.sql.Row
import scala.collection.mutable
// xgboost parameters
def get_param(): mutable.HashMap[String, Any] = {
val params = new mutable.HashMap[String, Any]()
params += "eta" -> 0.1
params += "max_depth" -> 4
params += "gamma" -> 0.0
params += "colsample_bylevel" -> 1
params += "objective" -> "binary:logistic"
params += "booster" -> "gbtree"
params += "num_rounds" -> 20
return params
}
// Prepare training data from a list of (id, text, label) tuples.
val training = spark.createDataFrame(Seq(
(0L, "a b c d e spark", 1.0),
(1L, "b d", 0.0),
(2L, "spark f g h", 1.0),
(3L, "hadoop mapreduce", 0.0),
(4L, "b spark who", 1.0),
(5L, "g d a y", 0.0),
(6L, "spark fly", 1.0),
(7L, "was mapreduce", 0.0),
(8L, "e spark program", 1.0),
(9L, "a e c l", 0.0),
(10L, "spark compile", 1.0),
(11L, "hadoop software", 0.0)
)).toDF("id", "text", "label")
// Prepare test documents, which are unlabeled (id, text) tuples.
val test = spark.createDataFrame(Seq(
(4L, "spark i j k"),
(5L, "l m n"),
(6L, "mapreduce spark"),
(7L, "apache hadoop")
)).toDF("id", "text")
// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and xgb.
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val xgb = new XGBoostEstimator(get_param().toMap)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, xgb))
// test pipeline .fit (works as expected)
val xgbModel = pipeline.fit(training)
xgbModel.transform(test).select("text", "probabilities", "prediction").show()
// +---------------+--------------------+----------+
// | text| probabilities|prediction|
// +---------------+--------------------+----------+
// | spark i j k|[0.47003597021102...| 1.0|
// | l m n|[0.52996402978897...| 0.0|
// |mapreduce spark|[0.47003597021102...| 1.0|
// | apache hadoop|[0.52996402978897...| 0.0|
// +---------------+--------------------+----------+
md5-6608653ce387e3a8472e5efff4df5367
// grid
val paramGrid = new ParamGridBuilder()
.addGrid(hashingTF.numFeatures, Array(10, 100, 1000))
.addGrid(xgb.round, Array(10, 20))
.build()
// cv
val evaluator = new BinaryClassificationEvaluator()
.setRawPredictionCol("probabilities")
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(2) // Use 3+ in practice
// Run cross-validation, and choose the best set of parameters.
val cvModel = cv.fit(training)
md5-6608653ce387e3a8472e5efff4df5367
// all predictions are 0
cvModel.transform(test).select("text", "probabilities", "prediction").show()
// +---------------+--------------------+----------+
// | text| probabilities|prediction|
// +---------------+--------------------+----------+
// | spark i j k|[0.72372663021087...| 0.0|
// | l m n|[0.72372663021087...| 0.0|
// |mapreduce spark|[0.72372663021087...| 0.0|
// | apache hadoop|[0.72372663021087...| 0.0|
// +---------------+--------------------+----------+
md5-6608653ce387e3a8472e5efff4df5367
// all AUC equal to 0.5...
cvModel.avgMetrics
// res3: Array[Double] = Array(0.5, 0.5, 0.5, 0.5, 0.5, 0.5)
Any idea why this is happening?
I didn't see any error here, the different parameters can generate the same predictions especially when you taking the fact that you have only 12 instances in training set
Hi! Thanks for the swift response @CodingCat
I am fairly certain this is an error, probably because I've seen the same result on my actual data set with a couple of 100 million rows, but I understand that you are not convinced from the example provided above so let me provide another example to convince you. Just to be clear, I'm not concerned that the predictions are the same for both parameter settings, I'm concerned because non of the models perform better than random.
import scala.util.Random
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
import ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator
import scala.collection.mutable
// classifier parameters
def get_param(): mutable.HashMap[String, Any] = {
val params = new mutable.HashMap[String, Any]()
params += "eta" -> 0.1
params += "max_depth" -> 4
params += "gamma" -> 0.0
params += "colsample_bylevel" -> 1
params += "objective" -> "binary:logistic"
params += "booster" -> "gbtree"
params += "num_rounds" -> 20
return params
}
Create training and test data with identical feature and label columns.
val r = new Random(0)
val training = spark.createDataFrame(
Seq.fill(10000)(r.nextInt(2)).map(i => (i, i))
).toDF("feature", "label")
val test = spark.createDataFrame(
Seq.fill(10000)(r.nextInt(2)).map(i => (i, i))
).toDF("feature", "label")
training.show(5)
// +-------+-----+
// |feature|label|
// +-------+-----+
// | 1| 1|
// | 1| 1|
// | 0| 0|
// | 1| 1|
// | 1| 1|
// +-------+-----+
// only showing top 5 rows
test.show(5)
// +-------+-----+
// |feature|label|
// +-------+-----+
// | 1| 1|
// | 1| 1|
// | 1| 1|
// | 1| 1|
// | 0| 0|
// +-------+-----+
// only showing top 5 rows
md5-85f3926cbfb043b169b2f694582b98f3
// create pipeline
val assembler = new VectorAssembler()
.setInputCols(Array("feature"))
.setOutputCol("features")
val xgb = new XGBoostEstimator(get_param().toMap)
.setFeaturesCol("features")
val pipeline = new Pipeline()
.setStages(Array(assembler, xgb))
// grid
val paramGrid = new ParamGridBuilder()
.addGrid(xgb.round, Array(10, 20))
.build()
// cv
val evaluator = new BinaryClassificationEvaluator().setRawPredictionCol("probabilities")
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(5)
.setSeed(0)
// Run cross-validation, and choose the best set of parameters.
val cvModel = cv.fit(training)
md5-826dd65d439d60b2ff941661198e268b
cvModel.avgMetrics
// res0: Array[Double] = Array(0.5, 0.5)
md5-37459e9ca696c8f556212295488c50be
cvModel.transform(test).groupBy("label", "prediction").count().show()
// +-----+----------+-----+
// |label|prediction|count|
// +-----+----------+-----+
// | 1| 0.0| 5017|
// | 0| 0.0| 4983|
// +-----+----------+-----+
md5-6608653ce387e3a8472e5efff4df5367
evaluator.evaluate(cvModel.transform(test))
// res2: Double = 0.5
md5-4abf6a9f69a9b6570ce2f0af668a0c27
val xgbModel = pipeline.fit(training)
xgbModel.transform(test).groupBy("label", "prediction").count().show()
// +-----+----------+-----+
// |label|prediction|count|
// +-----+----------+-----+
// | 0| 0.0| 4983|
// | 1| 1.0| 5017|
// +-----+----------+-----+
md5-1624a9083782c745ba3d5e19c0166600
evaluator.evaluate(xgbModel.transform(test))
// res4: Double = 1.0
Maybe it's worth mentioning that a similar CV works as expected if I replace the XGBoostEstimator
with LogisticRegression
. What do you think?
I will look at this
@CodingCat I spent some time looking into this. It looks to me that the issue is overloading .fit
with a parameter grid:
val paramGrid = new ParamGridBuilder()
.addGrid(xgb.round, Array(20))
.build()
val xgb_model_0 = pipeline.fit(training)
val xgb_model_1 = pipeline.fit(training, paramGrid)
xgb_model_0.transform(test).groupBy("prediction", "label").count().show()
// +----------+-----+-----+
// |prediction|label|count|
// +----------+-----+-----+
// | 0.0| 0| 4983|
// | 1.0| 1| 5017|
// +----------+-----+-----+
md5-6608653ce387e3a8472e5efff4df5367
xgb_model_1(0).transform(test).groupBy("prediction", "label").count().show()
// +----------+-----+-----+
// |prediction|label|count|
// +----------+-----+-----+
// | 0.0| 0| 4983|
// | 0.0| 1| 5017|
// +----------+-----+-----+
my first impression is that the copy() implementation may have some bug, but I will investigate it further in the following days
@CodingCat Have you had any time to look into this issue?
not yet, if you can contribute on it, that would be great
Got a similar issue here.
Thank you @CodingCat !
@abdullahektor just wondering if this issue is fixed?
I downloaded the latest xgboost4j and run the sample codes by @abdullahektor but still got all AUC 0.5 and all predictions 0. Thanks!
@CodingCat when i used cvModel.write.overwrite.save("/tmp/cvModel") I got following error,
and another trouble is cvModel.bestModel.asInstanceOf[PipelineModel].stages(1).extractParamMap
will not see optimal parameter.
java.lang.UnsupportedOperationException: Pipeline write will fail on this Pipeline because it contains a stage which does not implement Writable. Non-Writable stage: XGBoostEstimator_2a415364841e of type class ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator
at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$validateStages$1.apply(Pipeline.scala:231)
at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$validateStages$1.apply(Pipeline.scala:228)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline$SharedReadWrite$.validateStages(Pipeline.scala:228)
at org.apache.spark.ml.Pipeline$PipelineWriter.
at org.apache.spark.ml.Pipeline.write(Pipeline.scala:188)
at org.apache.spark.ml.util.MLWritable$class.save(ReadWrite.scala:154)
at org.apache.spark.ml.Pipeline.save(Pipeline.scala:96)
at org.apache.spark.ml.tuning.ValidatorParams$.saveImpl(ValidatorParams.scala:148)
at org.apache.spark.ml.tuning.CrossValidatorModel$CrossValidatorModelWriter.saveImpl(CrossValidator.scala:256)
at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:111)
... 52 elided
@labrook Yes, this issue is fixed. Plz ignore the first example and run the second one since the first probably could give 0 predictions and AUC 0.5 just by pure chance (such a small data set).
@Frank111 This seems completely unrelated... Also this issue is closed. ..
@abdullahektor I ran the second example but still got AUC 0.5 and all predictions 0. I am certain I download the latest package this Wednesday but just wonder if there is any way to verify I have the correct package with this fix? Thanks!
@labrook me too, I got all predictions 0, and I have no idea.
@YCG09 I ended up writing my own cross validation codes for xgb models. So I guess probably that's the best way as of now.
@labrook see if you fall into the same issue here...I keep using fixed cross validation model...everything goes fine
@CodingCat Thanks for checking. Last time I tried was on March 16 (see my comment above), which was supposed to be after the bug was fixed. However, the results showed otherwise. I wrote a CV module myself and have been using it since. Also, I notice that you have a talk in the coming Spark Summit. Look forward to meeting you and having more discussion about xgboost.
@labrook sorry, forgot the link,, https://github.com/dmlc/xgboost/issues/2297 I mean check if you fall into the same issue as @YCG09
Most helpful comment
fixed, see https://github.com/dmlc/xgboost/pull/2043 and https://github.com/dmlc/xgboost/pull/2042