Xgboost: spark xgboost prediction accuracy and auc is much lower than that in log

Created on 24 Jul 2017 · 4Comments · Source: dmlc/xgboost

I am training spark xgboost model, the train-error in log is about 0.28, and I saved the model, then load model to test it on test set, get very bad auc and accuracy (auc = 0.65, acc=0.55), which I think should be acc is about 0.72, auc should be much higher than 0.72.
also, I tried it on train set, the same result to test set. So I was confused, why accuracy is different from the log ?

1, My trainModel code

```import ml.dmlc.xgboost4j.scala.{Booster, DMatrix}
import ml.dmlc.xgboost4j.scala.spark.{DataUtils, XGBoost}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.ml.linalg.{DenseVector => MLDenseVector}
import org.apache.spark.ml.feature.{LabeledPoint => MLLabeledPoint}
import org.apache.spark.sql.SparkSession
import DataUtils._

object trainModel{
def main(args: Array[String]){
val spark = SparkSession.builder.appName("xiajizhong").getOrCreate()
val sc = spark.sparkContext

    //val inputTrainPath = "/user/gulfstream/zhenpeng/xgboost/demo/data/agaricus.txt.train"
    //val inputTrainPath = "/user/bigdata_driver_ecosys_test/xiajizhong/base/201706_space"
    //val inputTestPath = "/user/gulfstream/zhenpeng/xgboost/demo/data/agaricus.txt.test"
    val inputTrainPath = args(0)

    val trainRDD = MLUtils.loadLibSVMFile(sc, inputTrainPath).map(lp =>
        MLLabeledPoint(lp.label, new MLDenseVector(lp.features.toArray)))
    //val testSet = MLUtils.loadLibSVMFile(sc, inputTestPath).collect().map(
    //     lp => new MLDenseVector(lp.features.toArray)).iterator
    val paramMap = List(
          "eta" -> 0.5f,
          "max_depth" -> 6,
          "objective" -> "binary:logistic",
          "booster" -> "gbtree",
          "tree_method" -> "exact").toMap
    val xgboostModel = XGBoost.train(trainRDD, paramMap, round=100, nWorkers=10, useExternalMemory=true)
    //xgboostModel.booster.predict(new DMatrix(testSet))
    //val outputModelPath = "/user/bigdata_driver_ecosys_test/xiajizhong/base/xgb_06.model"
    val outputModelPath = args(1)
    xgboostModel.saveModelAsHadoopFile(outputModelPath)(sc)

}
}

the log is:
2017-07-24 12:50:11,391-[TS] INFO Thread-50 ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - 2017-07-24 12:50:11,390 INFO [93]    train-error:0.287650
2017-07-24 12:50:40,961-[TS] INFO Thread-50 ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - 2017-07-24 12:50:40,960 INFO [94]    train-error:0.287634
2017-07-24 12:51:10,258-[TS] INFO Thread-50 ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - 2017-07-24 12:51:10,258 INFO [95]    train-error:0.287631
2017-07-24 12:51:39,403-[TS] INFO Thread-50 ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - 2017-07-24 12:51:39,403 INFO [96]    train-error:0.287623
2017-07-24 12:52:09,241-[TS] INFO Thread-50 ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - 2017-07-24 12:52:09,241 INFO [97]    train-error:0.287612
2017-07-24 12:52:38,593-[TS] INFO Thread-50 ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - 2017-07-24 12:52:38,592 INFO [98]    train-error:0.287607
2017-07-24 12:53:07,767-[TS] INFO Thread-50 ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - 2017-07-24 12:53:07,767 INFO [99]    train-error:0.287586

2, my test model code
```import org.apache.log4j.{ Level, Logger }
import org.apache.spark.{ SparkConf, SparkContext }
import ml.dmlc.xgboost4j.scala.spark.XGBoost
import org.apache.spark.sql.{ SparkSession, Row }
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.ml.feature.LabeledPoint
import org.apache.spark.ml.linalg.Vectors

object validation{
    def main(args: Array[String]){
        val spark = SparkSession.builder.appName("xiajizhong").getOrCreate()
        val sc = spark.sparkContext

        val inputTestPath = args(0)
        val modelPath = args(1)    
        val outputPredictPath = args(2)

        //val test = MLUtils.loadLibSVMFile(sc, inputTestPath).toDF("label", "features")
        val test = spark.read.format("libsvm").load(inputTestPath).toDF("label", "features")
        val xgbModel = XGBoost.loadModelFromHadoopFile(modelPath)(sc)
        val predict = xgbModel.transform(test)
        predict.rdd.saveAsTextFile(outputPredictPath)
   }   
}

3, then I load the saved predict result use pyspark to calculate auc. using (from pyspark.mllib.evaluation import BinaryClassificationMetrics), I get the result auc(using areaUnderROC) is only 0.65, and I tried again on train set and the same!

Source

GeorgeXia1828

👍1

Most helpful comment

@hzliang just index error, you should train with index start from 1 and test with python model with index from 0. You can find it described at the github sparkxgboost page.

GeorgeXia1828 on 29 Jan 2018

👍2

All 4 comments

I also met this problem. Do you have solved it?

anddelu on 23 Aug 2017

Have you solved this problem?

hzliang on 19 Jan 2018

@hzliang just index error, you should train with index start from 1 and test with python model with index from 0. You can find it described at the github sparkxgboost page.

GeorgeXia1828 on 29 Jan 2018

👍2

the train error in train log is indeed right,

xgbModel.booster.saveModel("/local/path"), then you can use it in xgboost python api.

when using python module to do predict, the feature should transform from ...
1:xx_1,2:xx_2,3:xx_3 to 0:xx_1,1:xx_2,2:xx_3,3:0
maybe this will solve problem transform spark xgboost.

GeorgeXia1828 on 29 Jan 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings