Xgboost: [jvm-package] discussion on evaluation of xgboost4j-spark

Created on 8 Dec 2016  Â·  17Comments  Â·  Source: dmlc/xgboost

Hi, all. Recently I am reading the source code of xgboost4j-spark. I found that the output of predict methods is kind of un-usable.

  def predict(testSet: RDD[MLDenseVector], missingValue: Float): RDD[Array[Array[Float]]] = {
    val broadcastBooster = testSet.sparkContext.broadcast(_booster)
    testSet.mapPartitions { testSamples =>
      val sampleArray = testSamples.toList
      val numRows = sampleArray.size
      val numColumns = sampleArray.head.size
      if (numRows == 0) {
        Iterator()
      } else {
        val rabitEnv = Map("DMLC_TASK_ID" -> TaskContext.getPartitionId().toString)
        Rabit.init(rabitEnv.asJava)
        // translate to required format
        val flatSampleArray = new Array[Float](numRows * numColumns)
        for (i <- flatSampleArray.indices) {
          flatSampleArray(i) = sampleArray(i / numColumns).values(i % numColumns).toFloat
        }
        val dMatrix = new DMatrix(flatSampleArray, numRows, numColumns, missingValue)
        Rabit.shutdown()
        Iterator(broadcastBooster.value.predict(dMatrix))
      }
    }
  }

The output is a RDD of a matrix that each row represents the prediction result (I print the result it seems to be just one Float). OK, you now just get the each partitions' prediction result and you do not know the corresponding relation with the origin data. I don't know what I can do with such a result. I am wondering whether should we delete these methods or replace them with others. I suggest to offer two methods. One is to return input and prediction, the other is to return the label and prediction (enough for most evaluation).

BTW I can offer my implementation.

Most helpful comment

@fromradio I found this java library https://github.com/komiya-atsushi/xgboost-predictor-java maybe this is interesting to you.

All 17 comments

are you sure you have label in test data?

for input case, use zip method in RDD to produce the expected format...

reference: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionModel

I see. I parse the predict method in spark's GeneralizedLinearModel below

  /**
   * Predict the result given a data point and the weights learned.
   *
   * @param dataMatrix Row vector containing the features for this data point
   * @param weightMatrix Column vector containing the weights of the model
   * @param intercept Intercept of the model.
   */
  protected def predictPoint(dataMatrix: Vector, weightMatrix: Vector, intercept: Double): Double

  /**
   * Predict values for the given data set using the model trained.
   *
   * @param testData RDD representing data points to be predicted
   * @return RDD[Double] where each entry contains the corresponding prediction
   *
   */
  @Since("1.0.0")
  def predict(testData: RDD[Vector]): RDD[Double] = {
    // A small optimization to avoid serializing the entire model. Only the weightsMatrix
    // and intercept is needed.
    val localWeights = weights
    val bcWeights = testData.context.broadcast(localWeights)
    val localIntercept = intercept
    testData.mapPartitions { iter =>
      val w = bcWeights.value
      iter.map(v => predictPoint(v, w, localIntercept))
    }
  }

So it seems that we just need to unfold the Array[Array[Float]] to Iterator and the corresponding is kept?

Here is how they use predict in spark

object LogisticRegressionWithLBFGSExample {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("LogisticRegressionWithLBFGSExample")
    val sc = new SparkContext(conf)

    // $example on$
    // Load training data in LIBSVM format.
    val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

    // Split data into training (60%) and test (40%).
    val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
    val training = splits(0).cache()
    val test = splits(1)

    // Run training algorithm to build the model
    val model = new LogisticRegressionWithLBFGS()
      .setNumClasses(10)
      .run(training)

    // Compute raw scores on the test set.
    val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
      val prediction = model.predict(features)
      (prediction, label)
    }

    // Get evaluation metrics.
    val metrics = new MulticlassMetrics(predictionAndLabels)
    val accuracy = metrics.accuracy
    println(s"Accuracy = $accuracy")

    // Save and load model
    model.save(sc, "target/tmp/scalaLogisticRegressionWithLBFGSModel")
    val sameModel = LogisticRegressionModel.load(sc,
      "target/tmp/scalaLogisticRegressionWithLBFGSModel")
    // $example off$

    sc.stop()
  }
}

I think that we should provide a predict method similar with spark which takes a vector as input not a matrix. And the later process will be easier. If you think this is necessary I will try to do this~

XGBoost does not support predict for a single instance efficiently...so predict(v: Vector) does not make sense

I see. So how about implement a method with flatten RDD like spark does. For instance:

def predict(testSet: RDD[Vector], useExternalCache: Boolean = false): RDD[Float] = {
    import DataUtils._
    val broadcastBooster = testSet.sparkContext.broadcast(_booster)
    val appName = testSet.context.appName
    testSet.mapPartitions { testSamples =>
      if (testSamples.hasNext) {
        val rabitEnv = Array("DMLC_TASK_ID" -> TaskContext.getPartitionId().toString).toMap
        Rabit.init(rabitEnv.asJava)
        val cacheFileName = {
          if (useExternalCache) {
            s"$appName-dtest_cache-${TaskContext.getPartitionId()}"
          } else {
            null
          }
        }
        val dMatrix = new DMatrix(new JDMatrix(testSamples, cacheFileName))
        val res = broadcastBooster.value.predict(dMatrix)
        require(res(0).length == 1, "The prediction of the tree should be scalar")
        Rabit.shutdown()
        res.map { arr => arr(0) }.iterator
      } else {
        Iterator()
      }
    }
  }

Now we can use zip to combine input and output

That's on todo list but no chance to do it

You can deprecate the current one and write a new one

I think that we should provide a predict method similar with spark which takes a vector as input not a matrix. And the later process will be easier. If you think this is necessary I will try to do this~

@fromradio You will need to implement the GBT logic completely in Java/Scala without calling any JNI functions.
It certainly is doable, but typically you only need prediction for single vectors in production environment, in which case you don't necessarily need or want to ship a fat jar with all the components of XGBoost.

@xydrolase I actually is trying to build a project containing various cluster-based machine learning algorithms and I found the api of xgboost4j-spark is not consistent with spark. Currently I have not much time re-write the GBT logic in scala and I will try when I'm free.

@CodingCat I have finished this on xgboost-0.6, I will pull a request on 0.7 after I finish~

@fromradio I found this java library https://github.com/komiya-atsushi/xgboost-predictor-java maybe this is interesting to you.

@geoHeil Thanks a lot. It seems to be what I want! I will make a test on prediction speed with xgboost4j-spark I modified.

Cool. I would be interested to know about your results.
Ruimin Wang notifications@github.com schrieb am Di. 13. Dez. 2016 um
11:18:

@geoHeil https://github.com/geoHeil Thanks a lot. It seems to be what I
want! I will make a test on prediction speed with xgboost4j-spark I
modified.

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
https://github.com/dmlc/xgboost/issues/1849#issuecomment-266700269, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABnc9Jea5dUaPUOAvlSvfUs5-MyYWUCPks5rHnDWgaJpZM4LHmIL
.

@geoHeil I have tested on a dataset (containing 200,000 data) on spark. The xgboost4j-spark cost 1775736 milliseconds containing implicit data transformations. xgboost-predictor-java cost 4620104 milliseconds containing data transformations and 2907550 milliseconds without transformations. I think xgboost4j's prediction on a batch is faster and I will keep using xgboost4j.

Thanks. Did you test predicting a single record as well?
Ruimin Wang notifications@github.com schrieb am Di. 13. Dez. 2016 um
12:32:

@geoHeil https://github.com/geoHeil I have tested on a dataset
(containing 200,000 data) on spark. The xgboost4j-spark cost 1775736
milliseconds containing implicit data transformations.
xgboost-predictor-java cost 4620104 milliseconds containing data
transformations and 2907550 milliseconds without transformations. I think
xgboost4j's prediction on a batch is faster and I will keep using xgboost4j.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/xgboost/issues/1849#issuecomment-266715580, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABnc9MVMNvszPjLXIMZFAeSi0qYLTmUhks5rHoImgaJpZM4LHmIL
.

@geoHeil Not yet. xgboost4j does not offer such an api. As discussed before, this is because there is no fast solution yet. I think batch predict can handle most situations.

I met this problem too.After one week test ,I found that if you use this API, you should make sure the input RDD is persist. so the partition will not changed. and you can zip the result and input labels!!!!
If your input data is loaded from hdfs, you should do nothing. Otherwise, you should make the train_data persisted or cached

Was this page helpful?
0 / 5 - 0 ratings

Related issues

vkuznet picture vkuznet  Â·  3Comments

choushishi picture choushishi  Â·  3Comments

RanaivosonHerimanitra picture RanaivosonHerimanitra  Â·  3Comments

FabHan picture FabHan  Â·  4Comments

tqchen picture tqchen  Â·  4Comments