Catboost: Predict leaf indices

Created on 1 Feb 2018 · 4Comments · Source: catboost/catboost

Both XGBoost and LightGBM have the option to predict leaf indices instead of scores/classes/probabilities.

e.g https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py

Can this option be added to CatBoost? It seems simple to do, since CatBoost already creates binary vectors that correspond to leaf indices during inference.

in progress

Source

sosmond

Most helpful comment

Leaf index calculation is now available in python package. To try this feature please build catboost from source (instruction for linux) or wait for our next pypi release.

Two methods for leaf indexes calculation are added to the CatBoost class:

calc_leaf_indexes - returns two-dimensional numpy.ndarray of indexes.
iterate_leaf_indexes - returns generator of per-object leaf indexes.

See help on this methods for more details.

Andrew-Angrew on 13 May 2019

🎉3 👍2

All 4 comments

Any input on this? This is an extremely useful feature transformation as demonstrated by Facebook's paper https://research.fb.com/wp-content/uploads/2016/11/practical-lessons-from-predicting-clicks-on-ads-at-facebook.pdf?

And CatBoost's fast inference makes it the perfect candidate for online learning applications.

sosmond on 3 Feb 2018

Hello, do you want to help us with this feature? If yes, it would be great and we can help you. If no, we can plan this small feature in the next release:)

kizill on 3 Feb 2018

I have some time in the short term and am interested in taking a stab at it but I would need help. The change is not as simple as I first thought since it involves returning a different format of predictions.

The code that generates the predictions appears to be in formula_evaluator.cpp:

TCalcerIndexType index = 0; 
for (int depth = 0; depth < curTreeSize; ++depth) {
    const ui8 borderVal = (ui8)(treeSplitsCurPtr[depth].SplitIdx);
    const ui32 featureIndex = (treeSplitsCurPtr[depth].FeatureIndex);
    if (NeedXorMask) {
        const ui8 xorMask = (ui8)(treeSplitsCurPtr[depth].XorMask);
        index |= ((binFeatures[featureIndex] ^ xorMask) >= borderVal) << depth;
    } else {
        index |= (binFeatures[featureIndex] >= borderVal) << depth;
    }
}
auto treeLeafPtr = model.ObliviousTrees.LeafValues[treeId].data();
if (IsSingleClassModel) { // single class model
    result += treeLeafPtr[index];
} else { // mutliclass model
    auto leafValuePtr = treeLeafPtr + index * model.ObliviousTrees.ApproxDimension;
    for (int classId = 0; classId < model.ObliviousTrees.ApproxDimension; ++classId) {
        results[classId] += leafValuePtr[classId];
    }
}

where index is the leaf index and treeLeafPtr[index] is the score at that leaf node. The prediction for a particular test data point is built by summing the individual scores from each tree.

However, for this feature transformation I don't want one prediction per data point. Instead I want a vector of leaf indices with one entry for each tree.

For example: suppose the GBM grows 3 trees and for a particular test data point, the leaf indices are 1, 1, 2. Then the prediction would be 1[1] + 2[1] + 3[2] where _I_[_J_] is the score in leaf node _J_ of tree _I_, but what I want is the feature vector [1, 1, 2]. The idea, as mentioned in the Facebook paper linked above, is to take these features as inputs into a second classifier, and has been shown to improve prediction performance significantly in some cases.

After thinking about this for a bit it's still unclear to me what the best way to implement this would be. Any guidance would be great.

sosmond on 17 Feb 2018

Leaf index calculation is now available in python package. To try this feature please build catboost from source (instruction for linux) or wait for our next pypi release.

Two methods for leaf indexes calculation are added to the CatBoost class: