Both XGBoost and LightGBM have the option to predict leaf indices instead of scores/classes/probabilities.
e.g https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py
Can this option be added to CatBoost? It seems simple to do, since CatBoost already creates binary vectors that correspond to leaf indices during inference.
Any input on this? This is an extremely useful feature transformation as demonstrated by Facebook's paper https://research.fb.com/wp-content/uploads/2016/11/practical-lessons-from-predicting-clicks-on-ads-at-facebook.pdf?
And CatBoost's fast inference makes it the perfect candidate for online learning applications.
Hello, do you want to help us with this feature? If yes, it would be great and we can help you. If no, we can plan this small feature in the next release:)
I have some time in the short term and am interested in taking a stab at it but I would need help. The change is not as simple as I first thought since it involves returning a different format of predictions.
The code that generates the predictions appears to be in formula_evaluator.cpp:
TCalcerIndexType index = 0;
for (int depth = 0; depth < curTreeSize; ++depth) {
const ui8 borderVal = (ui8)(treeSplitsCurPtr[depth].SplitIdx);
const ui32 featureIndex = (treeSplitsCurPtr[depth].FeatureIndex);
if (NeedXorMask) {
const ui8 xorMask = (ui8)(treeSplitsCurPtr[depth].XorMask);
index |= ((binFeatures[featureIndex] ^ xorMask) >= borderVal) << depth;
} else {
index |= (binFeatures[featureIndex] >= borderVal) << depth;
}
}
auto treeLeafPtr = model.ObliviousTrees.LeafValues[treeId].data();
if (IsSingleClassModel) { // single class model
result += treeLeafPtr[index];
} else { // mutliclass model
auto leafValuePtr = treeLeafPtr + index * model.ObliviousTrees.ApproxDimension;
for (int classId = 0; classId < model.ObliviousTrees.ApproxDimension; ++classId) {
results[classId] += leafValuePtr[classId];
}
}
where index is the leaf index and treeLeafPtr[index] is the score at that leaf node. The prediction for a particular test data point is built by summing the individual scores from each tree.
However, for this feature transformation I don't want one prediction per data point. Instead I want a vector of leaf indices with one entry for each tree.
For example: suppose the GBM grows 3 trees and for a particular test data point, the leaf indices are 1, 1, 2. Then the prediction would be 1[1] + 2[1] + 3[2] where _I_[_J_] is the score in leaf node _J_ of tree _I_, but what I want is the feature vector [1, 1, 2]. The idea, as mentioned in the Facebook paper linked above, is to take these features as inputs into a second classifier, and has been shown to improve prediction performance significantly in some cases.
After thinking about this for a bit it's still unclear to me what the best way to implement this would be. Any guidance would be great.
Leaf index calculation is now available in python package. To try this feature please build catboost from source (instruction for linux) or wait for our next pypi release.
Two methods for leaf indexes calculation are added to the CatBoost class:
See help on this methods for more details.
Most helpful comment
Leaf index calculation is now available in python package. To try this feature please build catboost from source (instruction for linux) or wait for our next pypi release.
Two methods for leaf indexes calculation are added to the CatBoost class:
See help on this methods for more details.