Lightgbm: Support multi-output regression/classification

Created on 17 May 2017  路  22Comments  路  Source: microsoft/LightGBM

Currently, LightGBM only supports 1-output problems. It would be interesting if LightGBM could support multi-output tasks (multi-output regression, multi-label classification, etc.) like those in multitask lasso.

I've seen a similar request on xgboost, but it hasn't been implemented yet.

feature request help wanted

Most helpful comment

Any update on multi-label classification?

All 22 comments

I think it would require rewriting the whole algorithm from scratch for LightGBM, as it was optimized for a one case.

In the case of xgboost, it requires rewriting the whole algorithm from scratch, which is not possible in the current state unless someone is ready to work on it.

Is there any introduction website or paper about multi-output task, except divided into multiple binary/regression task?

@wxchan multi-output tasks require using an objective handling multi-output tasks.

It also includes/requires multi-split support for decision trees (multiple cutting points instead of one cutting point).

I think you can check this as starting point, it's explained very simply: http://users.math.yale.edu/users/gw289/CpSc-445-545/Slides/CPSC445%20-%20Topic%2005%20-%20Classification%20&%20Decision%20Trees.pdf

@wxchan I believe GBDT can adapt from multi-class to multi-label classification (where the labels aren't mutually exclusive) without too much additional computational cost.

In multi-label classification, the target y is a n_samples * n_labels matrix, and each column is a binary vector.

Traditionally, at the leaf node of a classification tree, the prediction is generated by an average of one-hot class probability vectors (which represent the classes of the samples belonging to the leaf). Ex. mean([0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 0, 1], [0, 1, 0, 0], ...). When we use it for multi-label classification, the probability vectors will be different. Ex. mean([0, 1, 0, 1], [0, 1, 1, 0], [0, 0, 0, 1], [0, 1, 0, 1], ...).

(I know gradient boosting involves more complex maths, but that's the basic idea.)

For GBDT, some other modifications may be required:

  1. the impurity function: For multi-label classification, impurity functions mentioned in this document should change. Multi-class impurity functions across n classes would be modified into the sum of n 2-class impurity functions for each label.

e.g.
original entropy function: $-\sum_{cin\mathcal{C}}{p(c)\log p(c)}$
new entropy function for multi-label classification: $-\sum_{lin\mathcal{L}}{p(l)\log p(l) + (1-p(l))\log (1-p(l))}$

  1. the objective function for gradient boosting: Not certain yet, since metrics like cross entropy also apply to multi-label problems. This may be something interesting to explore.

@Laurae2 I couldn't see the necessity of multi-split support. Theoretically, any split of 1 parent node into more than 2 child nodes can be equivalently represented by a sequence of binary splits. (Did I understand you correctly?)

The class_weight parameter will still be useful. A label with a larger weight would thus be considered more important during the evaluation of each split.

I suppose, for multi-label classification, an implementation within a single lightgbm model could train more efficiently and consume less memory. Relatively speaking, multi-output regression might not be as useful.

@TennielMiao multi-output classification is doable in xgboost/LightGBM, it is actually what is being done in multiclass problems but not in an efficient manner. Also, it returns everything, while you might be interested into a specific number of outputs (especially for classification). This is why we have softprob and softmax separate in xgboost (one is giving the raw values for your needs, while the other processes them before giving them to you).

It requires modification of the objective/loss function like you described. For instance, if you were to optimize F1 or F2 score, then you would have to put in the metric part an optimizer which finds the best threshold for each class for each iteration. For the loss function, you would have to find a proxy which is continuous and a local statistic (unlike F1/F2 Score requiring discrete inputs over a global statistic).

For proper multi-output classification, if you can have more splits instead of binary splits you will require a lower depth for trees which would also requires less splits. As the sum of loss from binary splits is only an approximation of the sum of loss from multi-splits (mathematically if you consider a graph with chained losses), the representation of a multi-split is not always identical to the representation of multiple binary splits (if you split more, you have higher odds to end up with something different).

As for the speed, there are two major cases:

  • O(n) for n classes: using n models for n classes/outputs is the easiest to implement. If you have 10,000 classes, then you have 10,000 models to train.
  • O(log(n)) for n classes: using 1 model for n classes/outputs is harder to implement and not trivial. It would also mean 10,000 classes would train 2,500x faster (theoretically) than a one-vs-all or one-vs-one classifier/regressor.

For the class_weight you mentioned: @guolinke did you implement the class_weight-like parameter in a previous PR? https://github.com/Microsoft/LightGBM/pull/314

@TennielMiao The main bottleneck for the implementation of multi-output classification( regression) are memories, It's been highly non-efficiency since we need maintain all the residual value of all sample over all features.

Just find a related paper in ICML 2017: "Gradient Boosted Decision Trees for High Dimensional Sparse Output" .

@huanzhang12 is one of the author.

The excerpt:

In this paper, we study the gradient boosted decision trees (GBDT) when the output space is high dimensional and sparse. For example, in multilabel classification, the output space is a L-dimensional 0/1 vector, where L is number of labels that can grow to millions and beyond in many modern applications. We show that GBDT can easily run out of memory or encounter near-forever running time in this regime, and propose a new GBDT variant, called GBDT-SPARSE to resolve this problem by employing L_0 regularization. We then discuss in detail how to utilize this sparsity to conduct GBDT training, including splitting the nodes, computing the sparse residual, and predicting in sublinear time. Finally, we apply our algorithm to extreme multilabel classification problems, and show that the proposed GBDT-SPARSE achieves an order of magnitude improvements in model size and prediction time over existing methods, while yielding similar performance.

That method seems way faster than O(n) so it will be interesting to see what is used in GBDT-SPARSE to reach that speed.

@marugari Thanks for posting the link here!
Let me know if you have any questions regarding our paper.

@huanzhang12 Trees for top-k labels have same (feature_id, threshold)?
Why the following algorithm fails?

Q <- index set of top-k p_s
for q in Q do
  for j = 1...D do
    for i = 1...N do
  best[q] <- (f_best, t_best)

If this discussion is not suitable for this issue, I will send a e-mail.

@marugari Yes, please send me an email if you have specific questions on our paper.

@marugari
Thanks very much, it seems is a training top-k worst class solution ?

@marugari Sorry for my late reply. I will look into this.

@guolinke
Yes. This algorithm's focus is not gains but residuals, unlike the paper.

@huanzhang12
It seems that classes improve gain easily often don't match classes have large loss.
This causes poor prediction accuracy in my implementation.

Hi all, I have a question and I am not sure if it is related to this discussion. When I train a multi-class model, from the log I can see that the number of trained trees is usually equal to the number of classes times number of iterations. Does that means LGB is training something like a one-vs-all classifier? If so can we not take the output for each class and somehow achieve multilabel classification? Please correct me if I am wrong. Thanks.

@albertauyeung
LGB uses softmax (or softprob) and minimizes cross-entropy.
However, the number of trees per iteration should be equal to the number of classes because TreeLearner doesn't support multi-output.

Multiclass classification in the TensorFlow Boosted trees.
Does this work well? (I don't believe)
https://arxiv.org/abs/1710.11547v1

Any update on multi-label classification?

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

Was this page helpful?
5 / 5 - 1 ratings