Scikit-learn: add post-pruning for decision trees

Created on 16 Mar 2016 · 64Comments · Source: scikit-learn/scikit-learn

I frequently get asked about post-pruning. Often using single trees is important for interpretability, and post-pruning can help both interpretability and generalization performance.
[I'm surprised there is no open issue on this, but I couldn't find one]

Moderate New Feature help wanted

Source

amueller

👍23

Most helpful comment

Hi,all ,
I have implemented:
CCP(Cost Complexity Pruning) Algorithm on
sklearn-CART-Classification-model in Python,

ECP(Error Complexity Pruning) Algorithm on
sklearn-CART-Regression-model in Python,

here's the link:
https://github.com/appleyuchi/Decision_Tree_Prune
you may like it.

appleyuchi on 8 Dec 2018

👍3

All 64 comments

I can give this a shot :)

nelson-liu on 17 Mar 2016

The title of this issue is related https://github.com/scikit-learn/scikit-learn/issues/4630 .

lesshaste on 18 Mar 2016

This issue #941 also seems to be proposing prune method to the tree but the code has considerably changed in the meantime I suppose.

maniteja123 on 18 Mar 2016

On the topic of interpretability, there has also been published work on creating single decision trees from random forests. As counter-intuitive as this sounds, due to the non-optimality of standard decision tree algorithms, this can apparently give single decisions tree that are both interpretable and also better classifiers/regressors than you would get otherwise.

lesshaste on 18 Mar 2016

👍2

@amueller Could you please explain this issue a little bit? I want to work on this.

qmaruf on 5 Apr 2016

actually maybe the first thing to do would be to add a stopping criterion based on mutual information...

amueller on 21 May 2016

@lesshaste can you give a citation? (I had been thinking about that recently, damn, obviously it has been done ^^)

amueller on 21 May 2016

@amueller It has been a long time since I looked at this but:

Section 2.2 of
"Rule Extraction from Random Forest: the RF+HC Methods" by Morteza Mashayekhi and Robin Gras has all the relevant citations I know. Please let me know if you don't have access to this paper.

On an only slightly related note, I remember looking at http://blog.datadive.net/interpreting-random-forests/ with some interest.

lesshaste on 21 May 2016

👍1

actually maybe the first thing to do would be to add a stopping criterion based on mutual information...

An easy addition would be to stop the construction when p(t) i(t, s*) < beta, i.e. when the weighted impurity p(t) i(t, s*) for the best split s* becomes less than some user-defined threshold beta.

glouppe on 6 Jun 2016

I frequently get asked about post-pruning. Often using single trees is important for interpretability, and post-pruning can help both interpretability and generalization performance.

I think this can also be beneficial for some ensemble algorithms like (gradient) boosting although it's typically not increasing the predictive accuracy for bagging-style ensembles (e.g. random forests).

Pruning might also be useful to save some memory and decrease prediction times a bit.

ogrisel on 6 Jun 2016

@amueller / others: @jmschrei and i met to discuss the issue of post-pruning a few weeks ago, and we were unsure of how it would fit in the current scikit-learn API. Generally, post-pruning needs a validation set, but this doesn't seem to fit nicely with how the library is currently organized (namely, issues like creation / origin of the validation set and whether this would be an argument to fit or a separate method come to mind). How were you thinking of this being implemented from an API point of view?
For now, I'm working on looking through the splitter.pyx code and adding the early stopping criteria based on weighted impurity for GSoC while thinking about MAE.

nelson-liu on 30 Jun 2016

👍1

Seems like there was some discussion about API issues in https://github.com/scikit-learn/scikit-learn/pull/941, but the tree module has changed quite a bit and perhaps we should have the discussion again. For one, we have parameters like max_leaf_nodes and max_depth to control growth of the tree a bit.

nelson-liu on 30 Jun 2016

I would be inclined to set up pruning as a separate method. In fact, a separate function that takes a tree and returns the pruned version would be ok to begin with. This is the approach which is used in https://github.com/ajtulloch/sklearn-compiledtrees to generate optimized code for trees.

@nelson-liu With regard to the regularization options, they sometimes lead to subtrees with all nodes belonging to the same class. While it is not pruning in the sense of regularization, it would be nice to have a function to get rid of these extra nodes that only add computational cost during prediction.

nlgranger on 30 Jun 2016

👍1

When would you ever get a subtree with all nodes predicting the same class? You wouldn't make a split if there wasn't a gain in your criterion.

jmschrei on 30 Jun 2016

In essence, what I'd like is a 'warm start' like method for building a tree, like we have a warm start for building a random forest. You should have a method to add a single node to a tree, but still get a valid estimator before and after adding the node. This would allow users to evaluate the trees performance on the evaluation set as they build it, just like you can add trees to a random forest, evaluating its performance on the validation set each time. This shouldn't be terribly difficult functionally to add, the biggest issue is just the API for this. @ogrisel @glouppe do you have any thoughts?

jmschrei on 30 Jun 2016

@jmschrei
I meant that a split of 90:10 into 80:0 & 10:10 won't change the performance if the tree stops there.

nlgranger on 30 Jun 2016

@pixelou another important thing to consider is whether we want this to be useable with GridSearchCV...a separate method would break that i think.

nelson-liu on 1 Jul 2016

@nelson-liu Sorry I wasn't clear: I meant that most of the hard work is to write the pruning procedures themselve. Integration shouldn't be too hard (but I'm not the one doing it obviously :-) )

One the top of my head, I see several options for integration:

options given to the tree constructor are then taken into account by .fit
options are given to .fit directly
a separate .prune or .post_prune method has to be called explicitely
a separate prune_tree or post_prune_tree function takes the tree and returns another pruned tree

It's up to you to decide on this, but I think one can write a separate private (class?)method for pruning and make it available to the API as one of the above solutions.

Note that I have just given the points above without further thinking ;-). 2 and 4 are clearly not the sklearn way of doing things, and you just mentioned how 3 can be a problem.

nlgranger on 1 Jul 2016

The first option of those seems the most consistent with sklearn.

jmschrei on 3 Jul 2016

👍1

@ameuller @glouppe @GaelVaroquaux any opinion on the api for post pruning? I see these three options, the first of which seems the best to me. Just wanted to get some clarification before I start to think about working on this:

options given to the tree constructor are then taken into account by .fit
a separate .prune or .post_prune method has to be called after fitting
a separate prune_tree or post_prune_tree function takes the tree and returns another pruned tree

nelson-liu on 9 Aug 2016

👍1

Yes, the first option is certainly the one that fits best with our API.

glouppe on 9 Aug 2016

Isn't it pre pruning if you do it at fit? Aren't we supposed to check the complexity of the tree and use post pruning to reduce it?

raghavrv on 9 Aug 2016

Start with the third option, then decide whether the others are appropriate...

jnothman on 9 Aug 2016

+1 for that

raghavrv on 9 Aug 2016

(by which I mean, the example code might help decide)

jnothman on 9 Aug 2016

👍1

It's not pre-pruning if you do it at fit, if you build the full tree and then go backwards and remove nodes. I agree with @glouppe that the first one is the best option, but I also agree with @jnothman that since the code will rely on a prune_tree method ~anyway~ it may be better to create a standalone thing during the development stages.

jmschrei on 9 Aug 2016

It's not pre-pruning if you do it at fit, if you build the full tree and then go backwards and remove nodes.

I assumed that the user would want to select whether to post pruning or not based on the built tree...

raghavrv on 9 Aug 2016

For instance this is how rpart does it - by a separate prune method which goes backward and trims the tree...

raghavrv on 9 Aug 2016

Presumably you'd want validation set performance to guide if/how to prune the tree.

jmschrei on 9 Aug 2016

My concern was - if users want a peek at the tree between the fit and post-prune operations, they will have to fit again without post-pruning enabled...

But as you say maybe that is not necessary if the need for post-pruning is based on the validation set performance and not on user's decision...

raghavrv on 9 Aug 2016

@nelson-liu Can I work on this if you are not?

dalmia on 26 Feb 2017

sure go ahead, it could be a good way to get familiar with the tree code. It's definitely not an easy task though, so feel free to ask if you need assistance...

nelson-liu on 26 Feb 2017

Thank you @nelson-liu.

dalmia on 26 Feb 2017

👍1

So is this been implemented or still remaining?

datavinci on 14 May 2017

@datavinci it is still remaining.

nelson-liu on 14 May 2017

If nobody is working on this feature, I would like to do this on myself or with @dalmia. OK?

avidale on 19 Jul 2017

It seems that this is stalled, so go ahead.

On Wed, Jul 19, 2017 at 2:32 PM, avidale notifications@github.com wrote:

If nobody is working on this feature, I would like to do this on myself or
with @dalmia https://github.com/dalmia. OK?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/6557#issuecomment-316524038,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADvEEOVYL_gQbMzWFixoHO1D0YfvuItCks5sPnXlgaJpZM4HybHH
.

jmschrei on 20 Jul 2017

I don't think @dalmia has availability these days.

On 20 July 2017 at 08:01, Jacob Schreiber notifications@github.com wrote:

It seems that this is stalled, so go ahead.

On Wed, Jul 19, 2017 at 2:32 PM, avidale notifications@github.com wrote:

If nobody is working on this feature, I would like to do this on myself
or
with @dalmia https://github.com/dalmia. OK?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
6557#issuecomment-316524038>,
or mute the thread
gQbMzWFixoHO1D0YfvuItCks5sPnXlgaJpZM4HybHH>
.

>

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/6557#issuecomment-316531460,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz64HQd8rQr3u2hD0QFR-RxktyYZoaks5sPnzTgaJpZM4HybHH
.

jnothman on 20 Jul 2017

I'm also seeing large subtrees predicting the same class. Even naive pruning would improve interpretability in cases like these (and it would have no effect on the predictions).

ccmaymay on 15 Dec 2017

👍3

@cjmay it would change the predicted probabilities. But yes, adding any sort of pruning would be nice ;)

amueller on 15 Dec 2017

👍1

It would be nice to see benchmarks of post-pruning vs pre-pruning. I think "min_impurity_decrease" is a relatively reasonable stopping criterion. Or you could grid-search n_leafs... do people usually grid-search the pruning parameter? (I would imagine so). It would be interesting to see how often that gives different outcomes.

amueller on 15 Dec 2017

@amueller oh, I see.

ccmaymay on 15 Dec 2017

Look forward to it!

bifeng on 8 Mar 2018

For some of the ensemble methods this could be worked in naturally. If there is bagging then the OOB sample could be used for validation and pruning. This would be like how oob_improvement_ is calculated in GradientBoostingClassifier

DrEhrfurchtgebietend on 14 Mar 2018

Is anyone working on this right now?

ghost on 4 May 2018

👍2

Not to my knowledge

jnothman on 6 May 2018

😕1

is there anybody done it ???

wsy1991 on 12 Jun 2018

is anyone have done it that could share your code......

wsy1991 on 12 Jun 2018

cost complexity pruning based on sklearn.decisiontreeclassifier ????

wsy1991 on 13 Jun 2018

Can someone do cost complexity pruning based on sklearn.decisiontreeclassifier()？ If not, I'll ask again .I'm seriously.......

wsy1991 on 13 Jun 2018

Found this:
https://github.com/shenwanxiang/sklearn-post-prune-tree

Not sure if it is exactly cost complexity pruning?

zanderbraam on 13 Jun 2018

@zanderbraam it's not CCP

appleyuchi on 3 Dec 2018

Hi,all ,
I have implemented:
CCP(Cost Complexity Pruning) Algorithm on
sklearn-CART-Classification-model in Python,

ECP(Error Complexity Pruning) Algorithm on
sklearn-CART-Regression-model in Python,

here's the link:
https://github.com/appleyuchi/Decision_Tree_Prune
you may like it.

appleyuchi on 8 Dec 2018

👍3

@appleyuchi thanks!
I find it a bit hard to follow the structure of the code, in particular given that file names and comments are in Chinese. There also seems to be a lot of duplicate code.

amueller on 8 Dec 2018

👍2

I don't work with DT anymore, but has anyone had a look at https://github.com/beedotkiran/randomforestpruning-ismm-2017 ? It seems relevant to this issue.

nlgranger on 13 Dec 2018

A tree is just a very small forest. Can't this implementation scale down to trees?

nlgranger on 13 Dec 2018

@appleyuchi I'm not sure if I follow what you're saying but we will not adopt an implementation based on going to JSON. Any implementation in scikit-learn would have to work directly with the scikit-learn data structures.

amueller on 13 Dec 2018

👍1

@appleyuchi thanks for your efforts. Regarding how tricky it may be to implement this feature, we do recognize that touching the tree code base is not necessarily a trivial task. There have been efforts to change that and have a more readable implementation. Besides, this issue isn't labeled as "Easy" for that exact reason.

I hope you find other issues that you may be interested in and you keep the good work on them :)

adrinjalali on 14 Dec 2018

I am working on this issue with a cost complexity pruning (CPP) algorithm. I see several tests that can be used to check tree pruning:

Increasing alpha (in CPP) should result in smaller or equal number of nodes.
Make sure the pruned tree is actually a subtree of the original tree.

What other tests would be appropriate for tree pruning?

thomasjpfan on 23 Dec 2018

@appleyuchi Thanks for sharing! My concern is that, even as a Chinese, it's still pretty hard for me to follow the code. I guess it is better to have the code more modularized so that others can apply your implementation on any data sets.

zhenyu-zhou on 26 Jul 2019

@zhenyu-zhou
Because almost all of you did NOT ever read the book《classification and regression trees》carefully.

The first author of this book has already passed away so you can not contact him for questions.

The defect of this book is discussed in
http://www.dcc.fc.up.pt/~ltorgo/PhD/th4.pdf
or
http://www.doc88.com/p-6445227043649.html

which point out that CCP/ECP algorithm-cross validation will fail for unbalanced and small datasets,
you should understand the above academic materials before you implement it.

I analyzed and summarized the defect in:

https://blog.csdn.net/appleyuchi/article/details/84957220

The Github link I provide for CCP/ECP is just "application style",NOT from sklearn's "bottom variable style",(the latter will be much more efficient and faster)
so they reject.

even as a Chinese, it's still pretty hard for me to follow the code. I guess it is better to have the code more modularized so that others can apply your implementation on any data sets.

It can be applied on many datasets I have tested,I guess you even have NOT clicked in and read the instruction in the Github link

appleyuchi on 26 Jul 2019

@appleyuchi

But it can be applied on any datasets you want,I'm sure you even have NOT clicked in and read the instruction in the link

You are ABSOLUTELY WRONG. To make it short, I just have one question for you: do you provide a clean API, like sklearn did?

You shouldn't expect everyone to read every detail of your code before using it, and blame others for that, if you treat your code as a library as sklearn. That's one of the reasons they reject the code. Consider it when you are using other libraries, take sklearn as an example, if it only has a bunch of self-contained code for experiments, which requires a certain input format, without a general framework. You need to carefully examine the library code to determine how to split the main logic out and apply to your dataset. Will you use it? I acknowledge that the experiment is interesting, but it is not a library. I just want to kindly post sth in my mind to improve the code but it is too hard.

zhenyu-zhou on 26 Jul 2019

@zhenyu-zhou

You shouldn't expect everyone to read every detail of your code before using it,and blame others for that,

you misunderstand it ,what I said is material,not code.
Note that it refers to a book,NOT the code I have written.

I mean《classification and regression trees》is a famous book,not the code I wrote ,
so that's not blame.and "you" refers to new contributors,NOT the existing member of sklearn.

I just want to kindly post sth in my mind to improve the code but it is too hard.

again,you should read the book carefully before you implement it.

@zhenyu-zhou

, if it only has a bunch of self-contained code for experiments, which requires a certain input format, without a general framework. You need to carefully examine the library code to determine how to split the main logic out and apply to your dataset.

The API style has been dicussed several months ago when you have NOT been here,
again,
what I have implemented is"application style",NOT from"sklearn bottom data structure"

Good Luck.

ps:
Notification from this issue has been cancelled,because I'm busy.
@me will NOT take effect any longer.
If you have any question ,contact me via Email please.

appleyuchi on 26 Jul 2019

@ameuller @glouppe @GaelVaroquaux any opinion on the api for post pruning? I see these three options, the first of which seems the best to me. Just wanted to get some clarification before I start to think about working on this:

options given to the tree constructor are then taken into account by .fit

a separate .prune or .post_prune method has to be called after fitting

a separate prune_tree or post_prune_tree function takes the tree and returns another pruned tree
do you have any code for this?
Thanks