I've been using the min_weight_fraction_leaf
parameter of DecisionTreeClassifier and RandomForestClassifier incorrectly and I think it's likely other people are doing the same thing as me.
For example, the documentation for min_weight_fraction_leaf
in DecisionTreeClassifier says
The minimum weighted fraction of the input samples required to be at a leaf node.
It was really unclear to me what the docs meant by "weighted fraction of the input samples". Initially I thought it was a weighting based on the size of the classes or the values given by class_weight
. I think a slight change in the parameter description could clear up this confusion. Perhaps something like
The minimum weighted fraction of the input samples required to be at a leaf node where weights are determined by sample_weight in the fit() method.
Furthermore, it appears min_weight_fraction_leaf
only applies if sample_weight
is provided in the call fit()
. If sample_weight
is not provided in the call to fit()
, min_weight_fraction_leaf
is silently ignored. Here, I think min_weight_fraction_leaf
should still apply under the assumption that all samples are equally weighted OR a warning should be given that min_weight_fraction_leaf
will not be used since sample_weight
was not provided.
Darwin-15.5.0-x86_64-i386-64bit
Python 3.5.1 |Continuum Analytics, Inc.| (default, Dec 7 2015, 11:24:55)
[GCC 4.2.1 (Apple Inc. build 5577)]
NumPy 1.11.0
SciPy 0.17.1
Scikit-Learn 0.17.1
Also, I would love to make the changes I suggested (if they're deemed worthy), but I have little experience contributing to open-source libraries. Might need a bit of hand-holding if someone would be willing to help me out.
Please submit a PR
On 29 June 2016 at 06:09, Ben [email protected] wrote:
Description
I've been using the min_weight_fraction_leaf parameter of
DecisionTreeClassifier and RandomForestClassifier incorrectly and I think
it's likely other people are doing the same thing as me.For example, the documentation for min_weight_fraction_leaf in
DecisionTreeClassifier
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
saysThe minimum weighted fraction of the input samples required to be at a
leaf node.It was really unclear to me what the docs meant by "weighted fraction of
the input samples". Initially I thought it was a weighting based on the
size of the classes or the values given by class_weight. I think a slight
change in the parameter description could clear up this confusion. Perhaps
something likeThe minimum weighted fraction of the input samples required to be at a
leaf node where weights are determined by sample_weight in the fit() method.Furthermore, it appears min_weight_fraction_leaf only applies if
sample_weight is provided in the call fit(). If sample_weight is not
provided in the call to fit(), min_weight_fraction_leaf is silently
ignored. Here, I think min_weight_fraction_leaf should still apply under
the assumption that all samples are equally weighted OR a warning should be
given that min_weight_fraction_leaf will not be used since sample_weight
was not provided.
VersionsDarwin-15.5.0-x86_64-i386-64bit
Python 3.5.1 |Continuum Analytics, Inc.| (default, Dec 7 2015, 11:24:55)
[GCC 4.2.1 (Apple Inc. build 5577)]
NumPy 1.11.0
SciPy 0.17.1
Scikit-Learn 0.17.1—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/6945, or mute the
thread
https://github.com/notifications/unsubscribe/AAEz6xE2BmEJHo6hGgTWoigsPutoD4_nks5qQX9zgaJpZM4JAe96
.
I think if min_weight_fraction_leaf
is set and no sample_weights
provided, it should either raise an error or assume uniform weights. In this case it's a bit redundant with min_samples_leaf
but I think assuming uniform weights would still be better.
I think this is similar to min_samples_leaf
. Instead of requiring an absolute number of samples in each leaf node, min_weight_fraction_leaf
provides the option to require a fraction of samples (or weights) in each leaf. Whether the model is using weights for samples depends on the class_weight
.
Most helpful comment
I think if
min_weight_fraction_leaf
is set and nosample_weights
provided, it should either raise an error or assume uniform weights. In this case it's a bit redundant withmin_samples_leaf
but I think assuming uniform weights would still be better.