The documentation of default values in many classes is either not included, inconsistent in how it is written, or out-of-date. I would like to gather a few people to work on the default value documentation for every class as there are a ton of classes where these issues exist. I have been told that the default values should be documented as "default=<'value'>" and so I am creating this issue under that assumption.
Here are a few things that I have seen for the parameters which should be changed:
If a few people work on a few classes each, then this should be done in no time! These should all be fairly simple fixes.
#### Examples
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
The link above is an example where default values are not indicated but the parameters say "optional", and where those with default values indicated are all inconsistently documented.
Hello @cgsavard, I would like to work on this. Can I start looking at the AgglomerativeClustering class?
@vachanda Go for it! We can continue posting here which ones we work on so that others know.
Thanks for coordinating this @cgsavard
Note to contributors: please follow the guidelines under: https://scikit-learn.org/stable/developers/contributing.html#guidelines-for-writing-documentation
@cgsavard, Is there a list of classes that have discrepancies or do we have to go through each of them and update them?
@vachanda I do not have a list, unfortunately. I have just been going through the files and seeing what needs to be updated.
I am working on AffinityPropagation, SpectralCoclustering, SpectralBiclustering, and Birch.
I am working on FeatureAgglomeration, KMeans, and MiniBatchKMeans.
Logically speaking, if a param is optional, shouldn't the default be None
always? Having a parameter with a default value other than None
suggests it should be required.
If there is a default, this usually means that the literature has found this to be a sensible default value which also suggests that this parameter has an impact on performance and thus it shouldn't be optional, but should just mention what the default is. Those seem closer to required parameters by definition, we just happened to make a sensible choice for the user so they can change it or not.
Or more practically speaking, are there currently any optional parameters that we've found which have numeric default values, but for which specifying None
will raise an exception? That would also suggest that the parameter is actually required, but that a sensible default has been chosen based on literature/research.
Or maybe I've been confusing the meaning of required
and optional
all these years? Lol. Would definitely love to help on this either way!
@jmwoloso We were really inconsistent regarding the usage of optional
and therefore we recently decided to remove it.
i want to contribute as well. can i go ahead with this
@glemaitre ok, that definitely makes sense. so then we're removing the optional
verbage all together, right, while also noting default values in the doc strings?
should each of these that we find be opened as an issue separately or how are we staging all of this work that we're doing since multiple people are working on multiple things relating to this single issue?
@cyrus303 @jmwoloso You can get a class (a module maximum) and correct it. The idea is to remove the optional and add a default when there is one (there is usually one). Since we are touching the documentation, we should make sure that the style on the line follow our new style guide: https://scikit-learn.org/dev/developers/contributing.html#guidelines-for-writing-documentation
You can mention which of the class/module you are working, open a link a PR to avoid duplicate effort :). Looking forward to reviewing it.
Hey! I will work on tree
classes (tree.DecisionTreeClassifier
, tree.DecisionTreeRegressor
, tree.ExtraTreeClassifier
and tree.ExtraTreeRegressor
).
I will also fix this issue for the neighbors
module.
I'll take the ensemble
module.
@glemaitre any preference on bool
vs. boolean
? seeing a mix of both in ensemble
, even in the same class. might as well get those in shape while i'm doing defaults.
EDIT:
ditto for int
vs integer
. I'm assuming int
on that one, but wanted to confirm.
EDIT (again):
also seeing docstrings with inconsistent values relative to the __init__
signature for that class, e.g.:
min_impurity_split
for RandomForestClassifier
the __init__
signature has min_impurity_split=None
while the docstrings for it say min_impurity_split : float, (default=0)
. I would assume update the docstrings to match the signature since we'd want to keep the behavior of the class consistent (i.e. we want the same defaults passed in upon instantiation)?
@jmwoloso Could you refer to https://scikit-learn.org/stable/developers/contributing.html#guidelines-for-writing-documentation. Basically you should default to the python type name (bool, str, int, float)
the __init__ signature has min_impurity_split=None while the docstrings for it say min_impurity_split : float, (default=0). I would assume update the docstrings to match the signature since we'd want to keep the behavior of the class consistent (i.e. we want the same defaults passed in upon instantiation)?
We should match the parameter in the function signature. This value default parameter has changed and the docstring was not updated.
Hi @cgsavard , I'd like to contribute but this is going to be my first time so need some hand holding. I'm quite familiar with python, somewhat handy with text editors and recently went through the fork -> clone -> edit -> PR workflow tutorial here. Please advise next step... Thank you!
Hi @cgsavard ,
Can I please work on Imputer ?
Hi @cgsavard , I want to work on linear_model
class.
I am also working on Neural Network
, Decomposition
, Feature Extraction
, Metrics
and Preprocess
classes.
can some one please check my pr #15964 and see why code cov is failing . This is my first time for contribution . Please guide.
Ignore codecov. This is a false positive since we don't touch code. I will review soon the PR
Sent from my phone - sorry to be brief and potential misspell.
I just made my first contribution #15988
I will take the naive_bayes
module.
I just made my first contribution #16019
Hi All, working on sklearn/neighbors
, thank you.
Contributed to sklearn/semi_supervised.Thanks
Hi @cgsavard , I'd like to contribute too, i will take sklearn/svm
module. Thanks
Contributed to sklearn/semi_supervised.Thanks
Are there some further edits needed on the PR #16042
@glemaitre in #16105, I had to dig a little deep into constructs to fetch default values, docstrings seemed inaccurate and outdated at times.
Also I tried to use less ambiguous, concise and mathematically rigorous way of defining ranges of parameters. for example, I changed positive float
to float in (0, inf]
or 0<= shrinkage <=1
to float in (0, 1)
. Long story short, I did the best i could to be concise and accurate but please pay 5% more attention reviewing this PR. Thanks.
@cgsavard , this is a very nice issue for a sprint! If you are ok with that I'm planning to add it to our Sprint list. I have summarized the classes that have been addressed by a PR already, and their correspondent PR here.
Do you mind to link the gist in the issue description? This will make the information available from the beginning. May I also ask you to clarify in the description that each PR should address one file (maximum one module) at a time as in explained here? This will really help contributors and reviewers! Thanks a lot!
For those interested in this issue, the command
git grep "optional.*default"
will output the files still affected by this problem (thanks @ogrisel! :) ).
@cgsavard Hello, I would like to work on model_selection
@WiMLDS
@lopusz and I want to work on random_projection.py
Have fun to everyone!
@adrinjalali @noatamir @WiMLDS
@ETay203 and I would like to work on mean_shift
@WiMLDS_Berlin sprint.
@magda-zielinska and I want to work on pipeline.py
@adrinjalali @noatamir @WiMLDS
@lopusz and @magda-zielinska and I want to work on kernel_approximation.py
I'm going to tackle the _optics.py now
Reopening: was closed by "Fixes" keyword in #16216.
Reopening: was closed by "Fixes" keyword in #16207
I'm going to tackle the sklearn/linear_model/_coordinate_descent.py now
I cleaned base.py and submitted PR
I cleaned discriminant_analysis.py and submitted a PR
I will look now at sklearn/gaussian_process/*.py
There is already a long pr for the GPs @lopusz :)
@lopusz my apologies, that PR was touching other issues of the GP module, you can go ahead and work on it if you don't mind :)
@adrinjalali Thank you for keeping an eye on it!
Indeed, I have not scanned the open PRs well enough, so the fact that GPs are not taken is more of an accident ;)
I will make sure to keep track of what is PRed.
And yes PR for GPs is comming ;)
Is there anything else to be done here?
I am working on sklearn/decomposition/_dict_learning.py
what is left to do? I'm open to help. . .
Figuring out what's left is probably a good place to start helping :)
Hi, I've been looking through to see what's left, I think there are still some updates to make in some of the modules looked at previously.
I was going to work through these, starting with the cluster module and could raise a PR for each module as I go along?
This is my first contribution so please let me know if I'm not following the process correctly etc.
Thanks!
This is the list of functions, classes and modules left to fix:
sklearn.feature_selection.SelectorMixin
sklearn.config_context
sklearn.set_config
sklearn.calibration.CalibratedClassifierCV
sklearn.cluster.OPTICS
sklearn.cluster.SpectralClustering
sklearn.cluster.affinity_propagation
sklearn.cluster.cluster_optics_dbscan
sklearn.cluster.cluster_optics_xi
sklearn.cluster.compute_optics_graph
sklearn.cluster.mean_shift
sklearn.cluster.spectral_clustering
sklearn.cluster.ward_tree
sklearn.cross_decomposition.CCA
sklearn.cross_decomposition.PLSCanonical
sklearn.cross_decomposition.PLSRegression
sklearn.cross_decomposition.PLSSVD
sklearn.datasets
sklearn.decomposition
sklearn.dummy
sklearn.ensemble.HistGradientBoostingRegressor
(experimental)sklearn.ensemble.HistGradientBoostingRegressor
(experimental)sklearn.feature_extraction.image.grid_to_graph
sklearn.feature_extraction.image.img_to_graph
sklearn.feature_extraction.text.CountVectorizer
sklearn.feature_extraction.text.HashVectorizer
sklearn.feature_selection
sklearn.impute
sklearn.inspection.partial_dependence
sklearn.inspection.permutation_importance
sklearn.inspection.permutation_importance
sklearn.inspection.PartialDependenceDisplay
sklearn.inspection.plot_partial_dependence
sklearn.isotonic.IsotonicRegression
sklearn.isotonic.check_increasing
sklearn.isotonic.isotonic_regression
sklearn.kernel_approximation
sklearn.kernel_ridge
sklearn.linear_model.PassiveAggressiveClassifier
sklearn.linear_model.LassoLars
sklearn.linear_model.OrthogonalMatchingPursuit
sklearn.linear_model.HuberRegressor
sklearn.linear_model.RANSACRegressor
sklearn.linear_model.TheilSenRegressor
sklearn.linear_model.PassiveAggressiveRegressor
sklearn.linear_model.orthogonal_mp
sklearn.linear_model.orthogonal_mp_gram
sklearn.manifold
sklearn.metrics
(except sklearn.metrics.confusion_matrix
, sklearn.metrics.roc_auc_score
, sklearn.metrics.max_error
sklearn.metrics.mean_poisson_deviance
, sklearn.metrics.mean_gamma_deviance
, sklearn.metrics.mean_tweedie_deviance
, sklearn.metrics.plot_confusion_matrix
, sklearn.metrics.plot_precision_recall_curve
)sklearn.mixture
sklearn.model_selection.GridSearchCV
sklearn.model_selection.ParameterGrid
sklearn.model_selection.ParameterSampler
sklearn.model_selection.RandomizedSearchCV
sklearn.model_selection.fit_grid_point
sklearn.multiclass
sklearn.multioutput
sklearn.neural_network
sklearn.preprocessing
sklearn.random_projection
sklearn.tree.export_graphviz
sklearn.tree.export_text
sklearn.tree.plot_tree
sklearn.utils
Hope I am not missing anything.
Hi. I'll go try make a pass in the feature_selection documentation
We take the sklearn.mixture part
Taking cross_decomposition part
For the 2020 Scikit-Learn Sprint, @icoder18 and I are taking the sklearn.random_projection part
@adrinjalali we completed sklearn/mixture
Working on the sklearn.linear_model for the sprint with @genvalen
Take sklearn.calibration.CalibratedClassifierCV
Working on this for sklearn.utils.validation
Next we'll be tackling sklearn.utils.random
working on sklearn.impute
Working on sklearn.tree.plot_tree
Table 14 will take sklearn.neural_network
Take sklearn.kernel_approximation
Taking sklearn.inspection
Table 14 will take sklearn.preprocessing
Taking datasets
Taking sklearn.mixture #17509
List updated.
Thank you all!
Taking sklearn.metrics for sprint
Taking model_selection module
@glemaitre Can we update the description of this to include it would be best to submit one file at a time?
Hello I would like to contribute. It is my first time though ... And it is not clear for me how I can know on which module there is still work to be done ? Thanks !
https://github.com/scikit-learn/scikit-learn/issues/15761#issuecomment-639461778 contains the list of modules left to fix.
Thanks. Take sklearn.decomposition then.
I am working on 'sklearn.isotonic.isotonic_regression'
I am working on 'sklearn.multiclass.py'
Hi, may I try to take the remaining on sklearn.tree
? This would be my first time contributing as well.
Thanks for checking in, great to have your help! Please proceed; I think all of our sprint updates have been closed out.
On Jul 4, 2020, at 10:45, Ivan Wiryadi notifications@github.com wrote:

Hi, may I try to take the remaining on sklearn.tree? This would be my first time contributing as well.—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
Hi, I would like to make my first contribution. Can I take sklearn.multioutput?
I'll continue with sklearn.utils, starting with _encode.py
I am working on sklearn/decomposition/_dict_learning.py
I'm working on sklearn.kernel_ridge
in the sprint
Hi, I will start working on sklearn.feature_extraction.image.img_to_graph
I am working on sklearn.feature_extraction.text.CountVectorizer
I am working on sklearn.sklearn.kernel_ridge
I am working on sklearn.ensemble.HistGradientBoostingRegressor
"I am working on this"
on this? @Hoda1394
"I am working on this"
on this? @Hoda1394
@TahiriNadia corrected.
@cgsavard Hey, Can I work on this? I'm a first-timer
I'll work on the files in sklearn.datasets
.
Can i work on sklearn.linear_model._least_angle.py
@glemaitre I'm working on sklearn.linear_model._least_angle.py
and i found an inconsistency of the use of method ='lar'
sometimes it indicates lars
sometimes lar
, this inconsistency is also in code (not only in documentation), I can see that lars
is the right one, could you confirm it, and i will make a PR.
working on 'sklearn/ensemble/_hist_gradient_boosting/binning.py'
files need change:
sklearn/_config.py
sklearn/dummy.py
sklearn/multioutput.py
sklearn/linear_model/_huber.py
sklearn/linear_model/_theil_sen.py
sklearn/linear_model/_ridge.py
sklearn/linear_model/_omp.py
sklearn/linear_model/_sag.py
sklearn/externals/_lobpcg.py
sklearn/externals/_lobpcg.py
sklearn/utils/extmath.py
sklearn/utils/__init__.py
sklearn/utils/graph.py
sklearn/utils/_mocking.py
sklearn/utils/sparsefuncs.py
sklearn/neighbors/_base.py
sklearn/gaussian_process/_gpc.py
sklearn/gaussian_process/kernels.py
sklearn/model_selection/_validation.py
~sklearn/decomposition/_fastica.py~
~sklearn/decomposition/_dict_learning.py~
~sklearn/decomposition/_factor_analysis.py~
~sklearn/decomposition/_incremental_pca.py~
~sklearn/decomposition/_lda.py~
~sklearn/decomposition/_pca.py~
~sklearn/decomposition/_truncated_svd.py~
~sklearn/decomposition/_sparse_pca.py~
~sklearn/decomposition/_nmf.py~
sklearn/manifold/_mds.py
sklearn/manifold/_spectral_embedding.py
sklearn/manifold/_t_sne.py
sklearn/ensemble/_hist_gradient_boosting/grower.py
sklearn/ensemble/_hist_gradient_boosting/binning.py
sklearn/metrics/_ranking.py
sklearn/tree/_classes.py
sklearn/preprocessing/_discretization.py
sklearn/preprocessing/_encoders.py line 620
sklearn/neural_network/_multilayer_perceptron.py line 1054
sklearn/covariance/_robust_covariance.py
Please do check if someone is already working/worked on the file you choosed
@sadakmed, for all the "decomposition files", there is an ongoing pull request #17739.
working on "gaussian_process.GaussianProcessRegressor"
and "neighbors._base.py"
Hi, I am new, and I would like to start contributing. Do you still need some help on this issue? is there any file you still need help with?
Hey @boricles!
Have a look to https://github.com/scikit-learn/scikit-learn/issues/15761#issuecomment-639461778 for a list with the modules still to be fixed.
@alfaro96 thanks. I did a quick look just now. I will select a module tonight, and work on it.
I am working on sklearn/config_context
Hey, thought I'd see if I could help with the docs.
@alfaro96 I'd like to work on sklearn.feature_extraction.text.CountVectorizer
, if it hasn't already been taken, especially because I've personally encountered some pitfalls when working with Vectorizers in the past.
Also, I noticed that although sklearn.model_selection.learning_curve
was updated, there's an out-of-date tutorial using the old documentation, should I leave it be? Or is it worth updating?
Hi @alfaro96,
after edits:
I see sklearn.config_context
and sklearn.set_config
from sklearn.config_config.py
were fixed so it can be checked out from the task list.
I would like to work on sklearn.utils
. I saw only once instance of parameter documentation where 'optional' is used in. That means that I need to fix only that instance, correct? It is in sklearn.utils._mocking.py
Hey, thought I'd see if I could help with the docs.
Hey @madprogramer,
@alfaro96 I'd like to work on
sklearn.feature_extraction.text.CountVectorizer
, if it hasn't already been taken, especially because I've personally encountered some pitfalls when working with Vectorizers in the past.
~I have had a look to the checklist and sklearn.feature_extraction.text.CountVectorizer
reference and it does not seem to be fixed. A PR would be welcome.~
Edit: The sklearn.feature_extraction.text.CountVectorizer
is already fixed.
Also, I noticed that although
sklearn.model_selection.learning_curve
was updated, there's an out-of-date tutorial using the old documentation, should I leave it be? Or is it worth updating?
It is worth it updating, although this should be done in a separate PR.
Thank you!
Hi @alfaro96,
Hey @haiatn,
after edits:
I seesklearn.config_context
andsklearn.set_config
fromsklearn.config_config.py
were fixed so it can be checked out from the task list.
I have updated the checklist.
I would like to work on
sklearn.utils
. I saw only once instance of parameter documentation where 'optional' is used in. That means that I need to fix only that instance, correct? It is insklearn.utils._mocking.py
That is the idea, although the classes in the sklearn.utils._mocking.py
file are not part of the public API, so I do not think that it is worthy to update them.
Nevertheless, it would be nice if you could work in any of the other functions, classes and modules that are pending to be fixed.
Thank you!
I looked at the checklist. From what I saw the following can be checked from the checklist:
sklearn.feature_extraction.image.img_to_graph
sklearn.isotonic.IsotonicRegression
sklearn.isotonic.check_increasing
sklearn.ensemble.HistGradientBoostingRegressor
but all of sklearn.ensemble
is OKCan I work on sklearn.manifold._spectral_embedding
and sklearn.feature_extraction.text.HashVectorizer
? I will do it in seperate PR. I think they are the only files left that need fixing (assuming sklearn.feature_extraction.text.CountVectorizer
is taken).
I looked at the checklist. From what I saw the following can be checked from the checklist:
sklearn.feature_extraction.image.img_to_graph
sklearn.isotonic.IsotonicRegression
sklearn.isotonic.check_increasing
Thank you @haiatn, I have updated the checklist.
- I did not find the file
sklearn.ensemble.HistGradientBoostingRegressor
but all ofsklearn.ensemble
is OK
The sklearn.ensemble.HistGradientBoostingClassifier
and sklearn.ensemble.HistGradientBoostingRegressor
are in this file: scikit-learn/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py
. However, they have been already fixed.
Can I work on
sklearn.manifold._spectral_embedding
andsklearn.feature_extraction.text.HashVectorizer
? I will do it in seperate PR. I think they are the only files left that need fixing (assumingsklearn.feature_extraction.text.CountVectorizer
is taken).
I have had a look to the sklearn.manifold
module and sklearn.feature_extraction.text.HashingVectorizer
and they have been already fixed (I have updated the checklist accordingly).
Nevertheless, there are several functions in the sklearn.utils
module that should be still fixed.
Thank you @haiatn, we really appreciate your help!
I will now work on sklearn.utils._estimator_html_repr
, sklearn.utils.deprecation
and sklearn.utils._testing
I will finish sklearn.utils. There are only 3 files I found that need fixing.
hey @alfaro96 ,
could you review my open pull requests? I think they are the last ones.
Hey @haiatn!
I have already have a look to your open PRs.
Thank you!
Now that we merged what's left off sklearn.utils and it was the last on the checklist, did we finish?
There is one last open pull request #18025, then this issue could be eventually closed.
Hello,
I want to start contributing. Is there any class pending for fixing doc of default values? If any then I can take it up.
Hey new to the open source I am looking forward to fixing doc by any chance something is left which needs to be fixed
Most helpful comment
Logically speaking, if a param is optional, shouldn't the default be
None
always? Having a parameter with a default value other thanNone
suggests it should be required.If there is a default, this usually means that the literature has found this to be a sensible default value which also suggests that this parameter has an impact on performance and thus it shouldn't be optional, but should just mention what the default is. Those seem closer to required parameters by definition, we just happened to make a sensible choice for the user so they can change it or not.
Or more practically speaking, are there currently any optional parameters that we've found which have numeric default values, but for which specifying
None
will raise an exception? That would also suggest that the parameter is actually required, but that a sensible default has been chosen based on literature/research.Or maybe I've been confusing the meaning of
required
andoptional
all these years? Lol. Would definitely love to help on this either way!