Scikit-learn: Fix documentation of default values in all classes

Created on 2 Dec 2019  Â·  118Comments  Â·  Source: scikit-learn/scikit-learn

Description

The documentation of default values in many classes is either not included, inconsistent in how it is written, or out-of-date. I would like to gather a few people to work on the default value documentation for every class as there are a ton of classes where these issues exist. I have been told that the default values should be documented as "default=<'value'>" and so I am creating this issue under that assumption.

Solution

Here are a few things that I have seen for the parameters which should be changed:

  • no mention of whether there is a default should be checked against code as a few parameters are missing this entirely
  • "optional" should be changed to "default=<'value'>"
  • make sure how the default values are documented is consistent within the class, i.e. change everything to the format "default=<'value'>"
  • Modify a single file per PR

If a few people work on a few classes each, then this should be done in no time! These should all be fairly simple fixes.

#### Examples
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
The link above is an example where default values are not indicated but the parameters say "optional", and where those with default values indicated are all inconsistently documented.

Sprint good first issue

Most helpful comment

Logically speaking, if a param is optional, shouldn't the default be None always? Having a parameter with a default value other than None suggests it should be required.

If there is a default, this usually means that the literature has found this to be a sensible default value which also suggests that this parameter has an impact on performance and thus it shouldn't be optional, but should just mention what the default is. Those seem closer to required parameters by definition, we just happened to make a sensible choice for the user so they can change it or not.

Or more practically speaking, are there currently any optional parameters that we've found which have numeric default values, but for which specifying None will raise an exception? That would also suggest that the parameter is actually required, but that a sensible default has been chosen based on literature/research.

Or maybe I've been confusing the meaning of required and optional all these years? Lol. Would definitely love to help on this either way!

All 118 comments

Hello @cgsavard, I would like to work on this. Can I start looking at the AgglomerativeClustering class?

@vachanda Go for it! We can continue posting here which ones we work on so that others know.

Thanks for coordinating this @cgsavard

Note to contributors: please follow the guidelines under: https://scikit-learn.org/stable/developers/contributing.html#guidelines-for-writing-documentation

@cgsavard, Is there a list of classes that have discrepancies or do we have to go through each of them and update them?

@vachanda I do not have a list, unfortunately. I have just been going through the files and seeing what needs to be updated.

I am working on AffinityPropagation, SpectralCoclustering, SpectralBiclustering, and Birch.

I am working on FeatureAgglomeration, KMeans, and MiniBatchKMeans.

Logically speaking, if a param is optional, shouldn't the default be None always? Having a parameter with a default value other than None suggests it should be required.

If there is a default, this usually means that the literature has found this to be a sensible default value which also suggests that this parameter has an impact on performance and thus it shouldn't be optional, but should just mention what the default is. Those seem closer to required parameters by definition, we just happened to make a sensible choice for the user so they can change it or not.

Or more practically speaking, are there currently any optional parameters that we've found which have numeric default values, but for which specifying None will raise an exception? That would also suggest that the parameter is actually required, but that a sensible default has been chosen based on literature/research.

Or maybe I've been confusing the meaning of required and optional all these years? Lol. Would definitely love to help on this either way!

@jmwoloso We were really inconsistent regarding the usage of optional and therefore we recently decided to remove it.

i want to contribute as well. can i go ahead with this

@glemaitre ok, that definitely makes sense. so then we're removing the optional verbage all together, right, while also noting default values in the doc strings?

should each of these that we find be opened as an issue separately or how are we staging all of this work that we're doing since multiple people are working on multiple things relating to this single issue?

@cyrus303 @jmwoloso You can get a class (a module maximum) and correct it. The idea is to remove the optional and add a default when there is one (there is usually one). Since we are touching the documentation, we should make sure that the style on the line follow our new style guide: https://scikit-learn.org/dev/developers/contributing.html#guidelines-for-writing-documentation

You can mention which of the class/module you are working, open a link a PR to avoid duplicate effort :). Looking forward to reviewing it.

Hey! I will work on tree classes (tree.DecisionTreeClassifier, tree.DecisionTreeRegressor, tree.ExtraTreeClassifier and tree.ExtraTreeRegressor).

I will also fix this issue for the neighbors module.

I'll take the ensemble module.

@glemaitre any preference on bool vs. boolean? seeing a mix of both in ensemble, even in the same class. might as well get those in shape while i'm doing defaults.

EDIT:

ditto for int vs integer. I'm assuming int on that one, but wanted to confirm.

EDIT (again):

also seeing docstrings with inconsistent values relative to the __init__ signature for that class, e.g.:

min_impurity_split for RandomForestClassifier

the __init__ signature has min_impurity_split=None while the docstrings for it say min_impurity_split : float, (default=0). I would assume update the docstrings to match the signature since we'd want to keep the behavior of the class consistent (i.e. we want the same defaults passed in upon instantiation)?

@jmwoloso Could you refer to https://scikit-learn.org/stable/developers/contributing.html#guidelines-for-writing-documentation. Basically you should default to the python type name (bool, str, int, float)

the __init__ signature has min_impurity_split=None while the docstrings for it say min_impurity_split : float, (default=0). I would assume update the docstrings to match the signature since we'd want to keep the behavior of the class consistent (i.e. we want the same defaults passed in upon instantiation)?

We should match the parameter in the function signature. This value default parameter has changed and the docstring was not updated.

Hi @cgsavard , I'd like to contribute but this is going to be my first time so need some hand holding. I'm quite familiar with python, somewhat handy with text editors and recently went through the fork -> clone -> edit -> PR workflow tutorial here. Please advise next step... Thank you!

Hi @cgsavard ,
Can I please work on Imputer ?

Hi @cgsavard , I want to work on linear_model class.

I am also working on Neural Network, Decomposition, Feature Extraction, Metrics and Preprocess classes.

can some one please check my pr #15964 and see why code cov is failing . This is my first time for contribution . Please guide.

Ignore codecov. This is a false positive since we don't touch code. I will review soon the PR

Sent from my phone - sorry to be brief and potential misspell.

I just made my first contribution #15988

I will take the naive_bayes module.

I just made my first contribution #16019

Hi All, working on sklearn/neighbors, thank you.

Contributed to sklearn/semi_supervised.Thanks

Hi @cgsavard , I'd like to contribute too, i will take sklearn/svm module. Thanks

Contributed to sklearn/semi_supervised.Thanks
Are there some further edits needed on the PR #16042

@glemaitre in #16105, I had to dig a little deep into constructs to fetch default values, docstrings seemed inaccurate and outdated at times.

Also I tried to use less ambiguous, concise and mathematically rigorous way of defining ranges of parameters. for example, I changed positive float to float in (0, inf] or 0<= shrinkage <=1 to float in (0, 1). Long story short, I did the best i could to be concise and accurate but please pay 5% more attention reviewing this PR. Thanks.

@cgsavard , this is a very nice issue for a sprint! If you are ok with that I'm planning to add it to our Sprint list. I have summarized the classes that have been addressed by a PR already, and their correspondent PR here.
Do you mind to link the gist in the issue description? This will make the information available from the beginning. May I also ask you to clarify in the description that each PR should address one file (maximum one module) at a time as in explained here? This will really help contributors and reviewers! Thanks a lot!

For those interested in this issue, the command

git grep "optional.*default"

will output the files still affected by this problem (thanks @ogrisel! :) ).

@cgsavard Hello, I would like to work on model_selection @WiMLDS

@lopusz and I want to work on random_projection.py

Have fun to everyone!

@adrinjalali @noatamir @WiMLDS

@ETay203 and I would like to work on mean_shift @WiMLDS_Berlin sprint.

@magda-zielinska and I want to work on pipeline.py

@adrinjalali @noatamir @WiMLDS

@lopusz and @magda-zielinska and I want to work on kernel_approximation.py

I'm going to tackle the _optics.py now

Reopening: was closed by "Fixes" keyword in #16216.

Reopening: was closed by "Fixes" keyword in #16207

I'm going to tackle the sklearn/linear_model/_coordinate_descent.py now

I cleaned base.py and submitted PR

I cleaned discriminant_analysis.py and submitted a PR

I will look now at sklearn/gaussian_process/*.py

There is already a long pr for the GPs @lopusz :)

@lopusz my apologies, that PR was touching other issues of the GP module, you can go ahead and work on it if you don't mind :)

@adrinjalali Thank you for keeping an eye on it!

Indeed, I have not scanned the open PRs well enough, so the fact that GPs are not taken is more of an accident ;)

I will make sure to keep track of what is PRed.

And yes PR for GPs is comming ;)

Is there anything else to be done here?

I am working on sklearn/decomposition/_dict_learning.py

what is left to do? I'm open to help. . .

Figuring out what's left is probably a good place to start helping :)

Hi, I've been looking through to see what's left, I think there are still some updates to make in some of the modules looked at previously.
I was going to work through these, starting with the cluster module and could raise a PR for each module as I go along?
This is my first contribution so please let me know if I'm not following the process correctly etc.
Thanks!

This is the list of functions, classes and modules left to fix:

  • [x] sklearn.feature_selection.SelectorMixin
  • [x] sklearn.config_context
  • [x] sklearn.set_config
  • [x] sklearn.calibration.CalibratedClassifierCV
  • [x] sklearn.cluster.OPTICS
  • [x] sklearn.cluster.SpectralClustering
  • [x] sklearn.cluster.affinity_propagation
  • [x] sklearn.cluster.cluster_optics_dbscan
  • [x] sklearn.cluster.cluster_optics_xi
  • [x] sklearn.cluster.compute_optics_graph
  • [x] sklearn.cluster.mean_shift
  • [x] sklearn.cluster.spectral_clustering
  • [x] sklearn.cluster.ward_tree
  • [x] sklearn.cross_decomposition.CCA
  • [x] sklearn.cross_decomposition.PLSCanonical
  • [x] sklearn.cross_decomposition.PLSRegression
  • [x] sklearn.cross_decomposition.PLSSVD
  • [x] sklearn.datasets
  • [x] sklearn.decomposition
  • [x] sklearn.dummy
  • [x] sklearn.ensemble.HistGradientBoostingRegressor (experimental)
  • [x] sklearn.ensemble.HistGradientBoostingRegressor (experimental)
  • [x] sklearn.feature_extraction.image.grid_to_graph
  • [x] sklearn.feature_extraction.image.img_to_graph
  • [x] sklearn.feature_extraction.text.CountVectorizer
  • [x] sklearn.feature_extraction.text.HashVectorizer
  • [x] sklearn.feature_selection
  • [x] sklearn.impute
  • [x] sklearn.inspection.partial_dependence
  • [x] sklearn.inspection.permutation_importance
  • [x] sklearn.inspection.permutation_importance
  • [x] sklearn.inspection.PartialDependenceDisplay
  • [x] sklearn.inspection.plot_partial_dependence
  • [x] sklearn.isotonic.IsotonicRegression
  • [x] sklearn.isotonic.check_increasing
  • [x] sklearn.isotonic.isotonic_regression
  • [x] sklearn.kernel_approximation
  • [x] sklearn.kernel_ridge
  • [x] sklearn.linear_model.PassiveAggressiveClassifier
  • [x] sklearn.linear_model.LassoLars
  • [x] sklearn.linear_model.OrthogonalMatchingPursuit
  • [x] sklearn.linear_model.HuberRegressor
  • [x] sklearn.linear_model.RANSACRegressor
  • [x] sklearn.linear_model.TheilSenRegressor
  • [x] sklearn.linear_model.PassiveAggressiveRegressor
  • [x] sklearn.linear_model.orthogonal_mp
  • [x] sklearn.linear_model.orthogonal_mp_gram
  • [x] sklearn.manifold
  • [x] sklearn.metrics (except sklearn.metrics.confusion_matrix, sklearn.metrics.roc_auc_score, sklearn.metrics.max_error sklearn.metrics.mean_poisson_deviance, sklearn.metrics.mean_gamma_deviance, sklearn.metrics.mean_tweedie_deviance, sklearn.metrics.plot_confusion_matrix, sklearn.metrics.plot_precision_recall_curve)
  • [x] sklearn.mixture
  • [x] sklearn.model_selection.GridSearchCV
  • [x] sklearn.model_selection.ParameterGrid
  • [x] sklearn.model_selection.ParameterSampler
  • [x] sklearn.model_selection.RandomizedSearchCV
  • [x] sklearn.model_selection.fit_grid_point
  • [x] sklearn.multiclass
  • [x] sklearn.multioutput
  • [x] sklearn.neural_network
  • [x] sklearn.preprocessing
  • [x] sklearn.random_projection
  • [x] sklearn.tree.export_graphviz
  • [x] sklearn.tree.export_text
  • [x] sklearn.tree.plot_tree
  • [x] sklearn.utils

Hope I am not missing anything.

Hi. I'll go try make a pass in the feature_selection documentation

We take the sklearn.mixture part

Taking cross_decomposition part

For the 2020 Scikit-Learn Sprint, @icoder18 and I are taking the sklearn.random_projection part

@adrinjalali we completed sklearn/mixture

Working on the sklearn.linear_model for the sprint with @genvalen

Take sklearn.calibration.CalibratedClassifierCV

Working on this for sklearn.utils.validation

Next we'll be tackling sklearn.utils.random

working on sklearn.impute

Working on sklearn.tree.plot_tree

Table 14 will take sklearn.neural_network

Take sklearn.kernel_approximation

Taking sklearn.inspection

Table 14 will take sklearn.preprocessing

Taking datasets

Taking sklearn.mixture #17509

List updated.

Thank you all!

Taking sklearn.metrics for sprint

Taking model_selection module

@glemaitre Can we update the description of this to include it would be best to submit one file at a time?

Hello I would like to contribute. It is my first time though ... And it is not clear for me how I can know on which module there is still work to be done ? Thanks !

https://github.com/scikit-learn/scikit-learn/issues/15761#issuecomment-639461778 contains the list of modules left to fix.

Thanks. Take sklearn.decomposition then.

I am working on 'sklearn.isotonic.isotonic_regression'

I am working on 'sklearn.multiclass.py'

Hi, may I try to take the remaining on sklearn.tree? This would be my first time contributing as well.

Thanks for checking in, great to have your help! Please proceed; I think all of our sprint updates have been closed out.

On Jul 4, 2020, at 10:45, Ivan Wiryadi notifications@github.com wrote:


Hi, may I try to take the remaining on sklearn.tree? This would be my first time contributing as well.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.

Hi, I would like to make my first contribution. Can I take sklearn.multioutput?

I'll continue with sklearn.utils, starting with _encode.py

I am working on sklearn/decomposition/_dict_learning.py

I'm working on sklearn.kernel_ridge in the sprint

Hi, I will start working on sklearn.feature_extraction.image.img_to_graph

I am working on sklearn.feature_extraction.text.CountVectorizer

I am working on sklearn.sklearn.kernel_ridge

I am working on sklearn.ensemble.HistGradientBoostingRegressor

"I am working on this"

on this? @Hoda1394

"I am working on this"

on this? @Hoda1394

@TahiriNadia corrected.

@cgsavard Hey, Can I work on this? I'm a first-timer

I'll work on the files in sklearn.datasets.

Can i work on sklearn.linear_model._least_angle.py

@glemaitre I'm working on sklearn.linear_model._least_angle.py and i found an inconsistency of the use of method ='lar' sometimes it indicates lars sometimes lar, this inconsistency is also in code (not only in documentation), I can see that lars is the right one, could you confirm it, and i will make a PR.

working on 'sklearn/ensemble/_hist_gradient_boosting/binning.py'

files need change:

sklearn/_config.py
sklearn/dummy.py
sklearn/multioutput.py
sklearn/linear_model/_huber.py
sklearn/linear_model/_theil_sen.py
sklearn/linear_model/_ridge.py
sklearn/linear_model/_omp.py
sklearn/linear_model/_sag.py
sklearn/externals/_lobpcg.py
sklearn/externals/_lobpcg.py
sklearn/utils/extmath.py
sklearn/utils/__init__.py
sklearn/utils/graph.py
sklearn/utils/_mocking.py
sklearn/utils/sparsefuncs.py
sklearn/neighbors/_base.py
sklearn/gaussian_process/_gpc.py
sklearn/gaussian_process/kernels.py
sklearn/model_selection/_validation.py
~sklearn/decomposition/_fastica.py~
~sklearn/decomposition/_dict_learning.py~
~sklearn/decomposition/_factor_analysis.py~
~sklearn/decomposition/_incremental_pca.py~
~sklearn/decomposition/_lda.py~
~sklearn/decomposition/_pca.py~
~sklearn/decomposition/_truncated_svd.py~
~sklearn/decomposition/_sparse_pca.py~
~sklearn/decomposition/_nmf.py~
sklearn/manifold/_mds.py
sklearn/manifold/_spectral_embedding.py
sklearn/manifold/_t_sne.py
sklearn/ensemble/_hist_gradient_boosting/grower.py
sklearn/ensemble/_hist_gradient_boosting/binning.py
sklearn/metrics/_ranking.py
sklearn/tree/_classes.py
sklearn/preprocessing/_discretization.py
sklearn/preprocessing/_encoders.py line 620
sklearn/neural_network/_multilayer_perceptron.py line 1054
sklearn/covariance/_robust_covariance.py

Please do check if someone is already working/worked on the file you choosed

@sadakmed, for all the "decomposition files", there is an ongoing pull request #17739.

working on "gaussian_process.GaussianProcessRegressor" and "neighbors._base.py"

Hi, I am new, and I would like to start contributing. Do you still need some help on this issue? is there any file you still need help with?

Hey @boricles!

Have a look to https://github.com/scikit-learn/scikit-learn/issues/15761#issuecomment-639461778 for a list with the modules still to be fixed.

@alfaro96 thanks. I did a quick look just now. I will select a module tonight, and work on it.

I am working on sklearn/config_context

Hey, thought I'd see if I could help with the docs.

@alfaro96 I'd like to work on sklearn.feature_extraction.text.CountVectorizer, if it hasn't already been taken, especially because I've personally encountered some pitfalls when working with Vectorizers in the past.

Also, I noticed that although sklearn.model_selection.learning_curve was updated, there's an out-of-date tutorial using the old documentation, should I leave it be? Or is it worth updating?

Hi @alfaro96,

after edits:
I see sklearn.config_context and sklearn.set_config from sklearn.config_config.py were fixed so it can be checked out from the task list.

I would like to work on sklearn.utils. I saw only once instance of parameter documentation where 'optional' is used in. That means that I need to fix only that instance, correct? It is in sklearn.utils._mocking.py

Hey, thought I'd see if I could help with the docs.

Hey @madprogramer,

@alfaro96 I'd like to work on sklearn.feature_extraction.text.CountVectorizer, if it hasn't already been taken, especially because I've personally encountered some pitfalls when working with Vectorizers in the past.

~I have had a look to the checklist and sklearn.feature_extraction.text.CountVectorizer reference and it does not seem to be fixed. A PR would be welcome.~

Edit: The sklearn.feature_extraction.text.CountVectorizer is already fixed.

Also, I noticed that although sklearn.model_selection.learning_curve was updated, there's an out-of-date tutorial using the old documentation, should I leave it be? Or is it worth updating?

It is worth it updating, although this should be done in a separate PR.

Thank you!

Hi @alfaro96,

Hey @haiatn,

after edits:
I see sklearn.config_context and sklearn.set_config from sklearn.config_config.py were fixed so it can be checked out from the task list.

I have updated the checklist.

I would like to work on sklearn.utils. I saw only once instance of parameter documentation where 'optional' is used in. That means that I need to fix only that instance, correct? It is in sklearn.utils._mocking.py

That is the idea, although the classes in the sklearn.utils._mocking.py file are not part of the public API, so I do not think that it is worthy to update them.

Nevertheless, it would be nice if you could work in any of the other functions, classes and modules that are pending to be fixed.

Thank you!

I looked at the checklist. From what I saw the following can be checked from the checklist:

  • sklearn.feature_extraction.image.img_to_graph
  • sklearn.isotonic.IsotonicRegression
  • sklearn.isotonic.check_increasing
  • I did not find the file sklearn.ensemble.HistGradientBoostingRegressor but all of sklearn.ensemble is OK

Can I work on sklearn.manifold._spectral_embedding and sklearn.feature_extraction.text.HashVectorizer? I will do it in seperate PR. I think they are the only files left that need fixing (assuming sklearn.feature_extraction.text.CountVectorizer is taken).

I looked at the checklist. From what I saw the following can be checked from the checklist:

  • sklearn.feature_extraction.image.img_to_graph
  • sklearn.isotonic.IsotonicRegression
  • sklearn.isotonic.check_increasing

Thank you @haiatn, I have updated the checklist.

  • I did not find the file sklearn.ensemble.HistGradientBoostingRegressor but all of sklearn.ensemble is OK

The sklearn.ensemble.HistGradientBoostingClassifier and sklearn.ensemble.HistGradientBoostingRegressor are in this file: scikit-learn/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py. However, they have been already fixed.

Can I work on sklearn.manifold._spectral_embedding and sklearn.feature_extraction.text.HashVectorizer? I will do it in seperate PR. I think they are the only files left that need fixing (assuming sklearn.feature_extraction.text.CountVectorizer is taken).

I have had a look to the sklearn.manifold module and sklearn.feature_extraction.text.HashingVectorizer and they have been already fixed (I have updated the checklist accordingly).

Nevertheless, there are several functions in the sklearn.utils module that should be still fixed.

Thank you @haiatn, we really appreciate your help!

I will now work on sklearn.utils._estimator_html_repr, sklearn.utils.deprecation and sklearn.utils._testing

I will finish sklearn.utils. There are only 3 files I found that need fixing.

hey @alfaro96 ,
could you review my open pull requests? I think they are the last ones.

18360 #18385 #18386

Hey @haiatn!

I have already have a look to your open PRs.

Thank you!

Now that we merged what's left off sklearn.utils and it was the last on the checklist, did we finish?

There is one last open pull request #18025, then this issue could be eventually closed.

Hello,
I want to start contributing. Is there any class pending for fixing doc of default values? If any then I can take it up.

Hey new to the open source I am looking forward to fixing doc by any chance something is left which needs to be fixed

Was this page helpful?
0 / 5 - 0 ratings