We recently added a Glossary to our documentation, which describes common parameters among other things. We should now replace descriptions of random_state
parameters to make them more concise and informative (see #10415). For example, instead of
random_state : int, RandomState instance or None, optional, default: None
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by `np.random`.
in both KMeans and MiniBatchKMeans, we might have:
KMeans:
random_state : int, RandomState instance, default=None
Determines random number generation for centroid initialization.
Pass an int for reproducible results across multiple function calls.
See :term:`Glossary <random_state>`.
MiniBatchKMeans:
random_state : int, RandomState instance, default=None
Determines random number generation for centroid initialization and
random reassignment.
Pass an int for reproducible results across multiple function calls.
See :term:`Glossary <random_state>`.
Therefore, the description should focus on what is the impact of random_state
on the algorithm.
Contributors interested in contributing this change should take on one module at a time, initially.
The list of estimators to be modified is the following:
List of files to modify using kwinata script
[x] [sklearn/ensemble/_hist_gradient_boosting/binning.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_hist_gradient_boosting/binning.py) - 37, 112
[x] [sklearn/ensemble/_bagging.py](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_bagging.py) - 503, 902
Hi @jnothman, Can I take this issue? Thanks
Claim a module/subpackage and have a go...
On 30 January 2018 at 00:24, Somya Anand notifications@github.com wrote:
Hi @jnothman https://github.com/jnothman, Can I take this issue? Thanks
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/10548#issuecomment-361243951,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz62ie2pMFVg7uM6_MVnmWKRX-efgHks5tPcaHgaJpZM4Rwij3
.
@jnothman I am sorry for being naive but can you elaborate about the module/submodule? I mean are you referring to a sub-package like Kmeans for instance?
I think what @jnothman means is just start with one file, for example sklearn/cluster/k_means_.py, update the random_state
docstring as in the top post and open a PR.
a subpackage is something like sklearn.cluster
Thanks. Will do that and open a PR.
Hi! @jnothman
Would you also like to replace the following comments as seen in grid_search.py? They have an extra line as compared to the one shared by you.
random_state : int, RandomState instance or None, optional (default=None)
Pseudo random number generator state used for random uniform sampling
from lists of possible values instead of scipy.stats distributions.
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by `np.random`.
I can take grid_search.py and k_means.py(KMeans).
leave grid_search.py alone. it is deprecated. The idea is to minimise the
content that is repeated, and available in the glossary, so that we can
give the users to most informative description about random_state's role in
the particular estimator.
Thanks @jnothman. WIll I need to understand these algorithms before I can replace this random_state information?
You will need to understand the algorithms broadly, but not every detail of
their implementation. You will need to be able to find where random_state
is used, if the randomisation in the algorithm is not completely obvious.
In some cases, it may be appropriate to not even give much more detail than
just linking to the glossary; we'll have to see how it goes.
Okay, thank you. I will start going through the algorithms slowly.
Regards,
Shivam Rastogi
I have opened a pull request #10614
Since @aby0 has not claimed the sklearn.cluster module yet. I would like to claim the whole module. Please let me know if I can work on it or I should work on something else.
Any update guys? It is a long holiday for us so let me know if I can pick this.
I'll take the datasets
module since I'm already poking around in the docstrings there for #10731.
I'm claiming the linear_model
module. will raise a PR soon. #11900 raised.
Claiming decomposition
module next.
Checklist of modules where this needs to be done:
We had some trouble reaching consensus on how to strike the right balance
here, iirc
So do pay attention to the prior PRs merged above
@jnothman thanks! will update the PRs for to mention the reproducibility when passing an int.
willing to take up all the other modules in another PR, once these have been reviewed...
I'm claiming covariance.
@BlackTeaAndCoffee please be aware, the doc string format is not yet finalised, discussions have been happening on the other PRs listed here. So you might wanna have a look too.
I am claiming feature_extraction
@jnothman , @NicolasHug, just discovered #15222 and a number of PR related to it that I haven't taken into account in summarizing this one... some of them are never been reviewed... :(
In order to make things clear for sprints, I'm wondering if we can close one of those two issues: if yes, which one? As I can avoid duplicated information. Thanks for your collaboration.
I wasn't aware of this issue (should have checked better), I'm happy to close https://github.com/scikit-learn/scikit-learn/issues/15222 in favor of this one
Following @jnothman comment maybe this issue could deserve a 'Moderate' label?
We want to work on ensemble/_hist_gradient_boosting/binning.
@mojc and me.
@anaisabeldhero and me want to work on manifold/*
#wimlds #SciKitLearnSprint
@daphn3k and I will work on sklearn/gaussian_process/
We want to work on sklearn/preprocessing/_data.py - 2178, 2607
@rachelcjordan and @fabi-cast
Me and @Malesche want to take the sklearn/inspection/_permutation_importance.py
claiming sklearn/metrics/cluster/_unsupervised.py file! #wimlds
@daphn3k and I take also the covariance/* and neighbors/* #wimlds
claim:
sklearn/dummy.py - 59
sklearn/multioutput.py - 578, 738
sklearn/kernel_approximation.py - 41, 143, 470
sklearn/multiclass.py - 687
sklearn/random_projection.py - 178, 245, 464, 586
PSA: please use the original sentence
Pass an int for reproducible results across multiple function calls.
instead of what I'm seeing in PRs at the moment:
Use an int to make the randomness deterministic
which isn't correct, since the RNG is always deterministic regardless of what is passed
CC @adrinjalali since I think you're at the sprint
working on the neural network and mixture
PSA: please use the original sentence
Pass an int for reproducible results across multiple function calls.
instead of what I'm seeing in PRs at the moment:
Use an int to make the randomness deterministic
which isn't correct, since the RNG is always deterministic regardless of what is passed
CC @adrinjalali since I think you're at the sprint
Hi @NicolasHug this was meant to comment a PR I suppose... which one? :)
going to work on scikit-learn/sklearn/model_selection/_validation.py
@cmarmo That was a general comment for all PRs. I saw one and commented there, then saw a second one and figured out it was a pattern that would be better addressed at the source
@cmarmo That was a general comment for all PRs. I saw one and commented there, then saw a second one and figured out it was a pattern that would be better addressed at the source
Sorry @NicolasHug, my bad, I haven't found the comment easy to trace.
@NicolasHug Original sentence has been corrected in the commits from @anaisabeldhero and me
Me and @Olks claim sklearn/utils/extmath.py - 185, 297
Claim sklearn/ensemble/_iforest.py - 109
Claim sklearn/neural_network/_multilayer_perceptron.py - 782, 1174
Claim sklearn/ensemble/_weight_boosting.py - 188, 324, 479, 900, 1022
Claim sklearn/multioutput.py - 578, 738
Claim :
sklearn/mixture/_bayesian_mixture.py - 166
sklearn/mixture/_base.py - 139
sklearn/mixture/_gaussian_mixture.py - 504
Claim sklearn/ensemble/_gb.py - 887, 1360
Claim sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py - 736, 918
Claim sklearn/neural_network/_rbm.py - 59
Claim :
sklearn/svm/_classes.py - 90, 312, 546, 752
sklearn/svm/_base.py - 853
Claim:
sklearn/feature_selection/_mutual_info.py - 226, 335, 414
sklearn/metrics/cluster/_unsupervised.py - 80
sklearn/utils/_testing.py - 521
sklearn/utils/init.py - 478, 623
Claim :
sklearn/dummy.py - 59
sklearn/random_projection.py - 178, 245, 464, 586
@DatenBiene @GregoireMialon Thanks for all your contributions during last sprint. There are only 3 modules left unchecked !
Would you be interested / have time / have motivation to tackle those (no pressure !) ?
Hi Jérémie ! I'll try to have a look at it soon
Le mer. 12 févr. 2020 à 15:53, Jérémie du Boisberranger <
[email protected]> a écrit :
@DatenBiene https://github.com/DatenBiene @GregoireMialon
https://github.com/GregoireMialon Thanks for all your contributions
during last sprint. There are only 3 modules left unchecked !Would you be interested / have time / have motivation to tackle those (no
pressure !) ?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/10548?email_source=notifications&email_token=AFY4624NQL3EAFLBGPUNAE3RCQEO3A5CNFSM4EOCFD32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELRBT2A#issuecomment-585243112,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AFY4625457AU7OL4E4EUVOTRCQEO3ANCNFSM4EOCFD3Q
.
Hi @jeremiedbb! I will try to finish the 3 remaining modules today 😃
Claim:
sklearn/kernel_approximation.py - 41, 143, 470
sklearn/multiclass.py - 687
sklearn/ensemble/_base.py - 52
Hi @jnothman and @jeremiedbb, looks like all the files where modified. I would be happy to help if you find any remaining issues.
Thanks a lot @DatenBiene and all the contributors that worked to close this issue!
I think we can close this huge one!
Feel free to open new specific issues if something is still missing about random_state
description.
Most helpful comment
We want to work on sklearn/preprocessing/_data.py - 2178, 2607
@rachelcjordan and @fabi-cast
wimlds #SciKitLearnSprint