Just found that scikit-learn iris dataset is different from R's MASS datasets package.
I've found it in this Stats Exchange answer, which refers to here which, in turn, says
This data differs from the data presented in Fishers article (identified by Steve Chadwick, spchadwick '@' espeedaz.net ).
The 35th sample should be: 4.9,3.1,1.5,0.2,"Iris-setosa" where the error is in the fourth feature.
The 38th sample: 4.9,3.6,1.4,0.1,"Iris-setosa" where the errors are in the second and third features.
Note that the data in UCI repo and from scikit dataset is the same at cited lines.
Comparing iris data between the 3 sources, we have the following (bolds are differences)
row # | source | sepal_length | sepal_width | petal_length | petal_width |
35 | sklearn | 4.9 | 3.1 | 1.5 | 0.1 |
Fisher | 4.9 | 3.1 | 1.5 | 0.2 | |
R MASS | 4.9 | 3.1 | 1.5 | 0.2 | |
38 | sklearn | 4.9 | 3.1 | 1.5 | 0.1 |
Fisher | 4.9 | 3.6 | 1.4 | 0.1 | |
R MASS | 4.9 | 3.6 | 1.4 | 0.1 |
http://rcs.chemometrics.ru/Tutorials/classification/Fisher.pdf
If it's a bug, make iris dataset from scikit equal to the others, explain why the difference otherwise. I would guess it's the former.
Sounds like a bug. :\
I'm not sure if others are aware of this, or where our Iris came from.
Not sure if we should aspire to backwards compatibility when fixing this either...
Not sure if we should aspire to backwards compatibility when fixing this either...
Yeah, I though about that either. That's why I opened the issue instead of just PR'ng the correct values :laughing: .
I think iris dataset in scikit-learn is obtained from UCI Machine Learning Repository.
http://archive.ics.uci.edu/ml/datasets/Iris
Relevant commit:https://github.com/scikit-learn/scikit-learn/commit/def073bedb0cd71fbdc1e06d5fed1acf07960388
I'm really curious why such a common dataset has multiple versions.
As I said, their data doesn't agree with the original paper, so I guessed that was the cause of this "bug".
Oops, I think @paulochf is right. According to the doc and previous commit, scikit-learn obtained the data from UCI Machine Learning Repository. UCI itself has claimed that their data is not right (See http://archive.ics.uci.edu/ml/datasets/Iris Data Set Information section). So it seems reasonable to correct the data in scikit-learn. WDYT? @jnothman
If noting the error is sufficient for UCI, perhaps it is for us?
If noting the error is sufficient for UCI, perhaps it is for us?
@jnothman A note is accpetable from my side but I might prefer to correct the data. Some (or maybe many?) materials are using the word 'error' to describe the difference. See e.g.
http://archive.ics.uci.edu/ml/datasets/Iris
http://pdfs.semanticscholar.org/1c27/e59992d483892274fd27ba1d8e19bbfb5d46.pdf
Since iris is widely used, correcting the data might influence some examples and many people's code, but maybe it's somehow worthwhile in the long run?
Adding to @qinhanmin2014 : sometimes people write a code using iris using different languages and compare the results. Maybe fixing would give some consistency and reproducibility.
I think you might be right that fixing it up is preferable, under the
assumption that all people have done with this is illustrations that are
unlikely to change...
I'd be curious to know what @gaelvaroquaux thinks
Well, the cost is that many tutorials / stackoverflow Q/A answer will change. So, that's a drawback.
On the other hand, it seems that the change is quite small, so I would guess that if a tutorial or example depends on this, it's a bad anyhow :).
I am +0 on this.
+0 is about how I feel too. So far no one is agitating to keep it, so we
could just change and move on!
Could I try to do this?
@rotuna Feel free to have a try :) You also need to take care of relevant docs/tests/examples after correcting the dataset.
@qinhanmin2014 could you elaborate on the relevant docs/tests/examples . Actually , I am totally new to the open source community
I think what @qinhanmin2014 meant was that when you do change the iris dataset values (as expected from the issue fixing), you'll possibly expect the output of quite a few docs, tests and examples (those which make use of iris dataset) to actually fail. As an example, the iris dataset will have a different standard deviation so (possibly) different output from fitting on a classifier.
So to reflect those changes, you'll need just need to update the docs/tests/examples which will fail.
@rotuna @PranayMukherjee See above comments. You need to take care of every place which uses the iris dataset (one possible way is to use something like git grep
). @PranayMukherjee since @rotuna has taken the issue, please wait for a couple of days or ask whether he's still on the issue.
I'd say the easy way is to make the change, then look at the test logs to
work out what needs fixing up...
On 6 February 2018 at 20:58, Hanmin Qin notifications@github.com wrote:
@rotuna https://github.com/rotuna @PranayMukherjee
https://github.com/pranaymukherjee See above comments. You need to take
care of every place which uses the iris dataset (one possible way is to use
something like git grep). @PranayMukherjee
https://github.com/pranaymukherjee since @rotuna
https://github.com/rotuna has taken the issue, please wait for a couple
of days or ask whether he's still on the issue.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/10550#issuecomment-363371077,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6xsn8Wrrw_Q9J9A4DcSRigHENm8pks5tSCJIgaJpZM4Rxagq
.
Hi, I'm just trying to figure out how to do this. @PranayMukherjee I'll reach out to you in case I'm not able to do this and let you take over. I'm very new to this too, so I'll need some time to try :)
Please correct me if I'm wrong: @rotuna, I guess @jnothman said you have to change the values in iris data, then run the test suite ($ pytest sklearn
) to see which parts will use the data.
My suggestion was more that the uses of Iris are many... Sometimes it is used to assert general properties of an algorithm, and the specific values produced by that algorithm run over Iris are not tested. These tests do not need to be changed. Only those where specific data-dependent values are tested will need to be changed. Looking through every use of Iris is a waste of time. Looking through the test failures after changing Iris is straightforward.
I changed the values and found that so far only the graph_lasso part has failed. I'm the trying to find the actual values from R so that the test can compare it with sklearn's values (which is what I think the test was intended for).
I'm also running the scripts from the examples which have "load_iris" in them to check if something is changing. I'm guessing this would work for the docs too?
A lot of the failures could be in doctests. So make sure pytest is
configured to run them. If you submit a PR (with [doc build] in the commit
message) you can check the error log and the rendered examples quite easily.
I am also +0 on this. It's a lot of changes, work, review time for little gain.
Update: I've got the data from R and I'll send in a PR to find out where he other issues are as at jnothman said on friday or saturday. I've been having an unusually hectic time at work. Thankfully the week ends tomorrow and I can get sometime to focus on this properly.
Sorry for the delay
@rotuna, I've already done that. I posted the wrong values in my issue.
@paulochf I meant the cov and icov values for iris from R glasso package. These are in test_graph_lasso.py under the test_graph_lasso_iris function.
Oh, never mind! :sweat_smile:
The cov and icov values for iris from R don't match with sklearn's values after updating iris.
The code I used for finding the R values is here
I used nosetests -v sklearn to run the tests.
I tried to understand both functions over the week to figure out where I was wrong but I couldn't figure anything beyond the fact that the output of the cov function and empirical_variance in sklearn did not match. Please help.
@rotuna Are you still stuck on this? Can you involve me to solve this issue ?
I'm not sure what's going on here. A pull request may help make it concrete. Those arrays only differ on the diagonal, and there they are offset by 1
@jnothman @paulochf @qinhanmin2014
I sent a PR and the same test is failing for the same reason here with the same values (other than one coding convention error in the comments. I'll fix that in the actual MRG PR). Could you please help me fix this?
Note: test_graph_lasso_iris is the one I am talking about. The other test (test_graphical_lasso_iris) also fails but I have not changed the values for that yet. test_graphical_lasso_iris has values from the original sklearn iris values in the PR. (By comparing the values from R to it manually, it too fails though)
well it says
# Hard-coded solution from R glasso package for alpha=1.0
# The iris datasets in R and scikit-learn do not match in a few places,
# these values are for the scikit-learn version.
so presumably you should try and recompute glasso's output with Iris in R, and put those in the test. Is that something you are able to do?
@jnothman
Yes, the test fails on the newly computed values from R. The code I used to get the values is here
Right, sorry for not keeping track of this appropriately. Now, the error you show seems to suggest that the difference between R and us is systematic for $w: the R diagonal is q greater than ours, and otherwise they appear identical, yeah? I've not looked into why there might be such a consistent difference, but it might be expected.
Most helpful comment
I think what @qinhanmin2014 meant was that when you do change the iris dataset values (as expected from the issue fixing), you'll possibly expect the output of quite a few docs, tests and examples (those which make use of iris dataset) to actually fail. As an example, the iris dataset will have a different standard deviation so (possibly) different output from fitting on a classifier.
So to reflect those changes, you'll need just need to update the docs/tests/examples which will fail.