Scikit-learn: Iris dataset inconsistency compared to other sources

Created on 29 Jan 2018 · 34Comments · Source: scikit-learn/scikit-learn

Just found that scikit-learn iris dataset is different from R's MASS datasets package.

I've found it in this Stats Exchange answer, which refers to here which, in turn, says

This data differs from the data presented in Fishers article (identified by Steve Chadwick, spchadwick '@' espeedaz.net ).
The 35th sample should be: 4.9,3.1,1.5,0.2,"Iris-setosa" where the error is in the fourth feature.
The 38th sample: 4.9,3.6,1.4,0.1,"Iris-setosa" where the errors are in the second and third features.

Note that the data in UCI repo and from scikit dataset is the same at cited lines.

Comparing iris data between the 3 sources, we have the following (bolds are differences)

row #	source	sepal_length	sepal_width	petal_length	petal_width
35	sklearn	4.9	3.1	1.5	0.1
	Fisher	4.9	3.1	1.5	0.2
	R MASS	4.9	3.1	1.5	0.2
38	sklearn	4.9	3.1	1.5	0.1
	Fisher	4.9	3.6	1.4	0.1
	R MASS	4.9	3.6	1.4	0.1

Fisher data

http://rcs.chemometrics.ru/Tutorials/classification/Fisher.pdf

Fisher's article data; 35th and 38th rows marked with red dots

R data

R MASS data

Proposal

If it's a bug, make iris dataset from scikit equal to the others, explain why the difference otherwise. I would guess it's the former.

Bug Easy help wanted

Source

paulochf

Most helpful comment

I think what @qinhanmin2014 meant was that when you do change the iris dataset values (as expected from the issue fixing), you'll possibly expect the output of quite a few docs, tests and examples (those which make use of iris dataset) to actually fail. As an example, the iris dataset will have a different standard deviation so (possibly) different output from fitting on a classifier.

So to reflect those changes, you'll need just need to update the docs/tests/examples which will fail.

gxyd on 6 Feb 2018

👍2

All 34 comments

Sounds like a bug. :\

I'm not sure if others are aware of this, or where our Iris came from.

Not sure if we should aspire to backwards compatibility when fixing this either...

jnothman on 29 Jan 2018

😕1

Not sure if we should aspire to backwards compatibility when fixing this either...

Yeah, I though about that either. That's why I opened the issue instead of just PR'ng the correct values :laughing: .

paulochf on 29 Jan 2018

I think iris dataset in scikit-learn is obtained from UCI Machine Learning Repository.
http://archive.ics.uci.edu/ml/datasets/Iris
Relevant commit:https://github.com/scikit-learn/scikit-learn/commit/def073bedb0cd71fbdc1e06d5fed1acf07960388
I'm really curious why such a common dataset has multiple versions.

qinhanmin2014 on 30 Jan 2018

As I said, their data doesn't agree with the original paper, so I guessed that was the cause of this "bug".

paulochf on 30 Jan 2018

Oops, I think @paulochf is right. According to the doc and previous commit, scikit-learn obtained the data from UCI Machine Learning Repository. UCI itself has claimed that their data is not right (See http://archive.ics.uci.edu/ml/datasets/Iris Data Set Information section). So it seems reasonable to correct the data in scikit-learn. WDYT? @jnothman

qinhanmin2014 on 1 Feb 2018

👍1

If noting the error is sufficient for UCI, perhaps it is for us?

jnothman on 1 Feb 2018

If noting the error is sufficient for UCI, perhaps it is for us?

@jnothman A note is accpetable from my side but I might prefer to correct the data. Some (or maybe many?) materials are using the word 'error' to describe the difference. See e.g.
http://archive.ics.uci.edu/ml/datasets/Iris
http://pdfs.semanticscholar.org/1c27/e59992d483892274fd27ba1d8e19bbfb5d46.pdf
Since iris is widely used, correcting the data might influence some examples and many people's code, but maybe it's somehow worthwhile in the long run?

qinhanmin2014 on 1 Feb 2018

Adding to @qinhanmin2014 : sometimes people write a code using iris using different languages and compare the results. Maybe fixing would give some consistency and reproducibility.

paulochf on 1 Feb 2018

I think you might be right that fixing it up is preferable, under the
assumption that all people have done with this is illustrations that are
unlikely to change...

I'd be curious to know what @gaelvaroquaux thinks

jnothman on 1 Feb 2018

Well, the cost is that many tutorials / stackoverflow Q/A answer will change. So, that's a drawback.

On the other hand, it seems that the change is quite small, so I would guess that if a tutorial or example depends on this, it's a bad anyhow :).

I am +0 on this.

GaelVaroquaux on 1 Feb 2018

+0 is about how I feel too. So far no one is agitating to keep it, so we
could just change and move on!

jnothman on 1 Feb 2018

🎉1

Could I try to do this?

rotuna on 5 Feb 2018

👍1

@rotuna Feel free to have a try :) You also need to take care of relevant docs/tests/examples after correcting the dataset.

qinhanmin2014 on 5 Feb 2018

@qinhanmin2014 could you elaborate on the relevant docs/tests/examples . Actually , I am totally new to the open source community

PranayMukherjee on 6 Feb 2018

👍1

So to reflect those changes, you'll need just need to update the docs/tests/examples which will fail.

gxyd on 6 Feb 2018

👍2

@rotuna @PranayMukherjee See above comments. You need to take care of every place which uses the iris dataset (one possible way is to use something like git grep). @PranayMukherjee since @rotuna has taken the issue, please wait for a couple of days or ask whether he's still on the issue.

qinhanmin2014 on 6 Feb 2018

I'd say the easy way is to make the change, then look at the test logs to
work out what needs fixing up...

On 6 February 2018 at 20:58, Hanmin Qin notifications@github.com wrote:

@rotuna https://github.com/rotuna @PranayMukherjee
https://github.com/pranaymukherjee See above comments. You need to take
care of every place which uses the iris dataset (one possible way is to use
something like git grep). @PranayMukherjee
https://github.com/pranaymukherjee since @rotuna
https://github.com/rotuna has taken the issue, please wait for a couple
of days or ask whether he's still on the issue.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/10550#issuecomment-363371077,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6xsn8Wrrw_Q9J9A4DcSRigHENm8pks5tSCJIgaJpZM4Rxagq
.

jnothman on 6 Feb 2018

Hi, I'm just trying to figure out how to do this. @PranayMukherjee I'll reach out to you in case I'm not able to do this and let you take over. I'm very new to this too, so I'll need some time to try :)

rotuna on 7 Feb 2018

Please correct me if I'm wrong: @rotuna, I guess @jnothman said you have to change the values in iris data, then run the test suite ($ pytest sklearn) to see which parts will use the data.

paulochf on 7 Feb 2018

My suggestion was more that the uses of Iris are many... Sometimes it is used to assert general properties of an algorithm, and the specific values produced by that algorithm run over Iris are not tested. These tests do not need to be changed. Only those where specific data-dependent values are tested will need to be changed. Looking through every use of Iris is a waste of time. Looking through the test failures after changing Iris is straightforward.

jnothman on 7 Feb 2018

👍1

I changed the values and found that so far only the graph_lasso part has failed. I'm the trying to find the actual values from R so that the test can compare it with sklearn's values (which is what I think the test was intended for).

I'm also running the scripts from the examples which have "load_iris" in them to check if something is changing. I'm guessing this would work for the docs too?

rotuna on 7 Feb 2018

A lot of the failures could be in doctests. So make sure pytest is
configured to run them. If you submit a PR (with [doc build] in the commit
message) you can check the error log and the rendered examples quite easily.

jnothman on 7 Feb 2018

I am also +0 on this. It's a lot of changes, work, review time for little gain.

agramfort on 7 Feb 2018

Update: I've got the data from R and I'll send in a PR to find out where he other issues are as at jnothman said on friday or saturday. I've been having an unusually hectic time at work. Thankfully the week ends tomorrow and I can get sometime to focus on this properly.

Sorry for the delay

rotuna on 15 Feb 2018

@rotuna, I've already done that. I posted the wrong values in my issue.

paulochf on 15 Feb 2018

@paulochf I meant the cov and icov values for iris from R glasso package. These are in test_graph_lasso.py under the test_graph_lasso_iris function.

rotuna on 17 Feb 2018

Oh, never mind! :sweat_smile:

paulochf on 18 Feb 2018

The cov and icov values for iris from R don't match with sklearn's values after updating iris.

I used the glasso function with alpha (rho) value as 1.0
The output of the cov function and empirical_variance in sklearn did not match so I used the output of sklearn directly
The following are the R values

screenshot from 2018-02-25 22-32-36

The following is the error I get when running the tests

screenshot from 2018-02-25 22-34-58

The code I used for finding the R values is here

I used nosetests -v sklearn to run the tests.

I tried to understand both functions over the week to figure out where I was wrong but I couldn't figure anything beyond the fact that the output of the cov function and empirical_variance in sklearn did not match. Please help.

rotuna on 25 Feb 2018

@rotuna Are you still stuck on this? Can you involve me to solve this issue ?

jasneetsinghanand on 26 Feb 2018

I'm not sure what's going on here. A pull request may help make it concrete. Those arrays only differ on the diagonal, and there they are offset by 1

jnothman on 26 Feb 2018

@jnothman @paulochf @qinhanmin2014

I sent a PR and the same test is failing for the same reason here with the same values (other than one coding convention error in the comments. I'll fix that in the actual MRG PR). Could you please help me fix this?

Note: test_graph_lasso_iris is the one I am talking about. The other test (test_graphical_lasso_iris) also fails but I have not changed the values for that yet. test_graphical_lasso_iris has values from the original sklearn iris values in the PR. (By comparing the values from R to it manually, it too fails though)

rotuna on 1 Mar 2018

well it says

        # Hard-coded solution from R glasso package for alpha=1.0
        # The iris datasets in R and scikit-learn do not match in a few places,
        # these values are for the scikit-learn version.

so presumably you should try and recompute glasso's output with Iris in R, and put those in the test. Is that something you are able to do?

jnothman on 2 Mar 2018

@jnothman

Yes, the test fails on the newly computed values from R. The code I used to get the values is here

rotuna on 2 Mar 2018

Right, sorry for not keeping track of this appropriately. Now, the error you show seems to suggest that the difference between R and us is systematic for $w: the R diagonal is q greater than ours, and otherwise they appear identical, yeah? I've not looked into why there might be such a consistent difference, but it might be expected.

jnothman on 4 Mar 2018

Was this page helpful?

0 / 5 - 0 ratings