Scipy: Overflow in stats.kruskal for large samples

Created on 20 Aug 2017  路  4Comments  路  Source: scipy/scipy

Having a total number of samples which is large results in an overflow warning and nonsensical results. It's easy to fix by changing 'totaln = np.sum(n)' to 'totaln = np.sum(n,dtype='uint64')' in the stats.kruskal() definition. I'll submit a fix shortly.

Reproducing code examples:

This produces an error and prints the wrong pvalue:

import numpy as np
from scipy import stats
[statistic, pvalue]  = stats.kruskal(np.random.randn(25000),np.random.randn(25000))
print(pvalue)

This produces no error:

import numpy as np
from scipy import stats
[statistic, pvalue]  = stats.kruskal(np.random.randn(20000),np.random.randn(20000))
print(pvalue)

This produces an error and prints the wrong pvalue:

import numpy as np
from scipy import stats
[statistic, pvalue]  = stats.kruskal(np.random.randn(20000),np.random.randn(15000),np.random.randn(15000))
print(pvalue)

This produces no error:

import numpy as np
from scipy import stats
[statistic, pvalue]  = stats.kruskal(np.random.randn(15000),np.random.randn(15000),np.random.randn(15000))
print(pvalue)

Error message:

stats\stats.py:5056: RuntimeWarning: overflow encountered in long_scalars
  h = 12.0 / (totaln * (totaln + 1)) * ssbn - 3 * (totaln + 1)

Scipy/Numpy/Python version information:

<<Output from 'import sys, scipy, numpy; print(scipy.__version__, numpy.__version__, sys.version_info)'>>

0.19.0 1.12.1 sys.version_info(major=3, minor=6, micro=1, releaselevel='final', serial=0)

defect scipy.stats

Most helpful comment

Yes @avihaleva, there was a fix in #7763 but for some reason it has not been merged. You can fix this issue yourself by editing the stats.py file and replacing totaln = np.sum(n) with totaln = np.sum(n, dtype='uint64') on line 5931.

All 4 comments

Is there a fix to this issue ? It is two years old now ...

Yes @avihaleva, there was a fix in #7763 but for some reason it has not been merged. You can fix this issue yourself by editing the stats.py file and replacing totaln = np.sum(n) with totaln = np.sum(n, dtype='uint64') on line 5931.

This issue still exists. I just re-installed latest version of scipy and the code still exists (now, found on line 5878):
totaln = np.sum(n)

@petereliason 1.4.0 is not released yet

Was this page helpful?
0 / 5 - 0 ratings

Related issues

nschloe picture nschloe  路  4Comments

StefanTheWiz picture StefanTheWiz  路  4Comments

Diyago picture Diyago  路  4Comments

asteppke picture asteppke  路  5Comments

scipy-gitbot picture scipy-gitbot  路  4Comments