Pandas: IndexError: tuple index out of range after upgrade to 0.25

Created on 6 Aug 2019 · 13Comments · Source: pandas-dev/pandas

Root cause (in both cases using df = pd.DataFrame({'a': [1, 2, 3]})):

In [71]: pd.__version__  
Out[71]: '0.25.0'

In [73]: df.index[:, None]
Out[73]: Int64Index([0, 1, 2], dtype='int64')

In [74]: df.index[:, None].shape
Out[74]: (3,)

In [10]: pd.__version__  
Out[10]: '0.24.2'

In [13]: df.index[:, None] 
Out[13]: Int64Index([0, 1, 2], dtype='int64')

In [14]: df.index[:, None].shape
Out[14]: (3, 1)

So before, indexing with [:, None] (in numpy a way to add a dimension to get 2D array) actually resulting in Index with ndim of 2 (but which is of course inconsistent state of the Index object)

Matplotlib relied on this fact when an Index is passed to plt.plot, as reported in https://github.com/matplotlib/matplotlib/issues/14992

I have explained the issue here and here in details. Basically, after upgrading to the version 0.25 I got the error:

IndexError: tuple index out of range

while attempting to plot a CSV file.

Compat Regression

Source

Foadsf

All 13 comments

pls update the top section with a reproducible example; links to additional material is fine but the source material and versions should be here

jreback on 6 Aug 2019

👍1

@jreback I have actually downgraded Pandas from 0.25 to 0.24 so I'm not sure if there are other dependencies which might have also been downgraded. Right now the result of pd.show_versions() is:

~~~

INSTALLED VERSIONS

commit: None
python: 3.7.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.0
pytest: None
pip: 19.2.1
setuptools: 41.0.1
Cython: None
numpy: 1.17.0
scipy: 1.3.0
pyarrow: None
xarray: None
IPython: 7.7.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.2
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.9
feather: None
matplotlib: 3.1.1
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: 1.1.8
lxml.etree: 4.4.0
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
~~~

The reproducible example is actually very simple:

~~~python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

headers = ['fx', 'fy', 'fz', 'tx', 'ty', 'tz', 'currentr',
'time', 'theta', 'omegay', 'currenty', 'pr', 'Dc', 'Fr', 'Fl']
df = pd.read_csv('data.csv', names=headers)

fig3 = plt.figure()
plt.plot(df.index, df['time'])
plt.show()
~~~

nothing particularly specific. more details including the CSV file here.

Please let me know if this is this satisfactory. Thanks for your support in advance.

Foadsf on 6 Aug 2019

pls try to reduce this to a copy pastable example w/o any external links
the likelihood of response will be higher

jreback on 6 Aug 2019

👍1

Dear @jreback ,

@anntzer has provided a small example showing the different between 0.25 and 0.24 here, so I'm just gonna quote her/him:

~python
import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3]})
print(df.index.shape, df.index[:, None].shape)
~
This now prints (3,) (3,), but with pandas 0.24 used to print (3,) (3, 1) which we relied on to convert input to 2D.

Foadsf on 6 Aug 2019

@Foadsf I updated the top post with that example

jorisvandenbossche on 6 Aug 2019

👍1

So the root cause is that we don't handle well a 2D indexer on an Index class.
We basically simply ignore the fact that df.index[:, None] is a 2D indexer.

The source of Index.__getitem__ actually mentions that for such a case, a plain ndarray should be returned:

https://github.com/pandas-dev/pandas/blob/640d9e1f5fe8ab64d1f6496b8216c28185e53225/pandas/core/indexes/base.py#L4241-L4242

but that clearly does not happen (anymore).

jorisvandenbossche on 6 Aug 2019

Though I don't think returning an ndarray is appropriate, right? I'd be surprised to have __getitem__ change the type to a different container class.

What's the best path forward? IMO raising is the most correct thing to do. But is it worth changing?

TomAugspurger on 6 Aug 2019

This was "caused" by https://github.com/pandas-dev/pandas/pull/27384, which optimized Index.shape to be return (len(self), ) instead of return self.values.shape.

But of course bottom line is still that an Index with 2D values is an invalid index object:

In [13]: idx = pd.Index([1, 2, 3])[:, None]                                                                                                                   

In [14]: idx.values                                                                                                                                           
Out[14]: 
array([[1],
       [2],
       [3]])

In [15]: idx.shape                                                                                                                                            
Out[15]: (3,)

jorisvandenbossche on 6 Aug 2019

I think short term, the easiest option is to revert the Index.shape change (but we could keep it for MultiIndex, to keep the performance improvement). That would at least solve the regression with matplotlib.

But longer term this is not really a good solution.
Raising an error certainly sounds as a valid option, but that will require changes in matplotlib.

I suppose the reason that it returned a 2D array before, might have been because it was an ndarray subclass, and in general might be useful to have see the Index as an array-like that behaves in code that expects a numpy-like array.

BTW, Series actually does this:

In [16]: pd.Series([1, 2, 3])[:, None]                                                                                                                        
Out[16]: 
array([[1],
       [2],
       [3]])

jorisvandenbossche on 6 Aug 2019

The Series case only works for actual numpy dtypes. Eg for categorical it returns a Series but goes wrong in all kinds of ways:

In [32]: s = pd.Series(pd.Categorical(['a', 'b']))[:, None]                                                                                                   

In [33]: type(s)                                                                                                                                              
Out[33]: pandas.core.series.Series

In [34]: s                                                                                                                                                    
Out[34]:
...
TypeError: unsupported format string passed to numpy.ndarray.__format__

In [35]: s._data                                                                                                                                              
Out[35]: 
SingleBlockManager
Items: Int64Index([[0], [1]], dtype='int64')
CategoricalBlock: 1 dtype: category

In [36]: s.index                                                                                                                                              
Out[36]: Int64Index([[0], [1]], dtype='int64')

In [37]: s.values                                                                                                                                             
Out[37]: 
...
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

In [38]: s.cat.codes                                                                                                                                          
...
ValueError: Length of passed values is 1, index implies 2

jorisvandenbossche on 6 Aug 2019

From Matplotlib's point of view, returning a numpy array is just fine (as we are trying to duck-type as a Series and Index as numpy arrays anyway). If we have gotten to the point where we are doing [:, None] we probably think it is close enough to a numpy array, maybe we just need to cast to numpy a bit more vigorously?

tacaswell on 7 Aug 2019

This is also related to https://github.com/pandas-dev/pandas/issues/27125 (the fact that we can create an Index with >1 dimensional array).

For a 0.25.1 bugfix release, I would propose to again start returning the 2D shape.

jorisvandenbossche on 8 Aug 2019

I opened a PR for what I proposed above: https://github.com/pandas-dev/pandas/pull/27818

I think for pandas it is fine to output a "invalid" (2D) shape as long as we allow to construct "invalid" Index objects. We should fix that second issue though, for which there is https://github.com/pandas-dev/pandas/issues/27125

jorisvandenbossche on 8 Aug 2019

Was this page helpful?

0 / 5 - 0 ratings