Root cause (in both cases using df = pd.DataFrame({'a': [1, 2, 3]})):
In [71]: pd.__version__
Out[71]: '0.25.0'
In [73]: df.index[:, None]
Out[73]: Int64Index([0, 1, 2], dtype='int64')
In [74]: df.index[:, None].shape
Out[74]: (3,)
vs
In [10]: pd.__version__
Out[10]: '0.24.2'
In [13]: df.index[:, None]
Out[13]: Int64Index([0, 1, 2], dtype='int64')
In [14]: df.index[:, None].shape
Out[14]: (3, 1)
So before, indexing with [:, None] (in numpy a way to add a dimension to get 2D array) actually resulting in Index with ndim of 2 (but which is of course inconsistent state of the Index object)
Matplotlib relied on this fact when an Index is passed to plt.plot, as reported in https://github.com/matplotlib/matplotlib/issues/14992
I have explained the issue here and here in details. Basically, after upgrading to the version 0.25 I got the error:
IndexError: tuple index out of range
while attempting to plot a CSV file.
pls update the top section with a reproducible example; links to additional material is fine but the source material and versions should be here
@jreback I have actually downgraded Pandas from 0.25 to 0.24 so I'm not sure if there are other dependencies which might have also been downgraded. Right now the result of pd.show_versions() is:
~~~
commit: None
python: 3.7.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.24.0
pytest: None
pip: 19.2.1
setuptools: 41.0.1
Cython: None
numpy: 1.17.0
scipy: 1.3.0
pyarrow: None
xarray: None
IPython: 7.7.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.2
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.9
feather: None
matplotlib: 3.1.1
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: 1.1.8
lxml.etree: 4.4.0
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
~~~
The reproducible example is actually very simple:
~~~python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
headers = ['fx', 'fy', 'fz', 'tx', 'ty', 'tz', 'currentr',
'time', 'theta', 'omegay', 'currenty', 'pr', 'Dc', 'Fr', 'Fl']
df = pd.read_csv('data.csv', names=headers)
fig3 = plt.figure()
plt.plot(df.index, df['time'])
plt.show()
~~~
nothing particularly specific. more details including the CSV file here.
Please let me know if this is this satisfactory. Thanks for your support in advance.
pls try to reduce this to a copy pastable example w/o any external links
the likelihood of response will be higher
Dear @jreback ,
@anntzer has provided a small example showing the different between 0.25 and 0.24 here, so I'm just gonna quote her/him:
~python
import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3]})
print(df.index.shape, df.index[:, None].shape)
~
This now prints(3,) (3,),but with pandas0.24used to print(3,) (3, 1)which we relied on to convert input to 2D.
@Foadsf I updated the top post with that example
So the root cause is that we don't handle well a 2D indexer on an Index class.
We basically simply ignore the fact that df.index[:, None] is a 2D indexer.
The source of Index.__getitem__ actually mentions that for such a case, a plain ndarray should be returned:
but that clearly does not happen (anymore).
Though I don't think returning an ndarray is appropriate, right? I'd be surprised to have __getitem__ change the type to a different container class.
What's the best path forward? IMO raising is the most correct thing to do. But is it worth changing?
This was "caused" by https://github.com/pandas-dev/pandas/pull/27384, which optimized Index.shape to be return (len(self), ) instead of return self.values.shape.
But of course bottom line is still that an Index with 2D values is an invalid index object:
In [13]: idx = pd.Index([1, 2, 3])[:, None]
In [14]: idx.values
Out[14]:
array([[1],
[2],
[3]])
In [15]: idx.shape
Out[15]: (3,)
I think short term, the easiest option is to revert the Index.shape change (but we could keep it for MultiIndex, to keep the performance improvement). That would at least solve the regression with matplotlib.
But longer term this is not really a good solution.
Raising an error certainly sounds as a valid option, but that will require changes in matplotlib.
I suppose the reason that it returned a 2D array before, might have been because it was an ndarray subclass, and in general might be useful to have see the Index as an array-like that behaves in code that expects a numpy-like array.
BTW, Series actually does this:
In [16]: pd.Series([1, 2, 3])[:, None]
Out[16]:
array([[1],
[2],
[3]])
The Series case only works for actual numpy dtypes. Eg for categorical it returns a Series but goes wrong in all kinds of ways:
In [32]: s = pd.Series(pd.Categorical(['a', 'b']))[:, None]
In [33]: type(s)
Out[33]: pandas.core.series.Series
In [34]: s
Out[34]:
...
TypeError: unsupported format string passed to numpy.ndarray.__format__
In [35]: s._data
Out[35]:
SingleBlockManager
Items: Int64Index([[0], [1]], dtype='int64')
CategoricalBlock: 1 dtype: category
In [36]: s.index
Out[36]: Int64Index([[0], [1]], dtype='int64')
In [37]: s.values
Out[37]:
...
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
In [38]: s.cat.codes
...
ValueError: Length of passed values is 1, index implies 2
From Matplotlib's point of view, returning a numpy array is just fine (as we are trying to duck-type as a Series and Index as numpy arrays anyway). If we have gotten to the point where we are doing [:, None] we probably think it is close enough to a numpy array, maybe we just need to cast to numpy a bit more vigorously?
This is also related to https://github.com/pandas-dev/pandas/issues/27125 (the fact that we can create an Index with >1 dimensional array).
For a 0.25.1 bugfix release, I would propose to again start returning the 2D shape.
I opened a PR for what I proposed above: https://github.com/pandas-dev/pandas/pull/27818
I think for pandas it is fine to output a "invalid" (2D) shape as long as we allow to construct "invalid" Index objects. We should fix that second issue though, for which there is https://github.com/pandas-dev/pandas/issues/27125