Setup:
index = pd.Index(['PCE']*4, name='Variable')
data = [
pd.Period('2018Q2'),
pd.Period('2021', freq='5A-Dec'),
pd.Period('2026', freq='10A-Dec'),
pd.Period('2017Q2')
]
ser = pd.Series(data, index=index, name='Period')
In the real-life version of this issue, 'Period' is a column in a DataFrame and I need to append it as a new level to the index. The snippets here show the problem(s) in both py2 and py3, but for reasons unknown df.set_index('Period', append=True) goes through fine in py2.
The large majority of Period values are quarterly-frequency.
py2
>>> pd.__version__
'0.20.2'
>>> ser.sort_values()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/pandas/core/series.py", line 1710, in sort_values
argsorted = _try_kind_sort(arr[good])
File "/usr/local/lib/python2.7/site-packages/pandas/core/series.py", line 1696, in _try_kind_sort
return arr.argsort(kind=kind)
File "pandas/_libs/period.pyx", line 725, in pandas._libs.period._Period.__richcmp__ (pandas/_libs/period.c:11842)
pandas._libs.period.IncompatibleFrequency: Input has different freq=10A-DEC from Period(freq=Q-DEC)
>>> ser.to_frame()
Period
Variable
PCE 2018Q2
PCE 2021
PGDP 2026
PGDP 2017Q2
>>> ser.to_frame().set_index('Period', append=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 2836, in set_index
index = MultiIndex.from_arrays(arrays, names=names)
File "/usr/local/lib/python2.7/site-packages/pandas/core/indexes/multi.py", line 1100, in from_arrays
labels, levels = _factorize_from_iterables(arrays)
File "/usr/local/lib/python2.7/site-packages/pandas/core/categorical.py", line 2193, in _factorize_from_iterables
return map(list, lzip(*[_factorize_from_iterable(it) for it in iterables]))
File "/usr/local/lib/python2.7/site-packages/pandas/core/categorical.py", line 2165, in _factorize_from_iterable
cat = Categorical(values, ordered=True)
File "/usr/local/lib/python2.7/site-packages/pandas/core/categorical.py", line 310, in __init__
raise NotImplementedError("> 1 ndim Categorical are not "
NotImplementedError: > 1 ndim Categorical are not supported at this time
No idea why it thinks Categorical is relevant here. That doesn't happen in py3.
For the purposes of sort_values, refusing to sort might make sense. But when all I care about is set_index, I'm pretty indifferent to the ordering.
py3
>>> pd.__version__
'0.20.2'
>>> ser.sort_values()
pandas._libs.period.IncompatibleFrequency: Input has different freq=Q-DEC from Period(freq=5A-DEC)
During handling of the above exception, another exception occurred:
SystemError: <built-in function isinstance> returned a result with an error set
[...]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/site-packages/pandas/core/series.py", line 1710, in sort_values
argsorted = _try_kind_sort(arr[good])
File "/usr/local/lib/python3.5/site-packages/pandas/core/series.py", line 1696, in _try_kind_sort
return arr.argsort(kind=kind)
File "pandas/_libs/period.pyx", line 723, in pandas._libs.period._Period.__richcmp__ (pandas/_libs/period.c:11713)
File "/usr/local/lib/python3.5/site-packages/pandas/tseries/offsets.py", line 375, in __ne__
return not self == other
File "/usr/local/lib/python3.5/site-packages/pandas/tseries/offsets.py", line 364, in __eq__
if isinstance(other, compat.string_types):
SystemError: <built-in function isinstance> returned a result with an error set
>>> ser.to_frame().set_index('Period', append=True)
pandas._libs.period.IncompatibleFrequency: Input has different freq=Q-DEC from Period(freq=5A-DEC)
During handling of the above exception, another exception occurred:
SystemError: <built-in function isinstance> returned a result with an error set
[...]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/site-packages/pandas/core/frame.py", line 2836, in set_index
index = MultiIndex.from_arrays(arrays, names=names)
File "/usr/local/lib/python3.5/site-packages/pandas/core/indexes/multi.py", line 1100, in from_arrays
labels, levels = _factorize_from_iterables(arrays)
File "/usr/local/lib/python3.5/site-packages/pandas/core/categorical.py", line 2193, in _factorize_from_iterables
return map(list, lzip(*[_factorize_from_iterable(it) for it in iterables]))
File "/usr/local/lib/python3.5/site-packages/pandas/core/categorical.py", line 2193, in <listcomp>
return map(list, lzip(*[_factorize_from_iterable(it) for it in iterables]))
File "/usr/local/lib/python3.5/site-packages/pandas/core/categorical.py", line 2165, in _factorize_from_iterable
cat = Categorical(values, ordered=True)
File "/usr/local/lib/python3.5/site-packages/pandas/core/categorical.py", line 298, in __init__
codes, categories = factorize(values, sort=True)
File "/usr/local/lib/python3.5/site-packages/pandas/core/algorithms.py", line 567, in factorize
assume_unique=True)
File "/usr/local/lib/python3.5/site-packages/pandas/core/algorithms.py", line 486, in safe_sort
sorter = values.argsort()
File "pandas/_libs/period.pyx", line 723, in pandas._libs.period._Period.__richcmp__ (pandas/_libs/period.c:11713)
File "/usr/local/lib/python3.5/site-packages/pandas/tseries/offsets.py", line 375, in __ne__
return not self == other
File "/usr/local/lib/python3.5/site-packages/pandas/tseries/offsets.py", line 364, in __eq__
if isinstance(other, compat.string_types):
SystemError: <built-in function isinstance> returned a result with an error set
I have no idea what to make of this.
A problem that I have not been able to replicate with a copy/pasteable subset of the data:
>>> mi = pd.MultiIndex.from_arrays([period.index, period])
>>> mi
[... prints roughly what we'd expect...]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/site-packages/pandas/core/base.py", line 800, in shape
return self._values.shape
File "/usr/local/lib/python3.5/site-packages/pandas/core/base.py", line 860, in _values
return self.values
File "/usr/local/lib/python3.5/site-packages/pandas/core/indexes/multi.py", line 667, in values
self._tuples = lib.fast_zip(values)
File "pandas/_libs/lib.pyx", line 549, in pandas._libs.lib.fast_zip (pandas/_libs/lib.c:10513)
ValueError: all arrays must be same length
>>> mi.names
FrozenList(['Variable', None])
>>> mi[0]
('CPROF', 'Period')
>>> mi[1]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/site-packages/pandas/core/indexes/multi.py", line 1377, in __getitem__
if lab[key] == -1:
IndexError: index 1 is out of bounds for axis 0 with size 1
AFAICT it took the _name_ 'Period' and made that the only value in the new level of the MultiIndex. Really no idea what's going on here.
Yikes! I can't really follow your last code snippet (not sure what period is). That being said, it does appear that all of your issues are stemming from a frequency incompatibility one way or the other.
No idea why it thinks Categorical is relevant here. That doesn't happen in py3.
Yeah...that does look a little weird. Can you try first upgrading to 0.20.3 and see if anything changes on that end? If not, then we most certainly should improve the error message.
(not sure what period is)
That's the real-life column (2770 rows). Looks a lot like ser from the snippet, but I haven't figured out a snippet that demonstrates the problem. It looks like it's caused by passing a single-column DataFrame to from_arrays:
index = pd.Index(['CPROF', 'HOUSING', 'INDPROD', 'NGDP', 'PGDP'])
data = [pd.Period('1968Q4')]*5
df = pd.DataFrame(data, index=index, columns=['Period'])
mi = pd.MultiIndex.from_arrays([df.index, df])
>>> mi
MultiIndex(levels=[['CPROF', 'HOUSING', 'INDPROD', 'NGDP', 'PGDP'], ['Period']],
labels=[[0, 1, 2, 3, 4], [0]])
>>> mi.shape
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/site-packages/pandas/core/base.py", line 800, in shape
return self._values.shape
File "/usr/local/lib/python3.5/site-packages/pandas/core/base.py", line 860, in _values
return self.values
File "/usr/local/lib/python3.5/site-packages/pandas/core/indexes/multi.py", line 667, in values
self._tuples = lib.fast_zip(values)
File "pandas/_libs/lib.pyx", line 549, in pandas._libs.lib.fast_zip (pandas/_libs/lib.c:10513)
ValueError: all arrays must be same length
On the plus side its clearly a user error (this guy). Ideally it'd be caught in __init__ though.
Can you try first upgrading to 0.20.3 and see if anything changes on that end?
Unchanged.
I still don't get why MultiIndex.from_arrays needs to go through Categorical, but a partial fix can be made in Categorical.__init__:
if categories is None:
try:
codes, categories = factorize(values, sort=True)
except TypeError:
codes, categories = factorize(values, sort=False)
if ordered:
# raise, as we don't have a sortable data structure and so
# the user should give us one by specifying categories
raise TypeError("'values' is not ordered, please "
"explicitly specify the categories order "
"by passing in a categories argument.")
except ValueError:
# FIXME
raise NotImplementedError("> 1 ndim Categorical are not "
"supported at this time")
Especially in py3, we unsortable errors to be TypeErrors, but the error that gets raises when trying to compare Periods with different frequencies is _libs.period.IncompatibleFrequency, which subclasses ValueError.
Having the except TypeError: above also catch IncompatibleFrequency gets us one step closer to correctness. But then it raise immediately because ordered is True here. Any idea why MultiIndex.from_arrays is requiring an ordered Categorical?
¯ \ _(ツ)_/¯
Feel free to experiment and see what happens when you loosen that restriction 😄
There are more effective approaches than trial and error. Someone somewhere knows why this decision was made in the first place.
Someone somewhere knows why this decision was made in the first place.
Perhaps, but I'm assuming worst case in that we don't remember anymore why that is the case.
I still don't get why MultiIndex.from_arrays needs to go through Categorical, but a partial fix can be made in Categorical.__init__:
well, you need to factorize things when you construct a MI.
Not really sure what this issue is about, it has gone off on tangents. Can you provide a narrow clear example.
@gfyoung don't tag things until it is clear what they are.
Not really sure what this issue is about, it has gone off on tangents. Can you provide a narrow clear example.
1) Period.__richcmp__ currently raises IncompatibleFrequency when trying to compare periods with unequal frequencies. This breaks things that do sorting under the hood. It should be changed.
Setup:
index = pd.Index(['PCE']*4, name='Variable')
data = [
pd.Period('2018Q2'),
pd.Period('2021', freq='5A-Dec'),
pd.Period('2026', freq='10A-Dec'),
pd.Period('2017Q2')
]
ser = pd.Series(data, index=index, name='Period')
Clear Error:
>>> ser.sort_values()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/pandas/core/series.py", line 1710, in sort_values
argsorted = _try_kind_sort(arr[good])
File "/usr/local/lib/python2.7/site-packages/pandas/core/series.py", line 1696, in _try_kind_sort
return arr.argsort(kind=kind)
File "pandas/_libs/period.pyx", line 725, in pandas._libs.period._Period.__richcmp__ (pandas/_libs/period.c:11842)
pandas._libs.period.IncompatibleFrequency: Input has different freq=10A-DEC from Period(freq=Q-DEC)
Incorrect Error Message (I think because IncompatibleFrequency subclasses ValueError and not TypeError)
>>> ser.to_frame().set_index('Period', append=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 2836, in set_index
index = MultiIndex.from_arrays(arrays, names=names)
File "/usr/local/lib/python2.7/site-packages/pandas/core/indexes/multi.py", line 1100, in from_arrays
labels, levels = _factorize_from_iterables(arrays)
File "/usr/local/lib/python2.7/site-packages/pandas/core/categorical.py", line 2193, in _factorize_from_iterables
return map(list, lzip(*[_factorize_from_iterable(it) for it in iterables]))
File "/usr/local/lib/python2.7/site-packages/pandas/core/categorical.py", line 2165, in _factorize_from_iterable
cat = Categorical(values, ordered=True)
File "/usr/local/lib/python2.7/site-packages/pandas/core/categorical.py", line 310, in __init__
raise NotImplementedError("> 1 ndim Categorical are not "
NotImplementedError: > 1 ndim Categorical are not supported at this time
py3
>>> ser.sort_values()
pandas._libs.period.IncompatibleFrequency: Input has different freq=Q-DEC from Period(freq=5A-DEC)
During handling of the above exception, another exception occurred:
SystemError: <built-in function isinstance> returned a result with an error set
[...]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/site-packages/pandas/core/series.py", line 1710, in sort_values
argsorted = _try_kind_sort(arr[good])
File "/usr/local/lib/python3.5/site-packages/pandas/core/series.py", line 1696, in _try_kind_sort
return arr.argsort(kind=kind)
File "pandas/_libs/period.pyx", line 723, in pandas._libs.period._Period.__richcmp__ (pandas/_libs/period.c:11713)
File "/usr/local/lib/python3.5/site-packages/pandas/tseries/offsets.py", line 375, in __ne__
return not self == other
File "/usr/local/lib/python3.5/site-packages/pandas/tseries/offsets.py", line 364, in __eq__
if isinstance(other, compat.string_types):
SystemError: <built-in function isinstance> returned a result with an error set
This results in the same error:
In [2]: pd.Index([pd.Timestamp('2000-01-03 00:00:00', freq='B'),
pd.Period('2000-01-03', 'B'),
pd.Period('2000-01-03', 'B')]).sort_values()
[...]
SystemError: <built-in function isinstance> returned a result with an error set
Most helpful comment
That's the real-life column (2770 rows). Looks a lot like
serfrom the snippet, but I haven't figured out a snippet that demonstrates the problem. It looks like it's caused by passing a single-column DataFrame tofrom_arrays:On the plus side its clearly a user error (this guy). Ideally it'd be caught in
__init__though.Unchanged.