Pandas: Can't use dt.ceil, dt.floor, dt.round with coarser target frequencies

Created on 4 Feb 2017 · 5Comments · Source: pandas-dev/pandas

Code Sample

This works fine:

In [16]: hours = pd.Series(pd.date_range('2017-01-01', '2017-01-31', freq='H'))

In [17]: hours.dt.ceil('D')
Out[17]: 
0     2017-01-01
1     2017-01-02
2     2017-01-02
3     2017-01-02
4     2017-01-02
...

But this doesn't:

In [18]: days = pd.Series(pd.date_range('2017-01-01', '2017-01-31', freq='D'))

In [19]: days.dt.ceil('M')
...
/home/andrew/git/pandas-rosnfeld-py3/pandas/tseries/offsets.py in nanos(self)
    510     @property
    511     def nanos(self):
--> 512         raise ValueError("{0} is a non-fixed frequency".format(self))
    513 
    514 

ValueError: <MonthEnd> is a non-fixed frequency

Problem description

I haven't done an exhaustive test, but it seems that one can't use coarser frequencies like 'M', 'Q', 'A' in the dt.ceil/floor/round set of methods.

Expected Output

It would seem reasonable to be able to take a series of scattered dates and be able to "coarsen" them into month-ends, quarter-ends, year-ends, etc.

Output of `pd.show_versions()`

(admittedly some pretty old stuff in this environment but I also got the same behavior on 0.19.2 with latest of everything else)

INSTALLED VERSIONS

commit: 74e20a0c4146919a82312c1027dcb57b48b613e3
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-62-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.19.0+300.g74e20a0.dirty
nose: 1.3.7
pip: 7.1.2
setuptools: 19.1.1
Cython: 0.23.4
numpy: 1.10.2
scipy: 0.16.1
statsmodels: None
xarray: None
IPython: 4.0.1
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2016.4
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
feather: None
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.2
lxml: 3.5.0
bs4: None
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 1.0.11
pymysql: 0.6.7.None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_datareader: None

Enhancement Frequency Timeseries

Source

rosnfeld

👍18

Most helpful comment

Here is an alternative and probably more correct.

In [3]: s.dt.to_period('M').dt.to_timestamp()
Out[3]: 
DatetimeIndex(['2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01',
               ...
               '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01'],
              dtype='datetime64[ns]', length=721, freq=None)

The issue is rounding is pretty tricky and I would argue unexpected for Months. You can easily have round up on the 15th for some months and down for others. You could argue that .ceil/.floor are unambiguous though.

If you want to implement something reasonable would be fine. (you could prob just do the .to_period transform under the hood).

jreback on 4 Feb 2017

👍3

All 5 comments

Here is an alternative and probably more correct.

In [3]: s.dt.to_period('M').dt.to_timestamp()
Out[3]: 
DatetimeIndex(['2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01',
               ...
               '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01'],
              dtype='datetime64[ns]', length=721, freq=None)

If you want to implement something reasonable would be fine. (you could prob just do the .to_period transform under the hood).

jreback on 4 Feb 2017

👍3

Taking a look at this: if the date is '2017-01-23 12:34:56', what should floor and ceil return for a frequency like 'M' (MonthEnd)? If you're doing .dt.to_period('M').dt.to_timestamp(), then you get '2017-01-01' - but that's not a MonthEnd value. Similarly if you took the end_time of that period, you'd get '2017-01-31 23:59:59.999999999', also not a MonthEnd. I think you actually want '2016-12-31 00:00:00' (for floor) and '2017-01-31 00:00:00' (for ceil) - right? So I am not sure that to_period is the key here.

While I wrote this issue for Series, upon playing with this further I've noticed that it also affects DatetimeIndex and individual Timestamp values. I have a fix for individual values (e.g. pd.to_datetime('2017-01-23 12:34:56').ceil('M') now behaves as I think is right in the paragraph above), but I'm actually not sure how to fix the Series/DatetimeIndex case.

My patch currently affects the Timestamp._round() code in tslib.pyx, and goes something like the following, using the offset rollforward/rollback to stay on-offset for at least the floor and ceil cases:

        from pandas.tseries.offsets import Tick
        offset = to_offset(freq)

        if not isinstance(offset, Tick):
            offset.normalize = True

            if rounder == np.ceil:
                return offset.rollforward(self)

            if rounder == np.floor:
                return offset.rollback(self)

            raise ValueError('Can only round to Tick offsets')

There is a very similar TimelikeOps._round() method in datetimelike.py, but I can't do the same change there, as the offset rollforward/rollback methods only work on Timestamp. Suggestions are welcome - not sure how one normally "scales up" more complicated individual operations like this to Series in a performant way. Also not sure if there is a cleaner/more obvious way to do what I've done in the single Timestamp case.

rosnfeld on 22 Jul 2017

@rosnfeld

This is how you work with dti/Series; its very similar.

In [10]: i = pd.to_datetime(['2017-01-23 12:34:56'])

In [11]: i
Out[11]: DatetimeIndex(['2017-01-23 12:34:56'], dtype='datetime64[ns]', freq=None)

In [12]: i.normalize() + pd.offsets.MonthBegin()
Out[12]: DatetimeIndex(['2017-02-01'], dtype='datetime64[ns]', freq=None)

In [13]: i.normalize() + pd.offsets.MonthEnd()
Out[13]: DatetimeIndex(['2017-01-31'], dtype='datetime64[ns]', freq=None)

jreback on 6 Sep 2017

👍2

(got motivated to come back to this issue as I hit it again...)

Thanks @jreback - I am very close now. Only issue is that adding the offset isn't idempotent/doesn't handle the exact boundary properly. That is, in the single Timestamp case, timestamp.ceil('M').ceil('M') the second application correctly does nothing, but in the series case series.dt.ceil('M').dt.ceil('M') rolls the month again, since it adds another offset.

offset.rollforward(timestamp) is smart enough to call offset.onOffset(timestamp) to catch this case. I guess I need a series equivalent that somehow incorporates this kind of logic... but may be tricky to do in an efficient way.

rosnfeld on 19 Nov 2017

If you just want to roll forward then i.normalize() + 0*pd.offsets.MonthBegin() is an idempotent operation.