Pandas: resample() should allow a "null" fill_method

Created on 2 Oct 2015 · 25Comments · Source: pandas-dev/pandas

I'm using resample() to aggregate data in a timeframe.

>>> s = pandas.Series((3, 4), (pandas.Timestamp("2014-1-1"), pandas.Timestamp("2015-1-1")))
>>> x = s.resample("1s")
>>> len(x)
31536001

When doing such a call, resample fills with NaN all the (31536001 - 2) inexistent values, which ends up creating thousands of points and making Python using 500M+ RAM. The thing is that I don't care about the NaN point, so I would like to _not_ fill them in the Series and having so much memory used. AFAICS resample does not offer such as fill_method.

API Design Resample Timeseries

Source

👍4

Most helpful comment

+1 for fill_method='drop'

shoyer on 4 Oct 2015

👍4

All 25 comments

this is trivially done with

s.resample('1s').dropna()

jreback on 2 Oct 2015

👍2

@jreback Sure, but that still makes Pandas uses 500M+ RAM.

Things are actually worse for something like that:

>>> s = pandas.Series(range(1000), pandas.date_range('2014-1-1', periods=1000))
>>> s.resample('1s', how='median')

The result is pretty obvious and should take less than one second to compute. But since Pandas resample with thousands of points filled with NaN, it takes 10+ minutes to compute. And uses 1.5G+ RAM.

jd on 2 Oct 2015

you need to show what you are going to do with this

why are you resampling such a large period? what's the point

jreback on 2 Oct 2015

I don't understand why the reason is important here.
I treat statistics over a year with fine grained resolution up to the second.

jd on 2 Oct 2015

a usecase/example of what you are doing would be helpful to understand the goal. e.g. are you merely snaping things to a close freq? or are you doing an actual resample but with sparse data?

the _point_ of resample IS to give you a full-rank set for every point in time.

jreback on 2 Oct 2015

@jd does snap cover your needs?

TomAugspurger on 2 Oct 2015

@jreback Yeah, I'm doing a resampling (aggregation) using sparse data, but I'm not only snapping – there might be several values in a time bucket.

@TomAugspurger It does not, though I was looking for such a function anyway so thanks for the pointer :)

jd on 2 Oct 2015

@jd

so api for this could be 1 of the following:

Series.resample(...., fill_method='sparse') (which could return a sparse Series)
Series.resample(...., fill_method='drop') (which could return a Series, but with no NaN)
SparseSeries(....).resample(....) (which would by definition return a sparse Series)

none of these are implemented, but potentially useful.

want to take a stab at this?

jreback on 2 Oct 2015

👍1

@jreback Hum I've trouble figuring out the different between sparse and drop, but I'm sure I would need one of those.

I don't think I'm know enough of Pandas internal to implement that right now, but I'd be happy to test it and report if you write some code. :)

Thanks a lot!

jd on 3 Oct 2015

Here's a way to do a sparse resample. A resample is just a groupby. So we are grouping by a combination of the date, and a second (e.g. to give us a freq of 's'). This only groups by the representative elements in the input index (e.g. your sparse points), so this at most generates n groups where n is the length of the set.

In [24]: pd.options.display.max_rows=12

In [25]: np.random.seed(1234)

In [26]: i = date_range('20140101',periods=100000000,freq='s')
^[[A
In [27]: s = Series(range(1000),index=i.take(np.random.randint(0,len(i),size=1000)))

In [28]: s
Out[28]: 
2014-07-19 19:23:27      0
2014-09-23 09:00:37      1
2014-01-12 07:42:04      2
2016-07-01 18:35:32      3
2014-07-24 05:43:52      4
2016-11-27 19:22:33      5
                      ... 
2015-01-04 08:11:55    994
2016-09-26 02:24:18    995
2016-03-01 18:48:41    996
2014-03-15 01:18:10    997
2015-07-14 06:45:46    998
2016-03-09 08:47:23    999
dtype: int64

In [29]: s.groupby([s.index.date,s.index.second]).sum()
Out[29]: 
2014-01-03  9      52
            50    414
2014-01-05  31    535
2014-01-07  16    937
2014-01-08  28    164
2014-01-12  4       2
                 ... 
2017-02-27  39    174
2017-03-01  5     536
            26    760
            31    208
            51    488
2017-03-03  7     810
dtype: int64

jreback on 3 Oct 2015

That really looks like a good way of approaching the solution. I probably lack knowledge about Pandas usage to understand how to map the groupby result to something closer than the output of resample, but it looks like that indeed. I see the result has an index and 2 columns, not sure what the first column is for. Also the index at only date, and not timestamps. I guess it's just a matter of accessing and presenting the data in the same way than resample after.

Also the by argument for groupby is really not obvious to me; does it make possible to group by e.g. 3 minutes ?

jd on 3 Oct 2015

yes this is a multi-groupby. Its a bit tricky as the default way of doing this will create all of the groups, but here you only want _some_. So this could be built into resample as I proposed above. The idea is that you map your index (even using a function is enough).

You are essentially rounding the value to whatever interval you want, so here's sort of a trivial way to do this:

xref #4314 (e.g. we should simply define this on a Timestamp)

In [25]: def round(t, freq):
    # round a Timestamp to a specified freq
    return pd.Timestamp((t.value/freq.delta.value)*freq.delta.value)
   ....: 

In [26]: from functools import partial

In [27]: s.groupby(partial(round,freq=pd.offsets.Minute(3))).sum()
Out[27]: 
2014-01-03 02:12:00     52
2014-01-03 11:00:00    414
2014-01-05 20:39:00    535
2014-01-07 07:03:00    937
2014-01-08 03:09:00    164
2014-01-12 07:42:00      2
                      ... 
2017-02-27 20:27:00    214
2017-03-01 04:51:00    488
2017-03-01 12:06:00    536
2017-03-01 15:45:00    208
2017-03-01 19:06:00    760
2017-03-03 00:06:00    810
dtype: int64

In [28]: s.groupby(partial(round,freq=pd.offsets.Minute(5))).sum()
Out[28]: 
2014-01-03 02:10:00     52
2014-01-03 11:00:00    414
2014-01-05 20:40:00    535
2014-01-07 07:00:00    937
2014-01-08 03:10:00    164
2014-01-12 07:40:00      2
                      ... 
2017-02-27 20:25:00    214
2017-03-01 04:50:00    488
2017-03-01 12:05:00    536
2017-03-01 15:45:00    208
2017-03-01 19:05:00    760
2017-03-03 00:05:00    810
dtype: int64

jreback on 3 Oct 2015

actually if u would like add this to the timeseries.rst docs under the resample section would be great

can add as an actual method at some point later

jreback on 3 Oct 2015

+1 for fill_method='drop'

shoyer on 4 Oct 2015

👍4

Any updates here?

joddm on 13 May 2018

Still open if you want to take a shot at fixing it!

On Sun, May 13, 2018 at 11:23 AM, joddm notifications@github.com wrote:

Any updates here?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/11217#issuecomment-388638666,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIiEnJlgB3aUrag5ZlBTyoUylGfaYks5tyF4FgaJpZM4GH0Ly
.

TomAugspurger on 16 May 2018

You can also try this:

def resample (df, column, resolution, how='last'):
    if type (df.index) != pd.DatetimeIndex: df.set_index (column, inplace=True)
    df.index = df.index.floor (resolution)
    return getattr (df.groupby (column), how) ()

Examples:

resample (balances, 'timestamp', '1s', 'last')

resample (trades[['timestamp', 'amount']], 'timestamp', '1s', 'sum')

xpl on 29 Sep 2018

It is better to have fill_method='drop' , because missing data should not be fill by zero by default . This caused a huge problem in my project .

eromoe on 13 May 2019

@jreback You suggested (long time ago in this thread):

s.resample('1s').dropna()

How would you recommend to do this nowadays? That way you suggested seems not to work with the latest Pandas version.

I would like to have a series resampled/grouped by '2D' and then .sum() the results to end up with only the dates in which there was some data in the original series (i.e.: I would rather have NaN if there is no data to sum so that I can drop those values later).

An example:

>>> from pandas import date_range
>>> from pandas import Series
>>> dates = date_range('2019-01-01', '2019-01-10', freq='D')[[0, 4, 5, 8]]
>>> dates
DatetimeIndex(['2019-01-01', '2019-01-05', '2019-01-06', '2019-01-09'], dtype='datetime64[ns]', freq=None)
>>> series = Series(index=dates, data=[0, 1, 2, 3])
>>> series
2019-01-01    0
2019-01-05    1
2019-01-06    2
2019-01-09    3
dtype: int64
>>> series.resample('2D').sum()
2019-01-01    0
2019-01-03    0
2019-01-05    3
2019-01-07    0
2019-01-09    3
Freq: 2D, dtype: int64

Instead of the result I got, I was looking for:

2019-01-01    0
2019-01-05    3
2019-01-09    3
Freq: 2D, dtype: int64

Or at least (I don't mind if the full resampled index gets filled in memory):

2019-01-01    0
2019-01-03    Nan
2019-01-05    3
2019-01-07    Nan
2019-01-09    3
Freq: 2D, dtype: int64

Peque on 10 Jun 2019

So I came up with this, using min_count:

>>> series.resample('2D').sum(min_count=1).dropna()
2019-01-01    0.0
2019-01-05    3.0
2019-01-09    3.0
dtype: float64

Still fills the empty spaces with NaNs, but gets the job done for short series which can easily fit in memory.

Peque on 11 Jun 2019

👍1

@TomAugspurger @jreback

Still open if you want to take a shot at fixing it!

Could you give me a few pointers how to start? I would like to try implementing

Series.resample(...., fill_method='drop') (which could return a Series, but with no NaN)

maowerner on 25 Jul 2019

Hmm, right now I don't really see how to do this cleanly. The Resampler is created in
https://github.com/pandas-dev/pandas/blob/37f29e5bf967d76d76a05e84b5c9b14c9bc66f23/pandas/core/generic.py#L8404
However, the fill_method parameter is only entering in the following function call.
https://github.com/pandas-dev/pandas/blob/37f29e5bf967d76d76a05e84b5c9b14c9bc66f23/pandas/core/generic.py#L8418
Furthermore it is deprecated in favor of

.resample().method().fillmethod()

Making the Resampler dependent on fill_method would only reintroduce the coupling into the code. And fill_method = 'drop' sounds like a contradiction anyways.

I think it might be cleaner to bypass resample and call round() and groupby as proposed above. It would however be nice to have reference to this in resample.fillna().

Do share this opinion? Is there somewhere else where this might be implemented? Wouldn't some ẁay to group by approximate values i.e. .groupby(round = '1s') be useful in other cases as well?

maowerner on 26 Jul 2019

Sorry, I'm not familiar with this section of the code. I don't know ahead
of time what the best approach is.

On Fri, Jul 26, 2019 at 3:27 PM Markus Werner notifications@github.com
wrote:

Hmm, right now I don't really see how to do this cleanly. The Resampler is
created in

https://github.com/pandas-dev/pandas/blob/37f29e5bf967d76d76a05e84b5c9b14c9bc66f23/pandas/core/generic.py#L8404
However, the fill_method parameter is only entering in the following
function call.

https://github.com/pandas-dev/pandas/blob/37f29e5bf967d76d76a05e84b5c9b14c9bc66f23/pandas/core/generic.py#L8418
Furthermore it is deprecated in favor of

.resample().method().fillmethod()

Making the Resampler dependent on fill_method would only reintroduce the
coupling into the code. And fill_method = 'drop' sounds like a
contradiction anyways.

I think it might be cleaner to bypass resample and call round() and
groupby as proposed above. It would however be nice to have reference to
this in resample.fillna().

Do share this opinion? Is there somewhere else where this might be
implemented? Wouldn't some ẁay to group by approximate values i.e. .groupby(round
= '1s') be useful in other cases as well?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/11217?email_source=notifications&email_token=AAKAOIXEUMNSRHDBI36UB43QBNMZVA5CNFSM4BQ7ILZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD25UERI#issuecomment-515588677,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAKAOIRYGSBSFKGYL3W3KTLQBNMZVANCNFSM4BQ7ILZA
.

TomAugspurger on 29 Jul 2019

So I tried this out with simply grouping by the rounded value. This works perfectly fine without intermediate objects. Below is a comparison with the workaround of @Peque I would like to mention this (or a similar) example in the documentation of resample. I think I am not the only one who was mislead to think that it is the correct function to group sparse data.

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import random

np.random.seed(41)

i = pd.date_range('20140101',periods=100000000,freq='s')
s = Series(range(1000),index=i.take(np.random.randint(0,len(i),size=1000)))

%timeit s.\
    resample('1H').\
    sum(min_count=1).\
    dropna()

2.4 ms ± 33.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit s.\
    groupby(s.index.floor('1H')).\
    sum()

1.67 ms ± 18.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit s.\
    groupby(s.index.round('1H')).\
    sum()

1.72 ms ± 9.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

maowerner on 30 Jul 2019

🚀1 👍1

Since resample no longer has a full_method, I don't think there's much left to implement directly here. If there's a related need using ffill or bfill, a new issue can be opened regarding that. Closing.