I'm using resample()
to aggregate data in a timeframe.
>>> s = pandas.Series((3, 4), (pandas.Timestamp("2014-1-1"), pandas.Timestamp("2015-1-1")))
>>> x = s.resample("1s")
>>> len(x)
31536001
When doing such a call, resample
fills with NaN all the (31536001 - 2) inexistent values, which ends up creating thousands of points and making Python using 500M+ RAM. The thing is that I don't care about the NaN point, so I would like to _not_ fill them in the Series
and having so much memory used. AFAICS resample
does not offer such as fill_method
.
this is trivially done with
s.resample('1s').dropna()
@jreback Sure, but that still makes Pandas uses 500M+ RAM.
Things are actually worse for something like that:
>>> s = pandas.Series(range(1000), pandas.date_range('2014-1-1', periods=1000))
>>> s.resample('1s', how='median')
The result is pretty obvious and should take less than one second to compute. But since Pandas resample with thousands of points filled with NaN, it takes 10+ minutes to compute. And uses 1.5G+ RAM.
you need to show what you are going to do with this
why are you resampling such a large period? what's the point
I don't understand why the reason is important here.
I treat statistics over a year with fine grained resolution up to the second.
a usecase/example of what you are doing would be helpful to understand the goal. e.g. are you merely snaping things to a close freq? or are you doing an actual resample but with sparse data?
the _point_ of resample IS to give you a full-rank set for every point in time.
@jd does snap cover your needs?
@jreback Yeah, I'm doing a resampling (aggregation) using sparse data, but I'm not only snapping – there might be several values in a time bucket.
@TomAugspurger It does not, though I was looking for such a function anyway so thanks for the pointer :)
@jd
so api for this could be 1 of the following:
Series.resample(...., fill_method='sparse')
(which could return a sparse Series)Series.resample(...., fill_method='drop')
(which could return a Series, but with no NaN)SparseSeries(....).resample(....)
(which would by definition return a sparse Series)none of these are implemented, but potentially useful.
want to take a stab at this?
@jreback Hum I've trouble figuring out the different between sparse and drop, but I'm sure I would need one of those.
I don't think I'm know enough of Pandas internal to implement that right now, but I'd be happy to test it and report if you write some code. :)
Thanks a lot!
Here's a way to do a sparse resample. A resample is just a groupby. So we are grouping by a combination of the date, and a second (e.g. to give us a freq of 's'). This only groups by the representative elements in the input index (e.g. your sparse points), so this at most generates n groups where n is the length of the set.
In [24]: pd.options.display.max_rows=12
In [25]: np.random.seed(1234)
In [26]: i = date_range('20140101',periods=100000000,freq='s')
^[[A
In [27]: s = Series(range(1000),index=i.take(np.random.randint(0,len(i),size=1000)))
In [28]: s
Out[28]:
2014-07-19 19:23:27 0
2014-09-23 09:00:37 1
2014-01-12 07:42:04 2
2016-07-01 18:35:32 3
2014-07-24 05:43:52 4
2016-11-27 19:22:33 5
...
2015-01-04 08:11:55 994
2016-09-26 02:24:18 995
2016-03-01 18:48:41 996
2014-03-15 01:18:10 997
2015-07-14 06:45:46 998
2016-03-09 08:47:23 999
dtype: int64
In [29]: s.groupby([s.index.date,s.index.second]).sum()
Out[29]:
2014-01-03 9 52
50 414
2014-01-05 31 535
2014-01-07 16 937
2014-01-08 28 164
2014-01-12 4 2
...
2017-02-27 39 174
2017-03-01 5 536
26 760
31 208
51 488
2017-03-03 7 810
dtype: int64
That really looks like a good way of approaching the solution. I probably lack knowledge about Pandas usage to understand how to map the groupby
result to something closer than the output of resample
, but it looks like that indeed. I see the result has an index and 2 columns, not sure what the first column is for. Also the index at only date, and not timestamps. I guess it's just a matter of accessing and presenting the data in the same way than resample after.
Also the by
argument for groupby is really not obvious to me; does it make possible to group by e.g. 3 minutes ?
yes this is a multi-groupby. Its a bit tricky as the default way of doing this will create all of the groups, but here you only want _some_. So this could be built into resample
as I proposed above. The idea is that you map your index (even using a function is enough).
You are essentially rounding the value to whatever interval you want, so here's sort of a trivial way to do this:
xref #4314 (e.g. we should simply define this on a Timestamp
)
In [25]: def round(t, freq):
# round a Timestamp to a specified freq
return pd.Timestamp((t.value/freq.delta.value)*freq.delta.value)
....:
In [26]: from functools import partial
In [27]: s.groupby(partial(round,freq=pd.offsets.Minute(3))).sum()
Out[27]:
2014-01-03 02:12:00 52
2014-01-03 11:00:00 414
2014-01-05 20:39:00 535
2014-01-07 07:03:00 937
2014-01-08 03:09:00 164
2014-01-12 07:42:00 2
...
2017-02-27 20:27:00 214
2017-03-01 04:51:00 488
2017-03-01 12:06:00 536
2017-03-01 15:45:00 208
2017-03-01 19:06:00 760
2017-03-03 00:06:00 810
dtype: int64
In [28]: s.groupby(partial(round,freq=pd.offsets.Minute(5))).sum()
Out[28]:
2014-01-03 02:10:00 52
2014-01-03 11:00:00 414
2014-01-05 20:40:00 535
2014-01-07 07:00:00 937
2014-01-08 03:10:00 164
2014-01-12 07:40:00 2
...
2017-02-27 20:25:00 214
2017-03-01 04:50:00 488
2017-03-01 12:05:00 536
2017-03-01 15:45:00 208
2017-03-01 19:05:00 760
2017-03-03 00:05:00 810
dtype: int64
actually if u would like add this to the timeseries.rst docs under the resample section would be great
can add as an actual method at some point later
+1 for fill_method='drop'
Any updates here?
Still open if you want to take a shot at fixing it!
On Sun, May 13, 2018 at 11:23 AM, joddm notifications@github.com wrote:
Any updates here?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/11217#issuecomment-388638666,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIiEnJlgB3aUrag5ZlBTyoUylGfaYks5tyF4FgaJpZM4GH0Ly
.
You can also try this:
def resample (df, column, resolution, how='last'):
if type (df.index) != pd.DatetimeIndex: df.set_index (column, inplace=True)
df.index = df.index.floor (resolution)
return getattr (df.groupby (column), how) ()
Examples:
resample (balances, 'timestamp', '1s', 'last')
resample (trades[['timestamp', 'amount']], 'timestamp', '1s', 'sum')
It is better to have fill_method='drop'
, because missing data should not be fill by zero by default . This caused a huge problem in my project .
@jreback You suggested (long time ago in this thread):
s.resample('1s').dropna()
How would you recommend to do this nowadays? That way you suggested seems not to work with the latest Pandas version.
I would like to have a series resampled/grouped by '2D'
and then .sum()
the results to end up with only the dates in which there was some data in the original series (i.e.: I would rather have NaN
if there is no data to sum so that I can drop those values later).
An example:
>>> from pandas import date_range
>>> from pandas import Series
>>> dates = date_range('2019-01-01', '2019-01-10', freq='D')[[0, 4, 5, 8]]
>>> dates
DatetimeIndex(['2019-01-01', '2019-01-05', '2019-01-06', '2019-01-09'], dtype='datetime64[ns]', freq=None)
>>> series = Series(index=dates, data=[0, 1, 2, 3])
>>> series
2019-01-01 0
2019-01-05 1
2019-01-06 2
2019-01-09 3
dtype: int64
>>> series.resample('2D').sum()
2019-01-01 0
2019-01-03 0
2019-01-05 3
2019-01-07 0
2019-01-09 3
Freq: 2D, dtype: int64
Instead of the result I got, I was looking for:
2019-01-01 0
2019-01-05 3
2019-01-09 3
Freq: 2D, dtype: int64
Or at least (I don't mind if the full resampled index gets filled in memory):
2019-01-01 0
2019-01-03 Nan
2019-01-05 3
2019-01-07 Nan
2019-01-09 3
Freq: 2D, dtype: int64
So I came up with this, using min_count
:
>>> series.resample('2D').sum(min_count=1).dropna()
2019-01-01 0.0
2019-01-05 3.0
2019-01-09 3.0
dtype: float64
Still fills the empty spaces with NaN
s, but gets the job done for short series which can easily fit in memory.
@TomAugspurger @jreback
Still open if you want to take a shot at fixing it!
Could you give me a few pointers how to start? I would like to try implementing
Series.resample(...., fill_method='drop') (which could return a Series, but with no NaN)
Hmm, right now I don't really see how to do this cleanly. The Resampler is created in
https://github.com/pandas-dev/pandas/blob/37f29e5bf967d76d76a05e84b5c9b14c9bc66f23/pandas/core/generic.py#L8404
However, the fill_method
parameter is only entering in the following function call.
https://github.com/pandas-dev/pandas/blob/37f29e5bf967d76d76a05e84b5c9b14c9bc66f23/pandas/core/generic.py#L8418
Furthermore it is deprecated in favor of
.resample().method().fillmethod()
Making the Resampler dependent on fill_method would only reintroduce the coupling into the code. And fill_method = 'drop'
sounds like a contradiction anyways.
I think it might be cleaner to bypass resample
and call round()
and groupby
as proposed above. It would however be nice to have reference to this in resample.fillna()
.
Do share this opinion? Is there somewhere else where this might be implemented? Wouldn't some ẁay to group by approximate values i.e. .groupby(round = '1s')
be useful in other cases as well?
Sorry, I'm not familiar with this section of the code. I don't know ahead
of time what the best approach is.
On Fri, Jul 26, 2019 at 3:27 PM Markus Werner notifications@github.com
wrote:
Hmm, right now I don't really see how to do this cleanly. The Resampler is
created inhttps://github.com/pandas-dev/pandas/blob/37f29e5bf967d76d76a05e84b5c9b14c9bc66f23/pandas/core/generic.py#L8404
However, the fill_method parameter is only entering in the following
function call.https://github.com/pandas-dev/pandas/blob/37f29e5bf967d76d76a05e84b5c9b14c9bc66f23/pandas/core/generic.py#L8418
Furthermore it is deprecated in favor of.resample().method().fillmethod()
Making the Resampler dependent on fill_method would only reintroduce the
coupling into the code. And fill_method = 'drop' sounds like a
contradiction anyways.I think it might be cleaner to bypass resample and call round() and
groupby as proposed above. It would however be nice to have reference to
this in resample.fillna().Do share this opinion? Is there somewhere else where this might be
implemented? Wouldn't some ẁay to group by approximate values i.e. .groupby(round
= '1s') be useful in other cases as well?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/11217?email_source=notifications&email_token=AAKAOIXEUMNSRHDBI36UB43QBNMZVA5CNFSM4BQ7ILZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD25UERI#issuecomment-515588677,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAKAOIRYGSBSFKGYL3W3KTLQBNMZVANCNFSM4BQ7ILZA
.
So I tried this out with simply grouping by the rounded value. This works perfectly fine without intermediate objects. Below is a comparison with the workaround of @Peque I would like to mention this (or a similar) example in the documentation of resample. I think I am not the only one who was mislead to think that it is the correct function to group sparse data.
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import random
np.random.seed(41)
i = pd.date_range('20140101',periods=100000000,freq='s')
s = Series(range(1000),index=i.take(np.random.randint(0,len(i),size=1000)))
%timeit s.\
resample('1H').\
sum(min_count=1).\
dropna()
2.4 ms ± 33.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit s.\
groupby(s.index.floor('1H')).\
sum()
1.67 ms ± 18.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit s.\
groupby(s.index.round('1H')).\
sum()
1.72 ms ± 9.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Since resample
no longer has a full_method
, I don't think there's much left to implement directly here. If there's a related need using ffill
or bfill
, a new issue can be opened regarding that. Closing.
Most helpful comment
+1 for
fill_method='drop'