Pandas: PERF: `dt` accessor has massive performance drain when accessing date, time

Created on 16 Nov 2016  路  9Comments  路  Source: pandas-dev/pandas

A small, complete example of the issue

import pandas as pd
import numpy as np

df = pd.DataFrame({'tstamp' : pd.date_range('2016-11-15', periods=4*365*60, freq='T'),
                   'data' : np.random.rand(4*365*60)})
len(df)
## 87600

%%timeit
__ = df['tstamp'].dt.date
## 10 loops, best of 3: 128 ms per loop

%%timeit
__ = df['tstamp'].dt.time
## 10 loops, best of 3: 132 ms per loop

%%timeit
__ = df['tstamp'].dt.dayofyear
## 100 loops, best of 3: 3.04 ms per loop

%%timeit
__ = df['tstamp'].dt.day
## 100 loops, best of 3: 2.83 ms per loop

As clearly demonstrated, accessing date and time take a really long time to compute. I do not know what is causing the bottleneck, but some speed-up will be definitely appreciated.

Also, accessing date and time require more than double the memory that the DataFrame requires. I don't have a memory profiler working, but I can attest that my computer with 30 GB of available RAM (after OS use), can load a massive csv that consumes 10.2 GB in memory as a DataFrame. However, trying to access date from that DataFrame raises MemoryError. It basically fills up the remaining 19.8 GB of RAM trying to compute the date from a timestamp column. The DataFrame in question has 113,587,339 rows, and 5 columns of numeric data, one column of strings, and a column with the datetime stamp similar to the example above.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-47-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.0
nose: None
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.25.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: 1.1.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 0.9.3
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

Performance Usage Question

Most helpful comment

@dragonator4 looks like you could round your tstamp. df['date'] = df['tstamp'].dt.round('d') and then group by that and sensor ID.

It'd be quite a bit of work to support date and time, and in many cases (like this) a suitably rounded datetime does the job.

EDIT: mmm, you need to round down though. Pretty sure there's a way to do this. Yeah, df['date'] = df['tstamp'].dt.floor('d')

All 9 comments

The date and time attributes return python datetime.date and datetime.time objects, while the other attributes return integers. This is certainly a reason that it is both slower and takes more memory.

Example of memory usage (in MBs)

In [144]: df.memory_usage(deep=True, index=False) / (1024**2)
Out[144]: 
data      0.668335
tstamp    0.668335
dtype: float64

In [145]: df['tstamp'].dt.date.memory_usage(deep=True, index=False) / (1024**2)
Out[145]: 3.3416748046875

In [146]: df['tstamp'].dt.time.memory_usage(deep=True, index=False) / (1024**2)
Out[146]: 4.010009765625

In [147]: df['tstamp'].dt.dayofyear.memory_usage(deep=True, index=False) / (1024**2)
Out[147]: 0.6683349609375

So you can see that the datetime.date/time objects take a lot more memory compared to the numpy integers or foats (up to almost 6 times as much memory). This easily explains the MemoryError you see since you only have 5 columns.

Speed-wise, the creation of the datetime.date/time objects is probably a reason for slowdown as well, although I am not sure all slowdown can be attributed to this. You are welcome to look into this.

Anyway, if speed and memory is of concern, you better avoid date and time anyway, and you cannot perform vectorized operations anymore on those columns. Depending on your application, all information about the date and time is also available in the other attributes (day, month, hour, etc ..)

further to @jorisvandenbossche comments. One of the strengths of pandas is that data can be represented by a specific implementation with fast-path accessing for most operations, while retaining compat with python object if needed. This is true for datetime, timedelta, period, and category dtypes. (other dtypes map the interface directly to the implementation).

What this means is that a user must be aware of the performance/memory tradeoffs when asking for python objects.

@jorisvandenbossche Then how is it that storing datetime together consumes just as much memory as storing the day of year? If datetime can be represented as ints, and floats, but still have a display format as YYYY-MM-DD HH:MM:SS, then similarly dates, if not also times, can be stored as ints and floats and displayed correctly.

@jreback How much of an effort is it to build pandas equivalent for Python date and time types?

@dragonator4 you are missing the point. this already exists as Timestamp (which is a sub-class of datetime actually). It only instantiated when needed. If you are actually using date and time directly as python objects then you are doing it wrong.

Timestamp is a performant replacement that is feature-full like datetime.datetime and performant like np.datetime64

@jreback I have timestamped sensor data, and I need to group the data by day and sensor ID, and fit models on each group. Since the data spans more than a year, I need something robust to form the groups on. Date is perfect, but it has performance and memory drain, as we discussed. As a work around, I can form groups on year, and day of year. Doing so requires an extra step of recovering the date from year, and day of year. Can you recommend a better way?

Ideally, I would like to do this:

df['date'] = df['tstamp'].dt.date
res = df.groupby(['sens_id', 'date']).apply(modeler)

Now, I would do this (for lack of something better):

df['year'], df['doy'] = df['tstamp'].dt.year, df['tstamp'].dt.dayofyear
res = df.groupby(['sens_id', 'year', 'doy']).apply(modeler)

Since the gap in performance is so huge, the latter is actually faster than the former, despite the extra computations.

@dragonator4 looks like you could round your tstamp. df['date'] = df['tstamp'].dt.round('d') and then group by that and sensor ID.

It'd be quite a bit of work to support date and time, and in many cases (like this) a suitably rounded datetime does the job.

EDIT: mmm, you need to round down though. Pretty sure there's a way to do this. Yeah, df['date'] = df['tstamp'].dt.floor('d')

@TomAugspurger Thanks for that trick. I'll perhaps use floor. round may round to the next day if the time is after noon..?

There is also normalize to get only the date part (not sure about what would be better performance wise, normalize or floor, you can test):

In [14]: s = pd.Series(pd.date_range('2012-01-01', freq='8h', periods=10))

In [15]: s
Out[15]: 
0   2012-01-01 00:00:00
1   2012-01-01 08:00:00
2   2012-01-01 16:00:00
3   2012-01-02 00:00:00
4   2012-01-02 08:00:00
5   2012-01-02 16:00:00
6   2012-01-03 00:00:00
7   2012-01-03 08:00:00
8   2012-01-03 16:00:00
9   2012-01-04 00:00:00
dtype: datetime64[ns]

In [17]: s.dt.normalize()
Out[17]: 
0   2012-01-01
1   2012-01-01
2   2012-01-01
3   2012-01-02
4   2012-01-02
5   2012-01-02
6   2012-01-03
7   2012-01-03
8   2012-01-03
9   2012-01-04
dtype: datetime64[ns]

I seriously doubt the _grouping_ part is actually slowing you down. For sure its the .apply which IS executed in python space.

In any event you can simply do

df.groupby(['sens_id', pd.Grouper(key='date', freq='D')]).apply(modeler)

for proper semantics or as @TomAugspurger and @jorisvandenbossche suggest this is de-facto equivalent to rounding.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

matthiasroder picture matthiasroder  路  3Comments

andreas-thomik picture andreas-thomik  路  3Comments

idanivanov picture idanivanov  路  3Comments

mfmain picture mfmain  路  3Comments

amelio-vazquez-reina picture amelio-vazquez-reina  路  3Comments