Pandas: BUG: resample seems to convert hours to 00:00

Created on 17 Jun 2020  Â·  8Comments  Â·  Source: pandas-dev/pandas

  • [x] I have checked that this issue has not already been reported. (As far as I can see by using the search)

  • [x] I have confirmed this bug exists on the latest version of pandas.

  • [x] (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

# Your code here
import datetime as dt
import pandas as pd

start_time = dt.datetime(year=2018, month=6, day=7, hour=18)
end_time = dt.datetime(year=2018, month=6, day=27, hour=18)
time_index_df = pd.date_range(
    start_time,
    end_time,
    freq="12H",
    name="datetime",
).to_frame(index=False)

time_index_df["test"] = 1
time_index_df["name"] = "Name"

time_index_df = time_index_df.resample("2d", convention="end", on="datetime").mean().reset_index()

print(time_index_df["datetime"].values)

time_index_df.to_csv(f"test_resample_{pd.__version__}")

Problem description

On pandas 0.23 the behaviour of resample will keep the correct datetime (18:00 in this case) after resample.
But starting from 0.24, after resample, the datetime is now converted to (00:00).
[this should explain why the current behaviour is a problem and why the expected output is a better solution]

Expected Output

On. 0.23
['2018-06-07T18:00:00.000000000' '2018-06-09T18:00:00.000000000'
'2018-06-11T18:00:00.000000000' '2018-06-13T18:00:00.000000000'
'2018-06-15T18:00:00.000000000' '2018-06-17T18:00:00.000000000'
'2018-06-19T18:00:00.000000000' '2018-06-21T18:00:00.000000000'
'2018-06-23T18:00:00.000000000' '2018-06-25T18:00:00.000000000'
'2018-06-27T18:00:00.000000000']

But Starting from 0.24 it is giving me
['2018-06-07T00:00:00.000000000' '2018-06-09T00:00:00.000000000'
'2018-06-11T00:00:00.000000000' '2018-06-13T00:00:00.000000000'
'2018-06-15T00:00:00.000000000' '2018-06-17T00:00:00.000000000'
'2018-06-19T00:00:00.000000000' '2018-06-21T00:00:00.000000000'
'2018-06-23T00:00:00.000000000' '2018-06-25T00:00:00.000000000'
'2018-06-27T00:00:00.000000000']

Output of pd.show_versions()

0.23

INSTALLED VERSIONS

commit: None
python: 3.6.9.final.0
python-bits: 64
OS: Linux
OS-release: 5.3.0-59-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_AU.UTF-8
LOCALE: en_AU.UTF-8

pandas: 0.23.0
pytest: None
pip: 20.0.2
setuptools: 46.0.0
Cython: None
numpy: 1.18.5
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.8.1
pytz: 2020.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.5.8
xlrd: 0.9.4
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.2.19
pymysql: None
psycopg2: 2.8.5 (dt dec pq3 ext lo64)
jinja2: 2.11.2
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
None

0.24

INSTALLED VERSIONS

commit: None
python: 3.6.9.final.0
python-bits: 64
OS: Linux
OS-release: 5.3.0-59-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_AU.UTF-8
LOCALE: en_AU.UTF-8

pandas: 0.24.2
pytest: None
pip: 20.0.2
setuptools: 46.0.0
Cython: None
numpy: 1.18.5
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.8.1
pytz: 2020.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.5.8
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: 1.2.19
pymysql: None
psycopg2: 2.8.5 (dt dec pq3 ext lo64)
jinja2: 2.11.2
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
None

Bug Resample

Most helpful comment

@liverpool1026 The behavior has been to follow start_day since 2012 as you can see in this commit: https://github.com/pandas-dev/pandas/commit/31ca168faba43d761f5b53326b18250804ccd6ef. The behavior you are facing is actually a bug that has been fixed in v0.2.4 by #24159. Basically your current code is relying on a bug :confused: ...

The behavior of start_day of origin has been preserved as the default value in #31809 to avoid creating breaking changes.

The proof that your code is relying on a bug fixed in v0.2.4:

import datetime as dt
import pandas as pd

def print_resample(example_nb, start, end, resample_freq):
    print(f"\nEXAMPLE {example_nb}: {start} - {end} [{resample_freq}]")
    time_index_df = pd.date_range(start, end, freq="1H", name="datetime").to_frame(index=False)
    time_index_df["test"] = 1
    time_index_df = time_index_df.resample(resample_freq, convention="end", on="datetime").sum().reset_index()
    print(time_index_df)


print_resample(1, "2018-06-07 11:00", "2018-06-10 11:00", "2D")
print_resample(2, "2018-06-07 11:00", "2018-06-10 11:00", "12H")
print_resample(3, "2018-06-07 13:00", "2018-06-10 11:00", "12H")

Outputs (v0.2.3):

EXAMPLE 1: 2018-06-07 11:00 - 2018-06-10 11:00 [2D]
             datetime  test
0 2018-06-07 11:00:00    48
1 2018-06-09 11:00:00    25

EXAMPLE 2: 2018-06-07 11:00 - 2018-06-10 11:00 [12H]
             datetime  test
0 2018-06-07 00:00:00     1
1 2018-06-07 12:00:00    12
2 2018-06-08 00:00:00    12
3 2018-06-08 12:00:00    12
4 2018-06-09 00:00:00    12
5 2018-06-09 12:00:00    12
6 2018-06-10 00:00:00    12

EXAMPLE 3: 2018-06-07 13:00 - 2018-06-10 11:00 [12H]
             datetime  test
0 2018-06-07 12:00:00    11
1 2018-06-08 00:00:00    12
2 2018-06-08 12:00:00    12
3 2018-06-09 00:00:00    12
4 2018-06-09 12:00:00    12
5 2018-06-10 00:00:00    12
  • Example 1 is the buggy one since it uses a frequency bigger than 1D.
  • Example 3 is showing you that the behavior is not to start at midnight of the day of the first timestamp, but we are using the start of the day to align the bins. The argument origin of #31809 is just giving more choice to the user that want to specify on how to align the bins of the resampled data.

Now, that being said... What should you do to align from the start of your timeseries with those constraint before the version 1.1.0? Well first, I would advise to wait a few months the release of 1.1.0... But you could hack a bit around by converting this temporally into Timedeltas and it should work:

import datetime as dt
import pandas as pd

def print_resample_simulate_origin_start(example_nb, start, end, resample_freq):
    print(f"\nEXAMPLE {example_nb}: {start} - {end} [{resample_freq}]")
    time_index_df = pd.date_range(start, end, freq="1H", name="datetime").to_frame(index=False)
    time_index_df["test"] = 1

    # hack: transform datetime into timestamps to resample on start of the timeseries
    time_index_df["datetime"] -= pd.Timestamp(0)

    time_index_df = time_index_df.resample(resample_freq, convention="end", on="datetime").sum().reset_index()

    # hack: transform datetime back into timestamps
    time_index_df["datetime"] = pd.to_datetime(time_index_df["datetime"])
    print(time_index_df)


print_resample_simulate_origin_start(1, "2018-06-07 11:00", "2018-06-10 11:00", "2D")
print_resample_simulate_origin_start(2, "2018-06-07 11:00", "2018-06-10 11:00", "12H")
print_resample_simulate_origin_start(3, "2018-06-07 13:00", "2018-06-10 11:00", "12H")

Outputs (v0.2.3):


EXAMPLE 1: 2018-06-07 11:00 - 2018-06-10 11:00 [2D]
             datetime  test
0 2018-06-07 11:00:00    48
1 2018-06-09 11:00:00    25
2 2018-06-11 11:00:00     0

EXAMPLE 2: 2018-06-07 11:00 - 2018-06-10 11:00 [12H]
             datetime  test
0 2018-06-07 11:00:00    12
1 2018-06-07 23:00:00    12
2 2018-06-08 11:00:00    12
3 2018-06-08 23:00:00    12
4 2018-06-09 11:00:00    12
5 2018-06-09 23:00:00    12
6 2018-06-10 11:00:00     1

EXAMPLE 3: 2018-06-07 13:00 - 2018-06-10 11:00 [12H]
             datetime  test
0 2018-06-07 13:00:00    12
1 2018-06-08 01:00:00    12
2 2018-06-08 13:00:00    12
3 2018-06-09 01:00:00    12
4 2018-06-09 13:00:00    12
5 2018-06-10 01:00:00    11

I know this solution is not ideal... But again, this has been fixed in #31809 and it will be in the upcoming 1.1.0. I hope I have answered to your questions/issues @liverpool1026 and @dsandeep0138.

All 8 comments

you need to try in a much newer version
1.04 and also on master where resample has changed a bit

Can confirm the same behavior on 1.04 and master

1.04

['2018-06-07T00:00:00.000000000' '2018-06-09T00:00:00.000000000'
'2018-06-11T00:00:00.000000000' '2018-06-13T00:00:00.000000000'
'2018-06-15T00:00:00.000000000' '2018-06-17T00:00:00.000000000'
'2018-06-19T00:00:00.000000000' '2018-06-21T00:00:00.000000000'
'2018-06-23T00:00:00.000000000' '2018-06-25T00:00:00.000000000'
'2018-06-27T00:00:00.000000000']

INSTALLED VERSIONS

commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-59-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8

pandas : 1.0.4
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

master

['2018-06-07T00:00:00.000000000' '2018-06-09T00:00:00.000000000'
'2018-06-11T00:00:00.000000000' '2018-06-13T00:00:00.000000000'
'2018-06-15T00:00:00.000000000' '2018-06-17T00:00:00.000000000'
'2018-06-19T00:00:00.000000000' '2018-06-21T00:00:00.000000000'
'2018-06-23T00:00:00.000000000' '2018-06-25T00:00:00.000000000'
'2018-06-27T00:00:00.000000000']

INSTALLED VERSIONS

commit : 5fdd6f50ac5ac2de939ef5b849e4c41e27e623a3
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-59-generic
Version : #53~18.04.1-Ubuntu SMP Thu Jun 4 14:58:26 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8

pandas : 1.1.0.dev0+1887.g5fdd6f50a
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : 2.5.8
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.2.19
tables : None
tabulate : 0.8.7
xarray : None
xlrd : 0.9.4
xlwt : None
numba : None

This is because the default value of "origin" in resample is "start_day". Please see below:
https://github.com/pandas-dev/pandas/blob/4a267c693ae6609664f117058d33e19df35878d5/pandas/core/generic.py#L7754-L7768

The following are the definitions for the 'start' and 'start_day'. Since the default is 'start_day', the first day at midnight is considered. Hence hours you have mentioned is disregarded.

‘start’: origin is the first value of the timeseries
‘start_day’: origin is the first day at midnight of the timeseries

Looks like there was some issue due to which this is changed, and is described here: https://github.com/pandas-dev/pandas/pull/31809

The default of "start" is more accurate in my opinion. @jreback @hasB4K @mroeschke Can you please comment further on this?

@dsandeep0138 pls have a read on the issue and the number of issues this patches

this change is much more consistent with respect to the frequency rather than just happens to work

@jreback Thanks for the comment. Yes, I understand the change is awesome, and deals with the inconsistencies in frequencies and fixes so many issues :)

Can you please comment on this particular bug if it is expected then? Should we document it, if it is so to avoid confusion? Thanks.

@dsandeep0138 I'll reply a bit later to explain everything, no worries :wink:

@deepandas11 That makes sense on master, by setting origin="start" fixes the issue.

However, it will still be an issue in 0.24 and 1.04 given that the origin kwarg was not introduced then.

If I am using 0.24, what's the way to preserve the hours?

@liverpool1026 The behavior has been to follow start_day since 2012 as you can see in this commit: https://github.com/pandas-dev/pandas/commit/31ca168faba43d761f5b53326b18250804ccd6ef. The behavior you are facing is actually a bug that has been fixed in v0.2.4 by #24159. Basically your current code is relying on a bug :confused: ...

The behavior of start_day of origin has been preserved as the default value in #31809 to avoid creating breaking changes.

The proof that your code is relying on a bug fixed in v0.2.4:

import datetime as dt
import pandas as pd

def print_resample(example_nb, start, end, resample_freq):
    print(f"\nEXAMPLE {example_nb}: {start} - {end} [{resample_freq}]")
    time_index_df = pd.date_range(start, end, freq="1H", name="datetime").to_frame(index=False)
    time_index_df["test"] = 1
    time_index_df = time_index_df.resample(resample_freq, convention="end", on="datetime").sum().reset_index()
    print(time_index_df)


print_resample(1, "2018-06-07 11:00", "2018-06-10 11:00", "2D")
print_resample(2, "2018-06-07 11:00", "2018-06-10 11:00", "12H")
print_resample(3, "2018-06-07 13:00", "2018-06-10 11:00", "12H")

Outputs (v0.2.3):

EXAMPLE 1: 2018-06-07 11:00 - 2018-06-10 11:00 [2D]
             datetime  test
0 2018-06-07 11:00:00    48
1 2018-06-09 11:00:00    25

EXAMPLE 2: 2018-06-07 11:00 - 2018-06-10 11:00 [12H]
             datetime  test
0 2018-06-07 00:00:00     1
1 2018-06-07 12:00:00    12
2 2018-06-08 00:00:00    12
3 2018-06-08 12:00:00    12
4 2018-06-09 00:00:00    12
5 2018-06-09 12:00:00    12
6 2018-06-10 00:00:00    12

EXAMPLE 3: 2018-06-07 13:00 - 2018-06-10 11:00 [12H]
             datetime  test
0 2018-06-07 12:00:00    11
1 2018-06-08 00:00:00    12
2 2018-06-08 12:00:00    12
3 2018-06-09 00:00:00    12
4 2018-06-09 12:00:00    12
5 2018-06-10 00:00:00    12
  • Example 1 is the buggy one since it uses a frequency bigger than 1D.
  • Example 3 is showing you that the behavior is not to start at midnight of the day of the first timestamp, but we are using the start of the day to align the bins. The argument origin of #31809 is just giving more choice to the user that want to specify on how to align the bins of the resampled data.

Now, that being said... What should you do to align from the start of your timeseries with those constraint before the version 1.1.0? Well first, I would advise to wait a few months the release of 1.1.0... But you could hack a bit around by converting this temporally into Timedeltas and it should work:

import datetime as dt
import pandas as pd

def print_resample_simulate_origin_start(example_nb, start, end, resample_freq):
    print(f"\nEXAMPLE {example_nb}: {start} - {end} [{resample_freq}]")
    time_index_df = pd.date_range(start, end, freq="1H", name="datetime").to_frame(index=False)
    time_index_df["test"] = 1

    # hack: transform datetime into timestamps to resample on start of the timeseries
    time_index_df["datetime"] -= pd.Timestamp(0)

    time_index_df = time_index_df.resample(resample_freq, convention="end", on="datetime").sum().reset_index()

    # hack: transform datetime back into timestamps
    time_index_df["datetime"] = pd.to_datetime(time_index_df["datetime"])
    print(time_index_df)


print_resample_simulate_origin_start(1, "2018-06-07 11:00", "2018-06-10 11:00", "2D")
print_resample_simulate_origin_start(2, "2018-06-07 11:00", "2018-06-10 11:00", "12H")
print_resample_simulate_origin_start(3, "2018-06-07 13:00", "2018-06-10 11:00", "12H")

Outputs (v0.2.3):


EXAMPLE 1: 2018-06-07 11:00 - 2018-06-10 11:00 [2D]
             datetime  test
0 2018-06-07 11:00:00    48
1 2018-06-09 11:00:00    25
2 2018-06-11 11:00:00     0

EXAMPLE 2: 2018-06-07 11:00 - 2018-06-10 11:00 [12H]
             datetime  test
0 2018-06-07 11:00:00    12
1 2018-06-07 23:00:00    12
2 2018-06-08 11:00:00    12
3 2018-06-08 23:00:00    12
4 2018-06-09 11:00:00    12
5 2018-06-09 23:00:00    12
6 2018-06-10 11:00:00     1

EXAMPLE 3: 2018-06-07 13:00 - 2018-06-10 11:00 [12H]
             datetime  test
0 2018-06-07 13:00:00    12
1 2018-06-08 01:00:00    12
2 2018-06-08 13:00:00    12
3 2018-06-09 01:00:00    12
4 2018-06-09 13:00:00    12
5 2018-06-10 01:00:00    11

I know this solution is not ideal... But again, this has been fixed in #31809 and it will be in the upcoming 1.1.0. I hope I have answered to your questions/issues @liverpool1026 and @dsandeep0138.

Was this page helpful?
0 / 5 - 0 ratings