[x] I have checked that this issue has not already been reported. (As far as I can see by using the search)
[x] I have confirmed this bug exists on the latest version of pandas.
[x] (optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
# Your code here
import datetime as dt
import pandas as pd
start_time = dt.datetime(year=2018, month=6, day=7, hour=18)
end_time = dt.datetime(year=2018, month=6, day=27, hour=18)
time_index_df = pd.date_range(
start_time,
end_time,
freq="12H",
name="datetime",
).to_frame(index=False)
time_index_df["test"] = 1
time_index_df["name"] = "Name"
time_index_df = time_index_df.resample("2d", convention="end", on="datetime").mean().reset_index()
print(time_index_df["datetime"].values)
time_index_df.to_csv(f"test_resample_{pd.__version__}")
On pandas 0.23 the behaviour of resample will keep the correct datetime (18:00 in this case) after resample.
But starting from 0.24, after resample, the datetime is now converted to (00:00).
[this should explain why the current behaviour is a problem and why the expected output is a better solution]
On. 0.23
['2018-06-07T18:00:00.000000000' '2018-06-09T18:00:00.000000000'
'2018-06-11T18:00:00.000000000' '2018-06-13T18:00:00.000000000'
'2018-06-15T18:00:00.000000000' '2018-06-17T18:00:00.000000000'
'2018-06-19T18:00:00.000000000' '2018-06-21T18:00:00.000000000'
'2018-06-23T18:00:00.000000000' '2018-06-25T18:00:00.000000000'
'2018-06-27T18:00:00.000000000']
But Starting from 0.24 it is giving me
['2018-06-07T00:00:00.000000000' '2018-06-09T00:00:00.000000000'
'2018-06-11T00:00:00.000000000' '2018-06-13T00:00:00.000000000'
'2018-06-15T00:00:00.000000000' '2018-06-17T00:00:00.000000000'
'2018-06-19T00:00:00.000000000' '2018-06-21T00:00:00.000000000'
'2018-06-23T00:00:00.000000000' '2018-06-25T00:00:00.000000000'
'2018-06-27T00:00:00.000000000']
pd.show_versions()0.23
commit: None
python: 3.6.9.final.0
python-bits: 64
OS: Linux
OS-release: 5.3.0-59-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_AU.UTF-8
LOCALE: en_AU.UTF-8
pandas: 0.23.0
pytest: None
pip: 20.0.2
setuptools: 46.0.0
Cython: None
numpy: 1.18.5
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.8.1
pytz: 2020.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.5.8
xlrd: 0.9.4
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.2.19
pymysql: None
psycopg2: 2.8.5 (dt dec pq3 ext lo64)
jinja2: 2.11.2
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
None
0.24
commit: None
python: 3.6.9.final.0
python-bits: 64
OS: Linux
OS-release: 5.3.0-59-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_AU.UTF-8
LOCALE: en_AU.UTF-8
pandas: 0.24.2
pytest: None
pip: 20.0.2
setuptools: 46.0.0
Cython: None
numpy: 1.18.5
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.8.1
pytz: 2020.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.5.8
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: 1.2.19
pymysql: None
psycopg2: 2.8.5 (dt dec pq3 ext lo64)
jinja2: 2.11.2
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
None
you need to try in a much newer version
1.04 and also on master where resample has changed a bit
Can confirm the same behavior on 1.04 and master
['2018-06-07T00:00:00.000000000' '2018-06-09T00:00:00.000000000'
'2018-06-11T00:00:00.000000000' '2018-06-13T00:00:00.000000000'
'2018-06-15T00:00:00.000000000' '2018-06-17T00:00:00.000000000'
'2018-06-19T00:00:00.000000000' '2018-06-21T00:00:00.000000000'
'2018-06-23T00:00:00.000000000' '2018-06-25T00:00:00.000000000'
'2018-06-27T00:00:00.000000000']
commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-59-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8
pandas : 1.0.4
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None
['2018-06-07T00:00:00.000000000' '2018-06-09T00:00:00.000000000'
'2018-06-11T00:00:00.000000000' '2018-06-13T00:00:00.000000000'
'2018-06-15T00:00:00.000000000' '2018-06-17T00:00:00.000000000'
'2018-06-19T00:00:00.000000000' '2018-06-21T00:00:00.000000000'
'2018-06-23T00:00:00.000000000' '2018-06-25T00:00:00.000000000'
'2018-06-27T00:00:00.000000000']
commit : 5fdd6f50ac5ac2de939ef5b849e4c41e27e623a3
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-59-generic
Version : #53~18.04.1-Ubuntu SMP Thu Jun 4 14:58:26 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8
pandas : 1.1.0.dev0+1887.g5fdd6f50a
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : 2.5.8
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.2.19
tables : None
tabulate : 0.8.7
xarray : None
xlrd : 0.9.4
xlwt : None
numba : None
This is because the default value of "origin" in resample is "start_day". Please see below:
https://github.com/pandas-dev/pandas/blob/4a267c693ae6609664f117058d33e19df35878d5/pandas/core/generic.py#L7754-L7768
The following are the definitions for the 'start' and 'start_day'. Since the default is 'start_day', the first day at midnight is considered. Hence hours you have mentioned is disregarded.
‘start’: origin is the first value of the timeseries
‘start_day’: origin is the first day at midnight of the timeseries
Looks like there was some issue due to which this is changed, and is described here: https://github.com/pandas-dev/pandas/pull/31809
The default of "start" is more accurate in my opinion. @jreback @hasB4K @mroeschke Can you please comment further on this?
@dsandeep0138 pls have a read on the issue and the number of issues this patches
this change is much more consistent with respect to the frequency rather than just happens to work
@jreback Thanks for the comment. Yes, I understand the change is awesome, and deals with the inconsistencies in frequencies and fixes so many issues :)
Can you please comment on this particular bug if it is expected then? Should we document it, if it is so to avoid confusion? Thanks.
@dsandeep0138 I'll reply a bit later to explain everything, no worries :wink:
@deepandas11 That makes sense on master, by setting origin="start" fixes the issue.
However, it will still be an issue in 0.24 and 1.04 given that the origin kwarg was not introduced then.
If I am using 0.24, what's the way to preserve the hours?
@liverpool1026 The behavior has been to follow start_day since 2012 as you can see in this commit: https://github.com/pandas-dev/pandas/commit/31ca168faba43d761f5b53326b18250804ccd6ef. The behavior you are facing is actually a bug that has been fixed in v0.2.4 by #24159. Basically your current code is relying on a bug :confused: ...
The behavior of start_day of origin has been preserved as the default value in #31809 to avoid creating breaking changes.
The proof that your code is relying on a bug fixed in v0.2.4:
import datetime as dt
import pandas as pd
def print_resample(example_nb, start, end, resample_freq):
print(f"\nEXAMPLE {example_nb}: {start} - {end} [{resample_freq}]")
time_index_df = pd.date_range(start, end, freq="1H", name="datetime").to_frame(index=False)
time_index_df["test"] = 1
time_index_df = time_index_df.resample(resample_freq, convention="end", on="datetime").sum().reset_index()
print(time_index_df)
print_resample(1, "2018-06-07 11:00", "2018-06-10 11:00", "2D")
print_resample(2, "2018-06-07 11:00", "2018-06-10 11:00", "12H")
print_resample(3, "2018-06-07 13:00", "2018-06-10 11:00", "12H")
Outputs (v0.2.3):
EXAMPLE 1: 2018-06-07 11:00 - 2018-06-10 11:00 [2D]
datetime test
0 2018-06-07 11:00:00 48
1 2018-06-09 11:00:00 25
EXAMPLE 2: 2018-06-07 11:00 - 2018-06-10 11:00 [12H]
datetime test
0 2018-06-07 00:00:00 1
1 2018-06-07 12:00:00 12
2 2018-06-08 00:00:00 12
3 2018-06-08 12:00:00 12
4 2018-06-09 00:00:00 12
5 2018-06-09 12:00:00 12
6 2018-06-10 00:00:00 12
EXAMPLE 3: 2018-06-07 13:00 - 2018-06-10 11:00 [12H]
datetime test
0 2018-06-07 12:00:00 11
1 2018-06-08 00:00:00 12
2 2018-06-08 12:00:00 12
3 2018-06-09 00:00:00 12
4 2018-06-09 12:00:00 12
5 2018-06-10 00:00:00 12
origin of #31809 is just giving more choice to the user that want to specify on how to align the bins of the resampled data.Now, that being said... What should you do to align from the start of your timeseries with those constraint before the version 1.1.0? Well first, I would advise to wait a few months the release of 1.1.0... But you could hack a bit around by converting this temporally into Timedeltas and it should work:
import datetime as dt
import pandas as pd
def print_resample_simulate_origin_start(example_nb, start, end, resample_freq):
print(f"\nEXAMPLE {example_nb}: {start} - {end} [{resample_freq}]")
time_index_df = pd.date_range(start, end, freq="1H", name="datetime").to_frame(index=False)
time_index_df["test"] = 1
# hack: transform datetime into timestamps to resample on start of the timeseries
time_index_df["datetime"] -= pd.Timestamp(0)
time_index_df = time_index_df.resample(resample_freq, convention="end", on="datetime").sum().reset_index()
# hack: transform datetime back into timestamps
time_index_df["datetime"] = pd.to_datetime(time_index_df["datetime"])
print(time_index_df)
print_resample_simulate_origin_start(1, "2018-06-07 11:00", "2018-06-10 11:00", "2D")
print_resample_simulate_origin_start(2, "2018-06-07 11:00", "2018-06-10 11:00", "12H")
print_resample_simulate_origin_start(3, "2018-06-07 13:00", "2018-06-10 11:00", "12H")
Outputs (v0.2.3):
EXAMPLE 1: 2018-06-07 11:00 - 2018-06-10 11:00 [2D]
datetime test
0 2018-06-07 11:00:00 48
1 2018-06-09 11:00:00 25
2 2018-06-11 11:00:00 0
EXAMPLE 2: 2018-06-07 11:00 - 2018-06-10 11:00 [12H]
datetime test
0 2018-06-07 11:00:00 12
1 2018-06-07 23:00:00 12
2 2018-06-08 11:00:00 12
3 2018-06-08 23:00:00 12
4 2018-06-09 11:00:00 12
5 2018-06-09 23:00:00 12
6 2018-06-10 11:00:00 1
EXAMPLE 3: 2018-06-07 13:00 - 2018-06-10 11:00 [12H]
datetime test
0 2018-06-07 13:00:00 12
1 2018-06-08 01:00:00 12
2 2018-06-08 13:00:00 12
3 2018-06-09 01:00:00 12
4 2018-06-09 13:00:00 12
5 2018-06-10 01:00:00 11
I know this solution is not ideal... But again, this has been fixed in #31809 and it will be in the upcoming 1.1.0. I hope I have answered to your questions/issues @liverpool1026 and @dsandeep0138.
Most helpful comment
@liverpool1026 The behavior has been to follow
start_daysince 2012 as you can see in this commit: https://github.com/pandas-dev/pandas/commit/31ca168faba43d761f5b53326b18250804ccd6ef. The behavior you are facing is actually a bug that has been fixed in v0.2.4 by #24159. Basically your current code is relying on a bug :confused: ...The behavior of
start_dayoforiginhas been preserved as the default value in #31809 to avoid creating breaking changes.The proof that your code is relying on a bug fixed in v0.2.4:
Outputs (v0.2.3):
originof #31809 is just giving more choice to the user that want to specify on how to align the bins of the resampled data.Now, that being said... What should you do to align from the start of your timeseries with those constraint before the version 1.1.0? Well first, I would advise to wait a few months the release of 1.1.0... But you could hack a bit around by converting this temporally into Timedeltas and it should work:
Outputs (v0.2.3):
I know this solution is not ideal... But again, this has been fixed in #31809 and it will be in the upcoming 1.1.0. I hope I have answered to your questions/issues @liverpool1026 and @dsandeep0138.