When parsing a timezone-aware datetime in a csv file with pd.read_csv + parse_dates, it returns naive timestampes converted to UTC, and it was a surprise for me.
Consider we are reading the following data. Let's say its name is pandas_read_csv_bug.csv.
It is a simple timeseries data with timezone (UTC+09:00) specified.
dt,val
2018-01-04 09:01:00+09:00,23350
2018-01-04 09:02:00+09:00,23400
2018-01-04 09:03:00+09:00,23400
2018-01-04 09:04:00+09:00,23400
2018-01-04 09:05:00+09:00,23400
I want to read it with pd.read_csv using parse_dates keyword argument activated.
If working properly, this seems to be the most elegant solution.
import pandas as pd
df = pd.read_csv('pandas_read_csv_bug.csv', parse_dates=['dt'])
However, the result is a data frame df with strange timestamps.
聽 | dt | val
-- | -- | --
0| 2018-01-04 00:01:00 | 23350
1| 2018-01-04 00:02:00 | 23400
2| 2018-01-04 00:03:00 | 23400
3| 2018-01-04 00:04:00 | 23400
4| 2018-01-04 00:05:00 | 23400
My surprise was,
df['dt'].iloc[0].tz is None == TrueMy first impression was that it shouldn't be the best possible behavior.
However, as an UTC offset does not uniquely corresponds to a single timezone, this could be the safest/most reasonable behavior.
In that case, the documentation should mention this behavior.
pd.show_versions()commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-29-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: 3.3.1
pip: 9.0.3
setuptools: 38.5.1
Cython: None
numpy: 1.15.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
I agree with your first impression. Patch and PR are welcome!
Can you try this on master? @mroeschke had a PR recently that I think should have fixed this.
@jbrockmendel : Unfortunately, no luck. I can reproduce this on master (475e391e).
On master, if date_parser=pd.to_datetime is specifically passed, then the dates are not automatically converted to UTC:
In [6]: s = StringIO("""dt,val
...: 2018-01-04 09:01:00+09:00,23350
...: 2018-01-04 09:02:00+09:00,23400
...: 2018-01-04 09:03:00+09:00,23400
...: 2018-01-04 09:04:00+09:00,23400
...: 2018-01-04 09:05:00+09:00,23400""")
In [7]: df = pd.read_csv(s, parse_dates=['dt'], date_parser=pd.to_datetime)
In [8]: df
Out[8]:
dt val
0 2018-01-04 09:01:00+09:00 23350
1 2018-01-04 09:02:00+09:00 23400
2 2018-01-04 09:03:00+09:00 23400
3 2018-01-04 09:04:00+09:00 23400
4 2018-01-04 09:05:00+09:00 23400
In [10]: df['dt']
Out[10]:
0 2018-01-04 09:01:00+09:00
1 2018-01-04 09:02:00+09:00
2 2018-01-04 09:03:00+09:00
3 2018-01-04 09:04:00+09:00
4 2018-01-04 09:05:00+09:00
Name: dt, dtype: datetime64[ns, pytz.FixedOffset(540)]
I think we use to_datetime under the hood of read_csv to parse dates, so there may be a keyword argument or extra function that is coercing to UTC without timezone.
On that note, what should be the expected behavior?
In [7]: df = pd.read_csv(s, parse_dates=['dt'], date_parser=pd.to_datetime)
# Keep the offsets (as FixedOffset)
In [8]: df
Out[8]:
dt val
0 2018-01-04 09:01:00+09:00 23350
1 2018-01-04 09:02:00+09:00 23400
2 2018-01-04 09:03:00+09:00 23400
3 2018-01-04 09:04:00+09:00 23400
4 2018-01-04 09:05:00+09:00 23400
In [14]: df = pd.read_csv(s, parse_dates=['dt'], date_parser=partial(pd.to_datetime, utc=True))
# More in line with the previous behavior but correctly keeps the UTC timezone
In [15]: df
Out[15]:
dt val
0 2018-01-04 00:01:00+00:00 23350
1 2018-01-04 00:02:00+00:00 23400
2 2018-01-04 00:03:00+00:00 23400
3 2018-01-04 00:04:00+00:00 23400
4 2018-01-04 00:05:00+00:00 23400
This change fixed this issue specifically but not sure how it will affect other tests.
diff --git a/pandas/io/parsers.py b/pandas/io/parsers.py
index 4b3fa08e5..86f3b6812 100755
--- a/pandas/io/parsers.py
+++ b/pandas/io/parsers.py
@@ -3033,7 +3033,7 @@ def _make_date_converter(date_parser=None, dayfirst=False,
return tools.to_datetime(
ensure_object(strs),
utc=None,
- box=False,
+ box=True,
dayfirst=dayfirst,
errors='ignore',
infer_datetime_format=infer_datetime_format
@mroeschke Thanks for pointing box out. Well, this breaks quite a lot of unit tests.
My first attempt was to keep box=False, and update pandas/core/tools/datetime.py:_convert_listlike_datetimes.
However, I realized that we can't fix this issue with box=False, because what is returned is a Numpy array of datetime64, and it cannot contain the timezone information.
https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.datetime.html#changes-with-numpy-1-11
So, I will try to fix the errors caused by setting box=True.
@gfyoung There is a Pandas sprint at PYCON KR in Seoul, the Republic of Korea on Aug 15th, and I am participating. (It is organized by @scari )
I will continue to work on this issue at the sprint.
@swyoon I look forward to seeing you! ;)
Most helpful comment
@mroeschke Thanks for pointing
boxout. Well, this breaks quite a lot of unit tests.My first attempt was to keep
box=False, and updatepandas/core/tools/datetime.py:_convert_listlike_datetimes.However, I realized that we can't fix this issue with
box=False, because what is returned is a Numpy array of datetime64, and it cannot contain the timezone information.https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.datetime.html#changes-with-numpy-1-11
So, I will try to fix the errors caused by setting
box=True.@gfyoung There is a Pandas sprint at PYCON KR in Seoul, the Republic of Korea on Aug 15th, and I am participating. (It is organized by @scari )
I will continue to work on this issue at the sprint.