Pandas: BUG or DOC: pd.read_csv with parse_dates does not recognize timezone

Created on 9 Aug 2018 · 7Comments · Source: pandas-dev/pandas

When parsing a timezone-aware datetime in a csv file with pd.read_csv + parse_dates, it returns naive timestampes converted to UTC, and it was a surprise for me.

Example

Consider we are reading the following data. Let's say its name is pandas_read_csv_bug.csv.
It is a simple timeseries data with timezone (UTC+09:00) specified.

dt,val
2018-01-04 09:01:00+09:00,23350
2018-01-04 09:02:00+09:00,23400
2018-01-04 09:03:00+09:00,23400
2018-01-04 09:04:00+09:00,23400
2018-01-04 09:05:00+09:00,23400

I want to read it with pd.read_csv using parse_dates keyword argument activated.
If working properly, this seems to be the most elegant solution.

import pandas as pd
df = pd.read_csv('pandas_read_csv_bug.csv', parse_dates=['dt'])

However, the result is a data frame df with strange timestamps.

| dt | val
-- | -- | --
0| 2018-01-04 00:01:00 | 23350
1| 2018-01-04 00:02:00 | 23400
2| 2018-01-04 00:03:00 | 23400
3| 2018-01-04 00:04:00 | 23400
4| 2018-01-04 00:05:00 | 23400

Problem description

My surprise was,

The parsed datetimes are timezone-naive. df['dt'].iloc[0].tz is None == True
The timestampe is automatically converted to UTC.

My first impression was that it shouldn't be the best possible behavior.
However, as an UTC offset does not uniquely corresponds to a single timezone, this could be the safest/most reasonable behavior.
In that case, the documentation should mention this behavior.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-29-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.3.1
pip: 9.0.3
setuptools: 38.5.1
Cython: None
numpy: 1.15.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Bug IO CSV Timeseries

Source

swyoon

👍1

Most helpful comment

@mroeschke Thanks for pointing box out. Well, this breaks quite a lot of unit tests.
My first attempt was to keep box=False, and update pandas/core/tools/datetime.py:_convert_listlike_datetimes.
However, I realized that we can't fix this issue with box=False, because what is returned is a Numpy array of datetime64, and it cannot contain the timezone information.
https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.datetime.html#changes-with-numpy-1-11
So, I will try to fix the errors caused by setting box=True.
@gfyoung There is a Pandas sprint at PYCON KR in Seoul, the Republic of Korea on Aug 15th, and I am participating. (It is organized by @scari )
I will continue to work on this issue at the sprint.

swyoon on 12 Aug 2018

👍2

All 7 comments

I agree with your first impression. Patch and PR are welcome!

gfyoung on 9 Aug 2018

Can you try this on master? @mroeschke had a PR recently that I think should have fixed this.

jbrockmendel on 9 Aug 2018

@jbrockmendel : Unfortunately, no luck. I can reproduce this on master (475e391e).

gfyoung on 10 Aug 2018

On master, if date_parser=pd.to_datetime is specifically passed, then the dates are not automatically converted to UTC:

In [6]: s = StringIO("""dt,val
   ...: 2018-01-04 09:01:00+09:00,23350
   ...: 2018-01-04 09:02:00+09:00,23400
   ...: 2018-01-04 09:03:00+09:00,23400
   ...: 2018-01-04 09:04:00+09:00,23400
   ...: 2018-01-04 09:05:00+09:00,23400""")

In [7]: df = pd.read_csv(s, parse_dates=['dt'], date_parser=pd.to_datetime)

In [8]: df
Out[8]:
                         dt    val
0 2018-01-04 09:01:00+09:00  23350
1 2018-01-04 09:02:00+09:00  23400
2 2018-01-04 09:03:00+09:00  23400
3 2018-01-04 09:04:00+09:00  23400
4 2018-01-04 09:05:00+09:00  23400

In [10]: df['dt']
Out[10]:
0   2018-01-04 09:01:00+09:00
1   2018-01-04 09:02:00+09:00
2   2018-01-04 09:03:00+09:00
3   2018-01-04 09:04:00+09:00
4   2018-01-04 09:05:00+09:00
Name: dt, dtype: datetime64[ns, pytz.FixedOffset(540)]

I think we use to_datetime under the hood of read_csv to parse dates, so there may be a keyword argument or extra function that is coercing to UTC without timezone.

On that note, what should be the expected behavior?

In [7]: df = pd.read_csv(s, parse_dates=['dt'], date_parser=pd.to_datetime)

# Keep the offsets (as FixedOffset)
In [8]: df
Out[8]:
                         dt    val
0 2018-01-04 09:01:00+09:00  23350
1 2018-01-04 09:02:00+09:00  23400
2 2018-01-04 09:03:00+09:00  23400
3 2018-01-04 09:04:00+09:00  23400
4 2018-01-04 09:05:00+09:00  23400

In [14]: df = pd.read_csv(s, parse_dates=['dt'], date_parser=partial(pd.to_datetime, utc=True))

# More in line with the previous behavior but correctly keeps the UTC timezone
In [15]: df
Out[15]:
                         dt    val
0 2018-01-04 00:01:00+00:00  23350
1 2018-01-04 00:02:00+00:00  23400
2 2018-01-04 00:03:00+00:00  23400
3 2018-01-04 00:04:00+00:00  23400
4 2018-01-04 00:05:00+00:00  23400

mroeschke on 10 Aug 2018

This change fixed this issue specifically but not sure how it will affect other tests.

diff --git a/pandas/io/parsers.py b/pandas/io/parsers.py
index 4b3fa08e5..86f3b6812 100755
--- a/pandas/io/parsers.py
+++ b/pandas/io/parsers.py
@@ -3033,7 +3033,7 @@ def _make_date_converter(date_parser=None, dayfirst=False,
                 return tools.to_datetime(
                     ensure_object(strs),
                     utc=None,
-                    box=False,
+                    box=True,
                     dayfirst=dayfirst,
                     errors='ignore',
                     infer_datetime_format=infer_datetime_format

mroeschke on 11 Aug 2018