Pandas: 0.23.4 changed read_csv parsing for a mixed-timezone datetimes

Created on 28 Jan 2019  Â·  15Comments  Â·  Source: pandas-dev/pandas

Previously, a column in a CSV with mixed timezones would (I think) convert each value to UTC and discard the tzinfo.

import pandas as pd
import io

content = """\
Sat, 22 Apr 2017 15:11:58 -0500
Fri, 21 Apr 2017 14:20:57 -0500
Thu, 9 Mar 2017 11:15:30 -0600"""

df = pd.read_csv(io.StringIO(content), parse_dates=True, header=None, names=['day', 'datetime'], index_col='datetime')

On 0.23.4 that's

In [7]: df.index
Out[7]:
DatetimeIndex(['2017-04-22 20:11:58', '2017-04-21 19:20:57',
               '2017-03-09 17:15:30'],
              dtype='datetime64[ns]', name='datetime', freq=None)

On 0.24 that's

In [7]: df.index
Out[7]:
Index([2017-04-22 15:11:58-05:00, 2017-04-21 14:20:57-05:00,
       2017-03-09 11:15:30-06:00],
      dtype='object', name='datetime')

I'm not sure what the expected behavior is here, but I think the old behavior is as good as any.

I haven't verified, but https://github.com/pandas-dev/pandas/pull/22380 seems like a likely candidate for introducing the change.

IO CSV Timezones

All 15 comments

cc @jbrockmendel, @mroeschke, @gfyoung, @swyoon if you have thoughts here.

Appears this was an intentional change as described by the original issue: https://github.com/pandas-dev/pandas/issues/22256#issue-349033392

Also AFAICT, the coercion to UTC behavior before was not documented in read_csv(?). In general, I am keen on preserving the distinct UTC offsets instead of coercing to UTC as it leads to less surprises.

I don't see anything in the original issue https://github.com/pandas-dev/pandas/issues/22256#issue-349033392 about mixed timezones.

So to clarify, what's the expected output of

In [8]: pd.read_csv(io.StringIO('a\n2000-01-01T00:00:00+06:00\n2000-01-01T00:00:00+05:00'), parse_dates=['a']).a

In 0.23.4, that's

Out[4]:
0   1999-12-31 18:00:00
1   1999-12-31 19:00:00
Name: a, dtype: datetime64[ns]

and in 0.24.0 that's

Out[8]:
0    2000-01-01 00:00:00+06:00
1    2000-01-01 00:00:00+05:00
Name: a, dtype: object

If we want the 0.24.0 behavior, do we have an alternative recommendation for "parse this mess into a column of datetimes"? Something like

pd.read_csv(..., date_parser=lambda x: pd.to_datetime(x, utc=True))

If that's the recommendation going forward, I can submit a PR updating the release note for 0.24 to note the breaking change, and an example with mixed timezones.

FWIW I see two upsides to not-converting to UTC. First, conversion is lossy; mixed timezones is a pretty weird special case and if it is intentional, we preserve it. Second, I'm pretty sure this is consistent with to_datetime's treatment of strings, which is a plus.

Actually, does date_parser operate row-wise? We probably don't want to recommend that. The alternative would be something like

date_columns = ['a']
df = pd.read_csv(...)
df[date_columns] = df[date_columns].apply(pd.to_datetime, utc=True)

so read it in as strings, and then convert later?

mixed timezones is a pretty weird special case and if it is intentional, we preserve it.

I would distinguish mixed timezones in a storage format like CSV from our internal representation. IMO, it's important to support easily ingesting this kind of data.

It does appear to be consistent with to_datetime

In [6]: pd.to_datetime(['2000-01-01T00:00:00+06:00', '2000-01-01T00:00:00+05:00'])
Out[6]: Index([2000-01-01 00:00:00+06:00, 2000-01-01 00:00:00+05:00], dtype='object')

In [7]: pd.to_datetime(['2000-01-01T00:00:00+06:00', '2000-01-01T00:00:00+05:00'], utc=True)
Out[7]: DatetimeIndex(['1999-12-31 18:00:00+00:00', '1999-12-31 19:00:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)

Yeah I'd recommend post processing via apply (or to_datetime directly if it's just one column).

The to_datetime example above was recently fixed in 0.24.0 as well. http://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.24.0.html#parsing-datetime-strings-with-timezone-offsets

Thanks. I'll link to that.

On Mon, Jan 28, 2019 at 3:52 PM Matthew Roeschke notifications@github.com
wrote:

Yeah I'd recommend post processing via apply (or to_datetime directly if
it's just one column).

The to_datetime example above was recently fixed in 0.24.0 as well.
http://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.24.0.html#parsing-datetime-strings-with-timezone-offsets

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/24987#issuecomment-458316344,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIk_BnZg4v26cFTOLl6naOGkUqimVks5vH3E0gaJpZM4aWoOk
.

I'm a little late to the discussion, but I do agree with the general sentiment here. I think a documentation change will suffice.

Actually, does date_parser operate row-wise

@TomAugspurger : The behavior of date_parser varies depending on how parse_dates is passed in. However, IIUC it does not operate "row-wise" as I believe you are implying.

Thanks @gfyoung, that's the recommendation I went with in #24989.

Sorry, too late to the discussion but we've just update to 0.24.2 from 0.23.4... It breaks a lot of our code which was relying on reading back 'UTC' version for years.
This is the sainest approach and date_parser=lambda x: pd.to_datetime(x, utc=True) should be the default really and not give back a set of object strings?

Having the fixed offset which most likely is wrong due to DST transitions serves no purpose unless you are localising to a known timezone using IANA timezone name.

This is the sainest approach and date_parser=lambda x: pd.to_datetime(x, utc=True) should be the default really and not give back a set of object strings?

My understanding is that ATM it gives back an array of Timestamp objects, not strings. Is it actually giving an array of strings?

This was changed because the all-utc version drops information. If a user has a mixed-timezone array, the default is to assume it is intentional.

On 0.24 that's

In [7]: df.index
Out[7]:
Index([2017-04-22 15:11:58-05:00, 2017-04-21 14:20:57-05:00,
      2017-03-09 11:15:30-06:00],
     dtype='object', name='datetime')

from @TomAugspurger example above - the dtype is object - but yeah the individual members of the index are of type datetime.datetime with different offset... not the Timestamp objects per se... and not the strings as I've implied earlier... my bad...

But my point is that "mixed timezones" or mixed timezone offsets is very common for any location with DST transitions... half of the year they are going to be different from the other half... but it is normal... What it was doing before was ok except that UTC timezone was not set... My point is it is better default behaviour, and it already does it for +00:00 offsets.... if I have +00:00 for all the timestamps it slaps UTC timezone on the index... And that is probably the right default thing especially given that it was doing the same before without slapping a timezone...

Also timezone AINA string is what I think when we talk about timezones as it is unambiguous... yet timezone offset does not give me this information anyway, only how to convert to UTC.

I could have 2 AINA timezone producing exactly the same offset strings and hence timestamps for half a year and be wrong the other half of the year or having the DST transition on different days.

But my point is that "mixed timezones" or mixed timezone offsets is very common for any location with DST transitions

That's true, but coercion to UTC is imposing an unnecessary transformation on datetime objects.
In addition, the behavior was not documented. From a design standpoint, we aim to give the end user their CSV data in as unadulterated form as possible because after all, the function is called read_csv. Nowhere in the name implies that we perform transformations on your data.

Now straddling the DST transition point has historically been a tricky spot for datetime indices on our end, so making a unilateral decision on how to address them would have been unusual, to say the least.

So from those standpoints, I stand by our decision to change the behavior as we did for 0.24.0.

I guess my issue is that I do not get a DatetimeIndex any more if I have mixed timestamps....

on a side note there's an amazing difference in speed...

buf = io.StringIO()
pd.DataFrame(
    index=pd.date_range(
        start='2015-03-10T00:00', 
        end='2020-03-12T00:00', 
        tz='America/Havana', 
        freq='1H'
      )
).to_csv(buf)


%timeit buf.seek(0); pd.read_csv(buf, parse_dates=[0], infer_datetime_format=True)
4.78 s ± 19.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit buf.seek(0); pd.read_csv(buf, parse_dates=[0], date_parser=lambda x: pd.to_datetime(x, utc=True))
750 ms ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

so it seems using pd.read_csv(..., date_parser=pd.to_datetime(x, utc=True)) is beneficial for the speed performance...

This difference is only present for a mixed timezone offset. When timezone offset is '00:00' I didn't observer any difference between different modes and it is around 160ms for the same size data...

Was this page helpful?
0 / 5 - 0 ratings