Pandas: tricky timestamp conversion

Created on 6 Mar 2019 · 20Comments · Source: pandas-dev/pandas

Hello there, its me the bug hunter again :)

I have this massive 200 million rows dataset, and I encountered some very annoying behavior. I wonder if this is a bug.

I load my csv using

mylog = pd.read_csv('/mydata.csv',
                    names = ['mydatetime',  'var2', 'var3', 'var4'],
                    dtype = {'mydatetime' : str},
                    skiprows = 1)

and the datetime column really look like regular timestamps (tz aware)

mylog.mydatetime.head()
Out[22]: 
0    2019-03-03T20:58:38.000-0500
1    2019-03-03T20:58:38.000-0500
2    2019-03-03T20:58:38.000-0500
3    2019-03-03T20:58:38.000-0500
4    2019-03-03T20:58:38.000-0500
Name: mydatetime, dtype: object

Now, I take extra care in converting these string into proper timestamps:

mylog['mydatetime'] = pd.to_datetime(mylog['mydatetime'] ,errors = 'coerce', format = '%Y-%m-%dT%H:%M:%S.%f%z', infer_datetime_format = True, cache = True)

That takes a looong time to process, but seems OK. The output is

mylog.mydatetime.head()
Out[23]: 
0    2019-03-03 20:58:38-05:00
1    2019-03-03 20:58:38-05:00
2    2019-03-03 20:58:38-05:00
3    2019-03-03 20:58:38-05:00
4    2019-03-03 20:58:38-05:00
Name: mydatetime, dtype: object

What is puzzling is that so far I thought I had full control of my dtypes. However, running the simple

mylog['myday'] = pd.to_datetime(mylog['mydatetime'].dt.date, errors = 'coerce')

  File "pandas/_libs/tslib.pyx", line 537, in pandas._libs.tslib.array_to_datetime

ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True

The only way I was able to go past this error was by running

mylog['myday'] = pd.to_datetime(mylog['mydatetime'].apply(lambda x: x.date()))

Is this a bug? Before upgrading to 24.1 I was not getting the tz error above. What do you think? I cant share the data but I am happy to try some things to help you out!

Thanks!

Needs Info Timeseries

Source

randomgambit

👍1

Most helpful comment

When I addressed this timezone parsing my rational was if %z or %Z were passed, the user would want to preserve these timezones, so this error was intentional.

For your use case, if you leave out the format argument and keep utc=True you should get you're dates in UTC with datetime64[ns, UTC] dtype

mroeschke on 8 Mar 2019

😄2

All 20 comments

It doesn't look like your first usage of to_datetime is actually casting to date time. Minimal code sample to reproduce would as always be very beneficial.

FYI should need to pass format and infer_datetime_format together

WillAyd on 6 Mar 2019

@WillAyd thanks! do you actually mean I should NOT use format and infer_datetime_format together? What might happens is that there are some malformed lines in the data. However, using errors = 'coerce' should protect me from that so I am still a bit puzzled..

randomgambit on 6 Mar 2019

Documentation points to infer_datetime_format only being valid if format is None, so should essentially be a no-op. The errors argument should just be setting invalid input to NaT, which should still give you the desired dtype instead of the object you are getting.

As mentioned in any case a minimally reproducible example would be very helpful

WillAyd on 6 Mar 2019

@WillAyd yes but the issue is that obviously I cannot share the large data. Is there any pandas function I can try in order to understand whats going on? Can we identify the possibly bogus timestamps?

thanks!!

randomgambit on 6 Mar 2019

Tough to say. Maybe don't coerce errors and see where it raises - could indicate where the issue arises

WillAyd on 6 Mar 2019

We have some issues where a combination of errors and infer_datetime_format and format can return odd results. https://github.com/pandas-dev/pandas/issues/25143 https://github.com/pandas-dev/pandas/issues/25512

Nonetheless, @WillAyd's suggestion is good to see which date is not converting correctly and subsequently why that combination of arguments is returning that result.

mroeschke on 6 Mar 2019

👍1

thanks guys. trying that shortly and reporting to the base ASAP :)

randomgambit on 6 Mar 2019

@mroeschke the problem is that i dont see an easy fix to #25143. maybe just getting rid of infer_datetime_format?

randomgambit on 6 Mar 2019

infer_datetime_format would probably be a tough sell to remove as it can provide a performance boost in parsing if all the dates conform to that format.

I imagine the error handling around this argument is not entirely robust.

mroeschke on 6 Mar 2019

OK @WillAyd @mroeschke this is getting even weirder. I reloaded my data, this time cleaning the (character) timestamps (I noticed some extra space in them)

mylog['mydatetime'] = mylog['mydatetime'].str.strip().str.replace('"', '')

and I was able to run pd.to_datetime() without the errors argument! Unfortunately, the output is a mere object.

mylog['mydatetime'] = pd.to_datetime(mylog['mydatetime'] , format = '%Y-%m-%dT%H:%M:%S.%f%z',cache = True)

mylog.mydatetime.head()
Out[17]: 
0    2019-03-03 20:58:38-05:00
1    2019-03-03 20:58:38-05:00
2    2019-03-03 20:58:38-05:00
3    2019-03-03 20:58:38-05:00
4    2019-03-03 20:58:38-05:00
Name: mydatetime, dtype: object

and using
mylog['myday'] = pd.to_datetime(mylog['mydatetime'].dt.date)will still return
ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True

This is very annoying... is there a workaround to make sure the processing is correct? At the end of the day I think this specific ISO format seems to be causing a few errors at the moment...

randomgambit on 7 Mar 2019

the puzzling part is that it is not raising any errors, yet the output is not datetime.

randomgambit on 7 Mar 2019

So I think object dtype might actually be the expected behavior here if mylog['mydatetime'] contained multiple UTC offsets and the parsing returned multiple fixed timezones (i.e. datetime64[ns] dtype cannot be used to represent multiple of these)

How many distinct tz offsets are there in mylog['mydatetime']?

mroeschke on 7 Mar 2019

@mroeschke HA! that is an interesting idea. Normally I should only have one tz offset but with 200m rows who knows. how can I tabulate the offsets to check that you theory is correct?

randomgambit on 7 Mar 2019

Might be a little slow but mylog['mydatetime'].apply(lambda x: x.tzinfo)).unique()

mroeschke on 7 Mar 2019

hum... i get

mylog['mydatetime'].apply(lambda x: x.tzinfo).unique()
Out[20]: array([pytz.FixedOffset(-300), pytz.FixedOffset(-240)], dtype=object)

does that mean there is just one timestamp (as expected)?

randomgambit on 7 Mar 2019

This means that the object dtype is expected. Since your string data contained more than one timezone offset, it's not possible to cast this data to one datetime64[ns, tz] dtype since there are multiple tzs in your data.

See https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.24.0.html#parsing-datetime-strings-with-timezone-offsets

mroeschke on 7 Mar 2019

👍2

@mroeschke we re close but its not over yet... looking at the link you sent I was hopeful that utc= True would be the magic solution but unfortunately I now get another error... I think that was supposed to work according to the docs. What do you think?

Below I reloaded the data so they are strings again.

mylog['mydatetime'] = pd.to_datetime(mylog['mydatetime'] , format = '%Y-%m-%dT%H:%M:%S.%f%z',\
     cache = True, utc = True)
ValueError: Cannot pass a tz argument when parsing strings with timezone information.

Thanks!

randomgambit on 8 Mar 2019

When I addressed this timezone parsing my rational was if %z or %Z were passed, the user would want to preserve these timezones, so this error was intentional.

For your use case, if you leave out the format argument and keep utc=True you should get you're dates in UTC with datetime64[ns, UTC] dtype

mroeschke on 8 Mar 2019

😄2

I think the behavior in this issue is expected and is more of a usage question. Closing.

mroeschke on 21 Oct 2019

Even though the issue has been closed, just want to point out that if tz-aware format bothers you so much and gives you the above error, simply convert it to naive format by using pd.Series.tz_localize(None), which removes the tail part, and keeps the datetime part same. Then everything is good to go.