Pandas: to_datetime function not working with %Y.%m.%d %H format

Created on 11 Jun 2018  路  13Comments  路  Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

import pandas as pd
import datetime as dt
print(dt.datetime.strptime('2012.01.01 1', '%Y.%m.%d %H')) 
# -> 2012-01-01 01:00:00
print(pd.to_datetime('2012.01.01 1', format='%Y.%m.%d %H')) 
# -> ValueError: time data '2012.01.01 1' doesn't match format specified

Problem description

When using to_datetime to parse a date that only includes an hour component, but not minutes and seconds, with a format that is otherwise similar to ISO8601 (such as the format '%Y.%m.%d %H'), a ValueError is raised (see above). This behavior is unexpected as strptime can parse the same date without any problem, using the same format string (see above).

I suspect that the problem is in the _format_is_iso function of pandas._libs.tslibs.parsing, where it is just checked if the ISO format starts with the format given - so this format is recognized as being ISO-like. In this case, the format passed to to_datetime is ignored and tslib.array_to_datetime function is used to parse the date instead, which doesn't seem to be able to handle this kind of format.

My current workaround is to modify the dates to also have a minutes component (append ':00' to every string) so that they can be parsed.

Expected Output

2012-01-01 01:00:00 (same as when using strptime)

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-6-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8
LOCALE: de_DE.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.5.2
Cython: None
numpy: 1.14.2
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.0
openpyxl: 2.5.0
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Bug Timeseries

Most helpful comment

To me it's only a question of whether we consider "2016.05.06 1" an ISO 8601 (loosely as we use it) formatted string, i.e. parses even if no format is passed.

Even if the answer to that is no, we need to match python if an explicit format spec is passed, so some change is necessary either way. It's a bug for pd.to_datetime(..., format='format') not to match datetime.strptime(..., 'format')

All 13 comments

Thanks for the report - issue seems to be here - raising an error if we reach the end of the string and the hour wasn't 2 digits, which looks unnecessary.

https://github.com/pandas-dev/pandas/blob/7000b899038e9d6559ce80d3c018ec0ad5412efe/pandas/_libs/src/datetime/np_datetime_strings.c#L257

PR to fix welcome - to start with I think could just delete this conditional and see if any tests break (as well as adding new tests)

@chris-b1 deleting this conditional does solve @johan12345 's issue. But fails multiple tests. I will see if can modify those tests.

@jbrockmendel brings up a good point in questioning this in the associated PR. While I understand that the datetime module is parsing that single digit hour for the user, I'm not sure that can be guaranteed across all platforms. In fact, the Python documentation on "%H" from C89 states that it represents a "Hour (24-hour clock) as a zero-padded decimal number", which the example is not.

I think if pandas tries to make guarantees about how this gets parsed for non-standard directives that we'd be opening up a can of worms, so I'd vote for no action here.

Source:

https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

It appears in POSIX std, strptime will handle missing leading 0s for %H - we already do the same for %m, %d, etc, so I think it's reasonable to expand our parser.

http://pubs.opengroup.org/onlinepubs/009695399/functions/strptime.html

I really only worry about ambigous cases, is there is any risk "2016.05.06 1" is ambiguous with something other than pd.Timestamp('2016-05-05 01:00:00')? (honestly asking)

FWIW this would be breaking with both numpy and dateutil.

we broke with numpy a while back, and its parser is limited so not worried about that, but the dateutil one is more interesting, didn't know that.

from dateutil.parser import parser

parser().parse("2016.05.06 1")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-f30f9e318c90> in <module>()
----> 1 parser().parse("2016.05.06 1")

~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\dateutil\parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
    557 
    558         if res is None:
--> 559             raise ValueError("Unknown string format")
    560 
    561         if len(res) == 0:

ValueError: Unknown string format

parser().parse("2016.05.06 01")
Out[11]: datetime.datetime(2016, 5, 6, 1, 0)

@chris-b1 good info. Any concerns with Windows? Not sure where to find that information

Windows being windows doesn't actually have a strptime in its c-runtime, but looks like std::get_time which it would have follows roughly the same spec with respect to leadings 0s.
http://en.cppreference.com/w/cpp/io/manip/get_time

@chris-b1 Is it really worth messing with present parser? As @WillAyd mentioned, the bugs that may follow can be tricky to tackle with considering following a different path from numpy and dateutil parser (which isn't that helpful either, I reckon)

To me it's only a question of whether we consider "2016.05.06 1" an ISO 8601 (loosely as we use it) formatted string, i.e. parses even if no format is passed.

Even if the answer to that is no, we need to match python if an explicit format spec is passed, so some change is necessary either way. It's a bug for pd.to_datetime(..., format='format') not to match datetime.strptime(..., 'format')

@chris-b1 Right - It is okay that the date is not parseable without specifying an explicit format, but when specifying the actual format, the behavior should be the same as with strptime.

Thanks everyone for the quick reactions, even before I had time to look into it in detail myself!

Hi there.
I need your help guys, I have a series of dates in '%Y-%m-%d' and I need them in '%Y%m'.
like 2018-12-04 to 201804.
I will appreciate any assistance.
Thanks

If x is date time object
pandas.datetime.strftime(x,'%Y%m')

Was this page helpful?
0 / 5 - 0 ratings