pandas.to_datetime() does not respect exact format string

Created on 17 Mar 2016 · 24Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

import pandas as pd
s = pd.Series(['2012-01-01 10:00', '2012-02-10 23:20', '2012-03-22 12:40', '2012-05-02 02:00', '2012-06-11 15:20', '2012-07-22 04:40','2012-08-31 18:00', '2012-10-11 07:20', '2012-11-20 20:40', '2012-12-31 10:00'])
print s
0    2012-01-01 10:00
1    2012-02-10 23:20
2    2012-03-22 12:40
3    2012-05-02 02:00
4    2012-06-11 15:20
5    2012-07-22 04:40
6    2012-08-31 18:00
7    2012-10-11 07:20
8    2012-11-20 20:40
9    2012-12-31 10:00
dtype: object
print pd.to_datetime(s, format='%Y-%m-%d', errors='raise', infer_datetime_format=False, exact=True)
0   2012-01-01 10:00:00
1   2012-02-10 23:20:00
2   2012-03-22 12:40:00
3   2012-05-02 02:00:00
4   2012-06-11 15:20:00
5   2012-07-22 04:40:00
6   2012-08-31 18:00:00
7   2012-10-11 07:20:00
8   2012-11-20 20:40:00
9   2012-12-31 10:00:00
dtype: datetime64[ns]

Expected Output

A ValueError exception because string '2012-01-01 10:00' (and all others of this series) does not comply with format string '%Y-%m-%d' and exact conversion is required.

output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Darwin
OS-release: 15.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None

pandas: 0.17.1
nose: 1.3.6
pip: 8.0.3
setuptools: 18.0
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.16.0
statsmodels: None
IPython: 3.0.0
sphinx: None
patsy: None
dateutil: 2.5.0
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.4.3
openpyxl: 2.2.2
xlrd: 0.9.3
xlwt: 0.8.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: None
html5lib: None
httplib2: 0.9.1
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
Jinja2: None

API Design Docs Timeseries

Source

ssche

👍1

Most helpful comment

Just adding my two cents because I was running into the same exact problem. My use case is that I have a program that uses read_csv to load a csv file where the index is either just a date, or date and time. If it has a time I want to make it UTC time zone, if not then no time zone. The pythonic way to do it I thought would be try to read_csv with a parser using to_datetime that would try to match one format, and if that raised an exception try the other format. But by not raising the exception if there is not an exact match that approach does not work.

I would be in favor of an exact='strict' or similar to truly enforce the exact match.

rsheftel on 14 Oct 2016

👍5

All 24 comments

The problem is this part in pd.to_datetime():

       if format is not None:
            # There is a special fast-path for iso8601 formatted
            # datetime strings, so in those cases don't use the inferred
            # format because this path makes process slower in this
            # special case
            format_is_iso8601 = _format_is_iso(format)
            print 'format_is_iso8601', format_is_iso8601
            if format_is_iso8601:
                require_iso8601 = not infer_datetime_format
                format = None

When I change the condition to if format is not None and infer_datetime_format, it works as expected. Is this something that could be changed in Pandas. I find it quite confusing that despite explicitly setting infer_datetime_format=False there's still some internal magic happening in this function. Others may feel the same, not sure...

ssche on 17 Mar 2016

👍1

nothing to do with that (well you can disable if you set infer_datetime_format=False). Its here: https://docs.python.org/3/library/re.html#search-vs-match

if format is specified and exact=True (the default) its a .match like search, meaning it must match at the beginning (as opposed to anywhere). So this is a valid match.

I suppose we could be more stricted on exact, e.g. maybe make it exact='match|search|False or something.

but why is this a problem? (as an aside you have an ISO8601 string, so you don't need to do anything to parse it).

jreback on 17 Mar 2016

I'm trying out a few date formats and want to determine the correct one. If I try '%Y-%m-%d' and '%Y-%m-%d %H:%M:%S', they both match which isn't quite true.

The format parameter is definitely set to None in the code I posted above, so I think it has something to do with the snippet... I find this counter-intuitive given the parameters I passed into this function.

ssche on 17 Mar 2016

hmm, actually I take that back. it IS removing the format BECAUSE its ISO8601. I am not sure it makes sense to change this as there is a perf penalty for parsing ISO strings with an actual format string (via regex). And we expect (and users expect) that the perf will be the same whether they pass a compat format or not.

Further if it actually does match ISO how is this a problem?

jreback on 17 Mar 2016

In my case, I can easily fix this in the application by not checking for multiple iso format strings. I am trying to detect date format ambiguities, i.e. more than one date format matching which is a generalisation of discerning the %m/%d vs. %d/%m case.

However, I was surprised to get this kind of behaviour from this function as I thought I disabled all auto-magic (infer_datetime_format=False, exact=True) behaviour. I wasn't concerned with possible performance hits.

Perhaps a comment in the documentation would suffice to clarify this behaviour if it cannot be turned off through parameters.

ssche on 17 Mar 2016

nothing to do with that (well you can disable if you set infer_datetime_format=False). Its here: https://docs.python.org/3/library/re.html#search-vs-match

Your intuition seemed to have been the same as mine... But infer_datetime_format=False doesn't work like this for this iso date case.

ssche on 17 Mar 2016

@ssche I don't see any ambiguity in an ISO8601 string. That's the point, it cannot be anything but dayfirst=False, yearfirst=False by-definition.

jreback on 17 Mar 2016

        if format is not None:
            # There is a special fast-path for iso8601 formatted
            # datetime strings, so in those cases don't use the inferred
            # format because this path makes process slower in this
            # special case
            format_is_iso8601 = _format_is_iso(format)
            if format_is_iso8601:
                require_iso8601 = not infer_datetime_format
                format = None

the code says it all. If you would like to update the doc-string, by all means.

jreback on 17 Mar 2016

The ambiguity detection is something I do in my application. I just provided this bit of information as background info because you asked.
The point here is that infer_datetime_format=False, exact=True do some auto-magic behaviour which I find unexpected.

ssche on 17 Mar 2016

@ssche its not unexpected, you gave it a format string which is ISO8601. I don't see any problem here. If you give it a non-ISO8601 string then it would indeed raise.

jreback on 17 Mar 2016

In [16]: pd.to_datetime(s, format='%Y-%m-%d%H', infer_datetime_format=False, exact=True, errors='raise')
ValueError: time data '2012-01-01 10:00' does not match format '%Y-%m-%d%H' (match)

# you don't need any arguments actually, the default will be to raise if it is not a matched format.
In [17]: pd.to_datetime(s, format='%Y-%m-%d%H')
ValueError: time data '2012-01-01 10:00' does not match format '%Y-%m-%d%H' (match)

jreback on 17 Mar 2016

So you don't find it confusing that the behaviour of this function is date format-specific, i.e. given another date format an exception is raised?

ssche on 17 Mar 2016

@ssche I am not sure what you mean. Of course its date-format specific. Its just special cased to ISO8601 to match the general format.

jreback on 17 Mar 2016

In the iso-format case, the date format doesn't need to be exact, but for all other formats an exception will be raised when the format doesn't match. That's what I find confusing.

ssche on 17 Mar 2016

👍1

and so? again why is this a problem. by definition they are ISO8601 which matches a variety of non-ambiguous formats (where the time is optional), and separators can be optional (pandas actually allows a lot more separators than the standard).

So we tend to parse more easily lots of formats. I still don't understand why this is a problem. It parses correctly as it should, even though you gave an underspecified format.

jreback on 17 Mar 2016

So we tend to parse more easily lots of formats. I still don't understand why this is a problem. It parses correctly as it should, even though you gave an underspecified format.

That's exactly my problem. I have a very specific use case which requires to raise an exception when the date format doesn't match. Trying '%Y-%m-%d' and '%Y-%m-%d %H:%M:%S' both won't raise an exception for parsing '2012-01-01 10:00', so the application is unable to detect the proper (=exact, =non underspecified, =non overspecified) date format (that's all I care about).

I tried to make this more strict through the parameters I quoted a number of times already, but this didn't make the date conversion more strict. I don't know how else I can make this function more strict to only accept date formats that I specify and raise an exception if they don't.

ssche on 17 Mar 2016

👍3

well you can try to make a change to the code to see if u can make it very strict but I suspect this will be quite tricky (maybe add exact='strict') or something

alternately in your code it's easy enough to check for lengths of the input strings

jreback on 17 Mar 2016

So the if format is not None and infer_datetime_format "fix" I proposed wouldn't be acceptable?

ssche on 17 Mar 2016

no that defeats the purpose of iso format detection

you could allow infer_datetime_format='strict' for maybe exact='striict') which when s format is specified could bypass the inference check (and then would fail because it's not an exact match)

jreback on 17 Mar 2016

I would be in favor of an exact='strict' or similar to truly enforce the exact match.

rsheftel on 14 Oct 2016

👍5

this breaks a lot of tests for us as well. Very similar use case as @rsheftel. The strict mode that raises exceptions would work for us too.

fersarr on 24 Apr 2018

👍2

We're having the same problem.
I'd like to be able to add some argument(s) to the following code to make it either raise or return NaT:

> to_datetime(0.0, format="%Y-%m-%d", exact=True)
Timestamp('1970-01-01 00:00:00')

I like the sound of exact="strict".

philipCrunchboards on 13 Mar 2019

👍3

We ran into the same problem. I already raised a question at stackoverflow:
Pandas to_datetime not setting NaT on wrong format
Is there a decision yet on how to proceed with this issue?

jo47011 on 4 Nov 2019

Hello all,

Not sure if my issue is the same but posting here for suggestions.

Core problem: the pd.to_datetime function does not respect original values even if I set errors = 'ignore'

See example below. The column of values is stored in a UTC timestamp in milliseconds. When I try to convert it to a datetime object, pandas automatically fills in the NaN values with NaT even though I specifically told it to ignore errors. Any suggestions on how to fix this? Thanks