Pandas: read_csv: European numbers do not work with dates

Created on 22 Aug 2016 · 10Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

Using the following reader leads to omission of the dates resulting in no index:

file_, index_col=0,
                   header=4, sep=';', quoting=1, parse_dates=True,
                   encoding='latin-1', dayfirst=True, 
#                   decimal=',', thousands='.'

Expected Output

dataframe with pyhon numeric data and datecol as index

output of `pd.show_versions()`


pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-31-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24
numpy: 1.11.1
scipy: 0.17.1
statsmodels: None
xarray: None
IPython: 4.2.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

Bug IO CSV

Source

dacoex

Most helpful comment

I tested with the v0.19 RC and it works with the original data. Thanks again!

dacoex on 13 Sep 2016

👍2

All 10 comments

Can you post some example data and the actual read_csv, and the output you're getting? You can use the StringIO module https://docs.python.org/3/library/io.html#io.StringIO to paste the data in. Thanks.

TomAugspurger on 22 Aug 2016

@TomAugspurger thanks for the response.

Here goes the example:

import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

import pandas as pd


TESTDATA=StringIO("""
 ; 
 ; 
 ; 
 ; 
 ; 
 ; 
day;transaction value (EUR) anon
04.08.2016;33043,5217026458
03.08.2016;52542,924322369
01.08.2016;2470,86389648009
01.08.2016;124462,590366738
01.08.2016;82283,2571462275
01.08.2016;5508278,98359258
01.08.2016;154264,547309073
01.08.2016;38068,8061252655
01.08.2016;10444,4806476402
01.08.2016;13958,4753522405
01.08.2016;4834682,43989406
29.07.2016;-1247752,81935378
29.07.2016;157975,468518703
    """)

df = pd.read_csv(TESTDATA, index_col=0,
                   header=4, sep=';', quoting=1, parse_dates=True,
                   encoding='latin-1', dayfirst=True, 
                   decimal=',',
                   thousands='.',
                   )

dacoex on 23 Aug 2016

A couple of things:

1) from pandas.compat import StringIO takes care of the Python 2/3 thing FYI.

2) I believe you have the header parameter wrong. I think it should be header=6.

However, even with that parameter fixed, the reason you're seeing the index_col not being parsed as a date is because your thousands='.' is confusing the parser into parsing the '.' as a numeric value.

That is a bug, so thank you for pointing it out!

The issue actually has nothing to do with European date formats. You can see the bug surfaced here with a much more simplified example:

>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>>
>>> data = 'a\n04.15.2016'
>>> read_csv(StringIO(data), index_col=0, parse_dates=True, thousands='.')
Empty DataFrame
Columns: []
Index: [4152016]  # WRONG
>>>
>>> read_csv(StringIO(data), index_col=0, parse_dates=True)
Empty DataFrame
Columns: []
Index: [2016-04-15 00:00:00]  # RIGHT

Note that this bug does not affect non-index columns:

>>> read_csv(StringIO(data), parse_dates=['a'], thousands='.')
           a
0 2016-04-15

Similar observations can be made with the Python parser.

gfyoung on 24 Aug 2016

@gfyoung thanks for confirmation.

minor clarification:

The issue actually has nothing to do with European date formats.

I was referring to _European data_ as under: Quoting, Compression, and File Format

No idea how to go on from here but looks like the processing priorities need to be changed in the parser.

dacoex on 24 Aug 2016

@dacoex : Ah, okay. Good to know that my minimal example is capturing the issue you were seeing!

gfyoung on 24 Aug 2016

Big thanks to @jorisvandenbossche & @gfyoung FOSS is great!

dacoex on 27 Aug 2016

BTW, to we need to add this to:
http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#v0-19-0-august-2016

dacoex on 27 Aug 2016

https://github.com/pydata/pandas/commit/9d10b76fa22382b6017741a1344922c627c881d3

jreback on 27 Aug 2016

it's takes a while to actually generate those docs

jreback on 27 Aug 2016

I tested with the v0.19 RC and it works with the original data. Thanks again!

dacoex on 13 Sep 2016

👍2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Hexbin plots does not display x label and xtick labels

BDannowitz · 3Comments

Interpolate (upsample) non-equispaced timeseries into equispaced 18.0rc1

marcelnem · 3Comments

Suffixes ignored on second merge

MatzeB · 3Comments

Can't read csv using python pandas

Ashutosh-Srivastav · 3Comments

Cannot use apply on Series with Timestamp values

nathanielatom · 3Comments