Pandas: Empty cells make Padas use float, even if read_csv(dtype={'FOO': str}) is used

Created on 7 Oct 2017 · 10Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import pandas as pd

csv_path = 'test.csv'
df = pd.read_csv(csv_path, delimiter=';', quotechar='"',
                 decimal=',', encoding="ISO-8859-1", dtype={'FOO': str})
df.FOO = df.FOO.map(lambda n: n.zfill(6))
print(df)

test.csv:

FOO;BAR
01,23;4,56
1,23;45,6
;987

Problem description

When I use dtype={'FOO': str}, I expect pandas to treat the column as a string. This seems to work, but when an empty cell is present Pandas seems to switch to float.

Expected Output

      FOO     BAR
0  001,23    4.56
1  001,23   45.60
2  000000  987.00

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.10.0-35-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: 3.2.2
pip: 9.0.1
setuptools: 20.7.0
Cython: None
numpy: 1.13.3
scipy: 0.19.0
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0b10
sqlalchemy: 1.1.14
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

Dtypes IO CSV Usage Question

Source

MartinThoma

👍3

Most helpful comment

I don't directly find another related issue, apart from https://github.com/pandas-dev/pandas/issues/1450, which you can actually do as well: add na_values=[], keep_default_na=False to read_csv if you want to prevent the parsing of empty strings to NaNs.

jorisvandenbossche on 9 Oct 2017

👍2

All 10 comments

See https://stackoverflow.com/q/46608604/562769

MartinThoma on 7 Oct 2017

I believe this is expected behavior.

From read_csv

dtype : Type name or dict of column -> type, default None
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use str or object to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.

Maybe the converter arg to read_csv is what you're after
converters={'FOO': lambda x: str(x)}

bobhaffner on 7 Oct 2017

👍1

@MartinThoma If you look at the values of the column, you will see pandas correctly preserved the data as strings (as you specified with dtype={'FOO': str}):

In [20]: df.FOO.values
Out[20]: array(['01,23', '1,23', nan], dtype=object)

The only 'gotcha' is that empty strings are still seen as missing values (and thus converted to NaN), and not kept as an empty string.

So your solution of filling the missing values with empty string (df.FOO.fillna(value="")) is actually fine.

jorisvandenbossche on 9 Oct 2017

👍2

The solution of using the converters arg (converters={'FOO': str}) is also fine (although I think it will be slower if you have a lot of data, but not sure).

jorisvandenbossche on 9 Oct 2017

I seem to recall this issue coming up before. would be helpful to link to prior discussions.

jreback on 9 Oct 2017

jorisvandenbossche on 9 Oct 2017

👍2

Why is na_values=[] required? What would happen without it?

MartinThoma on 9 Oct 2017

👍1

Is it only me, or is the type inference and missing data handling part of reading input data an idiosyncratic part of pandas dataframes? Anyway thanks for all the advice.

matanster on 14 Dec 2018

Seems like this is the intended behavior which is documented in read_csv. Going to close but happy to reopen if there are any suggestions to improve the documentation

mroeschke on 30 Apr 2020

Take a look at this:

pd.read_csv('csv_file.csv', dtype={'special_id': int})

That code throwing this error:
ValueError: Integer column has NA values in column 0

It is because that given column have empty cells that I expected to be consider as NaN. Without the dtype argument I'm getting the values as floats.