Pandas: Empty cells make Padas use float, even if read_csv(dtype={'FOO': str}) is used

Created on 7 Oct 2017  Â·  10Comments  Â·  Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import pandas as pd

csv_path = 'test.csv'
df = pd.read_csv(csv_path, delimiter=';', quotechar='"',
                 decimal=',', encoding="ISO-8859-1", dtype={'FOO': str})
df.FOO = df.FOO.map(lambda n: n.zfill(6))
print(df)

test.csv:

FOO;BAR
01,23;4,56
1,23;45,6
;987

Problem description

When I use dtype={'FOO': str}, I expect pandas to treat the column as a string. This seems to work, but when an empty cell is present Pandas seems to switch to float.

Expected Output

      FOO     BAR
0  001,23    4.56
1  001,23   45.60
2  000000  987.00

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.10.0-35-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: 3.2.2
pip: 9.0.1
setuptools: 20.7.0
Cython: None
numpy: 1.13.3
scipy: 0.19.0
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0b10
sqlalchemy: 1.1.14
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

Dtypes IO CSV Usage Question

Most helpful comment

I don't directly find another related issue, apart from https://github.com/pandas-dev/pandas/issues/1450, which you can actually do as well: add na_values=[], keep_default_na=False to read_csv if you want to prevent the parsing of empty strings to NaNs.

All 10 comments

I believe this is expected behavior.

From read_csv

dtype : Type name or dict of column -> type, default None
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use str or object to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.

Maybe the converter arg to read_csv is what you're after
converters={'FOO': lambda x: str(x)}

@MartinThoma If you look at the values of the column, you will see pandas correctly preserved the data as strings (as you specified with dtype={'FOO': str}):

In [20]: df.FOO.values
Out[20]: array(['01,23', '1,23', nan], dtype=object)

The only 'gotcha' is that empty strings are still seen as missing values (and thus converted to NaN), and not kept as an empty string.

So your solution of filling the missing values with empty string (df.FOO.fillna(value="")) is actually fine.

The solution of using the converters arg (converters={'FOO': str}) is also fine (although I think it will be slower if you have a lot of data, but not sure).

I seem to recall this issue coming up before. would be helpful to link to prior discussions.

I don't directly find another related issue, apart from https://github.com/pandas-dev/pandas/issues/1450, which you can actually do as well: add na_values=[], keep_default_na=False to read_csv if you want to prevent the parsing of empty strings to NaNs.

Why is na_values=[] required? What would happen without it?

Is it only me, or is the type inference and missing data handling part of reading input data an idiosyncratic part of pandas dataframes? Anyway thanks for all the advice.

Seems like this is the intended behavior which is documented in read_csv. Going to close but happy to reopen if there are any suggestions to improve the documentation

Take a look at this:

pd.read_csv('csv_file.csv', dtype={'special_id': int})

That code throwing this error:
ValueError: Integer column has NA values in column 0

It is because that given column have empty cells that I expected to be consider as NaN. Without the dtype argument I'm getting the values as floats.

Was this page helpful?
0 / 5 - 0 ratings