Pandas: read_csv() force numeric dtype on column and set unparseable entry as missing (NaN)

Created on 20 Dec 2012 · 17Comments · Source: pandas-dev/pandas

How can I force a dtype on a column and ensure that any not-parseable data entry are filled as NaN? This is important in cases where there are unpredictable data entry errors in CSVs or database streams that cannot be mapped to missing values a priori.

Eg: Below I want column 'a' to be parsed as np.float but the erroneous 'Dog' entry causes an exception. Is there a way to tell read_csv() to force parsing a column 'a' as np.float and fill all non-parseable entries with NaN?

data = 'a,b,c\n1.1,2,3\nDog,5,6\n7.7,8,9.5'
df = pd.read_csv(StringIO.StringIO(data), dtype={'a': np.float})
df.dtypes

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-12-cd8b6f868aec> in <module>()
      1 data = 'a,b,c\n1.1,2,3\nDog,5,6\n7.7,8,9.5'
----> 2 df = pd.read_csv(StringIO.StringIO(data), dtype={'a': np.float})
      3 df.dtypes

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze)
    389                     buffer_lines=buffer_lines)
    390 
--> 391         return _read(filepath_or_buffer, kwds)
    392 
    393     parser_f.__name__ = name

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)
    205         return parser
    206 
--> 207     return parser.read()
    208 
    209 _parser_defaults = {

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
    622             #     self._engine.set_error_bad_lines(False)
    623 
--> 624         ret = self._engine.read(nrows)
    625 
    626         if self.options.get('as_recarray'):

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
    943 
    944         try:
--> 945             data = self._reader.read(nrows)
    946         except StopIteration:
    947             if nrows is None:

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader.read (pandas\src\parser.c:5785)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6002)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._read_rows (pandas\src\parser.c:6870)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._convert_column_data (pandas\src\parser.c:7919)()

AttributeError: 'NoneType' object has no attribute 'dtype'

Bug Dtypes Enhancement IO CSV

Source

dragoljub

👍5

Most helpful comment

if u can find a way to fit this in with the existing dtype option would be preferable

maybe

dtype = {'foo' : (float, 'coerce')}

or we introduce a helper function

dtype = {'foo' : parser.coerce(float)}

where coerce just returns an instance

jreback on 22 Jul 2017

👍5

All 17 comments

This would be a nice enhancement sometime. I don't know when I or someone else will get to it though

wesm on 20 Jan 2013

same as #2779 ?

jreback on 11 Mar 2013

Yes this is the same enhancement request.

dragoljub on 11 Mar 2013

ok...why don't you close the other one then....

jreback on 11 Mar 2013

Comments from closed issue #2779:

I have data for which I know what the data type should be for each column, e.g. float. Occasionally these columns will also have spurious non-float string values due to erroneous data processing/pivoting up stream. I would like read_csv to help me ignore this (non-predictable) spurious data by parsing them as np.nan (missing values). This rules out using read_csv( na_values= ) parameter because I cant predict them before hand.

I wanted to use the read_csv supported converter functions to return an np.float if it can be parsed but it seems that this method is prohibitively slow between 4-10x slower in some examples I have tried.

I would love read_csv's quick parser to have a "force_dtype=True" option where no matter what string is in the column the specified dtype is always returned or np.nan is used to indicate an impossible parse and therefore a missing value is placed instead.

import numpy as np
import pandas as pd
import StringIO

data = 'a,b,c\n1.1,2.2,3.3\n2.2,garb,4.4\n3.3,4.4,5.5'
data = data+data[5:]*3000

def converter(num):
try:
return np.float(num)
except:
return np.nan

%time df = pd.read_csv(StringIO.StringIO((data+data[5:]*30)))
df.dtypes

%time df2 = pd.read_csv(StringIO.StringIO((data+data[5:]*30)), converters={'b':converter})
df2.dtypes

%time df3 = pd.read_csv(StringIO.StringIO((data+data[5:]*30)), dtype=np.float)
df3.dtypes

In [7]: %time df = pd.read_csv(StringIO.StringIO((data+data[5:]*30)))
CPU times: user 0.12 s, sys: 0.00 s, total: 0.12 s
Wall time: 0.12 s <-- Quick native CSV parser

In [8]: df.dtypes <-- Returns column 'b' as float and string without the ability to convert to a single type.
Out[8]: a float64
b object
c float64

In [9]: %time df2 = pd.read_csv(StringIO.StringIO((data+data[5:]*30)), converters={'b':converter})
CPU times: user 0.42 s, sys: 0.00 s, total: 0.42 s
Wall time: 0.42 s <-- Using converter function slows the parsing down by 4X (this is only for a one-column converter)

In [10]: df2.dtypes <-- This time only np.float types are returned, and np.nan is used in place for any spurious values in column 'b'
Out[10]: a float64
b float64
c float64

In [11]: %time df3 = pd.read_csv(StringIO.StringIO((data+data[5:]*30)), dtype=np.float)
AttributeError: 'NoneType' object has no attribute 'dtype' <-- Trying to force np.float on the columns results in un-parseable exception for column 'b' :(

dragoljub on 11 Mar 2013

Just wanted to add my 2 cents that the feature requested by @dragoljub is what SAS does by default when creating a data set. I have a similar desire for this functionality. In SAS, if I specify a column as numeric on import and a character field is parsed, it will return a NULL value.

Zelazny7 on 16 Mar 2013

I'll have a look at this (unless someone beats me to it) because it's a somewhat straightforward addition and makes sense in the event of data types explicitly specified.

wesm on 16 Mar 2013

@dragoljub quite straightforward after reading, I guess this is a request to push this down to read_csv (de factor when you specify a dtype)

In [5]: df = read_csv(StringIO(data))

In [6]: df
Out[6]: 
     a  b    c
0  1.1  2  3.0
1  Dog  5  6.0
2  7.7  8  9.5

In [8]: df['a'].convert_objects(convert_numeric='force')
Out[8]: 
0    1.1
1    NaN
2    7.7
dtype: float64

jreback on 21 Sep 2013

@dragoljub maybe best to add this as a cookbook entry / or in docs (see my example at the end)

jreback on 18 Feb 2014

This is a good learning to have in the cookbook. BTW does convert_objects() work on DataFrames too?

I guess some explicit pandas dtype post processing is not so bad after the CSV parser reads all the data so fast. :v:

dragoljub on 19 Feb 2014

yep...convert_objects works on all NDFrames

want to do a quick PR for the cookbook?

jreback on 19 Feb 2014

If dtype is specified ahead of time it would be nice if conversion were forced for date types as well. I'd be willing to give this a shot...

lminer on 25 Jun 2014

@lminer that would be excellent!

ideally implement for both the c and python parsers. lmk if you need help!

jreback on 25 Jun 2014

Apparently there is already code that does that

df.convert_objects(convert_numeric=True)

So it is only a matter of callin convert_objects when reading, right?

adrivsh on 12 Jun 2015

@jreback : I'd thought I revisit this given our recent discussion on .infer_objects. I can imagine just adding a coerce_dtype parameter to the signature that would replace failed converted objects with nan (or the equivalent for other dtypes).

That being said, I'm concerned about adding more bloat to an already massive signature. Do you still think that implementing this internally (compared to say using .infer_objects after the call) is preferable?