Pandas: read_csv() force numeric dtype on column and set unparseable entry as missing (NaN)

Created on 20 Dec 2012  路  17Comments  路  Source: pandas-dev/pandas

How can I force a dtype on a column and ensure that any not-parseable data entry are filled as NaN? This is important in cases where there are unpredictable data entry errors in CSVs or database streams that cannot be mapped to missing values a priori.

Eg: Below I want column 'a' to be parsed as np.float but the erroneous 'Dog' entry causes an exception. Is there a way to tell read_csv() to force parsing a column 'a' as np.float and fill all non-parseable entries with NaN?

data = 'a,b,c\n1.1,2,3\nDog,5,6\n7.7,8,9.5'
df = pd.read_csv(StringIO.StringIO(data), dtype={'a': np.float})
df.dtypes

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-12-cd8b6f868aec> in <module>()
      1 data = 'a,b,c\n1.1,2,3\nDog,5,6\n7.7,8,9.5'
----> 2 df = pd.read_csv(StringIO.StringIO(data), dtype={'a': np.float})
      3 df.dtypes

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze)
    389                     buffer_lines=buffer_lines)
    390 
--> 391         return _read(filepath_or_buffer, kwds)
    392 
    393     parser_f.__name__ = name

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)
    205         return parser
    206 
--> 207     return parser.read()
    208 
    209 _parser_defaults = {

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
    622             #     self._engine.set_error_bad_lines(False)
    623 
--> 624         ret = self._engine.read(nrows)
    625 
    626         if self.options.get('as_recarray'):

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
    943 
    944         try:
--> 945             data = self._reader.read(nrows)
    946         except StopIteration:
    947             if nrows is None:

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader.read (pandas\src\parser.c:5785)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6002)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._read_rows (pandas\src\parser.c:6870)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._convert_column_data (pandas\src\parser.c:7919)()

AttributeError: 'NoneType' object has no attribute 'dtype'
Bug Dtypes Enhancement IO CSV

Most helpful comment

if u can find a way to fit this in with the existing dtype option would be preferable

maybe

dtype = {'foo' : (float, 'coerce')}

or we introduce a helper function

dtype = {'foo' : parser.coerce(float)}

where coerce just returns an instance

All 17 comments

This would be a nice enhancement sometime. I don't know when I or someone else will get to it though

same as #2779 ?

Yes this is the same enhancement request.

ok...why don't you close the other one then....

Comments from closed issue #2779:

I have data for which I know what the data type should be for each column, e.g. float. Occasionally these columns will also have spurious non-float string values due to erroneous data processing/pivoting up stream. I would like read_csv to help me ignore this (non-predictable) spurious data by parsing them as np.nan (missing values). This rules out using read_csv( na_values= ) parameter because I cant predict them before hand.

I wanted to use the read_csv supported converter functions to return an np.float if it can be parsed but it seems that this method is prohibitively slow between 4-10x slower in some examples I have tried.

I would love read_csv's quick parser to have a "force_dtype=True" option where no matter what string is in the column the specified dtype is always returned or np.nan is used to indicate an impossible parse and therefore a missing value is placed instead.

import numpy as np
import pandas as pd
import StringIO

data = 'a,b,c\n1.1,2.2,3.3\n2.2,garb,4.4\n3.3,4.4,5.5'
data = data+data[5:]*3000

def converter(num):
try:
return np.float(num)
except:
return np.nan

%time df = pd.read_csv(StringIO.StringIO((data+data[5:]*30)))
df.dtypes

%time df2 = pd.read_csv(StringIO.StringIO((data+data[5:]*30)), converters={'b':converter})
df2.dtypes

%time df3 = pd.read_csv(StringIO.StringIO((data+data[5:]*30)), dtype=np.float)
df3.dtypes

In [7]: %time df = pd.read_csv(StringIO.StringIO((data+data[5:]*30)))
CPU times: user 0.12 s, sys: 0.00 s, total: 0.12 s
Wall time: 0.12 s <-- Quick native CSV parser

In [8]: df.dtypes <-- Returns column 'b' as float and string without the ability to convert to a single type.
Out[8]: a float64
b object
c float64

In [9]: %time df2 = pd.read_csv(StringIO.StringIO((data+data[5:]*30)), converters={'b':converter})
CPU times: user 0.42 s, sys: 0.00 s, total: 0.42 s
Wall time: 0.42 s <-- Using converter function slows the parsing down by 4X (this is only for a one-column converter)

In [10]: df2.dtypes <-- This time only np.float types are returned, and np.nan is used in place for any spurious values in column 'b'
Out[10]: a float64
b float64
c float64

In [11]: %time df3 = pd.read_csv(StringIO.StringIO((data+data[5:]*30)), dtype=np.float)
AttributeError: 'NoneType' object has no attribute 'dtype' <-- Trying to force np.float on the columns results in un-parseable exception for column 'b' :(

Just wanted to add my 2 cents that the feature requested by @dragoljub is what SAS does by default when creating a data set. I have a similar desire for this functionality. In SAS, if I specify a column as numeric on import and a character field is parsed, it will return a NULL value.

I'll have a look at this (unless someone beats me to it) because it's a somewhat straightforward addition and makes sense in the event of data types explicitly specified.

@dragoljub quite straightforward after reading, I guess this is a request to push this down to read_csv (de factor when you specify a dtype)

In [5]: df = read_csv(StringIO(data))

In [6]: df
Out[6]: 
     a  b    c
0  1.1  2  3.0
1  Dog  5  6.0
2  7.7  8  9.5

In [8]: df['a'].convert_objects(convert_numeric='force')
Out[8]: 
0    1.1
1    NaN
2    7.7
dtype: float64

@dragoljub maybe best to add this as a cookbook entry / or in docs (see my example at the end)

This is a good learning to have in the cookbook. BTW does convert_objects() work on DataFrames too?

I guess some explicit pandas dtype post processing is not so bad after the CSV parser reads all the data so fast. :v:

yep...convert_objects works on all NDFrames

want to do a quick PR for the cookbook?

If dtype is specified ahead of time it would be nice if conversion were forced for date types as well. I'd be willing to give this a shot...

@lminer that would be excellent!

ideally implement for both the c and python parsers. lmk if you need help!

Apparently there is already code that does that

df.convert_objects(convert_numeric=True)

So it is only a matter of callin convert_objects when reading, right?

@jreback : I'd thought I revisit this given our recent discussion on .infer_objects. I can imagine just adding a coerce_dtype parameter to the signature that would replace failed converted objects with nan (or the equivalent for other dtypes).

That being said, I'm concerned about adding more bloat to an already massive signature. Do you still think that implementing this internally (compared to say using .infer_objects after the call) is preferable?

if u can find a way to fit this in with the existing dtype option would be preferable

maybe

dtype = {'foo' : (float, 'coerce')}

or we introduce a helper function

dtype = {'foo' : parser.coerce(float)}

where coerce just returns an instance

@jreback : I like the first option, seems more user-friendly IMO. I can investigate that and see where it takes me in the process.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

matthiasroder picture matthiasroder  路  3Comments

songololo picture songololo  路  3Comments

amelio-vazquez-reina picture amelio-vazquez-reina  路  3Comments

nathanielatom picture nathanielatom  路  3Comments

ericdf picture ericdf  路  3Comments