Pandas: unexpected behavior when using infer_object function

Created on 6 Aug 2018  路  4Comments  路  Source: pandas-dev/pandas

Code Sample

In the code sample below: column "A" consists entirely of numbers formatted as strings.

df = pd.DataFrame({"A": ["1","2","3"]})
df.convert_objects(convert_numeric=True).dtypes
df.infer_objects().dtypes

Problem description

At present, I am using the convert_objects function to convert any columns which are entirely made up of numbers formatted as strings, to numeric values if possible. I note that the convert_objects function is deprecated, so I attempted to update my code to use infer_objects instead.

However, the infer_objects function appears to work differently, and will only convert a column to a numeric type if all rows in a particular column are numbers, but the series was previously configured in the dataframe (as shown in the example)

I understand the conversion of columns consisting entirely of string formatted numbers to numeric types may not be desirable for the default behavior, however it would be handy to give an argument which allows either behavior.

Alternatively, one must loop through each column and attempt conversion using the to_numeric function.

Expected Output

# output from df.convert_objects(convert_numeric=True).dtypes
A    int64
dtype: object

# output from df.infer_objects().dtypes
A    object
dtype: object

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.23.1
pytest: 3.2.1
pip: 18.0
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.2.2
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Numeric

Most helpful comment

@gfyoung : that's right, or something to that effect. In my mind numbers can be inferred from columns that exist entirely of string formatted numbers, so it would logically be associated with the infer_objects function, although I understand that this may not be desirable by default.

All 4 comments

I see that numeric coersion was specifically disabled as part of #16915 and 3670711cc8f56085783a22ccdc8274de041779df; what is the reasoning behind coercing dates and time_deltas but not numers? It would be handy to have a to_numeric function that operates across the whole dataframe.

@rora002 : IIUC, you're proposing to have pd.to_numeric expand to DataFrame ?

@gfyoung : that's right, or something to that effect. In my mind numbers can be inferred from columns that exist entirely of string formatted numbers, so it would logically be associated with the infer_objects function, although I understand that this may not be desirable by default.

Any progress/different thoughts on this?

Was this page helpful?
0 / 5 - 0 ratings