Pandas: [0.24.1] New nullable integer fillna with non-int doesn't coerce to object

Created on 12 Feb 2019  Â·  11Comments  Â·  Source: pandas-dev/pandas

Code Sample

import pandas as pd

sample_data = []

sample_data.append({"integer_column":None})
sample_data.append({"integer_column":1})
sample_data.append({"integer_column":2})

df = pd.DataFrame(sample_data)

# Previous type is object
# df.dtypes

df.loc[:,'integer_column'] = df.loc[:,'integer_column'].astype('Int64')

# Check new type is Int64, nullable
# df.dtypes

df.fillna('null_string')

Problem description

Using the new nullable type Int64, it is not possible to fill "NaN" values with other value.

Error raised

TypeError:

Expected Output

The new dataframe should have replaced it's NaN values with the desired input of .fillna() method.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 85 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.24.1
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.5.12
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml.etree: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.1.18
pymysql: None
psycopg2: None
jinja2: 2.8.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Dtypes ExtensionArray

All 11 comments

This works fine if you use an actual integer value to fill, so there's not really much of a point to using Int64 in this case since you're still asking for an object.

In any case I suppose it should still coerce to object for you like using float here would. Investigation and PRs are always welcome

I vaguely recall some discussion on whether ExtensionArray.fillna should
allow coercing the array to the dtype of the fill_value. I
don't recall if we reached a final conclusion. It's somewhat inconvenient
to have to manual .astype before filling with a different
dtype, but the type stability ensured by ExtensionArray[T].fillna -> ExtensionArray[T] is nice.

On Tue, Feb 12, 2019 at 10:35 PM William Ayd notifications@github.com
wrote:

This works fine if you use an actual integer value to fill, so there's not
really much of a point to using Int64 in this case since you're still
asking for an object.

In any case I suppose it should still coerce to object for you like using
float here would. Investigation and PRs are always welcome

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/25288#issuecomment-463054425,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIi1zyiheZ19Ef7M62IDXIE5hr0Psks5vM5YbgaJpZM4a37EH
.

In fact, since I'm using pandas for an ETL tool, this doesn't look nice to me.
Having to change the type to "object" inevitable adds ".0" after the integer number and breaks my code.

The alternative I used is to remove the ".0" part after "astype(object)" and fill it with NaNs values.

@jelther getting slightly off topic but if you are getting zeros appended to your integers its because they are getting cast to float at some point. Explicitly constructing your data frame with dtype=object as an argument would allow you to mix / match the None value with integers without implicit cast to float

@jelther getting slightly off topic but if you are getting zeros appended to your integers its because they are getting cast to float at some point. Explicitly constructing your data frame with dtype=object as an argument would allow you to mix / match the None value with integers without implicit cast to float

@WillAyd , I think this is the best alternative but I'm not able to specify since I'm extracting the data from a SQL Server database with "pandas.read_sql". I don't see on the documentation how I would be able to specify the "dtypes" when selecting the data.

but the type stability ensured by ExtensionArray[T].fillna -> ExtensionArray[T] is nice.

+1. For eg DatetimeArray, the fill value also needs to be a datetime-like.

I would assume that .fillna() would coerce the series into being of type object when I am trying to fill it with an object. Just like for example when adding a float to it coerces it into being of type float:

>>> pd.Series([1, 2, None], dtype='Int64') + 0.5
0    1.5
1    2.5
2    NaN
dtype: float64

However;

>>> pd.Series([1, 2, None], dtype='Int64').fillna('')
TypeError: <U1 cannot be converted to an IntegerDtype

This behaviour is also displayed when using .fillna() on a series of floats using a string:

>>> pd.Series([1, 2, None], dtype='float64').fillna('')
0    1
1    2
2     
dtype: object

would coerce the series into being of type object when I am trying to fill it with an object

Why do you prefer coercing the series to the dtype of the fill value, rather than the other way around? It's not clear to me that one is preferable to the other.

Because I need to fill the <NA> values with something else. Just like when adding a float to an integer it becomes a float. And like I said, this is already .fillna()'s current behaviour on floats.

But I also understand there is something to say for not doing so. Maybe a boolean argument such as coerce=True would be a solution?

Definitely on the IntegerArray the fillna should raise instead of cast, but in ExtensionBlock.fillna we would expect it to fall back by casting to object

Was this page helpful?
0 / 5 - 0 ratings