In [5]: df['Gold Categories'].count()
Out[5]: 135218
In [6]: df['Gold Categories'].isna().sum()
Out[6]: 0
In [7]: df['Gold Categories'].iloc[256]
Out[7]: <NA>
In [8]: pd.isna(df['Gold Categories'].iloc[256])
Out[8]: False
In [9]: type(df['Gold Categories'].iloc[256])
Out[9]: pandas._libs.missing.NAType
In [10]: pd.__version__
Out[10]: '1.0.1'
pd.show_versions()commit : None
python : 3.7.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.16-200.fc30.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : nb_NO.UTF-8
LOCALE : nb_NO.UTF-8
pandas : 1.0.1
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.0
pip : 19.3.1
setuptools : 41.6.0.post20191030
Cython : 0.29.13
pytest : 5.2.2
hypothesis : None
sphinx : 2.2.1
blosc : None
feather : None
xlsxwriter : 1.2.2
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : 7.9.0
pandas_datareader: None
bs4 : 4.8.1
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 2.2.3
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pytest : 5.2.2
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.10
tables : 3.5.2
tabulate : 0.8.5
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.2
numba : 0.46.0
Could you include a reproducible example? I could not reproduce this on master:
>>> pd.DataFrame({'Gold categories': [pd.NA]})['Gold categories'].iloc[0]
<NA>
>>> pd.isna(pd.DataFrame({'Gold categories': [pd.NA]})['Gold categories'].iloc[0])
True
@MarcoGorelli I can upload a sample, if that will suffice, but I do not know how to reproduce
@tsoernes if you run the two lines I posted above, do you get the same output?
@MarcoGorelli Yes
Here is a sample of that column with 2 rows. It is a zipped pickle file (Github only allows zips).
In [167]: na_test = read_pickle('/tmp/na_test.pickle')
Loaded 2 entries (a Series) from /tmp/na_test.pickle (2020-02-11 21:51)
In [168]: na_test.isna().sum()
Out[173]: 0
In [174]: na_test
Out[176]:
268 [Fintech]
269 <NA>
Name: Gold Categories, dtype: object
@tsoernes I'm afraid we can't accept raw pickle files in bug reports, as they could be unsafe. Please remove the attachment from your message :)
Could you please paste the output of na_test.to_dict(), so we can copy-and-paste it and reproduce the issue?
na_test.to_dict()
Out[195]: {268: ['Fintech'], 269: <NA>}
Thanks @tsoernes
TBH I still can't reproduce the issue though
>>> pd.Series({268: ['Fintech'], 269: pd.NA}).isna()
268 False
269 True
dtype: bool
I can't either, when going via a dictionary.
On Tue, Feb 11, 2020 at 11:57 PM Marco Gorelli notifications@github.com
wrote:
Thanks @tsoernes https://github.com/tsoernes
TBH I still can't reproduce the issue though
pd.Series({268: ['Fintech'], 269: pd.NA}).isna()
268 False
269 True
dtype: bool—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/31847?email_source=notifications&email_token=ABTX3RBRDVT7VZ5DSRTPSYTRCMNNRA5CNFSM4KSMLNH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELOHURQ#issuecomment-584874566,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABTX3RADPUPEHTYF7YCSB3LRCMNNRANCNFSM4KSMLNHQ
.
When pickling/unpickling, I can reproduce this:
In [40]: s = pd.Series({268: ['Fintech'], 269: pd.NA})
In [41]: s.isna()
Out[41]:
268 False
269 True
dtype: bool
In [42]: s.to_pickle('test_na_pickle.pkl')
In [43]: s2 = pd.read_pickle('test_na_pickle.pkl')
In [44]: s2.isna()
Out[44]:
268 False
269 False
dtype: bool
In [45]: type(s2.values[1])
Out[45]: pandas._libs.missing.NAType
In [46]: s2.values[1] is pd.NA
Out[46]: False
So apparently, when unpickling, it doesn't return the same singleton.
So we should probably explicitly implement methods for pickling/unpickling on the NA class
A simple example of the problem:
In [1]: import pandas as pd
In [2]: pd.DataFrame([[pd.NA]]).to_pickle('na_problem.pkl')
In [3]: df = pd.read_pickle('na_problem.pkl')
In [4]: df.isna()
Out[4]:
0
0 False
In [5]: id(df.loc[0, 0]), id(pd.NA)
Out[5]: (140393643089760, 140393944655632)
This can also cause exceptions when working with dtypes other than object.
In [1]: import pandas as pd
In [2]: pd.DataFrame([[pd.NA]], dtype='string').to_pickle('na_problem.pkl')
In [3]: pd.read_pickle('na_problem.pkl').head()
Out[3]: ---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
... removed for brevity
/home/mephph/.local/lib/python3.7/site-packages/pandas/core/arrays/string_.py in _validate(self)
168 """Validate that we only store NA or strings."""
169 if len(self._ndarray) and not lib.is_string_array(self._ndarray, skipna=True):
--> 170 raise ValueError("StringArray requires a sequence of strings or pandas.NA")
171 if self._ndarray.dtype != "object":
172 raise ValueError(
ValueError: StringArray requires a sequence of strings or pandas.NA
The following function replaces the incorrect NA values in place. It operates one column at-a-time to preserve dtypes. flake8 complains about comparing types rather than using isinstance, but I find this easier to read.
def fix_wrong_na(df):
for column in df.columns:
isna_mask = df[column].apply(type) == type(pd.NA)
df[column][isna_mask] = pd.NA
Most helpful comment
When pickling/unpickling, I can reproduce this:
So apparently, when unpickling, it doesn't return the same singleton.