Pandas: BUG: pd.NA doesn't pickle/unpickle faithfully

Created on 10 Feb 2020  Â·  12Comments  Â·  Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible


In [5]: df['Gold Categories'].count()
Out[5]: 135218

In [6]: df['Gold Categories'].isna().sum()
Out[6]: 0

In [7]: df['Gold Categories'].iloc[256]
Out[7]: <NA>

In [8]: pd.isna(df['Gold Categories'].iloc[256])
Out[8]: False

In [9]: type(df['Gold Categories'].iloc[256])
Out[9]: pandas._libs.missing.NAType

In [10]: pd.__version__
Out[10]: '1.0.1'


Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.16-200.fc30.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : nb_NO.UTF-8
LOCALE : nb_NO.UTF-8

pandas : 1.0.1
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.0
pip : 19.3.1
setuptools : 41.6.0.post20191030
Cython : 0.29.13
pytest : 5.2.2
hypothesis : None
sphinx : 2.2.1
blosc : None
feather : None
xlsxwriter : 1.2.2
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : 7.9.0
pandas_datareader: None
bs4 : 4.8.1
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 2.2.3
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pytest : 5.2.2
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.10
tables : 3.5.2
tabulate : 0.8.5
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.2
numba : 0.46.0

Bug IO Pickle NA - MaskedArrays

Most helpful comment

When pickling/unpickling, I can reproduce this:

In [40]: s = pd.Series({268: ['Fintech'], 269: pd.NA})                                                                                                                                                             

In [41]: s.isna()                                                                                                                                                                                                  
Out[41]: 
268    False
269     True
dtype: bool

In [42]: s.to_pickle('test_na_pickle.pkl')                                                                                                                                                                         

In [43]: s2 = pd.read_pickle('test_na_pickle.pkl')                                                                                                                                                                 

In [44]: s2.isna()                                                                                                                                                                                                 
Out[44]: 
268    False
269    False
dtype: bool

In [45]: type(s2.values[1])                                                                                                                                                                                        
Out[45]: pandas._libs.missing.NAType

In [46]: s2.values[1] is pd.NA                                                                                                                                                                                     
Out[46]: False

So apparently, when unpickling, it doesn't return the same singleton.

All 12 comments

Could you include a reproducible example? I could not reproduce this on master:

>>> pd.DataFrame({'Gold categories': [pd.NA]})['Gold categories'].iloc[0]                                                                                                                                   
<NA>

>>> pd.isna(pd.DataFrame({'Gold categories': [pd.NA]})['Gold categories'].iloc[0])                                                                                                                          
True

@MarcoGorelli I can upload a sample, if that will suffice, but I do not know how to reproduce

@tsoernes if you run the two lines I posted above, do you get the same output?

@MarcoGorelli Yes

Here is a sample of that column with 2 rows. It is a zipped pickle file (Github only allows zips).


In [167]: na_test = read_pickle('/tmp/na_test.pickle')
Loaded 2 entries (a Series) from /tmp/na_test.pickle (2020-02-11 21:51)

In [168]: na_test.isna().sum()
Out[173]: 0

In [174]: na_test

Out[176]: 
268    [Fintech]
269         <NA>
Name: Gold Categories, dtype: object

@tsoernes I'm afraid we can't accept raw pickle files in bug reports, as they could be unsafe. Please remove the attachment from your message :)

Could you please paste the output of na_test.to_dict(), so we can copy-and-paste it and reproduce the issue?

na_test.to_dict()
Out[195]: {268: ['Fintech'], 269: <NA>}

Thanks @tsoernes

TBH I still can't reproduce the issue though

>>> pd.Series({268: ['Fintech'], 269: pd.NA}).isna()                                                        
268    False
269     True
dtype: bool

I can't either, when going via a dictionary.

On Tue, Feb 11, 2020 at 11:57 PM Marco Gorelli notifications@github.com
wrote:

Thanks @tsoernes https://github.com/tsoernes

TBH I still can't reproduce the issue though

pd.Series({268: ['Fintech'], 269: pd.NA}).isna()
268 False
269 True
dtype: bool

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/31847?email_source=notifications&email_token=ABTX3RBRDVT7VZ5DSRTPSYTRCMNNRA5CNFSM4KSMLNH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELOHURQ#issuecomment-584874566,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABTX3RADPUPEHTYF7YCSB3LRCMNNRANCNFSM4KSMLNHQ
.

When pickling/unpickling, I can reproduce this:

In [40]: s = pd.Series({268: ['Fintech'], 269: pd.NA})                                                                                                                                                             

In [41]: s.isna()                                                                                                                                                                                                  
Out[41]: 
268    False
269     True
dtype: bool

In [42]: s.to_pickle('test_na_pickle.pkl')                                                                                                                                                                         

In [43]: s2 = pd.read_pickle('test_na_pickle.pkl')                                                                                                                                                                 

In [44]: s2.isna()                                                                                                                                                                                                 
Out[44]: 
268    False
269    False
dtype: bool

In [45]: type(s2.values[1])                                                                                                                                                                                        
Out[45]: pandas._libs.missing.NAType

In [46]: s2.values[1] is pd.NA                                                                                                                                                                                     
Out[46]: False

So apparently, when unpickling, it doesn't return the same singleton.

So we should probably explicitly implement methods for pickling/unpickling on the NA class

A simple example of the problem:

In [1]: import pandas as pd

In [2]: pd.DataFrame([[pd.NA]]).to_pickle('na_problem.pkl')

In [3]: df = pd.read_pickle('na_problem.pkl')

In [4]: df.isna()

Out[4]: 
       0
0  False

In [5]: id(df.loc[0, 0]), id(pd.NA)

Out[5]: (140393643089760, 140393944655632)

This can also cause exceptions when working with dtypes other than object.

In [1]: import pandas as pd

In [2]: pd.DataFrame([[pd.NA]], dtype='string').to_pickle('na_problem.pkl')

In [3]: pd.read_pickle('na_problem.pkl').head()
Out[3]: ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
... removed for brevity
/home/mephph/.local/lib/python3.7/site-packages/pandas/core/arrays/string_.py in _validate(self)
    168         """Validate that we only store NA or strings."""
    169         if len(self._ndarray) and not lib.is_string_array(self._ndarray, skipna=True):
--> 170             raise ValueError("StringArray requires a sequence of strings or pandas.NA")
    171         if self._ndarray.dtype != "object":
    172             raise ValueError(

ValueError: StringArray requires a sequence of strings or pandas.NA

The following function replaces the incorrect NA values in place. It operates one column at-a-time to preserve dtypes. flake8 complains about comparing types rather than using isinstance, but I find this easier to read.

def fix_wrong_na(df):
    for column in df.columns:
        isna_mask = df[column].apply(type) == type(pd.NA)
        df[column][isna_mask] = pd.NA
Was this page helpful?
0 / 5 - 0 ratings

Related issues

matthiasroder picture matthiasroder  Â·  3Comments

tade0726 picture tade0726  Â·  3Comments

andreas-thomik picture andreas-thomik  Â·  3Comments

MatzeB picture MatzeB  Â·  3Comments

nathanielatom picture nathanielatom  Â·  3Comments