Pandas: SparseDataFrame.to_parquet fails with new error

Created on 13 May 2019 · 11Comments · Source: pandas-dev/pandas

Code Sample

import pandas as pd # v0.24.2
import scipy.sparse # v1.1.0

df = pd.SparseDataFrame(scipy.sparse.random(1000, 1000), 
                         columns=list(map(str, range(1000))),
                         default_fill_value=0.0)
df.to_parquet('rpd.pq', engine='pyarrow')

Gives the error

ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column 0 with type Sparse[float64, 0.0]')

Problem description

This error occurs when trying to save a Pandas sparse DataFrame using the to_parquet method. The error can be avoided by running df.to_dense().to_parquet(). However, this can require a lot of memory for very large sparse matrices.

The issue was also raised https://github.com/apache/arrow/issues/1894 and https://github.com/pandas-dev/pandas/issues/20692

Expected Output

The expected output is a parquet file on disk.

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Darwin
OS-release: 18.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 3.9.1
pip: 19.0.3
setuptools: 40.2.0
Cython: None
numpy: 1.16.3
scipy: 1.1.0
pyarrow: 0.13.0
xarray: None
IPython: 6.5.0
sphinx: None
patsy: 0.5.1
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: 1.1.2
lxml.etree: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Source

cornhundred

All 11 comments

this is not a pandas issue, it is up to arrow whether (or more likely not) to support this format.

We are deprecating SparseDataFrame, but supporting SparseArray as an extension type, so this might be supported in the future.

jreback on 13 May 2019

👍1

Okay, @wesm recommend making the issue here https://github.com/apache/arrow/issues/1894#issuecomment-491991095

@jreback SparseDataFrame is being deprecated? So it will not be possible to have a sparse Pandas DataFrame in future versions? Or will it be possible to make one using the Sparse array extension type?

https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html

cornhundred on 14 May 2019

The SparseDataFrame subclass is being deprecated. It's functionally equivalent to a DataFrame with sparse values.

TomAugspurger on 14 May 2019

👍1

And support for SparseArrays in to_parquet / arrow might depend on the discussion in https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556

jorisvandenbossche on 14 May 2019

Thanks, @TomAugspurger @jreback @wesm. Is there an example of making a Pandas DataFrame from SparseArray values?

I'm trying this out on this kaggle kernel using the sparr variable from the documentation (https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#sparsearray), but the DataFrame does not appear sparse.

link to kernel (fork to re-run) - https://www.kaggle.com/cornhundred/pandas-dataframe-from-sparsearray?scriptVersionId=14173944

cc @melaniedavila @manugarciaquismondo

cornhundred on 14 May 2019

We should clearly update the user guide on this (http://pandas-docs.github.io/pandas-docs-travis/user_guide/sparse.html), as that still shows the "old" way. @TomAugspurger is adding some documentation in his PR to deprecated the subclass: https://github.com/pandas-dev/pandas/pull/26137

But basically, if you have SparseArray values, you can put them in a DataFrame by using the DataFrame constructor as normal, eg:

In [40]: arr = pd.SparseArray([0,0,0,1])

In [41]: arr
Out[41]: 
[0, 0, 0, 1]
Fill: 0
IntIndex
Indices: array([3], dtype=int32)

In [42]: df = pd.DataFrame({'a': arr})

In [43]: df
Out[43]: 
   a
0  0
1  0
2  0
3  1

In [44]: df.dtypes 
Out[44]: 
a    Sparse[int64, 0]
dtype: object

(what version of pandas are you using?)

Feedback on using it in a normal pandas DataFrame instead of the SparseDataFrame subclass is very welcome! (we are all not very regular users of the sparse functionality)

jorisvandenbossche on 14 May 2019

@cornhundred thanks for the notebook. From seeing the output there, I assume you are using an older version of Python? (the SparseArray support inside DataFrame itself is only availabe in 0.24)

jorisvandenbossche on 14 May 2019

Thanks @jorisvandenbossche. I modified your example a bit and got it to run on Google Colab, which is running Pandas 0.24.2:

https://colab.research.google.com/gist/cornhundred/c231f02b2edbc83f466756915ffdfbab/sparsearray_to_dataframe_pandas_0-24-2.ipynb

The DataFrame made with sparse data is smaller on memory than the dense matrix. The original issue with saving the sparse DataFrame to parquet is demonstrated at the bottom of the notebook.

Kaggle however, is running Pandas 0.23.4

https://www.kaggle.com/cornhundred/pandas-dataframe-from-sparsearray-0-23-4?scriptVersionId=14175226

In terms of how we are using sparse data - we start by loading a sparse matrix (of single cell gene expression data) in Matrix Market format (MTX) using scipy.io.mmread, perform some filtering on the data and then save back to a new Matrix Market format (directory) using scipy.io.mmwrite. The scipy read/write functionality allow us to load and save data directly to scipy sparse matrix format (coo_matrix) without having to make it dense (which would cause us to run out of RAM).

We're looking into parquet since it allows you to read select columns without loading the entire dataset (as well as predicate pushdown for row group filtering). However, it seems that we first have to convert to dense matrices before saving to parquet (see bottom of colab notebook gist). Ideally we could have the same sparse matrix IO we have with the Matrix Market format but instead with parquet.

I'm looking into pyarrow to see if they have this functionality https://arrow.apache.org/docs/python/parquet.html#reading-and-writing-single-files

cornhundred on 14 May 2019

Hi @jorisvandenbossche, it's probably a naive question but SparseArray is one dimensional (as far as I understand) so to make a 2D DataFrame do I have to make a bunch of series and then combine them into a DataFrame? Are there methods (e.g. df.to_sparse and df.to_dense) that exist (or are planned) to support easy swapping between sparse and dense DataFrames (using SparseArray as an extension)?

cc @manugarciaquismondo

cornhundred on 17 May 2019

@cornhundred yes, if you have a DataFrame with sparse columns, it is each column that is separately stored as a 1D sparse array (that was the same before with the SparseDataFrame as well).

But you can convert a 2D sparse matrix into that format without needing to make a full dense array. With the currently released version, the pd.SparseDataFrame(..) constructor accepts a scipy matrix, and in the upcoming version this will be replaced with pd.DataFrame.sparse.from_spmatrix.
And also going from sparse to dense exists, as you mentioned with to_dense()

jorisvandenbossche on 18 May 2019

👍1

Thanks @jorisvandenbossche that makes sense.

cornhundred on 18 May 2019

Was this page helpful?

0 / 5 - 0 ratings