import pandas as pd # v0.24.2
import scipy.sparse # v1.1.0
df = pd.SparseDataFrame(scipy.sparse.random(1000, 1000),
columns=list(map(str, range(1000))),
default_fill_value=0.0)
df.to_parquet('rpd.pq', engine='pyarrow')
Gives the error
ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column 0 with type Sparse[float64, 0.0]')
This error occurs when trying to save a Pandas sparse DataFrame using the to_parquet method. The error can be avoided by running df.to_dense().to_parquet(). However, this can require a lot of memory for very large sparse matrices.
The issue was also raised https://github.com/apache/arrow/issues/1894 and https://github.com/pandas-dev/pandas/issues/20692
The expected output is a parquet file on disk.
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Darwin
OS-release: 18.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.2
pytest: 3.9.1
pip: 19.0.3
setuptools: 40.2.0
Cython: None
numpy: 1.16.3
scipy: 1.1.0
pyarrow: 0.13.0
xarray: None
IPython: 6.5.0
sphinx: None
patsy: 0.5.1
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: 1.1.2
lxml.etree: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
this is not a pandas issue, it is up to arrow whether (or more likely not) to support this format.
We are deprecating SparseDataFrame, but supporting SparseArray as an extension type, so this might be supported in the future.
Okay, @wesm recommend making the issue here https://github.com/apache/arrow/issues/1894#issuecomment-491991095
@jreback SparseDataFrame is being deprecated? So it will not be possible to have a sparse Pandas DataFrame in future versions? Or will it be possible to make one using the Sparse array extension type?
https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html
The SparseDataFrame subclass is being deprecated. It's functionally equivalent to a DataFrame with sparse values.
And support for SparseArrays in to_parquet / arrow might depend on the discussion in https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556
Thanks, @TomAugspurger @jreback @wesm. Is there an example of making a Pandas DataFrame from SparseArray values?
I'm trying this out on this kaggle kernel using the sparr variable from the documentation (https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#sparsearray), but the DataFrame does not appear sparse.
link to kernel (fork to re-run) - https://www.kaggle.com/cornhundred/pandas-dataframe-from-sparsearray?scriptVersionId=14173944
cc @melaniedavila @manugarciaquismondo
We should clearly update the user guide on this (http://pandas-docs.github.io/pandas-docs-travis/user_guide/sparse.html), as that still shows the "old" way. @TomAugspurger is adding some documentation in his PR to deprecated the subclass: https://github.com/pandas-dev/pandas/pull/26137
But basically, if you have SparseArray values, you can put them in a DataFrame by using the DataFrame constructor as normal, eg:
In [40]: arr = pd.SparseArray([0,0,0,1])
In [41]: arr
Out[41]:
[0, 0, 0, 1]
Fill: 0
IntIndex
Indices: array([3], dtype=int32)
In [42]: df = pd.DataFrame({'a': arr})
In [43]: df
Out[43]:
a
0 0
1 0
2 0
3 1
In [44]: df.dtypes
Out[44]:
a Sparse[int64, 0]
dtype: object
(what version of pandas are you using?)
Feedback on using it in a normal pandas DataFrame instead of the SparseDataFrame subclass is very welcome! (we are all not very regular users of the sparse functionality)
@cornhundred thanks for the notebook. From seeing the output there, I assume you are using an older version of Python? (the SparseArray support inside DataFrame itself is only availabe in 0.24)
Thanks @jorisvandenbossche. I modified your example a bit and got it to run on Google Colab, which is running Pandas 0.24.2:
The DataFrame made with sparse data is smaller on memory than the dense matrix. The original issue with saving the sparse DataFrame to parquet is demonstrated at the bottom of the notebook.
Kaggle however, is running Pandas 0.23.4
https://www.kaggle.com/cornhundred/pandas-dataframe-from-sparsearray-0-23-4?scriptVersionId=14175226
In terms of how we are using sparse data - we start by loading a sparse matrix (of single cell gene expression data) in Matrix Market format (MTX) using scipy.io.mmread, perform some filtering on the data and then save back to a new Matrix Market format (directory) using scipy.io.mmwrite. The scipy read/write functionality allow us to load and save data directly to scipy sparse matrix format (coo_matrix) without having to make it dense (which would cause us to run out of RAM).
We're looking into parquet since it allows you to read select columns without loading the entire dataset (as well as predicate pushdown for row group filtering). However, it seems that we first have to convert to dense matrices before saving to parquet (see bottom of colab notebook gist). Ideally we could have the same sparse matrix IO we have with the Matrix Market format but instead with parquet.
I'm looking into pyarrow to see if they have this functionality https://arrow.apache.org/docs/python/parquet.html#reading-and-writing-single-files
Hi @jorisvandenbossche, it's probably a naive question but SparseArray is one dimensional (as far as I understand) so to make a 2D DataFrame do I have to make a bunch of series and then combine them into a DataFrame? Are there methods (e.g. df.to_sparse and df.to_dense) that exist (or are planned) to support easy swapping between sparse and dense DataFrames (using SparseArray as an extension)?
cc @manugarciaquismondo
@cornhundred yes, if you have a DataFrame with sparse columns, it is each column that is separately stored as a 1D sparse array (that was the same before with the SparseDataFrame as well).
But you can convert a 2D sparse matrix into that format without needing to make a full dense array. With the currently released version, the pd.SparseDataFrame(..) constructor accepts a scipy matrix, and in the upcoming version this will be replaced with pd.DataFrame.sparse.from_spmatrix.
And also going from sparse to dense exists, as you mentioned with to_dense()
Thanks @jorisvandenbossche that makes sense.