import ray
import pyarrow.parquet as pq
Then the traceback will be:
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-5-dc8a4f7832af> in <module>()
----> 1 import pyarrow.parquet as pq
~/anaconda3/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/parquet.py in <module>()
26
27 from pyarrow.filesystem import FileSystem, LocalFileSystem, S3FSWrapper
---> 28 from pyarrow._parquet import (ParquetReader, FileMetaData, # noqa
29 RowGroupMetaData, ParquetSchema)
30 import pyarrow._parquet as _parquet # noqa
ModuleNotFoundError: No module named 'pyarrow._parquet'
Unable to import. I'm working on distributively loading parquet file into ray DataFrame. This issue is preventing me from successfully doing it.
This module inside ray binary does have parquet file in it:
simonmo@Simons-MBP î‚° ~/anaconda3/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow î‚° ls
__init__.pxd io-hdfs.pxi libplasma.dylib
__init__.py io.pxi memory.pxi
__pycache__ ipc.pxi orc.py
_config.pyx ipc.py pandas_compat.py
_orc.pxd lib.cpython-36m-darwin.so parquet.py
_orc.pyx lib.pxd plasma.cpython-36m-darwin.so
_parquet.pxd lib.pyx plasma.pyx
_parquet.pyx lib_api.h plasma_store
array.pxi libarrow.0.0.0.dylib public-api.pxi
compat.py libarrow.0.dylib scalar.pxi
error.pxi libarrow.dylib serialization.pxi
feather.pxi libarrow_python.0.0.0.dylib serialization.py
feather.py libarrow_python.0.dylib table.pxi
filesystem.py libarrow_python.dylib types.pxi
formatting.py libplasma.0.0.0.dylib types.py
hdfs.py libplasma.0.dylib util.py
From a discussion offline: The current plan to resolve this is to add parquet-cpp to the build process. When arrow is stable, we can just use the binary because it will include the required libraries. @simon-mo will lead this effort.
@robertnishihara We are having a hard time getting pyarrow to build with the one that's currently in Ray. If we update to the most recent pyarrow, it seems to work.
I see, updating to the most recent is fine.
Is this going to be merged soon? the latest whl on s3 still has this issue for me. thanks.
@virtualluke you're trying to use pyarrow.parquet? We're still working on it in #1531. @simon-mo do you have a sense of how long it will take?
@robertnishihara I would expect this issue to be fixed (along with adding support for reading parquet into ray DataFrame) within this week!
Thanks, yes I have some scripts making heavy use of pyarrow.parquet that I want to use ray to parallelize.
I look forward also to seeing how I can work with parquet data in a ray dataframe. Exciting stuff!
@virtualluke If you need it sooner, you can checkout my PR #1531, you can build the pyarrow_files needed by:
src/thirdparty/download_thirdparty.shsrc/thirdparty/build_thirdparty.sh. This script will generate a directory in ray/pyarrow_filesanaconda3/lib/python3.6/site-packages/ray/pyarrow_files) with the one just built. I haven't built ray in a while as the daily whl files are getting uploaded to S3 (nice feature!).
My build environment (connected to the internet) is not the same as my run environment (isolated from internet) so I build via the instructions at ray/python/README-building-wheels.md
I made the changes from your PR #1531 but the wheel build failed.
I see, I think #1531 isn't working yet. Should be soon.
Most helpful comment
@robertnishihara I would expect this issue to be fixed (along with adding support for reading parquet into ray DataFrame) within this week!