Ray: [DataFrames] Can't import pyarrow.parquet

Created on 6 Feb 2018  Â·  11Comments  Â·  Source: ray-project/ray

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): macOS 10.13.1
  • Ray installed from (source or binary): pip
  • Ray version: '0.3.1'
  • Python version: Python 3.6.2 :: Anaconda custom (64-bit)
  • Exact command to reproduce:
import ray
import pyarrow.parquet as pq

Then the traceback will be:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-5-dc8a4f7832af> in <module>()
----> 1 import pyarrow.parquet as pq

~/anaconda3/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/parquet.py in <module>()
     26 
     27 from pyarrow.filesystem import FileSystem, LocalFileSystem, S3FSWrapper
---> 28 from pyarrow._parquet import (ParquetReader, FileMetaData,  # noqa
     29                               RowGroupMetaData, ParquetSchema)
     30 import pyarrow._parquet as _parquet  # noqa

ModuleNotFoundError: No module named 'pyarrow._parquet'

Describe the problem


Unable to import. I'm working on distributively loading parquet file into ray DataFrame. This issue is preventing me from successfully doing it.

Source code / logs


This module inside ray binary does have parquet file in it:

 simonmo@Simons-MBP î‚° ~/anaconda3/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow î‚° ls
__init__.pxd                 io-hdfs.pxi                  libplasma.dylib
__init__.py                  io.pxi                       memory.pxi
__pycache__                  ipc.pxi                      orc.py
_config.pyx                  ipc.py                       pandas_compat.py
_orc.pxd                     lib.cpython-36m-darwin.so    parquet.py
_orc.pyx                     lib.pxd                      plasma.cpython-36m-darwin.so
_parquet.pxd                 lib.pyx                      plasma.pyx
_parquet.pyx                 lib_api.h                    plasma_store
array.pxi                    libarrow.0.0.0.dylib         public-api.pxi
compat.py                    libarrow.0.dylib             scalar.pxi
error.pxi                    libarrow.dylib               serialization.pxi
feather.pxi                  libarrow_python.0.0.0.dylib  serialization.py
feather.py                   libarrow_python.0.dylib      table.pxi
filesystem.py                libarrow_python.dylib        types.pxi
formatting.py                libplasma.0.0.0.dylib        types.py
hdfs.py                      libplasma.0.dylib            util.py

Most helpful comment

@robertnishihara I would expect this issue to be fixed (along with adding support for reading parquet into ray DataFrame) within this week!

All 11 comments

From a discussion offline: The current plan to resolve this is to add parquet-cpp to the build process. When arrow is stable, we can just use the binary because it will include the required libraries. @simon-mo will lead this effort.

@robertnishihara We are having a hard time getting pyarrow to build with the one that's currently in Ray. If we update to the most recent pyarrow, it seems to work.

I see, updating to the most recent is fine.

Is this going to be merged soon? the latest whl on s3 still has this issue for me. thanks.

@virtualluke you're trying to use pyarrow.parquet? We're still working on it in #1531. @simon-mo do you have a sense of how long it will take?

@robertnishihara I would expect this issue to be fixed (along with adding support for reading parquet into ray DataFrame) within this week!

Thanks, yes I have some scripts making heavy use of pyarrow.parquet that I want to use ray to parallelize.

I look forward also to seeing how I can work with parquet data in a ray dataframe. Exciting stuff!

@virtualluke If you need it sooner, you can checkout my PR #1531, you can build the pyarrow_files needed by:

  1. run src/thirdparty/download_thirdparty.sh
  2. run src/thirdparty/build_thirdparty.sh. This script will generate a directory in ray/pyarrow_files
  3. replace your ray library's pyarrow (for example, anaconda3/lib/python3.6/site-packages/ray/pyarrow_files) with the one just built.

I haven't built ray in a while as the daily whl files are getting uploaded to S3 (nice feature!).

My build environment (connected to the internet) is not the same as my run environment (isolated from internet) so I build via the instructions at ray/python/README-building-wheels.md

I made the changes from your PR #1531 but the wheel build failed.

I see, I think #1531 isn't working yet. Should be soon.

Was this page helpful?
0 / 5 - 0 ratings