Pandas: DISCUSS: What would an ORC reader/writer API look like?

Created on 8 Feb 2019 · 11Comments · Source: pandas-dev/pandas

cc @mrocklin for dask.dataframe visibility

I'm one of the developers of https://github.com/rapidsai/cudf and we're working on adding GPU-accelerated file readers / writers to our library. It seems most of the standard formats are covered quite nicely in the Pandas API, but ORC isn't. Before we went off defining our own API I wanted to open a discussion for defining what that API would look like so we can be consistent with the Pandas and Pandas-like community.

At the top level, I imagine it would look almost identical to Parquet in something like the following:

def read_orc(path, engine='auto', columns=None, **kwargs):
    """
    Load an orc object from the file path, returning a DataFrame.

    Parameters
    ----------
    path : string
        File path
    columns : list, default=None
        If not None, only these columns will be read from the file.
    engine : {'auto', 'pyarrow'}, default 'auto'
        Orc library to use. If 'auto', then the option
        ``io.orc.engine`` is used. The default ``io.orc.engine``
        behavior is to use 'pyarrow'.
    kwargs are passed to the engine

    Returns
    -------
    DataFrame
    """
    ...


def to_orc(self, fname, engine='auto', compression='snappy', index=None,
           partition_cols=None, **kwargs):
    """
    Write a DataFrame to the binary orc format.

    This function writes the dataframe as a `orc file
    <https://orc.apache.org/>`_. You can choose different orc
    backends, and have the option of compression. See
    :ref:`the user guide <io.orc>` for more details.

    Parameters
    ----------
    fname : str
        File path or Root Directory path. Will be used as Root Directory
        path while writing a partitioned dataset.
    engine : {'auto', 'pyarrow'}, default 'auto'
        Orc library to use. If 'auto', then the option
        ``io.orc.engine`` is used. The default ``io.orc.engine``
        behavior is to use 'pyarrow'.
    compression : {'snappy', 'gzip', 'brotli', None}, default 'snappy'
        Name of the compression to use. Use ``None`` for no compression.
    index : bool, default None
        If ``True``, include the dataframe's index(es) in the file output.
        If ``False``, they will not be written to the file. If ``None``,
        the behavior depends on the chosen engine.
    partition_cols : list, optional, default None
        Column names by which to partition the dataset
        Columns are partitioned in the order they are given
    **kwargs
        Additional arguments passed to the orc library. See
        :ref:`pandas io <io.orc>` for more details.
    """
    ...

API Design IO Data

Source

kkraus14

Most helpful comment

From a user perspective I think that it might be better to have explicit read_parquet andread_orc` functions. Though of course on the implementation side hopefully there is some reuse as Arrow's ORC reader becomes more consistent with its parquet reader.

+1 to everything that @xhochy said

mrocklin on 8 Feb 2019

👍3

All 11 comments

+1 for making it look like the Parquet API. Both formats are very similar and could be considered as "competitors". They should also roughly match on the pyarrow in future (ORC is currently missing Dataset support in the style of pyarrow.parquet.ParquetDataset and we're missing a writer API which makes testing hard).

We can skip the engine argument here though as there is only one implementation at the moment.

xhochy on 8 Feb 2019

is this 'close' enough to parquet in people's minds, that we could just add a flavor='parquet|orc' to the parquet readers/writer (in pandas) then appropriately dispatch?

jreback on 8 Feb 2019

+1 to everything that @xhochy said

mrocklin on 8 Feb 2019

👍3

From a user perspective I think that it might be better to have explicit read_parquet andread_orc` functions. Though of course on the implementation side hopefully there is some reuse as Arrow's ORC reader becomes more consistent with its parquet reader.

+1 to everything that @xhochy said

+1 to having separate functions for read_parquet and read_orc and everything that @xhochy suggested.

kkraus14 on 8 Feb 2019

To play the devil's advocate for a moment: do we think this is actually worth including in pandas as a top-level function?

I haven't seen much usage of ORC personally (but my view is also limited; and it's of course also a chicken and egg problem, having it in pandas would give it more exposure).

jorisvandenbossche on 19 Nov 2019

In my experience ORC is less commonly used than Parquet, but is still fairly common, at least among enterprise hadoop shops. I think that everyone who bought a hadoop/spark cluster from Cloudera ended up using Parquet while everyone who bought a hadoop/spark cluster from HortonWorks ended up using ORC (that's a generalization though). I commonly find ORC in companies who historically used Hortonworks, but are now increasing their use of Python.

I think that ORC is less popular than Parquet, and so not a strong priority, but still common enough to be well in scope for a project like Pandas.

mrocklin on 20 Nov 2019

@mrocklin thanks for that context! Sounds good to me then

jorisvandenbossche on 20 Nov 2019

@mrocklin ORC has different use cases than Parquet, especially with its powerful predicate push down, block level indexes and bloom filters.
Many people are using it with Presto due to the huge amount of work they invested in streamlining ORC.
Also in our tests ORC massively outperformed parquet for our use case (20%+ speed increases).

We are absolutely committed to ORC as a format simply due to the amount of data we manage on a tiny budget and ORC having the features required to allow us to do this within that budget.

With Support from spark, cudf and BigQuery recently added I think this should be bumped up the roadmap!