Presto: Please improve parquet connector to not require hive

Created on 16 Jun 2019  路  8Comments  路  Source: prestodb/presto

Please improve parquet connector to not require hive, ideally also not require hdfs

Working directly with parquet files would make Presto much easier to get adopted by data scientists.

Related: https://github.com/prestodb/presto/issues/11955

Most helpful comment

even the standalone hive metastore needs Hadoop specific environment variables/jars, maybe implement the Presto Thrift connector to communicate to S3? I will give it a go :-)

All 8 comments

@mariuss Marius, is this something you'd like to work on?

I don't have the skills :(

We're looking for how to query large CSV or Parquet files, Presto seems a great solution. However, it's too complex now to get started with Parquet, due to all hadoop/hive deps & steps, vs. being able to connect to Parquet directly:
https://scientific-software.netlify.com/howto/how-to-query-big-csv-files

Thanks, Maria!

@mariusa u dont need hadoop/hdfs. all u need is S3 and hive (hive can run without hdfs, namenode..etc)

even the standalone hive metastore needs Hadoop specific environment variables/jars, maybe implement the Presto Thrift connector to communicate to S3? I will give it a go :-)

I would be very interested in this connector also - I believe (though obviously I haven't tried) that it will be significantly more efficient to query large data from Parquet files directly without requiring the data to be between hive and presto over thrift

@jeromeof the data does not come from hive, just metadata does

I haven't tried but have heard there is an experimental setting (hive.metastore=file) that could support reading from parquet directly - https://github.com/prestodb/presto/tree/master/presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/file

You can indeed [ab]use hive.metastore=file to avoid setting up a real Hive metastore. You can then run MinIO to serve your local files over "S3".

I put together a docker container with the above setup; it seems to work well for ad-hoc analysis of local files, though I've not used extensively yet so YMMV.

docker run -it --mount source=/<data-dir>/,destination=/parquet,type=bind floatingwindow/presto-local-parquet

The above will start Presto and the necessary bits to read local files, then give you a Presto shell.

More info: https://github.com/floating-window/presto-local-parquet

At the moment you still need to define the schemas manually. I'm part way through writing a script to scan a set of parquet files and automagically set up corresponding schema & file mappings in Presto - I'll update the above container when it's finished.

Was this page helpful?
0 / 5 - 0 ratings