Presto: Please improve parquet connector to not require hive

Created on 16 Jun 2019 · 8Comments · Source: prestodb/presto

Please improve parquet connector to not require hive, ideally also not require hdfs

Working directly with parquet files would make Presto much easier to get adopted by data scientists.

Source

mariusa

👍5

Most helpful comment

even the standalone hive metastore needs Hadoop specific environment variables/jars, maybe implement the Presto Thrift connector to communicate to S3? I will give it a go :-)

DennisRutjes on 18 Oct 2019

❤3

All 8 comments

@mariuss Marius, is this something you'd like to work on?

mbasmanova on 18 Jun 2019

I don't have the skills :(

We're looking for how to query large CSV or Parquet files, Presto seems a great solution. However, it's too complex now to get started with Parquet, due to all hadoop/hive deps & steps, vs. being able to connect to Parquet directly:
https://scientific-software.netlify.com/howto/how-to-query-big-csv-files

Thanks, Maria!

mariusa on 18 Jun 2019

@mariusa u dont need hadoop/hdfs. all u need is S3 and hive (hive can run without hdfs, namenode..etc)

tooptoop4 on 23 Jun 2019

👍2

even the standalone hive metastore needs Hadoop specific environment variables/jars, maybe implement the Presto Thrift connector to communicate to S3? I will give it a go :-)

DennisRutjes on 18 Oct 2019

❤3

I would be very interested in this connector also - I believe (though obviously I haven't tried) that it will be significantly more efficient to query large data from Parquet files directly without requiring the data to be between hive and presto over thrift

jeromeof on 23 Jan 2020

@jeromeof the data does not come from hive, just metadata does

tooptoop4 on 24 May 2020

I haven't tried but have heard there is an experimental setting (hive.metastore=file) that could support reading from parquet directly - https://github.com/prestodb/presto/tree/master/presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/file

tooptoop4 on 24 May 2020

You can indeed [ab]use hive.metastore=file to avoid setting up a real Hive metastore. You can then run MinIO to serve your local files over "S3".

I put together a docker container with the above setup; it seems to work well for ad-hoc analysis of local files, though I've not used extensively yet so YMMV.

docker run -it --mount source=/<data-dir>/,destination=/parquet,type=bind floatingwindow/presto-local-parquet

The above will start Presto and the necessary bits to read local files, then give you a Presto shell.

More info: https://github.com/floating-window/presto-local-parquet

At the moment you still need to define the schemas manually. I'm part way through writing a script to scan a set of parquet files and automagically set up corresponding schema & file mappings in Presto - I'll update the above container when it's finished.