Please improve parquet connector to not require hive, ideally also not require hdfs
Working directly with parquet files would make Presto much easier to get adopted by data scientists.
@mariuss Marius, is this something you'd like to work on?
I don't have the skills :(
We're looking for how to query large CSV or Parquet files, Presto seems a great solution. However, it's too complex now to get started with Parquet, due to all hadoop/hive deps & steps, vs. being able to connect to Parquet directly:
https://scientific-software.netlify.com/howto/how-to-query-big-csv-files
Thanks, Maria!
@mariusa u dont need hadoop/hdfs. all u need is S3 and hive (hive can run without hdfs, namenode..etc)
even the standalone hive metastore needs Hadoop specific environment variables/jars, maybe implement the Presto Thrift connector to communicate to S3? I will give it a go :-)
I would be very interested in this connector also - I believe (though obviously I haven't tried) that it will be significantly more efficient to query large data from Parquet files directly without requiring the data to be between hive and presto over thrift
@jeromeof the data does not come from hive, just metadata does
I haven't tried but have heard there is an experimental setting (hive.metastore=file) that could support reading from parquet directly - https://github.com/prestodb/presto/tree/master/presto-hive-metastore/src/main/java/com/facebook/presto/hive/metastore/file
You can indeed [ab]use hive.metastore=file to avoid setting up a real Hive metastore. You can then run MinIO to serve your local files over "S3".
I put together a docker container with the above setup; it seems to work well for ad-hoc analysis of local files, though I've not used extensively yet so YMMV.
docker run -it --mount source=/<data-dir>/,destination=/parquet,type=bind floatingwindow/presto-local-parquet
The above will start Presto and the necessary bits to read local files, then give you a Presto shell.
More info: https://github.com/floating-window/presto-local-parquet
At the moment you still need to define the schemas manually. I'm part way through writing a script to scan a set of parquet files and automagically set up corresponding schema & file mappings in Presto - I'll update the above container when it's finished.
Most helpful comment
even the standalone hive metastore needs Hadoop specific environment variables/jars, maybe implement the Presto Thrift connector to communicate to S3? I will give it a go :-)