From my understanding, the hive-connector uses presto runtime to interact with orc, parquet, and avro files on filesystems such as S3 compatible object storage systems and distributed file systems. It can uses the hive-metastore to hold metadata about the tables it is querying or this can even be replaced using other metastore services such as AWS glue.
Can we consider renaming the hive-connector to a more generic name?
A few suggestions may be:
I'm not sure what other use cases this connector can cover outside of what I mentioned above so perhaps more names can be suggested by the community if these names are lacking but I think we should definitely aim for something more generic and clear as this is a common point of confusion and I find myself explaining a lot is that the presto hive-connector isn't using the hive runtime.
The Hive connector uses the Hive metadata model. Using a more generic name like “file system connector” would be worse, since it’s not generic but fundamentally relies on Hive metadata.
Isn't it possible to replace that with a metastore such as AWS glue? Making it a "hive-free" system?
Glue Data Catalog in an implementation of the Hive metastore. A user of the Hive connector should not see any differences in behavior, and still needs to understand Hive metadata concepts such as partitioning, bucketing, data types, etc.
Okay then that makes sense to maybe avoid going too generic and this is definitely an odd connector to name. But leaving the name as hive-connector definitely feels like a bit of a misnomer.
The majority of the connectors interact with some other sort of runtime related to the system that stores that data. As you point out this is partially the case with the hive connector since we use the hive metastore but most of the processing logic exists in presto. So it seems odd that we continue calling this hive connector to me for this reason.
I understand the confusion, but I think that's mainly because it is inherently complicated. You could think of the Hive connector as the "data lake" connector, but what's a data lake and what are the implications? You still need to explain the Hive metadata model and metastore requirements. Renaming also has a high cost and creates its own confusion.
Yeah I absolutely agree that this naming is inherently complicated. I do also agree with your points to keep the name hive connector. The frog book does a great job at covering those complexities as well and making its uses clear.
I was hoping to get some more overview documentation about the connectors and usage in catalogs going in the nearish future. That could include some guidance on what connector to look at for object storage/whatever we call it ..
In addition @electrum and myself are planning to break up the Hive connector docs since they have grown to much and that will have to include a good overview of use cases for the connector as a expanded ToC/guidance as well. Maybe that helps enough and we can avoid the complexities of renaming, while at the same time fixing the user confusion around when to use this connector.
I'm going to close this issue since I don't think we will come to an agreement on a better name and renaming it would cause confusion. Thanks for the discussion.
Wrote a blog related to help clarify this confusion.
https://prestosql.io/blog/2020/10/20/intro-to-hive-connector.html
Most helpful comment
I was hoping to get some more overview documentation about the connectors and usage in catalogs going in the nearish future. That could include some guidance on what connector to look at for object storage/whatever we call it ..
In addition @electrum and myself are planning to break up the Hive connector docs since they have grown to much and that will have to include a good overview of use cases for the connector as a expanded ToC/guidance as well. Maybe that helps enough and we can avoid the complexities of renaming, while at the same time fixing the user confusion around when to use this connector.