Add $location column in $partitions Hive system table,
or some other way to discover partition locations of a table.
The current $partitions table only needs to list the partition names. Getting the location requires fetching the full metadata, which is much more expensive. We also don't currently have a way to know which columns are requested for a system table (so we couldn't fetch it conditional based on the column being selected).
I'm thinking we should have a separate table that includes all the metadata for each partition, like $partitions_metadata. We could structure it the same as a $metadata table for each table that we've discussed. This would contain the storage descriptor, parameters, etc.
The current $partitions table only needs to list the partition names. Getting the location requires fetching the full metadata, which is much more expensive.
This is why I stopped implementing and created this issue instead.
We also don't currently have a way to know which columns are requested for a system table
Yep, this could be considered a prerequisite.
I'm thinking we should have a separate table that includes all the metadata for each partition, like $partitions_metadata
This can be a replacement or a complementary feature.
With hidden column ($location) and hidden tables (..$partitions_metadata) there is always a problem of feature-discovery by the users.
Since users are already sold on the idea of ..$partitions table, a new hidden column is arguably easier to be used.
Of course, this is under an assumption that location is the most interesting attribute of a partition that we currently don't show.
This feature enables a use case for Presto in ETL pipelines. Specifically, we can use it to enable partition swapping, along with data retention policies. When we partition swap, we don't want to immediately delete the old data, we want to leave it around to allow other jobs to finish processing.
Right now, this involves us having custom code to make a round trip to the metastore and get the location. While it works, this limits us in some ways. We use Python as the wrapper around our queries, and Python metastore support (via thriftpy) seems to be rather stagnant. Having a straightforward mechanism that relies on just running a SQL query through an engine that returns quickly (unlike hive itself) would allow us to drop usage of stale code.
I really have to agree with @pedersen here. I am trying to do partition swapping and it would be very helpful if I could just use presto to get the answer of what the location is of the partition is so I know where to swap it too.
One cheat is create a second partition that can use the location string from the first partition as the partition name in the second.
Example:
CALL system.register_partition('schema', 'table', ARRAY['real_partition', 'location_partition'], ARRAY['partition_data', 'location'], 'location');
Most helpful comment
This feature enables a use case for Presto in ETL pipelines. Specifically, we can use it to enable partition swapping, along with data retention policies. When we partition swap, we don't want to immediately delete the old data, we want to leave it around to allow other jobs to finish processing.
Right now, this involves us having custom code to make a round trip to the metastore and get the location. While it works, this limits us in some ways. We use Python as the wrapper around our queries, and Python metastore support (via thriftpy) seems to be rather stagnant. Having a straightforward mechanism that relies on just running a SQL query through an engine that returns quickly (unlike hive itself) would allow us to drop usage of stale code.