Presto: Add $location column in $partitions Hive system table

Created on 9 Sep 2019 · 6Comments · Source: prestosql/presto

Add $location column in $partitions Hive system table,
or some other way to discover partition locations of a table.

enhancement

Source

findepi

Most helpful comment

This feature enables a use case for Presto in ETL pipelines. Specifically, we can use it to enable partition swapping, along with data retention policies. When we partition swap, we don't want to immediately delete the old data, we want to leave it around to allow other jobs to finish processing.

Right now, this involves us having custom code to make a round trip to the metastore and get the location. While it works, this limits us in some ways. We use Python as the wrapper around our queries, and Python metastore support (via thriftpy) seems to be rather stagnant. Having a straightforward mechanism that relies on just running a SQL query through an engine that returns quickly (unlike hive itself) would allow us to drop usage of stale code.

pedersen on 7 Feb 2020

👍4

All 6 comments

The current $partitions table only needs to list the partition names. Getting the location requires fetching the full metadata, which is much more expensive. We also don't currently have a way to know which columns are requested for a system table (so we couldn't fetch it conditional based on the column being selected).

I'm thinking we should have a separate table that includes all the metadata for each partition, like $partitions_metadata. We could structure it the same as a $metadata table for each table that we've discussed. This would contain the storage descriptor, parameters, etc.

electrum on 10 Sep 2019

The current $partitions table only needs to list the partition names. Getting the location requires fetching the full metadata, which is much more expensive.

This is why I stopped implementing and created this issue instead.

We also don't currently have a way to know which columns are requested for a system table

Yep, this could be considered a prerequisite.

I'm thinking we should have a separate table that includes all the metadata for each partition, like $partitions_metadata

This can be a replacement or a complementary feature.

With hidden column ($location) and hidden tables (..$partitions_metadata) there is always a problem of feature-discovery by the users.
Since users are already sold on the idea of ..$partitions table, a new hidden column is arguably easier to be used.
Of course, this is under an assumption that location is the most interesting attribute of a partition that we currently don't show.

findepi on 10 Sep 2019

https://github.com/prestosql/presto/pull/1282

tooptoop4 on 10 Dec 2019

pedersen on 7 Feb 2020

👍4

I really have to agree with @pedersen here. I am trying to do partition swapping and it would be very helpful if I could just use presto to get the answer of what the location is of the partition is so I know where to swap it too.

jim12321 on 24 Sep 2020

👍1

One cheat is create a second partition that can use the location string from the first partition as the partition name in the second.

Example:
CALL system.register_partition('schema', 'table', ARRAY['real_partition', 'location_partition'], ARRAY['partition_data', 'location'], 'location');

jim12321 on 24 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Add Presto Druid connector

findepi · 4Comments

Prestsql + Hudi MOR Tables

anismiles · 3Comments

Allow password authentication when using HTTP

bill-warshaw · 4Comments

'Equi criteria are empty, so INNER join should not have PARTITIONED distribution type' error

tooptoop4 · 4Comments

show create table is not checked with accessControl

dpolonsky · 4Comments