Presto: Hive tables' Migration to Iceberg tables

Created on 14 Jul 2020  路  5Comments  路  Source: prestosql/presto

Problem Statement

Currently we're unable to migrate some existing Hive tables to Iceberg in a user-unaware way. One catalog is configured with one connector. Suppose we have catalog hive configured with Hive Connector and iceberg configured with Iceberg Connector. In order to query a table u_xinlin.orders with Iceberg Connector, we have to specify as iceberg.u_xinlin.orders.

More specifically, in our production environment, the default catalog is configured as hive associated with a Hive Connector. Consider a user query select * from u_xinlin.orders o join u_xinlin.customer c on o.custkey = c.custkey. Suppose table u_xinlin.orders has been migrated to Iceberg but u_xinlin.customer has not. We hope that Presto can dynamically resolve u_xinlin.orders with Iceberg Connector and u_xinlin.customer with Hive Connector.

Proposed Solution: Catalog Redirection

Add such a method to ConnectorMetadata SPI:

default Optional<String> redirectToCatalog(ConnectorSession session,  SchemaTableName tableName)
{
    return Optional.empty();
}

Add a config property in Hive Connector:

hive.redirect.iceberg.catalog=iceberg

Implement the redirection logic in HiveMetadata: redirect to the corresponding Iceberg catalog when hive.redirect.iceberg.catalog is set and the underlying table is an Iceberg table.

For relevant methods in MetadataManager, call redirectToCatalog to redirect.

Alternative Solution: Connector Federation

Change the current one-catalog-to-one-connector mapping to one-catalog-to-multiple-connectors mapping. One hive catalog can be associated with both a Hive Connector and an Iceberg Connector. A logic component will need to be introduced to dynamically resolve connectors given a catalog name and a table name.


@electrum Can you give pointers on what you think of this? cc: @rdsr

Part of #1324

Most helpful comment

For another project, we have a requirement that may be related to this. We need to be able to redirect tables from one connector to another for the purpose of caching. This may involve tables that have only a subset of columns or rows. That could be easily done if the redirection happened during optimization instead. For a given TableHandle that might include predicates or column projections, the connector may decide the redirect the table to another table in another catalog.

There may be some subtle differences in how the permission checks need to be performed for each case, though.

All 5 comments

I'm going to call this a synonym to make the discussion easier since that's what they're called in other systems.

There are various issues to consider here:

  • From an API perspective, we'd probably want to support full name redirection, so it would return CatalogSchemaTableName rather than just catalog name.

  • When are security checks performed? Today, we call checkCanSelectFromColumns() in Analyzer on the result of the Analysis from StatementAnalyzer. Do we want to perform security checks on the synonym, or just on the final target? Checks are performed for views as well as the underlying references.

  • Consider that a synonym can point at another synonym. Any checks would need to happen for each one. We'd also need a max depth with loop detection.

  • Is the original synonym or chain available anywhere? For example, QueryMetadata contains a list of TableInfo to record the referenced tables. This information doesn't seem to be available today for views, so perhaps this is not a concern.

  • How transparent is this expected to be for users? It seems unavoidable that the target table name will be visible in many places such as error messages.

  • There are many places that ask for table handles without going through the analyzer. In particular, everything implementing DataDefinitionTask. Doing the synonym resolution in Metadata could make this easier, but precludes things like security checks on the synonyms.

  • The semantics of DDL operations are not clear. What does it mean to drop or rename a table using a synonym?

  • How do the various ConnectorMetadata APIs such as listTables() and listTableColumns() work? Perhaps the rule is that synonyms would appear in the table list but nowhere else.

  • How do these tables appear in information_schema.columns? We currently use listTableColumns() which is designed to allow bulk fetch and return. If we need to resolve synonyms for metadata, and it seems like we do, then this API would be replaced with (slower) per-table calls.

A possible alternative that avoids many of these issues is to add search path support for table resolution. If we used a schema search path like PostgreSQL, then you could make your search path hive.u_xinlin, iceberg.u_xinlin, allowing references to customer and orders to resolve to the correct catalog.

If we need to support references like u_xinlin.orders, then we might need a catalog search path, which could be workable, but we'd need to think about the implications, especially when combined with a schema search path. Or perhaps the schema search path would work with this and match schema-qualified names with same-named schemas in the search path.

Implementing this would require changing all the places that call createQualifiedObjectName() to combine that with table name resolution, rather than having it be separate steps.

@electrum Thanks! Let me organize the discussions as follows.

Approach 1: Synonym Resolution (Full Name Redirection)

Connector SPI Change

Add such a synonym resolution method to the ConnectorMetadata SPI

default Optional<CatalogSchemaTableName> resolveToCanonical(ConnectorSession session, SchemaTableName tableName)
{
    return Optional.empty();
}

Hive Connector Implementation

Add a config property

hive.iceberg.redirection=iceberg

resolveToCanonical: Redirects from hive.u_xinlin.orders to iceberg.u_xinlin.orders when hive.iceberg.redirection is set and the underlying table is in Iceberg

Core Engine Behaviors

DML and DDL: In the context of Hive-to-Iceberg redirection, I think it makes sense to perform both DML and DDL on the eventual canonical table. In other words, reading, writing, dropping, renaming, and altering will happen in Iceberg Connector when it is an Iceberg table, and Hive Connector when it is not. In this use case, DML and DDL on the canonical table implicitly takes effect on the synonyms. However, it's unclear whether this behavior works for other use cases in the future.

Security Checking: I think we can do it for each synonym in the resolution chain.

Catalog-wise Operations: thinking in the case of Hive-Iceberg, I think we can keep listTables and listTableColumns as they are now in HiveMetadata. In MetadataManager, we can keep listTables as the same while doing synonym resolution in listTableColumns. It's true that we'll have to introduce per-table calls. But it should be fine with Hive does per-table calls itself when computing table columns. We can avoid doing synonym resolution for other connectors by introducing a boolean method in ConnectorMetadata that returns whether a connector supports synonym resolution or not.

Core Engine Implementation

In order to perform necessary security checks as you suggested, I think we can do the synonym resolution in StatementAnalyzer and other classes that implement DataDefinitionTask, rather than in the MetadataManager.

A Comment

After walking through all of these, I feel although full name redirection applies to more use cases from an API perspective, it doesn't really help in the Hive-Iceberg case. Would it be better to keep the original catalog redirection API to minimize scope and improve understandability?

Approach 2: Search Path Support for Table Resolution

This alternative approach sounds reasonable. And it seems that we should introduce a method that checks whether a table exists in ConnectorMetadata?


@electrum Overall speaking, which approach do you think is more appropriate? I can draft a proof-of-concept PR and we can discuss on that.

For another project, we have a requirement that may be related to this. We need to be able to redirect tables from one connector to another for the purpose of caching. This may involve tables that have only a subset of columns or rows. That could be easily done if the redirection happened during optimization instead. For a given TableHandle that might include predicates or column projections, the connector may decide the redirect the table to another table in another catalog.

There may be some subtle differences in how the permission checks need to be performed for each case, though.

@martint I think that's an entirely different scenario that we shouldn't try to combine here. The goal of this is to make the physical catalog largely invisible to end users. We want to allow DDL operations like renaming or dropping tables. These operations, especially rename, don't really make sense when the target is not 1:1 with the source.

Was this page helpful?
0 / 5 - 0 ratings