Presto: Row-level filtering and masking

Created on 10 Sep 2019 · 6Comments · Source: prestosql/presto

Presto should support row-level filtering and data masking for more granular access controls to data.

The general approach would be for SystemAccessControl/AccessControl to provide filters/projections on the base table given an authorization identifier (user/role) that the query planner injects into the plan.

Some questions that need to be considered:

Are the filters/masks provided as an expression or a reference to a function (e.g., a table function)?
If an expression, is it provided as SQL text or IR?
In the case of SQL text,
- What's the "environment" under which expressions get evaluated (path, catalog, schema, authorization identifier)?
- What happens if the SQL has free variables? Do they get bound to the enclosing scope if possible? Or is it treated as an error? (this would require some complex changes to the analyzer)
- What happens if the SQL has semantic errors (references to missing columns, type mismatch, missing functions, etc)? It might be hard/impossible to report a user-friendly error.

There are some potential benefits of modeling the filter/masks as references to (polymorphic) table functions:

The invocation semantics of functions in SQL is well-defined, especially around how the session and authorization contexts behave
It can potentially support opaque implementations (where the engine is not exposed to the filtering/masking logic) or transparent implementations by leveraging SQL-defined functions and related machinery.

Relates to https://github.com/prestosql/presto/issues/18

roadmap

Source

martint

👍2

Most helpful comment

It seems like the most user-friendly way to provide filters and masks is to allow connectors and access control plugins to describe them in terms of textual SQL expressions. This can be rationalized as an analog to dynamic views: the table and the filter/mask expressions form a dynamic view that's then parsed, analyzed and inlined into the query plan.

Clearly, the expressions must fit within some constraints:

The return type of a filter has to be boolean (or unknown, which would happen if the filter only ever returns null).
The return type of masks has to match the type of the column being masked to avoid breaking queries in hard-to-diagnose ways due to type mismatches further up the query tree.
All expressions need to be scalars (no aggregations or window functions) and no special functions such as grouping() can be involved.

Expressions can contain arbitrary subqueries (as part of an IN, EXISTS or quantified operator clause) and function references. This introduces a number of issues we need to resolve.

What authorization identifier is used to access tables and functions referenced by the filter?

Evaluating the filter and masks as the user who submitted the query is simple, but severely cripples the capabilities of this feature. Just like with views, being able to execute the filter with the permissions of a higher privilege user allows for more advanced filtering scenarios involving lookups into tables that might have restricted access. If we conceptualize row filters as just another form of dynamic views, where the view is returned "in parts" (the FROM clause and the WHERE clause separately), it follows that having a similar authorization model as views makes sense. In this case, the row filter would be returned by the plugin or connector together with an authorization identifier to be used to perform access checks on the tables and functions referenced by the filter.

This may seem pointless for a connector access control: if the connector is the one in charge of returning the filter for a table given a user identity, wouldn't that suffice to "trust" the filter and so any further permission checks on the filter should be skipped? Once we consider multiple connectors or global access control, it's clear that we need to perform those checks. Otherwise, a poorly written connector could result in a leak of information from another connector.

Should filters and masks be applied to tables/columns referenced by a filter or mask?

Assuming filters and masks are associated with an authorization identifier that will be for access control checks against any table referenced by them, should filters and masks be applied recursively to those tables? That would make sense if we want to have consistency, otherwise we'd next to explain why some access checks involve filters and masks and others don't.

However, applying filters and masks recursively introduces another potential problem: how far down the rabbit hole do we go? The solution to this may be to continue to resolve and apply row filters and masks as long as the given table and identity haven't been considered.

If both a connector and global access control provide masks for a column, what should the behavior be?

Some obvious choices are:

Require that only one of them provide masks and fail the query otherwis
Apply them all. In this case, in what order should they be applied?

martint on 5 Feb 2020

👍4

All 6 comments

Presto should support row-level filtering and data masking for more granular access controls to data.

The general approach would be for SystemAccessControl/AccessControl to provide filters/projections on the base table given an authorization identifier (user/role) that the query planner injects into the plan.

Some questions that need to be considered:
* Are the filters/masks provided as an expression or a reference to a function (e.g., a table function)?

* If an expression, is it provided as SQL text or IR?

* In the case of SQL text,

  * What's the "environment" under which expressions get evaluated (path, catalog, schema, authorization identifier)?
  * What happens if the SQL has free variables? Do they get bound to the enclosing scope if possible? Or is it treated as an error? (this would require some complex changes to the analyzer)
  * What happens if the SQL has semantic errors (references to missing columns, type mismatch, missing functions, etc)? It might be hard/impossible to report a user-friendly error.
There are some potential benefits of modeling the filter/masks as references to (polymorphic) table functions:
* The invocation semantics of functions in SQL is well-defined, especially around how the session and authorization contexts behave

* It can potentially support opaque implementations (where the engine is not exposed to the filtering/masking logic) or transparent implementations by leveraging SQL-defined functions and related machinery.
Relates to #18

@martint Could we just leverage create VIEW to expose a subset of table (rows or cols) to user ? then, we just need to grant SELECT privilege on VIEW to user - using connector level access control.

51yu on 10 Sep 2019

@l0yu, that's something that Presto supports today, but it requires someone to create those views and for users to know which views they are supposed to query. The feature we're discussing here is more powerful, as it would allow plugins and connectors to dynamically enforce filtering/masking policies based on the identity of the user without having to pollute the object namespace.

martint on 11 Sep 2019

@l0yu, that's something that Presto supports today, but it requires someone to create those views and for users to know which views they are supposed to query. The feature we're discussing here is more powerful, as it would allow plugins and connectors to dynamically enforce filtering/masking policies based on the identity of the user without having to pollute the object namespace.

OK Cool !

51yu on 11 Sep 2019

Clearly, the expressions must fit within some constraints:

The return type of a filter has to be boolean (or unknown, which would happen if the filter only ever returns null).
The return type of masks has to match the type of the column being masked to avoid breaking queries in hard-to-diagnose ways due to type mismatches further up the query tree.
All expressions need to be scalars (no aggregations or window functions) and no special functions such as grouping() can be involved.

Expressions can contain arbitrary subqueries (as part of an IN, EXISTS or quantified operator clause) and function references. This introduces a number of issues we need to resolve.

What authorization identifier is used to access tables and functions referenced by the filter?

Should filters and masks be applied to tables/columns referenced by a filter or mask?

If both a connector and global access control provide masks for a column, what should the behavior be?

Some obvious choices are:

Require that only one of them provide masks and fail the query otherwis
Apply them all. In this case, in what order should they be applied?

martint on 5 Feb 2020

👍4

What authorization identifier is used to access tables and functions referenced by the filter?

I think this decision should be left to a plugin, so connectors and access control plugins might return that identity along with the filter or mask.

I guess that in most cases it will be the same identity which is running the actual query. This is something that external service that stores authorization rules has to support in a first place. The ones that I most familiar with do not support that anyway. Because of that, we could also defer the support for using different authorization identifier to evaluate filter and mask until the moment we find the use case, otherwise it is going to be a "dead" code anyway.

wouldn't that suffice to "trust" the filter and so any further permission checks on the filter should be skipped?

I don't think so. We should still verify access. Even if plugin return a "trusted" identity that is going to evaluate filter and mask we still would need to make sure what access given "trusted" identity has.

Should filters and masks be applied to tables/columns referenced by a filter or mask?

I think so. I would expect to apply filter and then mask.

how far down the rabbit hole do we go?

I think we should go to the bottom with the caveat that planner would need to make sure that we are not within an endless loop.

If both a connector and global access control provide masks for a column, what should the behavior be?

I vote for Require that only one of them provide masks and fail the query otherwise. Notice that mask applies some semantic over a column (column is going to be a hash). If we consider the case of applying two masking functions then most like their semantics are conflicting (like one wants to generate a hash and other null). Choosing any of them might be misleading to the user. I don't expect this to be a common problem, so we could defer this until the moment we find actual valid use case for that. So failing for now sounds like a best choice.

kokosing on 5 Feb 2020

If both a connector and global access control provide masks for a column, what should the behavior be?

Since the masks have different "masking strengths" (eg. asterisk for everything vs asterisk for everything except last 4 characters), in principle it is not safe to apply _some_ mask.

Now consider a case of connector configured to apply a mask on credit_card_number column.
If we now want to apply same security measure system-wide with a new system access control configuration, we will easily end-up having two masks for one column.
It is completely reasonable usage scenario and we should not fail in such a case. (If we failed, how would users resolve this problem?)

The masks should be applied both, in order: connector mask, then system-level mask.