Policy tags are a BigQuery feature that allows users to set column level access control (docs). After a tag has been created, it can be added programatically via the BigQuery python library to columns as needed.
This proposed feature would allow users to add (already created) policy tags to their existing dbt models through their schema.yml files.
Considerations:
google-cloud-bigquery in version 1.25.0 (latest version). Dbt currently is using version 1.24.0 so this will need to be bumped upcolumns is set to true for persist_docs.meta field. Alternatively a new BigQuery specific schema field could be usedupdate_column_description adapter function that is. called during persist_docs. The advantage here is that we're already iterating over the columns so we wouldn't have to do it again and the function name can be changed to something more generic eg. update_columnFrom my understanding, this is something that currently can only be supported for models in the schema.yml files as I don't think docs can be updated for sources but that would be an ideal situation to support this as well
This feature would be specific to BigQuery.
BigQuery users who currently use (or want to start using) policy tags and manage them through their dbt setup.
Thanks for the detailed writeup, @azhard! Since dbt adds labels and descriptions to tables and views when initially creating them, it makes sense enough to me that dbt could add these policy tags as well.
Is this a feature you'd be interested in contributing?
On the dbt side, we could either take each piece of information and construct the policy tag "name" for the user or have them input the fully qualified name themselves.
I'm inclined to require users to fully qualify tag names, unless you have a strong feeling for us systematizing these in a highly opinionated way.
--full-refresh run.run-operation, both to persist column-level descriptions and apply column-level access policies.Yeah more than happy to work on this @jtcohen6 ! With respect to the tag names, I think having the user fully qualify them makes sense as well so will incorporate that into the PR.
What're your thoughts around the meta field that I mentioned? Do you think it makes sense to have users define the tags in meta or have a new field specific to bigquery schema files?
Also re: the related issues. Both make sense to me and would love to get this working with the future persist_source macro!
@jtcohen6 ended up sticking the policy tags in the tags field that already exists in my PR. Happy to change that as needed
I'm supportive of that. tags is a field that we understand to have some functional purpose. As I see it, meta really _just_ exists to persist an arbitrary dict of information into manifest.json and the documentation site. What we may want to do is differentiate between tags that are dbt-only and tags that should be persisted to BigQuery as access policy tags. @drewbanin let me know if you disagree.
Quick turnaround on the PR! I'll take a look a bit later today
hmmm, I think we may not want to repurpose tags as policyTags! It's a good and sensible idea, but I think we'll be better off long-term by making them totally separate. tags is a dbt-wide config, and I think that our future selves will be grateful if we implement BigQuery policy tags with a BigQuery-specific config. Maybe we can call it something like policy_tags instead?
That might be a little bit tricky to implement today as I don't think we have any notion of database-specific schema.yml configs. I truthfully am not sure of the best way to implement something like this, but I did want to raise this point sooner rather than later.
What do y'all think?
Yeah I think that makes sense actually, don't want to override any dbt specific functionality that leverages the tags feature in the future... With the current lack of database-specific schema.yml files is this something that will have to be put on hold or is there another way you can think of to implement this (and maybe the database-specific .ymls too)?
do we specifically want this config to live in the schema.yml file? If we instead allow it to be a model-level config (configured in the models: dict in dbt_project.yml or in a {{ config() }} call), then the path to implementation should be pretty straightforward! The tricky thing is of course that it's a column-level config. I could imagine an API like:
{{ config(
policyTags: {
"col_1": ['tag1', 'tag2'],
"col_2": ['tag3', 'tag4']
}
) }}
I totally get that it's not ideal to write out the column list twice, but this is definitely one option!
If we _really_ really want to, we could add this config as a dbt Core-level config that's only used on BigQuery. That's definitely not my preference.... but in looking at the config above, I do think schema.yml is the right place for this config long-term
Aside from defining the columns twice, my main concern with the first option is that I don't believe that method can be leveraged for sources. Can't speak to other orgs but for our case I was hoping to add policy tags to both models and sources (using a method similar to the one described in #2540).
I'm also hesitant on adding this as a core-level config, don't think it makes sense to expose it to other adapters. Not sure where that would leave implementation for this feature for now though
Ah, ok, thanks for that context @azhard! I gave this some more thought, and I'm increasingly happy with the idea of adding a BQ specific config in the columns section of dbt_project.yml. The big caveat I will add is that at some point in the future, we might decide we need to change up the interface to support per-adapter configs more generally. I can't say with any certainty what that might look like, but I did want to call it out!
I'll take a look at https://github.com/fishtown-analytics/dbt/pull/2589 and follow up over there with a review!