Add column descriptions to the BigQuery DDL statement creating a table in BigQuery from the model column definition in the relevant .yml file. Basically addition to https://docs.getdbt.com/docs/bigquery-configs#section-persisting-model-descriptions to add the column descriptions as well.
Though one challenge I see here is that dbt might need to derive the type for each column in the result table, not sure if that's being checked right now.
Thanks @shivam-darmora
So I was curious what it would take to do this and found it was beyond my ability to contribute, but I did some spelunking that I'd like to document for anybody picking this up.
The first thing to note is that BigQuery Data Definition Language statements do contain an option for the column schema which can contain a column list option for descriptions. "That means that CREATE SQL statements _could_ include column descriptions, but it would require us to change how the DDL statement is compiled for BigQuery," I said to myself, just before searching the codebase for 20 minutes and not finding any DDL statements.
It looks like dbt is using BigQuery API functionality instead of DDL statements to build tables in two ways:
This change is beyond my skill level because it might prompt a refactor. Should we use DDL statements across the board? Should we use the Table method across the board? I really don't know.
There are also other ways that dbt interacts with BQ tables beyond just creating new ones, and we would need to account for column descriptions in those as well. It looks like dbt's native Column class is sometimes converted to the google.cloud.bigquery.schema.SchemaField class for BigQuery API calls. The init function for this class _does_ have a parameter for description, but we're not currently calling it.
This SchemaField init function is called when:
To determine if descriptions could work with these methods, we need to answer a few questions:
Based on my spelunking, the answers are roadblocks:
In closing, adding column descriptions looks like a PITA but hopefully somebody reads this, says "actually that doesn't sound too bad", and actually goes for it 馃檹 馃檹 馃檹
Hey @mferryRV - thanks for this very thoughtful and well-researched comment! My ultimate takeaway is indeed "actually that doesn't sound too bad" :p
Couple of things to note:
The use of the table API is a vestige of BigQuery's historically poor support for date partitioned tables. Before column partitioning was released, dbt needed to create an ingestion-time partitioned table (using the table API) and then create partitions with a date decorator for each date specified in the model. We intend to remove this bit of functionality in a future release - column partitioning is unreservedly better for dbt's needs.
dbt _does_ execute DDL to create tables and views on BigQuery! This DDL looks something like:
create or replace table dbt_dbanin.debug_comments as
(
select
1 as id,
'drew' as name
);
Instead, dbt could build a DDL query that looks more like:
create or replace table dbt_dbanin.debug_comments (
id int64 options(description="abc 123"),
name string options(description="def 456")
)
options (
description = "my table description"
)
as
(
select
1 as id,
'drew' as name
);
This example will create a table with descriptions for both the specified columns and the table itself. The really challenging thing here is that _all of the columns in the table need to be specified in the table schema_. We could not, for instance, supply the schema for the id column but omit the name column in the example above.
This sort of leave us with two options:
schema.yml files for models which should contain column-level commentsschema.yml filesNeither of these options sound super pleasant IMO. The first option is really tractable for us to implement, but puts a pretty big burden on dbt users! There are some places where I think enumerating a full schema for a model is a good idea, but I hesitate to require something that verbose just to get column descriptions working.
The other broad approach is to create the table, then to iterate through the list of columns that contain descriptions in the schema.yml file. I believe we can fetch the table schema from BigQuery, modify it to include user-supplied comments, then push those descriptions back to BigQuery. Other databases support alter table comment on columns id to '....' syntax - this approach is somewhat analogous and would provide us a path forwards for implementing similar functionality on other databases.
We can totally add a comment or description field to the Column class, or we could implement it specifically for BigQueryColumns - either approach is appropriate here I think. Long term, I'd like to support this functionality on all databases where it's tractable.
Last, you can ignore Agate in this context - I don't think it's going to play much of a role in the implementation here.
Thanks! Super happy to follow up on this - let me know if there's anything else I can clarify, or if there's anything important I missed here :)
Thanks for digging into this, Drew!
My feeling is that creating the table and then iterating through the columns to add descriptions introduces the feature without making the functionality less intuitive. For example, I'd hate to see errors like:
Sorry, you need to specify column descriptions for all columns
or
Sorry, we guessed the column type wrong and now it doesn't work
Iterating through after the table already exists would make for more intuitive errors - the table still gets built and the secondary concern of column documentation is addressed later. It also feels like it might be faster than interpreting the column type for every column.
I think I'm with you here @mferryRV! Requiring users to specify _all_ of the columns in a table is a good feature in its own right, but it's definitely something that should be opted into #1570
I think we can implement this differently on different databases. BigQuery will probably let us update the column comments with a single API call, whereas we might need to dispatch one alter table ... comment on column .... is '...' query per column. Such is life!
Hi
Did someone make any progress on this topic? We are also looking into how to provide column level descriptions to our dbt models that we deploy in BigQuery.
thanks
Bob
Hi @bodschut - no progress to-date that I'm aware of. My current thinking is that we should:
I _think_ that would work, and it appears to be a best-of-both worlds approach as described above. I think that a proof-of-concept like this would entail ~80% of the work required to implement the eventual pull request, so if anyone here is keen to dig in, I'd be happy to support you however I can!
@drewbanin I am interested in getting this implemented for Snowflake as well so there is a chance in the next few weeks to months I might give it a stab.
It would be SUPER cool to have our Worksheets 4.0 (Numeracy) be able to have tight integration with column and table comments (or some higher-order construct of internal documentation).
Thanks @snowflakeseitz - I updated the title and labels on this issue. We should support this on every database we can. I really agree that tight editor integrations would be.... tight!
Snowflake makes this pretty tractable -- we can specify comments for multiple columns all at once. BigQuery is going to require usage of the rest API (as far as I understand), and Redshift is an MPP database.
We're currently operating with partial support for the persist_docs config across the core-supported database. I'd like to just take the time to round that feature out and support this everywhere we can! @snowflakeseitz if you're interested in sending through the Snowflake piece of this, we'd certainly love to merge that PR :)
Adding this to the next feature release regardless (codenamed Octavius Catto).
@drewbanin
This is how I have tackled it. I am creating the table via dbt and then as part of an airflow dag am getting the schema and running a script. Let me know how I can help getting this feature into dbt :)
@drewbanin I made some progress on the Snowflake connector side and got my docs to add to columns descriptions would love to set up some time to review before making any more effort
Thanks @snowflakeseitz! Going to check this PR out in more detail again today!
For everyone else - check out this issue which better tracks database-specific implementations for both relation-level and column-level docs persistence: https://github.com/fishtown-analytics/dbt/issues/1573
We'd like to get all of these in for the next release, 0.17.0, so if anyone is interested in picking up support for BQ/pg/Redshift, let us know in the relevant issue!
persist docs is shipping for all core plugins (at the relation and column level, where possible/applicable) in v0.17.0 馃帀
Thanks all for your help with making this happen!
Most helpful comment
So I was curious what it would take to do this and found it was beyond my ability to contribute, but I did some spelunking that I'd like to document for anybody picking this up.
The first thing to note is that BigQuery Data Definition Language statements do contain an option for the column schema which can contain a column list option for descriptions. "That means that
CREATESQL statements _could_ include column descriptions, but it would require us to change how the DDL statement is compiled for BigQuery," I said to myself, just before searching the codebase for 20 minutes and not finding any DDL statements.It looks like dbt is using BigQuery API functionality instead of DDL statements to build tables in two ways:
This change is beyond my skill level because it might prompt a refactor. Should we use DDL statements across the board? Should we use the Table method across the board? I really don't know.
There are also other ways that dbt interacts with BQ tables beyond just creating new ones, and we would need to account for column descriptions in those as well. It looks like dbt's native Column class is sometimes converted to the google.cloud.bigquery.schema.SchemaField class for BigQuery API calls. The init function for this class _does_ have a parameter for description, but we're not currently calling it.
This SchemaField init function is called when:
To determine if descriptions could work with these methods, we need to answer a few questions:
Based on my spelunking, the answers are roadblocks:
In closing, adding column descriptions looks like a PITA but hopefully somebody reads this, says "actually that doesn't sound too bad", and actually goes for it 馃檹 馃檹 馃檹