We should propagate docs to BQ table and column descriptions.
Cool that dbt generates a web page and docs for your project. Unfortunately, it's a bit removed from where our org is used to looking for documentation. We use BQ extensively and rely on the table and column descriptions provided in the BQ UI.
Anyone using BQ would benefit.
Since the BQ description field is just a giant text blob, I'd suggest putting the raw markdown description in there. I'd also love to see the raw (uncompiled) query appended after the description in the same field. Then, folks would have some documentation even if there were no description provided for the table.
Plus, devs tend to document their SQL in the .sql file itself. If we included both, we'd be getting decent documentation in the native BQ UI, with minimal documentation effort.
Got it, thanks for the report @jakebiesinger! The big challenge here is that we need to make this work for BigQuery, Redshift, Postgres, and Snowflake. Table and column comments are supported for all of these databases, so I don't view that as a problem, but it's a decent amount of work!
When do you see these descriptions being persisted? Is it when the models are built? Or via some other process, like dbt docs --persist?
It'd be awesome if this happened as part of the normal dbt run IMHO, so that documentation (always an afterthought) isn't left off in the normal case.
dbt docs --persist would be nice as a way to update them w/out rerunning the world though.
This feature would be very helpful.
It would also be nice to add additional comments like "made by DBT" with a link to the Git repo.
I'd be more than happy to contribute to this if we can decide on a solution 👍
Cool idea @darrenhaken. To summarize, there's a matrix of comments that dbt should support to the extent supported by the underlying databases:
| | schema/dataset | view | table | column |
|-----------|----------------|------|-------|--------|
| bigquery | | | | |
| snowflake | | | | |
| redshift | | | | |
| postgres | | | | |
Doing this correctly will require knowledge of and access to each of these databases. For the more SQL compatible ones (snowflake, redshift, postgres), we can issue an alter statement:
alter table schema.table add comment '...';
And for BigQuery, I think we can supply a value for the object description.
I'm not sure how best to surface schema-level descriptions, as these aren't currently supported in dbt's documentation. Let's leave that out for the first cut, but we should consider ways to support them in dbt in the near future.
@darrenhaken you mentioned using labels in dbt slack. Do you think those map to dbt's tags? I haven't used them before, and I'm curious what kind of information you'd be interested in putting into a label.
@drewbanin I think the dbt's tags could be used for labels. I wanted to have labels like:
created_by: dbt,
github_project_url: https://github.com/project-id`,
Re BigQuery - I would propose focusing on object description as a first iteration of expanding out the matrix.
cool, i agree, let's start with bigquery table/view descriptions. I think those can be embedded into the create or replace table/view statements that dbt uses, right?
In practice, that looks something like:
create or replace table `dbt_dbanin`.`my_table`
options (
description="my description goes here"
)
as (
select 1 as id
)
The macro that generates create table|view as ... statements can be found here: https://github.com/fishtown-analytics/dbt/blob/b0a4f5c9814629d48230898a19eadb9c53938109/plugins/bigquery/dbt/include/bigquery/macros/adapters.sql#L30
You can see where dbt actually calls this create_table_as macro here: https://github.com/fishtown-analytics/dbt/blob/b0a4f5c9814629d48230898a19eadb9c53938109/plugins/bigquery/dbt/include/bigquery/macros/materializations/table.sql#L68
We do something similar for views and incremental models in the corresponding materializations as well. This materialization code is kind of a mess and in need of a cleanup, but I think this is where all of the magic will likely need to happen.
@drewbanin the links you shared create 404 pages, is that because you shared your branch and not master?
I'd be keen to start development on this if we agree on an approach, thoughts?
It doesn't sound like a tonne of work.
Thanks @darrenhaken - just updated my comment to use permalinks. Database-specific implementation questions aside, I think you're right that we just need to agree upon a syntax for specifying docs persistence.
Let's add a config option to models called persist_docs which accepts a dictionary or None. If the value is None, dbt should not persist docs. If the value is a dictionary, then it may contain the keys:
{
"relation": true/false,
"columns": true/false,
"schema": true/false
}
Let's definitely only implement this for relations initially (i think columns is going to be more involved for sure). I like the idea of using a dict from the outset though, as it will afford us the option to persist docs to other types of object in the future.
I think that for the first cut of this, dbt should only update docs when models are rebuilt. While I do think there's merit to updating documentation without re-running models, that's also a lot more involved.
Lmk if you buy all of this. I imagine you'll need some more info (like how to get documentation for a relation inside of a materialization), but very happy to help you with that in turn once we're all set on the implementation.
@drewbanin trying to get up to speed with all the dbt lingo 👍
What would relation contain in your example?
Could you show a dummy model file containing the relation with a description for the table?
So I can see how it would look for the user.
Or is the idea to use the description from the schema.yml file for the model?
hey @darrenhaken - relation is a database term for objects like "tables" and "views". Both of these things are types of "relations", and it's a convenient way to refer to them! In dbt, using table to refer to both "tables" and "views" is confusing IMO.
I'm picturing something like:
-- models/my_model.sql
{{ config(materialized='table', persist_docs={"relation": true}) }}
Because dbt's configs are hierarchical, you can also configure _all_ of your model to use this setting in your dbt_project.yml file, eg:
models:
my_project:
persist_docs:
relation: true
I absolutely think dbt should use the description from a schema.yml file! If relation is true (as in the above example), then dbt should grab the description (if any) specified from schema.yml and use that in the create table as statement to supply a description for a relation
Thanks for all the info! That all makes complete sense and the approach sounds good to me.
Can you point me in the right direction which files in the codebase need to be modified to support it, I assume Python?
@darrenhaken just out the links to the code above. I don't know that any python changes will be required here (at least for the first cut).
You can find the description for a model in the materialization context with: {{ model.description }}
You can find the value for a config (like persist_docs) with:
{%- set persist_docs = config.get('persist_docs', {}) -%}
{%- set persist_relation_docs = persist_docs.get('relation', false) -%}
Give the BigQuery links above a once-over and lmk if you have any other questions!
I’ll try and take a look at this week in more detail
I'm picking this up now, sorry for the delay.
@drewbanin could you keep an eye on this issue as I'd love some feedback on it?
@drewbanin what's the correct way to set the persist_docs in a model with the dict like you suggested:
{
"relation": true/false,
"columns": true/false,
"schema": true/false
}
Got any example?
@drewbanin I worked this out, see PR
@drewbanin @darrenhaken @jakebiesinger
Super psyched about this PR: https://github.com/fishtown-analytics/dbt/pull/1285/
BigQuery's DDL also supports:
This means these could be inserted via a macro from the schema.yml as well: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language
This looks something like:
create or replace table `test-project.test_dataset.test_table`
(
column_int INT64 NOT NULL OPTIONS(description='This is an INT64 field'),
column_str STRING NOT NULL OPTIONS(description='This is a STRING field'),
column_array ARRAY <
STRUCT <
column_struct_str STRING NOT NULL OPTIONS(description='This is an ARRAY-STRUCT-STRING field'),
column_struct_int INT64 NOT NULL OPTIONS(description='This is a ARRAY-STRUCT-INT64 field')
>
> OPTIONS(description='This is an ARRAY of STRUCTS field'),
column_bool BOOL NOT NULL OPTIONS(description='This is a BOOL field')
)
OPTIONS(
description='Test DDL Schema \n - https://docs.getdbt.com/docs/schemayml-files \n - https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language',
labels=[('key1', 'value1'),('key2', 'value2')]
)
I think this is on your radar, but we have a very similar use case to @jakebiesinger and having these docs persist through to BQ would be really great.
Thanks,
Gil
Great to hear! Yes, I also wanted the labels/tags and column descriptions. If you have time feel free to contribute beyond this work to add them in! If not I’m going to try and get the time over the next month to do it but help would be ideal.
@gturetsky thanks for the writeup! Yeah - this should be totally possible. Should be straightforward-ish to extend #1285 for this use case :)
This was (finally) merged in https://github.com/fishtown-analytics/dbt/pull/1532 :D
Thanks for your hard work on the original PR @darrenhaken!
Hurray!
On Thu, 13 Jun 2019 at 18:00, Drew Banin notifications@github.com wrote:
This was (finally) merged in #1532
https://github.com/fishtown-analytics/dbt/pull/1532 :DThanks for your hard work on the original PR @darrenhaken
https://github.com/darrenhaken!—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/fishtown-analytics/dbt/issues/1031?email_source=notifications&email_token=AAHIY67VI43IXJHJE4T6VGLP2J4MRA5CNFSM4FXTH4RKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXULMUQ#issuecomment-501790290,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAHIY63NE3RVFGULZIIHS33P2J4MRANCNFSM4FXTH4RA
.
Most helpful comment
Great to hear! Yes, I also wanted the labels/tags and column descriptions. If you have time feel free to contribute beyond this work to add them in! If not I’m going to try and get the time over the next month to do it but help would be ideal.