Dbt: Feature: Command to auto-generate schema.yml files

Created on 23 Oct 2018  路  12Comments  路  Source: fishtown-analytics/dbt

Feature description

DBT should have a command to bootstrap a schema.yml file (or, if you ask me, files) for a given model.

dbt bootstrap_schmea --model my_model

Should create one model_name.yml file for every model, with the appropriate boilerplate (model names and column names) filled in.

It should warn and skip if model_name.yml already exists.

Who will this benefit?

When creating a new dbt project from scratch, creating all of the testing files by hand is a PITA. This will help encourage people to use tests by alleviating the need for boiler-plate copy/pasting.

How to achieve?

In redshift, you can get the column names by creating a view in an un-used namespace and then inspecting the relevant system tables. Not sure about BQ / SF but I'm hopeful there's an equivalent method.

Most helpful comment

Great idea. Shocked we don't have an issue for this already!

This becomes a _much_ easier problem if dbt can just inspect existing relations built from your model sql. That would require you to run your models first, but I imagine that's more inline with a typical development workflow anyway, right?

All 12 comments

Great idea. Shocked we don't have an issue for this already!

This becomes a _much_ easier problem if dbt can just inspect existing relations built from your model sql. That would require you to run your models first, but I imagine that's more inline with a typical development workflow anyway, right?

I think it would be cool if we could define the level at which the schema file is defined, maybe from the project.yml.

I can see cases where having one schema file for the entire project would be nice, but can definitely see an argument for having a {{model}}_schema.yml created in a folder structure that mirrors the model structure, where a schemas folder would be created at the root directory of a project.

@kevinsanz93 thanks for raising this. I have always used separate schema files, one for each model, and had been assuming it would be implemented this way, but clearly that's not a good assumption!

bootstrap populate a dbt schema.yml is a good idea!

I wrote a schema.yml generator query using the information_schema of the generated tables in a postgres warehouse. In case it's useful to anyone:

with "columns" as (
    select '- name: "' || column_name || '"' as column_statement,
        table_name
    from information_schema."columns"
    where table_schema = 'schema_name'
)
select table_name,
'version: 2

models:
  - name: "' || table_name || '"
    columns:
' || string_agg('      ' || column_statement, E'\n') || E'\n' as yml_file
from "columns"
group by table_name
order by table_name

@thalesmello this is super cool! Thanks for sharing :)

For snowflake :

with "columns" as (
    select '- name: ' || column_name  || '\n       description: '|| lower(column_name) || ' (data type '|| lower(DATA_TYPE) || ')'

            as column_statement,
        table_name
    from information_schema.columns
    where table_schema = 'MY_TABLE'
  and table_name in ('')
  order by 1
),
tables as (
select table_name,
'
  - name: ' || table_name || '
    columns:
' || listagg('      ' || column_statement || '\n'|| '\n')  as table_desc
from "columns"
group by table_name
order by table_name
)

select '---Generated automatically, please update after generation
version: 2
sources:
  - name: '  || '
\n\nmodels:' || listagg(table_desc )
from tables;

It still would be great if a DBT macro that generates SQL could generate YML files at the same time.

We are now using a DBT macro to generate a series of views in Snowflake. These views depack JSON data in discrete columns using source system information_schema metadata

As an example, each generate view as a column {table_name}_BK. This is a column that should be unique. We would like to create a accompanying YML file for each view that will validate this uniqueness....

In a similar fashion we'd like to add more Data Quality tests to these YML files: _in a metadata-driven way!_

For SQL Server:

DECLARE @schema_yaml varchar(max)='';

WITH [columns] AS (
       SELECT [table] = '- name: ' + TABLE_NAME + char(13) + char(10)
                      + '  description: ' + char(13) + char(10)
                      + '  columns: ' + char(13) + char(10)

              ,[column] = '    - name: ' + COLUMN_NAME + char(13) + char(10) 
                        + '      description: ' + char(13) + char(10)
              ,[column_index]  = ORDINAL_POSITION  
              ,[table_schema] = '- '+ TABLE_SCHEMA + char(13) + char(10) + char(13) + char(10)          
         FROM INFORMATION_SCHEMA.COLUMNS 
)

,[tables] AS (
       SELECT [table_full] = [table] + STRING_AGG(CAST([column] as varchar(max)),'') WITHIN GROUP (ORDER BY [column_index])
              ,[table_schema]
         FROM [columns]
        GROUP BY [table], [table_schema]
)

,[schema] AS (
       SELECT [table_schema_full] = [table_schema] + STRING_AGG([table_full],'')  WITHIN GROUP (ORDER BY [table_full])
         FROM [tables]
         GROUP BY [table_schema]
)

,[final] AS (
       SELECT [text] = STRING_AGG([table_schema_full],'')
         FROM [schema]
)

SELECT @schema_yaml = [text] FROM final

PRINT @schema_yaml


---- if result is longer than 8000 characters, you have to print in iterations
--DECLARE @i int = 1
--WHILE EXISTS(SELECT(SUBSTRING(@schema_yaml,@i,4000))) and (@i < LEN(@schema_yaml))
--BEGIN
--     PRINT SUBSTRING(@schema_yaml,@i,4000)
--     SET @i = @i+4000
--END

For BigQuery:

``SQL WITH columns as ( SELECT '- name: "' || column_name || '"\n description: ' || lower(column_name) || ')\n' || ' tests:\n' as column_statement, table_name FROM ( SELECT * FROM.`.INFORMATION_SCHEMA.COLUMNS
where table_name='

'
)
),

tables as (
SELECT
table_name,
'- name: ' || table_name || '\n'
|| ' description:\n\n'
|| ' columns:\n' || string_agg(' ' || column_statement, '\n') as table_desc
FROM columns
GROUP BY table_name
ORDER BY table_name
)

SELECT
'#Generated automatically, please update after generation\n'
|| 'version: 2\n\n'
|| 'models:\n'
|| string_agg('' || table_desc || '\n')
FROM tables;

Variant of the above script that works on Redshift

select p.name,
'version: 2

models:
  - name: "' || p.name || '"
    columns:
' || listagg(distinct ('      ' || '- name: ' || attname || '\n')) WITHIN GROUP (ORDER BY 1) AS "columns"
FROM pg_attribute a, pg_namespace ns, pg_class c, pg_type t, stv_tbl_perm p, pg_database db
WHERE t.oid=a.atttypid AND a.attrelid=p.id AND ns.oid = c.relnamespace AND db.oid = p.db_id AND c.oid = a.attrelid
AND typname NOT IN ('oid','xid','tid','cid')
AND ns.nspname = 'schema_name'
group by p.name

I think we should extend the scope, not only to initialize a new schema.yml file but also to sync existing shema.yml with current model.
Currently I add or delete columns in schema.yml manually and don鈥檛 enjoy it. So many times the change remains undocumented in the yml.
Btw I use one yml per one model.

Was this page helpful?
0 / 5 - 0 ratings