DBT should have a command to bootstrap a schema.yml file (or, if you ask me, files) for a given model.
dbt bootstrap_schmea --model my_model
Should create one model_name.yml file for every model, with the appropriate boilerplate (model names and column names) filled in.
It should warn and skip if model_name.yml already exists.
When creating a new dbt project from scratch, creating all of the testing files by hand is a PITA. This will help encourage people to use tests by alleviating the need for boiler-plate copy/pasting.
In redshift, you can get the column names by creating a view in an un-used namespace and then inspecting the relevant system tables. Not sure about BQ / SF but I'm hopeful there's an equivalent method.
Great idea. Shocked we don't have an issue for this already!
This becomes a _much_ easier problem if dbt can just inspect existing relations built from your model sql. That would require you to run your models first, but I imagine that's more inline with a typical development workflow anyway, right?
I think it would be cool if we could define the level at which the schema file is defined, maybe from the project.yml.
I can see cases where having one schema file for the entire project would be nice, but can definitely see an argument for having a {{model}}_schema.yml created in a folder structure that mirrors the model structure, where a schemas folder would be created at the root directory of a project.
@kevinsanz93 thanks for raising this. I have always used separate schema files, one for each model, and had been assuming it would be implemented this way, but clearly that's not a good assumption!
bootstrap populate a dbt schema.yml is a good idea!
I wrote a schema.yml generator query using the information_schema of the generated tables in a postgres warehouse. In case it's useful to anyone:
with "columns" as (
select '- name: "' || column_name || '"' as column_statement,
table_name
from information_schema."columns"
where table_schema = 'schema_name'
)
select table_name,
'version: 2
models:
- name: "' || table_name || '"
columns:
' || string_agg(' ' || column_statement, E'\n') || E'\n' as yml_file
from "columns"
group by table_name
order by table_name
@thalesmello this is super cool! Thanks for sharing :)
For snowflake :
with "columns" as (
select '- name: ' || column_name || '\n description: '|| lower(column_name) || ' (data type '|| lower(DATA_TYPE) || ')'
as column_statement,
table_name
from information_schema.columns
where table_schema = 'MY_TABLE'
and table_name in ('')
order by 1
),
tables as (
select table_name,
'
- name: ' || table_name || '
columns:
' || listagg(' ' || column_statement || '\n'|| '\n') as table_desc
from "columns"
group by table_name
order by table_name
)
select '---Generated automatically, please update after generation
version: 2
sources:
- name: ' || '
\n\nmodels:' || listagg(table_desc )
from tables;
It still would be great if a DBT macro that generates SQL could generate YML files at the same time.
We are now using a DBT macro to generate a series of views in Snowflake. These views depack JSON data in discrete columns using source system information_schema metadata
As an example, each generate view as a column {table_name}_BK. This is a column that should be unique. We would like to create a accompanying YML file for each view that will validate this uniqueness....
In a similar fashion we'd like to add more Data Quality tests to these YML files: _in a metadata-driven way!_
For SQL Server:
DECLARE @schema_yaml varchar(max)='';
WITH [columns] AS (
SELECT [table] = '- name: ' + TABLE_NAME + char(13) + char(10)
+ ' description: ' + char(13) + char(10)
+ ' columns: ' + char(13) + char(10)
,[column] = ' - name: ' + COLUMN_NAME + char(13) + char(10)
+ ' description: ' + char(13) + char(10)
,[column_index] = ORDINAL_POSITION
,[table_schema] = '- '+ TABLE_SCHEMA + char(13) + char(10) + char(13) + char(10)
FROM INFORMATION_SCHEMA.COLUMNS
)
,[tables] AS (
SELECT [table_full] = [table] + STRING_AGG(CAST([column] as varchar(max)),'') WITHIN GROUP (ORDER BY [column_index])
,[table_schema]
FROM [columns]
GROUP BY [table], [table_schema]
)
,[schema] AS (
SELECT [table_schema_full] = [table_schema] + STRING_AGG([table_full],'') WITHIN GROUP (ORDER BY [table_full])
FROM [tables]
GROUP BY [table_schema]
)
,[final] AS (
SELECT [text] = STRING_AGG([table_schema_full],'')
FROM [schema]
)
SELECT @schema_yaml = [text] FROM final
PRINT @schema_yaml
---- if result is longer than 8000 characters, you have to print in iterations
--DECLARE @i int = 1
--WHILE EXISTS(SELECT(SUBSTRING(@schema_yaml,@i,4000))) and (@i < LEN(@schema_yaml))
--BEGIN
-- PRINT SUBSTRING(@schema_yaml,@i,4000)
-- SET @i = @i+4000
--END
For BigQuery:
`` tables as ( SELECT Variant of the above script that works on Redshift I think we should extend the scope, not only to initialize a new schema.yml file but also to sync existing shema.yml with current model.SQL
WITH columns as (
SELECT
'- name: "'
|| column_name
|| '"\n description: '
|| lower(column_name)
|| ')\n'
|| ' tests:\n' as column_statement,
table_name
FROM (
SELECT *
FROM
where table_name=''
)
),
SELECT
table_name,
'- name: ' || table_name || '\n'
|| ' description:\n\n'
|| ' columns:\n' || string_agg(' ' || column_statement, '\n') as table_desc
FROM columns
GROUP BY table_name
ORDER BY table_name
)
'#Generated automatically, please update after generation\n'
|| 'version: 2\n\n'
|| 'models:\n'
|| string_agg('' || table_desc || '\n')
FROM tables;select p.name,
'version: 2
models:
- name: "' || p.name || '"
columns:
' || listagg(distinct (' ' || '- name: ' || attname || '\n')) WITHIN GROUP (ORDER BY 1) AS "columns"
FROM pg_attribute a, pg_namespace ns, pg_class c, pg_type t, stv_tbl_perm p, pg_database db
WHERE t.oid=a.atttypid AND a.attrelid=p.id AND ns.oid = c.relnamespace AND db.oid = p.db_id AND c.oid = a.attrelid
AND typname NOT IN ('oid','xid','tid','cid')
AND ns.nspname = 'schema_name'
group by p.name
Currently I add or delete columns in schema.yml manually and don鈥檛 enjoy it. So many times the change remains undocumented in the yml.
Btw I use one yml per one model.Related issues
keiththompson
路
3Comments
smomen
路
3Comments
kconvey
路
3Comments
drewbanin
路
4Comments
jtcohen6
路
3Comments
Most helpful comment
Great idea. Shocked we don't have an issue for this already!
This becomes a _much_ easier problem if dbt can just inspect existing relations built from your model sql. That would require you to run your models first, but I imagine that's more inline with a typical development workflow anyway, right?