dbt_project.yml should get a version 2. We can denote it by adding an optional config-version: field that defaults to v1 and encourage users to migrate to v2 by creating a migration script.
Many of these changes will enable/make it easier to implement #2269 and #2283 - in fact, v2 will probably be required for those features. It'll also make fixing #2265 a lot easier.
The changes:
models config block will be split out into seeds, models, analyses, sources, etc.config - that key will denote the actual values to be applied to the config.config in the search paths will be an error.vars its own top-level key. vars will be project-scoped only, no full paths/FQNsvars in config blocks will be treated as a config value named vars.We should provide a migration path: 0.17.0 will support both versions (but you can't have the "good stuff" on v1) and emit deprecation warnings. 0.18.0 will only support v2.
We can just not do it.
N/A
anyone developing sources
An example dbt_project.yml file that conforms to the v2 spec:
name: project_name
version: 1.0.0
config-version: 2
# vars cannot be scoped to specific directories
# they may only be scoped to packages, or "global"
vars:
testing_days_of_data: 7
my_project:
testing_days_of_data: 2
some_package:
testing_days_of_data: 4
models:
project_name:
config:
materialized: table
base:
disabled:
config:
enabled: false
config:
materialized: ephemeral
seeds:
snowplow:
config:
enabled: false
snapshots:
config:
strategy: check
check_cols: ['id']
The dbt_project.yml file spec is currently ambiguous. Consider the existing (v1) syntax:
models:
my_project:
partition_by:
field_name: date_day
data_type: date
From this syntax alone, it is not clear that partition_by is a configuration, rather than a path in the dbt project directory! Today, we're handling this by 1) hoping users don't name folders something like partition_by and 2) enumerating the list of configurations that dbt is aware of, eg:
If one of these keys is found in the dbt_project.yml file, then dbt knows that it is a config, not a folder path or model name. Historically, this has been _mostly fine_, most people don't name folders "partition_by", but it would be much better if there was no ambiguity here. Our proposed solution is that we have _one_ special key, config, which denotes that the corresponding value is a configuration instead of a hierarchical file tree.
Today, vars are scoped hierarchically in a dbt project. There isn't a _ton_ of merit to doing it this way, and it's more a fallout of the initial implementation than it is a well-considered design decision. The big downside of this approach is that variable resolution is hard to reason about, and hard to implement in dbt.
For models, folder-scoped variables are reasonably well defined. Consider:
# dbt_project.yml
models:
my_project:
folder_path:
vars:
abc: 123
some_other_path:
vars:
abc: 456
Here, the abc variable will have the value 123 for all models defined in or under models/folder_path. It is _less_ clear what should happen for a schema.yml file located at models/folder_path/schema.yml. Does it inherit the variable scope based on its path? Or should the variable scope be inferred from the model it is describing?
Consider the following two files:
# models/folder_path/schema.yml
models:
- name: my_model
description: "{{ var('abc') }}"
-- models/some_other_path/my_model.sql
select 1 as id
In this example, both 123 and 456 would both be pretty reasonable values for {{ var('abc') }} in the description field, and to be honest, I don't think I could tell you which one dbt returns off the top of my head! Making vars non-hierarchical won't _totally_ solve this exact problem, but it would certainly help!
Let's try to collect some feedback here around how folks are using vars. My personal take is:
var out of models: will be _incredibly_ controversial (confidence = 60%)config: change to models:, and the more I look at it, the more I'm inclined to agree with them :) Let's keep the goals in mind here (consistency, unambiguity, ease of parsing) and remain flexible on the exact implementation that we move forwards with.
Yes love all of this.
Yeah, there's never a time when I use vars scoped for a subdirectory. And moving it out of models: makes sense to me (esp. if you want to use them in sources too)
I'm in favour of config: (configs:?). From a documentation standpoint, it's hard to reason about "the list of keys that can be under models:" (I was literally thinking about this the other day).
I would also be in favor of:
models:
project_name:
materialized: view
enabled: true
tags: ['my_project']
base:
my_big_table:
materialized: table
materialization_configs: # ok not this but something like this
partition_by:
field_name: date_day
data_type: date
For me, there's a difference between the configs that a materialization uses (and are consumed in Jinja-land), versus configs that are consumed by the python parts of dbt. IDK if that's valid though
I'm not worried about the "using vars in a subdirectory" because empirically you can still accomplish the same use cases, you would just need to create multiple variables. This is how you would solve this problem in...normal programming :) There is no concept of "hierarchical variables scoped at a folder level" in Python, so we shouldn't attempt to create one if it's just creating problems for us.
I 100% massively agree that we should have a config key in the yaml spec for this file.
If we are going to make a bunch of changes to the spec for this file, I have two others that I'd really like to get in there!
dbt_project.yml should _only_ include the directives required to start up dbt, and I think that's just the file paths. All other directives should follow our other yaml should go into arbitrarily-named files, just like sources.yml works. This would include models, vars, seeds (anything else?). I would imagine creating a single models.yml or seeds.yml for most projects, but I could imagine wanting to separate those contents out into multiple files in certain situations (large projects, decentralized teams...). But dbt would parse them all in the same way that it currently parses sources.yml, which is to say that it shouldn't care about what specific file the yaml content comes from. Beyond just making dbt_project.yml more tightly scoped and resistant to sprawl, this also gives dbt project developers more control about how they package and architect their code.
The top-level config in models can be _either_ a project name or a top-level-directory name _within_ a project. I think this inconsistency is incredibly confusing and that we should be less flexible on this, whatever the right answer is. Maybe we have a project: key that is optional and defaults to the current project? Dunno. But resolving this feels consistent with our drive here to reduce uncertainty.
and I think that's just the file paths.
In fact, you only need:
name: jaffle_shop
version: 0.0.1
Everything else is optional 馃槃
Speaking of, while we're here, do we want to do something about version:? Either use it meaningfully or nix it?
dbt_project.ymlshould only include the directives required to start up dbt, and I think that's just the file paths. All other directives should follow our other yaml should go into arbitrarily-named files, just likesources.ymlworks.
I agree! This appears to be the cause of a decent amount of confusion. Unfortunately, this is very tricky to specify without also breaking schema.yml behavior and/or creating ambiguities. Currently models, sources, etc are all defined as lists of dicts, where each dict represents one model/source/whatever. I'm not sure I can come up with a model that preserves that without adding the ambiguity back in or having a second layer of keys. and this feels really ugly and verbose:
config:
models:
my_project:
my_folder:
config:
enabled: true
The top-level config in models can be either a project name or a top-level-directory name within a project. I think this inconsistency is incredibly confusing and that we should be less flexible on this, whatever the right answer is. Maybe we have a project: key that is optional and defaults to the current project? Dunno. But resolving this feels consistent with our drive here to reduce uncertainty.
This should just be required to be a project name. I actually thought it was that way already, though config resolution is currently so opaque I could easily be wrong.
@clrcrl Everything is optional but if you _do_ want to specify a file path, that specification is required at startup :)
And good ping on version.
Speaking of, while we're here, do we want to do something about
version:? Either use it meaningfully or nix it?
It's used by the hubsite to determine what version a project is, I thought?
No, not at all. Hubsite uses GitHub releases
Unfortunately, this is very tricky to specify without also breaking schema.yml behavior and/or creating ambiguities.
@beckjake I don't fully understand the challenges you're describing here, although I certainly trust that you're correct. My hope would be that we not totally give up on moving towards that desirable end-state without a fight though :) I'm very game to hop on a call to dig in further on this topic if others have an appetite to do that.
@jthandy more than happy to do so! Just so everyone can follow along, this is a currently valid schema.yml:
version: 2
models:
- name: emails
columns:
- name: email
tests:
- unique
- name: users
columns:
- name: id
tests:
- unique
That models section becomes this in python:
[
{'name': 'emails', 'columns': [{'name': 'email', 'tests': ['unique']}]},
{'name': 'users', 'columns': [{'name': 'id', 'tests': ['unique']}]}
]
It's not possible to mix lists and dicts in yaml, so this naive merge of the dbt_project.yml and schema.yml syntax isn't valid:
version: 2
models:
my_project:
my_folder:
config:
enabled: true
- name: emails
columns:
- name: email
tests:
- unique
- name: users
columns:
- name: id
tests:
- unique
I think we probably _could_ get this to work, but I don't love it - I think it's even _less_ intuitive than having two config files:
version: 2
models:
- config:
my_project:
my_folder:
config:
enabled: true
- name: emails
columns:
- name: email
tests:
- unique
- name: users
columns:
- name: id
tests:
- unique
I really like the proposed solution for handling vars. There was a time where I felt a bit too clever trying to folder scope vars but in reality I didn't really need to, I just thought it would be cleaner. I do think setting them globally/package specific prevents users from building out crazy use cases for vars in the long run! And having its own top-level key makes it really easy to find and define!
Most helpful comment
An example dbt_project.yml file that conforms to the v2 spec:
The config key
The
dbt_project.ymlfile spec is currently ambiguous. Consider the existing (v1) syntax:From this syntax alone, it is not clear that
partition_byis a configuration, rather than a path in the dbt project directory! Today, we're handling this by 1) hoping users don't name folders something likepartition_byand 2) enumerating the list of configurations that dbt is aware of, eg:If one of these keys is found in the
dbt_project.ymlfile, then dbt knows that it is a config, not a folder path or model name. Historically, this has been _mostly fine_, most people don't name folders "partition_by", but it would be much better if there was no ambiguity here. Our proposed solution is that we have _one_ special key,config, which denotes that the corresponding value is a configuration instead of a hierarchical file tree.vars
Today, vars are scoped hierarchically in a dbt project. There isn't a _ton_ of merit to doing it this way, and it's more a fallout of the initial implementation than it is a well-considered design decision. The big downside of this approach is that variable resolution is hard to reason about, and hard to implement in dbt.
For models, folder-scoped variables are reasonably well defined. Consider:
Here, the
abcvariable will have the value123for all models defined in or undermodels/folder_path. It is _less_ clear what should happen for aschema.ymlfile located atmodels/folder_path/schema.yml. Does it inherit the variable scope based on its path? Or should the variable scope be inferred from the model it is describing?Consider the following two files:
In this example, both
123and456would both be pretty reasonable values for{{ var('abc') }}in thedescriptionfield, and to be honest, I don't think I could tell you which one dbt returns off the top of my head! Making vars non-hierarchical won't _totally_ solve this exact problem, but it would certainly help!Let's try to collect some feedback here around how folks are using
vars. My personal take is:varout ofmodels:will be _incredibly_ controversial (confidence = 60%)config:change tomodels:, and the more I look at it, the more I'm inclined to agree with them :)Let's keep the goals in mind here (consistency, unambiguity, ease of parsing) and remain flexible on the exact implementation that we move forwards with.