dbt_project.yml v2

Created on 7 Apr 2020  路  11Comments  路  Source: fishtown-analytics/dbt

Describe the feature

dbt_project.yml should get a version 2. We can denote it by adding an optional config-version: field that defaults to v1 and encourage users to migrate to v2 by creating a migration script.

Many of these changes will enable/make it easier to implement #2269 and #2283 - in fact, v2 will probably be required for those features. It'll also make fixing #2265 a lot easier.

The changes:

  • The models config block will be split out into seeds, models, analyses, sources, etc.
  • In those config blocks, every entry will be used as a part of the FQN hierarchy, except for the key named config - that key will denote the actual values to be applied to the config.

    • This will resolve some ambiguities

    • Directories and models named config in the search paths will be an error.

  • Give vars its own top-level key. vars will be project-scoped only, no full paths/FQNs

    • this will still make it possible to override dependency vars, which is important

    • but we won't have to do a bunch of hierarchy-parsing from configs.

    • vars in config blocks will be treated as a config value named vars.

We should provide a migration path: 0.17.0 will support both versions (but you can't have the "good stuff" on v1) and emit deprecation warnings. 0.18.0 will only support v2.

Describe alternatives you've considered

We can just not do it.

Additional context

N/A

Who will this benefit?

anyone developing sources

1.0.0 enhancement

Most helpful comment

An example dbt_project.yml file that conforms to the v2 spec:

name: project_name
version: 1.0.0

config-version: 2 

# vars cannot be scoped to specific directories
# they may only be scoped to packages, or "global"
vars:
  testing_days_of_data: 7

  my_project:
    testing_days_of_data: 2

  some_package:
    testing_days_of_data: 4

models:
  project_name:
    config:
      materialized: table
    base:
      disabled:
        config:
          enabled: false
      config:
        materialized: ephemeral

seeds:
  snowplow:
    config:
      enabled: false

snapshots:
  config:
    strategy: check
    check_cols: ['id']

The config key

The dbt_project.yml file spec is currently ambiguous. Consider the existing (v1) syntax:

models:
  my_project:
    partition_by:
      field_name: date_day
      data_type: date

From this syntax alone, it is not clear that partition_by is a configuration, rather than a path in the dbt project directory! Today, we're handling this by 1) hoping users don't name folders something like partition_by and 2) enumerating the list of configurations that dbt is aware of, eg:

  • enabled, materialized
  • sort, dist
  • cluster_by, partition_by
  • etc etc etc

If one of these keys is found in the dbt_project.yml file, then dbt knows that it is a config, not a folder path or model name. Historically, this has been _mostly fine_, most people don't name folders "partition_by", but it would be much better if there was no ambiguity here. Our proposed solution is that we have _one_ special key, config, which denotes that the corresponding value is a configuration instead of a hierarchical file tree.

vars

Today, vars are scoped hierarchically in a dbt project. There isn't a _ton_ of merit to doing it this way, and it's more a fallout of the initial implementation than it is a well-considered design decision. The big downside of this approach is that variable resolution is hard to reason about, and hard to implement in dbt.

For models, folder-scoped variables are reasonably well defined. Consider:

# dbt_project.yml

models:
  my_project:
    folder_path:
      vars:
        abc: 123
    some_other_path:
      vars:
        abc: 456

Here, the abc variable will have the value 123 for all models defined in or under models/folder_path. It is _less_ clear what should happen for a schema.yml file located at models/folder_path/schema.yml. Does it inherit the variable scope based on its path? Or should the variable scope be inferred from the model it is describing?

Consider the following two files:

# models/folder_path/schema.yml

models:
  - name: my_model
    description: "{{ var('abc') }}"
-- models/some_other_path/my_model.sql

select 1 as id

In this example, both 123 and 456 would both be pretty reasonable values for {{ var('abc') }} in the description field, and to be honest, I don't think I could tell you which one dbt returns off the top of my head! Making vars non-hierarchical won't _totally_ solve this exact problem, but it would certainly help!


Let's try to collect some feedback here around how folks are using vars. My personal take is:

  • I don't think moving var out of models: will be _incredibly_ controversial (confidence = 60%)
  • People aren't going to love the config: change to models:, and the more I look at it, the more I'm inclined to agree with them :)

Let's keep the goals in mind here (consistency, unambiguity, ease of parsing) and remain flexible on the exact implementation that we move forwards with.

All 11 comments

An example dbt_project.yml file that conforms to the v2 spec:

name: project_name
version: 1.0.0

config-version: 2 

# vars cannot be scoped to specific directories
# they may only be scoped to packages, or "global"
vars:
  testing_days_of_data: 7

  my_project:
    testing_days_of_data: 2

  some_package:
    testing_days_of_data: 4

models:
  project_name:
    config:
      materialized: table
    base:
      disabled:
        config:
          enabled: false
      config:
        materialized: ephemeral

seeds:
  snowplow:
    config:
      enabled: false

snapshots:
  config:
    strategy: check
    check_cols: ['id']

The config key

The dbt_project.yml file spec is currently ambiguous. Consider the existing (v1) syntax:

models:
  my_project:
    partition_by:
      field_name: date_day
      data_type: date

From this syntax alone, it is not clear that partition_by is a configuration, rather than a path in the dbt project directory! Today, we're handling this by 1) hoping users don't name folders something like partition_by and 2) enumerating the list of configurations that dbt is aware of, eg:

  • enabled, materialized
  • sort, dist
  • cluster_by, partition_by
  • etc etc etc

If one of these keys is found in the dbt_project.yml file, then dbt knows that it is a config, not a folder path or model name. Historically, this has been _mostly fine_, most people don't name folders "partition_by", but it would be much better if there was no ambiguity here. Our proposed solution is that we have _one_ special key, config, which denotes that the corresponding value is a configuration instead of a hierarchical file tree.

vars

Today, vars are scoped hierarchically in a dbt project. There isn't a _ton_ of merit to doing it this way, and it's more a fallout of the initial implementation than it is a well-considered design decision. The big downside of this approach is that variable resolution is hard to reason about, and hard to implement in dbt.

For models, folder-scoped variables are reasonably well defined. Consider:

# dbt_project.yml

models:
  my_project:
    folder_path:
      vars:
        abc: 123
    some_other_path:
      vars:
        abc: 456

Here, the abc variable will have the value 123 for all models defined in or under models/folder_path. It is _less_ clear what should happen for a schema.yml file located at models/folder_path/schema.yml. Does it inherit the variable scope based on its path? Or should the variable scope be inferred from the model it is describing?

Consider the following two files:

# models/folder_path/schema.yml

models:
  - name: my_model
    description: "{{ var('abc') }}"
-- models/some_other_path/my_model.sql

select 1 as id

In this example, both 123 and 456 would both be pretty reasonable values for {{ var('abc') }} in the description field, and to be honest, I don't think I could tell you which one dbt returns off the top of my head! Making vars non-hierarchical won't _totally_ solve this exact problem, but it would certainly help!


Let's try to collect some feedback here around how folks are using vars. My personal take is:

  • I don't think moving var out of models: will be _incredibly_ controversial (confidence = 60%)
  • People aren't going to love the config: change to models:, and the more I look at it, the more I'm inclined to agree with them :)

Let's keep the goals in mind here (consistency, unambiguity, ease of parsing) and remain flexible on the exact implementation that we move forwards with.

Yes love all of this.

Yeah, there's never a time when I use vars scoped for a subdirectory. And moving it out of models: makes sense to me (esp. if you want to use them in sources too)

I'm in favour of config: (configs:?). From a documentation standpoint, it's hard to reason about "the list of keys that can be under models:" (I was literally thinking about this the other day).

I would also be in favor of:

models:
  project_name:
    materialized: view
    enabled: true
    tags: ['my_project']
    base:
      my_big_table:
        materialized: table
        materialization_configs: # ok not this but something like this
         partition_by: 
            field_name: date_day
            data_type: date

For me, there's a difference between the configs that a materialization uses (and are consumed in Jinja-land), versus configs that are consumed by the python parts of dbt. IDK if that's valid though

  1. I'm not worried about the "using vars in a subdirectory" because empirically you can still accomplish the same use cases, you would just need to create multiple variables. This is how you would solve this problem in...normal programming :) There is no concept of "hierarchical variables scoped at a folder level" in Python, so we shouldn't attempt to create one if it's just creating problems for us.

  2. I 100% massively agree that we should have a config key in the yaml spec for this file.

  3. If we are going to make a bunch of changes to the spec for this file, I have two others that I'd really like to get in there!

  • dbt_project.yml should _only_ include the directives required to start up dbt, and I think that's just the file paths. All other directives should follow our other yaml should go into arbitrarily-named files, just like sources.yml works. This would include models, vars, seeds (anything else?). I would imagine creating a single models.yml or seeds.yml for most projects, but I could imagine wanting to separate those contents out into multiple files in certain situations (large projects, decentralized teams...). But dbt would parse them all in the same way that it currently parses sources.yml, which is to say that it shouldn't care about what specific file the yaml content comes from. Beyond just making dbt_project.yml more tightly scoped and resistant to sprawl, this also gives dbt project developers more control about how they package and architect their code.

  • The top-level config in models can be _either_ a project name or a top-level-directory name _within_ a project. I think this inconsistency is incredibly confusing and that we should be less flexible on this, whatever the right answer is. Maybe we have a project: key that is optional and defaults to the current project? Dunno. But resolving this feels consistent with our drive here to reduce uncertainty.

and I think that's just the file paths.

In fact, you only need:

name: jaffle_shop
version: 0.0.1

Everything else is optional 馃槃

Speaking of, while we're here, do we want to do something about version:? Either use it meaningfully or nix it?

dbt_project.yml should only include the directives required to start up dbt, and I think that's just the file paths. All other directives should follow our other yaml should go into arbitrarily-named files, just like sources.yml works.

I agree! This appears to be the cause of a decent amount of confusion. Unfortunately, this is very tricky to specify without also breaking schema.yml behavior and/or creating ambiguities. Currently models, sources, etc are all defined as lists of dicts, where each dict represents one model/source/whatever. I'm not sure I can come up with a model that preserves that without adding the ambiguity back in or having a second layer of keys. and this feels really ugly and verbose:

config:
  models:
    my_project:
      my_folder:
        config:
          enabled: true

The top-level config in models can be either a project name or a top-level-directory name within a project. I think this inconsistency is incredibly confusing and that we should be less flexible on this, whatever the right answer is. Maybe we have a project: key that is optional and defaults to the current project? Dunno. But resolving this feels consistent with our drive here to reduce uncertainty.

This should just be required to be a project name. I actually thought it was that way already, though config resolution is currently so opaque I could easily be wrong.

@clrcrl Everything is optional but if you _do_ want to specify a file path, that specification is required at startup :)

And good ping on version.

Speaking of, while we're here, do we want to do something about version:? Either use it meaningfully or nix it?

It's used by the hubsite to determine what version a project is, I thought?

No, not at all. Hubsite uses GitHub releases

Unfortunately, this is very tricky to specify without also breaking schema.yml behavior and/or creating ambiguities.

@beckjake I don't fully understand the challenges you're describing here, although I certainly trust that you're correct. My hope would be that we not totally give up on moving towards that desirable end-state without a fight though :) I'm very game to hop on a call to dig in further on this topic if others have an appetite to do that.

@jthandy more than happy to do so! Just so everyone can follow along, this is a currently valid schema.yml:

version: 2
models:
- name: emails
  columns:
  - name: email
    tests:
    - unique
- name: users
  columns:
  - name: id
    tests:
    - unique

That models section becomes this in python:

[
  {'name': 'emails', 'columns': [{'name': 'email', 'tests': ['unique']}]},
  {'name': 'users', 'columns': [{'name': 'id', 'tests': ['unique']}]}
]

It's not possible to mix lists and dicts in yaml, so this naive merge of the dbt_project.yml and schema.yml syntax isn't valid:

version: 2
models:
 my_project:
   my_folder:
     config:
       enabled: true
- name: emails
  columns:
  - name: email
    tests:
    - unique
- name: users
  columns:
  - name: id
    tests:
    - unique

I think we probably _could_ get this to work, but I don't love it - I think it's even _less_ intuitive than having two config files:

version: 2
models:
- config:
    my_project:
     my_folder:
      config:
       enabled: true
- name: emails
  columns:
  - name: email
    tests:
    - unique
- name: users
  columns:
  - name: id
    tests:
    - unique

I really like the proposed solution for handling vars. There was a time where I felt a bit too clever trying to folder scope vars but in reality I didn't really need to, I just thought it would be cleaner. I do think setting them globally/package specific prevents users from building out crazy use cases for vars in the long run! And having its own top-level key makes it really easy to find and define!

Was this page helpful?
0 / 5 - 0 ratings