Dbt: Advanced node selection syntax

Created on 28 Feb 2020 · 5Comments · Source: fishtown-analytics/dbt

We want to enable a mechanism of node selection that is:

More powerful, composable, extensible
In a structured data format
Possible to check into version control

We think that this is best implemented as YML. It should be similar to CLI --models and --select syntaxes, but it will also allow us to move beyond what's possible with CLI flags + arguments.

Selectors

resource name
resource type
model materialization type
tags
project/package
subdirectory
- file path literals
Node dependencies. We can make these more verbose in YML than the current selectors on the CLI:
- parents: +my_model
- children: my_model+
- children and all their parents: @my_model
- proposed: models that depend on macros (and their children?): my_macro+

Set logic

unions (inclusive OR): current default behavior
exclusion: possible on the CLI with --exclude
intersections (AND): not yet possible, proposed in #2167
exclusive OR will be possible as the combination of the three above (union(A,B) —exclude intersect(A,B))

Well defined "pseudo-selectors"

We can encode a dynamic selector that returns resources based on a set of conditions, which dbt uses to pick specific nodes at build time. I'm including a couple possibilities of varying complexity, mainly to spur the imagination:

this_package_only
- Only execute models + tests that are defined in the current "home? project
- Dynamic based on the project directory from which it's run.
build_if_missing
- Exclude model nodes that already exist as relations in the target database + schema
build_if_changed
- point to manifest.json from a different dbt build, and dbt can compare to infer changed resources
- sensible pattern: select all nodes with changes + their children
build_if_updated
- point to a manifest.json from a different dbt build, _and_ the result of a more recent dbt source snapshot-freshness. dbt can determine whether

(Very) hypothetical spec

version: 2

selectors:

  - name: snowplow_marketing_nightly    # human-friendly name for this custom node grouping
    definition:
      - union:          # include nodes for which ANY of the selectors below is true
        - intersect:    # include nodes for which ALL of the selectors below are true
          - tag: nightly
          - tag: marketing
          - package: snowplow
          - materialized: incremental
        - union:
          - resource_name: snowplow_marketing_custom_events
          - file_path: "models/snowplow/marketing/custom_events.sql"
          - model_dir: "snowplow/marketing"
        - intersect:
          - resource_type: seed
          - package: snowplow
          - exclude:
              resource_name: country_codes

  - name: ci    # a different custom node grouping
    definition:
      - dynamic: build_if_changed
        parents: false
        children: true

dbt run --selector snowplow_marketing_nightly

dbt run --selector ci
dbt test --selector ci

Prior art

This carries on the legacy of several past issues (going back to #550, if not earlier). It's something we've been thinking about for some time.

Looking ahead, I believe that a good approach here will form the basis for features we're very interested in supporting:

complex workflows
smarter CI
dev/prod environments in larger/multi-package projects

enhancement

Source

jtcohen6

👍2

Most helpful comment

I'd like to propose a possible implementation for the "diff-only" (build_if_changed) feature which is based upon my own prior learnings with similar architectures. I'm not sure if this is already the plan but I wanted to document here in case it would be helpful.

During DBT run, the source code of this node is hashed, and the result is hashed with the hashes of all upstream models and the dbt version number. The output of this process is a unique hash can be stored for each model - either on the database itself, and/or in manifest.json as a uniqueness key hash for that run.
During subsequent executions to DBT run - the hashes again are calculated and compared. If --diff-only (or --skip-unchanged or similar) is specified, any object with an exactly matching hash is skipped.
Objects which are missing would always fail the comparison and would therefor be built.

Importantly, this can be performed using static code analysis and is sensitive to upstream model changes. The use cases supported here are:

I'm a developer (using dbt-cloud or similar incremental development process) and I don't want to wait for things to rebuild when we already know what their outputs will be.
- Developers could (and probably would), leave this as the default setting and only disable/override it when changes are applied outside of the DBT environment (e.g. new data loaded or raw table schemas updated).
I'm in production and I just released a bugfix to the main branch. Without rebuilding my entire environment, I want to automatically rebuild only objects who's source code definition has changed (along with its downstream models) - without having to manually identify which those objects are.

Would this type of "smart rebuild" be feasible and is this similar perhaps to what is already being planned?

aaronsteers on 24 Jun 2020

👍3

All 5 comments

@drewbanin @jtcohen6 - I'm very invested in this feature. I think it could meaningfully improve the incremental run times of our production DAG. Especially the ability to skip any view materialised models and just run a pruned DAG of incremental and table models. I'm really pleased to find such a well through approach detailed here and in the linked issues.

I looks like this depends on #2203, so I'm assuming there's nothing I can do to help right now, but I'm very keen to help out if I can - event if that's just constructing a bank of potential test cases. Please let me know if I can help. 😁

alanmcruickshank on 14 Apr 2020

👍3

@beckjake to review and advise. Sounds like PowerShell and jq have good syntaxes for arbitrary selection over a list -- what do those look like, and can we be inspired by them?

drewbanin on 29 Apr 2020

During DBT run, the source code of this node is hashed, and the result is hashed with the hashes of all upstream models and the dbt version number. The output of this process is a unique hash can be stored for each model - either on the database itself, and/or in manifest.json as a uniqueness key hash for that run.
During subsequent executions to DBT run - the hashes again are calculated and compared. If --diff-only (or --skip-unchanged or similar) is specified, any object with an exactly matching hash is skipped.
Objects which are missing would always fail the comparison and would therefor be built.

Importantly, this can be performed using static code analysis and is sensitive to upstream model changes. The use cases supported here are:

I'm a developer (using dbt-cloud or similar incremental development process) and I don't want to wait for things to rebuild when we already know what their outputs will be.
- Developers could (and probably would), leave this as the default setting and only disable/override it when changes are applied outside of the DBT environment (e.g. new data loaded or raw table schemas updated).
I'm in production and I just released a bugfix to the main branch. Without rebuilding my entire environment, I want to automatically rebuild only objects who's source code definition has changed (along with its downstream models) - without having to manually identify which those objects are.

Would this type of "smart rebuild" be feasible and is this similar perhaps to what is already being planned?

aaronsteers on 24 Jun 2020

👍3

This could also improve the data lineage usability in dbt docs.

I don't think this is covered above. When working with massive DAGs I don't want all children/parents recursively. But want to traverse the tree a level at a time or specify the depth I want to traverse.

Much like the nix command tree takes an argument to list X many levels deep OR recursive. This might look something like e.g.

dbt model_name^1 # only immediate children
dbt model_name^2 # immediate children and grandchildren
dbt 1^model_name # immediate parents

ucg8j on 13 Jul 2020

👍1

@ucg8j Check direct child model selector syntax added here: https://github.com/fishtown-analytics/dbt/pull/2485 . It should be released in next feature release (maybe 0.18.0 or something)