Dbt: Advanced node selection syntax

Created on 28 Feb 2020  路  5Comments  路  Source: fishtown-analytics/dbt

We want to enable a mechanism of node selection that is:

  • More powerful, composable, extensible
  • In a structured data format
  • Possible to check into version control

We think that this is best implemented as YML. It should be similar to CLI --models and --select syntaxes, but it will also allow us to move beyond what's possible with CLI flags + arguments.

Selectors

  • resource name
  • resource type
  • model materialization type
  • tags
  • project/package
  • subdirectory

    • file path literals

  • Node dependencies. We can make these more verbose in YML than the current selectors on the CLI:

    • parents: +my_model

    • children: my_model+

    • children and all their parents: @my_model

    • proposed: models that depend on macros (and their children?): my_macro+

Set logic

  • unions (inclusive OR):聽current default behavior
  • exclusion:聽possible on the CLI with --exclude
  • intersections (AND):聽not yet possible, proposed in #2167
  • exclusive OR will be possible as the combination of the three above (union(A,B) 鈥攅xclude intersect(A,B))

Well defined "pseudo-selectors"

We can encode a dynamic selector that returns resources based on a set of conditions, which dbt uses to pick specific nodes at build time. I'm including a couple possibilities of varying complexity, mainly to spur the imagination:

  • this_package_only

    • Only execute models + tests that are defined in the current "home? project

    • Dynamic based on the project directory from which it's run.

  • build_if_missing

    • Exclude model nodes that already exist as relations in the target database + schema

  • build_if_changed

    • point to manifest.json from a different dbt build, and dbt can compare to infer changed resources

    • sensible pattern: select all nodes with changes + their children

  • build_if_updated

    • point to a manifest.json from a different dbt build, _and_ the result of a more recent dbt source snapshot-freshness. dbt can determine whether

(Very) hypothetical spec

version: 2

selectors:

  - name: snowplow_marketing_nightly    # human-friendly name for this custom node grouping
    definition:
      - union:          # include nodes for which ANY of the selectors below is true
        - intersect:    # include nodes for which ALL of the selectors below are true
          - tag: nightly
          - tag: marketing
          - package: snowplow
          - materialized: incremental
        - union:
          - resource_name: snowplow_marketing_custom_events
          - file_path: "models/snowplow/marketing/custom_events.sql"
          - model_dir: "snowplow/marketing"
        - intersect:
          - resource_type: seed
          - package: snowplow
          - exclude:
              resource_name: country_codes

  - name: ci    # a different custom node grouping
    definition:
      - dynamic: build_if_changed
        parents: false
        children: true
dbt run --selector snowplow_marketing_nightly
dbt run --selector ci
dbt test --selector ci

Prior art

This carries on the legacy of several past issues (going back to #550, if not earlier). It's something we've been thinking about for some time.

Looking ahead, I believe that a good approach here will form the basis for features we're very interested in supporting:

  • complex workflows
  • smarter CI
  • dev/prod environments in larger/multi-package projects
enhancement

Most helpful comment

I'd like to propose a possible implementation for the "diff-only" (build_if_changed) feature which is based upon my own prior learnings with similar architectures. I'm not sure if this is already the plan but I wanted to document here in case it would be helpful.

  1. During DBT run, the source code of this node is hashed, and the result is hashed with the hashes of all upstream models and the dbt version number. The output of this process is a unique hash can be stored for each model - either on the database itself, and/or in manifest.json as a uniqueness key hash for that run.
  2. During subsequent executions to DBT run - the hashes again are calculated and compared. If --diff-only (or --skip-unchanged or similar) is specified, any object with an exactly matching hash is skipped.
  3. Objects which are missing would always fail the comparison and would therefor be built.

Importantly, this can be performed using static code analysis and is sensitive to upstream model changes. The use cases supported here are:

  1. I'm a developer (using dbt-cloud or similar incremental development process) and I don't want to wait for things to rebuild when we already know what their outputs will be.

    • Developers could (and probably would), leave this as the default setting and only disable/override it when changes are applied outside of the DBT environment (e.g. new data loaded or raw table schemas updated).

  2. I'm in production and I just released a bugfix to the main branch. Without rebuilding my entire environment, I want to automatically rebuild only objects who's source code definition has changed (along with its downstream models) - without having to manually identify which those objects are.

Would this type of "smart rebuild" be feasible and is this similar perhaps to what is already being planned?

All 5 comments

@drewbanin @jtcohen6 - I'm very invested in this feature. I think it could meaningfully improve the incremental run times of our production DAG. Especially the ability to skip any view materialised models and just run a pruned DAG of incremental and table models. I'm really pleased to find such a well through approach detailed here and in the linked issues.

I looks like this depends on #2203, so I'm assuming there's nothing I can do to help right now, but I'm very keen to help out if I can - event if that's just constructing a bank of potential test cases. Please let me know if I can help. 馃榿

@beckjake to review and advise. Sounds like PowerShell and jq have good syntaxes for arbitrary selection over a list -- what do those look like, and can we be inspired by them?

I'd like to propose a possible implementation for the "diff-only" (build_if_changed) feature which is based upon my own prior learnings with similar architectures. I'm not sure if this is already the plan but I wanted to document here in case it would be helpful.

  1. During DBT run, the source code of this node is hashed, and the result is hashed with the hashes of all upstream models and the dbt version number. The output of this process is a unique hash can be stored for each model - either on the database itself, and/or in manifest.json as a uniqueness key hash for that run.
  2. During subsequent executions to DBT run - the hashes again are calculated and compared. If --diff-only (or --skip-unchanged or similar) is specified, any object with an exactly matching hash is skipped.
  3. Objects which are missing would always fail the comparison and would therefor be built.

Importantly, this can be performed using static code analysis and is sensitive to upstream model changes. The use cases supported here are:

  1. I'm a developer (using dbt-cloud or similar incremental development process) and I don't want to wait for things to rebuild when we already know what their outputs will be.

    • Developers could (and probably would), leave this as the default setting and only disable/override it when changes are applied outside of the DBT environment (e.g. new data loaded or raw table schemas updated).

  2. I'm in production and I just released a bugfix to the main branch. Without rebuilding my entire environment, I want to automatically rebuild only objects who's source code definition has changed (along with its downstream models) - without having to manually identify which those objects are.

Would this type of "smart rebuild" be feasible and is this similar perhaps to what is already being planned?

This could also improve the data lineage usability in dbt docs.

I don't think this is covered above. When working with massive DAGs I don't want all children/parents recursively. But want to traverse the tree a level at a time or specify the depth I want to traverse.

Much like the nix command tree takes an argument to list X many levels deep OR recursive. This might look something like e.g.

dbt model_name^1 # only immediate children
dbt model_name^2 # immediate children and grandchildren
dbt 1^model_name # immediate parents

@ucg8j Check direct child model selector syntax added here: https://github.com/fishtown-analytics/dbt/pull/2485 . It should be released in next feature release (maybe 0.18.0 or something)

Was this page helpful?
0 / 5 - 0 ratings