We want to enable a mechanism of node selection that is:
We think that this is best implemented as YML. It should be similar to CLI --models and --select syntaxes, but it will also allow us to move beyond what's possible with CLI flags + arguments.
+my_modelmy_model+@my_modelmy_macro+--excludeunion(A,B) 鈥攅xclude intersect(A,B))We can encode a dynamic selector that returns resources based on a set of conditions, which dbt uses to pick specific nodes at build time. I'm including a couple possibilities of varying complexity, mainly to spur the imagination:
this_package_onlybuild_if_missingtarget database + schemabuild_if_changedmanifest.json from a different dbt build, and dbt can compare to infer changed resourcesbuild_if_updatedmanifest.json from a different dbt build, _and_ the result of a more recent dbt source snapshot-freshness. dbt can determine whether version: 2
selectors:
- name: snowplow_marketing_nightly # human-friendly name for this custom node grouping
definition:
- union: # include nodes for which ANY of the selectors below is true
- intersect: # include nodes for which ALL of the selectors below are true
- tag: nightly
- tag: marketing
- package: snowplow
- materialized: incremental
- union:
- resource_name: snowplow_marketing_custom_events
- file_path: "models/snowplow/marketing/custom_events.sql"
- model_dir: "snowplow/marketing"
- intersect:
- resource_type: seed
- package: snowplow
- exclude:
resource_name: country_codes
- name: ci # a different custom node grouping
definition:
- dynamic: build_if_changed
parents: false
children: true
dbt run --selector snowplow_marketing_nightly
dbt run --selector ci
dbt test --selector ci
This carries on the legacy of several past issues (going back to #550, if not earlier). It's something we've been thinking about for some time.
Looking ahead, I believe that a good approach here will form the basis for features we're very interested in supporting:
@drewbanin @jtcohen6 - I'm very invested in this feature. I think it could meaningfully improve the incremental run times of our production DAG. Especially the ability to skip any view materialised models and just run a pruned DAG of incremental and table models. I'm really pleased to find such a well through approach detailed here and in the linked issues.
I looks like this depends on #2203, so I'm assuming there's nothing I can do to help right now, but I'm very keen to help out if I can - event if that's just constructing a bank of potential test cases. Please let me know if I can help. 馃榿
@beckjake to review and advise. Sounds like PowerShell and jq have good syntaxes for arbitrary selection over a list -- what do those look like, and can we be inspired by them?
I'd like to propose a possible implementation for the "diff-only" (build_if_changed) feature which is based upon my own prior learnings with similar architectures. I'm not sure if this is already the plan but I wanted to document here in case it would be helpful.
--diff-only (or --skip-unchanged or similar) is specified, any object with an exactly matching hash is skipped.Importantly, this can be performed using static code analysis and is sensitive to upstream model changes. The use cases supported here are:
main branch. Without rebuilding my entire environment, I want to automatically rebuild only objects who's source code definition has changed (along with its downstream models) - without having to manually identify which those objects are.Would this type of "smart rebuild" be feasible and is this similar perhaps to what is already being planned?
This could also improve the data lineage usability in dbt docs.
I don't think this is covered above. When working with massive DAGs I don't want all children/parents recursively. But want to traverse the tree a level at a time or specify the depth I want to traverse.
Much like the nix command tree takes an argument to list X many levels deep OR recursive. This might look something like e.g.
dbt model_name^1 # only immediate children
dbt model_name^2 # immediate children and grandchildren
dbt 1^model_name # immediate parents
@ucg8j Check direct child model selector syntax added here: https://github.com/fishtown-analytics/dbt/pull/2485 . It should be released in next feature release (maybe 0.18.0 or something)
Most helpful comment
I'd like to propose a possible implementation for the "diff-only" (
build_if_changed) feature which is based upon my own prior learnings with similar architectures. I'm not sure if this is already the plan but I wanted to document here in case it would be helpful.--diff-only(or--skip-unchangedor similar) is specified, any object with an exactly matching hash is skipped.Importantly, this can be performed using static code analysis and is sensitive to upstream model changes. The use cases supported here are:
mainbranch. Without rebuilding my entire environment, I want to automatically rebuild only objects who's source code definition has changed (along with its downstream models) - without having to manually identify which those objects are.Would this type of "smart rebuild" be feasible and is this similar perhaps to what is already being planned?