Dbt: Dynamically reference dbt models

Created on 4 Jan 2019 · 8Comments · Source: fishtown-analytics/dbt

Feature

Feature description

The dbt-utils documentation notes that get_tables_by_prefix pairs well with the union_tables macro and I agree. However, get_tables_by_prefix only works with tables in the database, not models within the DBT project.

I propose a new feature that allows us to put model names into a list.

Example

Suppose we have the following model directory:

.
└── activity
    ├── activities
    │   ├── activity_comment.sql
    │   └── activity_login.sql
    └── daily_activity.sql

daily_activity.sql would like to UNION ALL ref('activity_comment') and ref('activity_login').

Maybe the workflow looks something like this:

-- in daily_activity.sql
{%
  for table_name in get_model_names('activities')
%}

SELECT
  ...
FROM ref(table_name)

Who will this benefit?

Anyone who wants to benefit from DRY principles. The current workflow forces users manually write the names of files to be unioned, instead of depending on the natural structure of the DBT project.

@drewbanin was also interested in this feature in the DBT slack channel, prompting me to create this feature request.

enhancement

Source

tmastny

👍24

Most helpful comment

cc @beckjake

Ok, let's figure out how to make this happen! I have a bunch of ideas bouncing around in my head, and there are a couple of constraints that will guide this feature.

Thoughts:

This needs to be some sort of post-processing step that happens after parsing but before running
We'd need to do something special for ephemeral models
- ref returns a Relation for most models, but a CTE string name for ephemeral models
- Do we also need to defer ephemeral cte injection until after parsing? I think that might actually be how it already how it works...
- There's merit to returning ParsedNode representations in addition to just Relations
- I don't know that we need to do that in pursuit of this particular feature, but it's certainly adjacent!
- This wouldn't return a Relation (or build an edge), but would instead be a sort of Manifest API
- At some point, we'll probably want to think about how to import dbt from Python and do something similar...
- Each of models, sources, seeds, and archives should be selectable

Proposal:

nodes.ref.resources(*selectors, resource_types=None)

Overview

The nodes.ref family of functions exist to 1) find resources that exist in the graph and 2) dynamically ref them, returning a Relation. The full family is:

nodes.ref.resources(*selectors, resource_types=None)
nodes.ref.models(*selectors)
nodes.ref.sources(*selectors)
nodes.ref.archives(*selectors)
nodes.ref.seeds(*selectors)

These functions should all return Relation objects that match any of the specified selectors. The models/sources/archives/seeds variants are sugar over nodes.ref.resources.

Example usage:

{% for model in nodes.ref.models("events.*") %}
  select * from {{ model }} {% if not loop.last %} union all {% endif %}
{% endfor %}

*NOTE: * I think it's important that the ref() (or source()) happens implicitly in the nodes.ref.resources method. We need to capture these references, then post-process them before runtime. An alternative implementation which returns resources names would be hard for two reasons:

the syntax for referencing models and source differs between resource types (ref(name) vs. source(source_name, table_name). Pushing this logic to the caller would be unduly difficult/error prone i think
dbt needs to build graph edges in the post-processing step. Ideally, it could do this directly from the arguments passed to invocations of nodes.ref.models. An alternative implementation which returned model names and left refing to the caller would need to re-parse the whole model!

Node selection

The nodes.ref.* family of functions should accept a series of node selectors as varargs. These node selectors should use the same selection syntax as provided on the CLI. This includes:

selecting by fqn (eg. package.*, package.dir.*, dir.path.to.models)
selecting by tag (eg. event)
selecting by name (eg. stg_snowplow_event)
selecting by source (eg. source:snowplow.*+)

If it turns out that there are other selectors that are useful here (like selecting by name prefix, or regex, or similar), we should consider adding those as available selectors on the CLI too.

Implementation

Calls tonodes.ref.relations should return an empty list at parse-time. The function call should be stored for post-processing. When the function is called at run-time, dbt should evaluate the specified selectors and return a list of matched Relations.

Caveats

Ephemeral models are going to be annoying here. I personally dislike the idea of having to document this beautiful suite of functions, noting that they either return a Relation for literally every resource type, or a string for ephemeral models :)

It might be worth investigating if we can reconcile ephemeral models here. A ref of an ephemeral model returns the name of a CTE as a string. Maybe we can encode this as a Relation with its type set to CTE, excluding a database and schema. We might already be doing something like this in some places?

Other relevant things to think about

It wouldn't be crazy to support other functions in the future, like nodes.find.resources which returns ParsedNode objects (or their dict representations). This would _not_ build an edge in the graph, and would probably be most useful in context like run-operations

Would love to hear everyone's thoughts on an approach like this!

drewbanin on 16 May 2019

👀3 🎉3

All 8 comments

Thanks for the request @tmastny. This is something we've run into a bunch, and I agree: writing out the model names to union manually is error prone and a chore. I'd need to have a think about how this could work.

I don't want to make dbt parse all of the models in a project twice, but I'm not sure how else to accomplish something like this. I imagine we'd need to do one pass to find all of the models, then another pass to correctly process functions like get_model_names shown above.

Maybe there's some world in which the graph isn't fully finalized until just before "running". After parsing, dbt could conceivably create some more edges between nodes like daily_activity and the dynamic list of nodes that it would select from. In the example above, activity_comment and activity_login would need to run prior to daily_activity, then appear in the results of something like get_model_names.

So, let me ponder this one for a while. Keen to hear your thoughts about any of the above!

drewbanin on 5 Jan 2019

I think I now understand the root of this issue: dbt has no way to dynamically generate dependencies. In other DAG tools such as make, Snakemake, and drake, dynamic dependencies are a part of the design. These dynamic dependencies are found by preprocessing the workflow specification (the makefile, or the models in dbt).

Let's take look at Snakemake, which is a make-like DAG system for Python.

Snakemake

Suppose we are in the following working directory.

.
├── Snakefile
└── text
    ├── hello.txt
    └── world.txt

The objective is to "build" the models in text/ and then "union" the contents into union.txt.

# Snakefile: the snakemake "makefile"
path = 'text/{name}.txt'
names = glob_wildcards(path).name

rule all:
    input:
        'union.txt'

rule union:
    input:
        expand('text/{name}_compiled.txt', name=names)
    output:
        'union.txt'
    shell:
        'cat {input} > {output}'

rule build:
    input:
        path
    output:
        temp('text/{name}_compiled.txt')
    shell:
        'cat {input} > {output}'

This workflow specification alone doesn't determine all the dependencies, since the number of compiled files depends on the structure of text/. However, snakemake will preprocess Snakefile and evaluate the wildcards {. From there, snakemake determines the DAG.

For example, we can evaluate the wildcards without executing the DAG:

# in bash
snakemake --dryrun

# output
Building DAG of jobs...
Job counts:
    count   jobs
    1   all
    2   build
    1   union
    4

rule build:
    input: text/world.txt
    output: text/world_compiled.txt
    jobid: 3
    wildcards: name=world


rule build:
    input: text/hello.txt
    output: text/hello_compiled.txt
    jobid: 2
    wildcards: name=hello


rule union:
    input: text/world_compiled.txt, text/hello_compiled.txt
    output: union.txt
    jobid: 1


localrule all:
    input: union.txt
    jobid: 0

dbt

In dbt, the wildcard-like functionality of jinja templating is overloaded. Not only is the jinja used to expand out variables like {name} in the example above, it is also used to determine the DAG via ref.

The think your second idea is spot on. There needs to be some initial evaluation of the jinja templating, and then a pass through to evaluate the refs. As I mentioned above, I realize this is difficult because right now they are so intertwined.

tmastny on 6 Jan 2019

Nice writeup, thanks @tmastny. One totally different way of pursuing something like this might be a sort of upsert materialization?

Rather than collecting all of the tables to union in daily_activity, it might be interesting to have sub-tables like activity_login and activity_comment upsert their data into a table. This way, each model is responsible for itself and we can side step the whole question of dynamic graph building.

Another benefit of this type of materialization is that if the logic in activity_login changes but activity_comment does not, you can just run activity_login to refresh those specific rows. The naive version would delete all of the records from activity_login before inserting the full dataset. A smarter version would work more like dbt's incremental models. I think there's a way to guarantee idempotence here.

This approach presents other problems, but I have a feeling that they may be more tractable than dynamic graph generation at present. Do you think this approach adequately addresses your use case?

drewbanin on 6 Jan 2019

I'll have to do more research to understand if this could address my use case. I'm relatively new to SQL and I'm not familiar with upsert. I'm also on redshift, which this article indicates might be an additional source of difficulty.

However, I view dbt's DAG as a strength and would love to see more features around DAGs supported!

Thanks for all the hard work on dbt!

tmastny on 6 Jan 2019

cc @beckjake

Ok, let's figure out how to make this happen! I have a bunch of ideas bouncing around in my head, and there are a couple of constraints that will guide this feature.

Thoughts:

This needs to be some sort of post-processing step that happens after parsing but before running
We'd need to do something special for ephemeral models
- ref returns a Relation for most models, but a CTE string name for ephemeral models
- Do we also need to defer ephemeral cte injection until after parsing? I think that might actually be how it already how it works...
- There's merit to returning ParsedNode representations in addition to just Relations
- I don't know that we need to do that in pursuit of this particular feature, but it's certainly adjacent!
- This wouldn't return a Relation (or build an edge), but would instead be a sort of Manifest API
- At some point, we'll probably want to think about how to import dbt from Python and do something similar...
- Each of models, sources, seeds, and archives should be selectable

Proposal:

nodes.ref.resources(*selectors, resource_types=None)

Overview

The nodes.ref family of functions exist to 1) find resources that exist in the graph and 2) dynamically ref them, returning a Relation. The full family is:

nodes.ref.resources(*selectors, resource_types=None)
nodes.ref.models(*selectors)
nodes.ref.sources(*selectors)
nodes.ref.archives(*selectors)
nodes.ref.seeds(*selectors)

These functions should all return Relation objects that match any of the specified selectors. The models/sources/archives/seeds variants are sugar over nodes.ref.resources.

Example usage:

{% for model in nodes.ref.models("events.*") %}
  select * from {{ model }} {% if not loop.last %} union all {% endif %}
{% endfor %}

the syntax for referencing models and source differs between resource types (ref(name) vs. source(source_name, table_name). Pushing this logic to the caller would be unduly difficult/error prone i think
dbt needs to build graph edges in the post-processing step. Ideally, it could do this directly from the arguments passed to invocations of nodes.ref.models. An alternative implementation which returned model names and left refing to the caller would need to re-parse the whole model!

Node selection

The nodes.ref.* family of functions should accept a series of node selectors as varargs. These node selectors should use the same selection syntax as provided on the CLI. This includes:

selecting by fqn (eg. package.*, package.dir.*, dir.path.to.models)
selecting by tag (eg. event)
selecting by name (eg. stg_snowplow_event)
selecting by source (eg. source:snowplow.*+)

If it turns out that there are other selectors that are useful here (like selecting by name prefix, or regex, or similar), we should consider adding those as available selectors on the CLI too.

Implementation

Caveats

Other relevant things to think about

Would love to hear everyone's thoughts on an approach like this!

drewbanin on 16 May 2019

👀3 🎉3

I use this sort of pattern a fair bit, however, I would not use this feature to dynamically get the list of models. I have done this for models that union 13 tables, and tbh writing out the names of 13 models in order to loop over a union is already very dry, and more so it is _explicit_.

The main reason I think this use case is not great is that it amounts to a hidden rule based on the existence or not of models within a directory. What if someone added a new model there, or deleted one? I'd want it to break if that happened, not silently succeed with bad data.

Given the original example:

.
└── activity
    ├── activities
    │   ├── activity_comment.sql
    │   └── activity_login.sql
    └── daily_activity.sql

I would just do this:

{%-
    set source_models = [
         'activity_comment',
         'activity_login'
    ]
-%}

WITH all_sources AS (
{% for source in source_models %}
    SELECT
       col1
      ,col2
      ,col3
    FROM {{ ref(source) }}

    {%- if not loop.last %}
    UNION ALL
    {%- endif %}

{%- endfor %}
)
SELECT * FROM all_sources

By itself that is not useful, so I'd also use materialized='incremental' in the config and add a unique_key which could be the individual model's primary keys if they don't collide, or more likely they do collide so use something like the dbt-utils surrogate_key function to concatenate across some useful type/categorization of the source tables and their primary key.

If there are variations in the columns between source tables, e.g. some source tables need a fake NULL column or require a different column naming, then some jinja ternary statements can help.

e.g. instead of selecting just col2 in the SQL you could do:
{{ 'NULL' if source in [activity_login] else 'col2' }} AS col2

If the rules start to get more complex then instead of setting a list of strings up front I would use a list of dictionaries e.g.

{%-
set source_model_rule = [
    dict(name = 'activity_comment,
           type = 'small',
           depth = 'mid'
    ),
    dict(name = 'activity_login',
           type = 'extra-medium',
           depth = 'deep'
    ) 
]
-%}

On the other hand I do see use cases for this. For example I would like to use this feature for tests. It would be nice to define a test template or pattern to be run for all models in a directory, and additionally be able to access config values for models, e.g. the start_date of the dbt-utils increment_by_period materialization.

davehowell on 18 Sep 2019

I think that's a really fair point @davehowell. I am inclined to agree with you -- making the model filepath subtly significant like this can definitely lead to confusion. I'd probably recommend that a tag selector is used here to fetch all of the models with a given tag. That feels like the right combination of "obvious" and "abstracted" to me.

I think we can decouple something like nodes.find.resources() (or nodes.find.resource()) from the larger question of dynamic dag references. It would also be compelling to make the Relation object returned by ref() include config info about the node itself, but I think we're less well suited to implement that in dbt today. I'll have a further think about this and follow up here with any comments!

drewbanin on 18 Sep 2019

👍2

As requested from @clrcrl , adding a use case here from dbt slack channel: https://getdbt.slack.com/archives/C0VLZPLAE/p1576167419187600

Related to @drewbanin's comment above,
The use case is for conditional post-hooks based on the model tags (i.e. models tagged with bi_tool should get different grants than those tagged with something else) rather than having to put all grants separately under each model section or in each config