The dbt-utils documentation notes that get_tables_by_prefix pairs well with the union_tables macro and I agree. However, get_tables_by_prefix only works with tables in the database, not models within the DBT project.
I propose a new feature that allows us to put model names into a list.
Suppose we have the following model directory:
.
โโโ activity
โโโ activities
โ โโโ activity_comment.sql
โ โโโ activity_login.sql
โโโ daily_activity.sql
daily_activity.sql would like to UNION ALL ref('activity_comment') and ref('activity_login').
Maybe the workflow looks something like this:
-- in daily_activity.sql
{%
for table_name in get_model_names('activities')
%}
SELECT
...
FROM ref(table_name)
Anyone who wants to benefit from DRY principles. The current workflow forces users manually write the names of files to be unioned, instead of depending on the natural structure of the DBT project.
@drewbanin was also interested in this feature in the DBT slack channel, prompting me to create this feature request.
Thanks for the request @tmastny. This is something we've run into a bunch, and I agree: writing out the model names to union manually is error prone and a chore. I'd need to have a think about how this could work.
I don't want to make dbt parse all of the models in a project twice, but I'm not sure how else to accomplish something like this. I imagine we'd need to do one pass to find all of the models, then another pass to correctly process functions like get_model_names shown above.
Maybe there's some world in which the graph isn't fully finalized until just before "running". After parsing, dbt could conceivably create some more edges between nodes like daily_activity and the dynamic list of nodes that it would select from. In the example above, activity_comment and activity_login would need to run prior to daily_activity, then appear in the results of something like get_model_names.
So, let me ponder this one for a while. Keen to hear your thoughts about any of the above!
I think I now understand the root of this issue: dbt has no way to dynamically generate dependencies. In other DAG tools such as make, Snakemake, and drake, dynamic dependencies are a part of the design. These dynamic dependencies are found by preprocessing the workflow specification (the makefile, or the models in dbt).
Let's take look at Snakemake, which is a make-like DAG system for Python.
Suppose we are in the following working directory.
.
โโโ Snakefile
โโโ text
โโโ hello.txt
โโโ world.txt
The objective is to "build" the models in text/ and then "union" the contents into union.txt.
# Snakefile: the snakemake "makefile"
path = 'text/{name}.txt'
names = glob_wildcards(path).name
rule all:
input:
'union.txt'
rule union:
input:
expand('text/{name}_compiled.txt', name=names)
output:
'union.txt'
shell:
'cat {input} > {output}'
rule build:
input:
path
output:
temp('text/{name}_compiled.txt')
shell:
'cat {input} > {output}'
This workflow specification alone doesn't determine all the dependencies, since the number of compiled files depends on the structure of text/. However, snakemake will preprocess Snakefile and evaluate the wildcards {. From there, snakemake determines the DAG.
For example, we can evaluate the wildcards without executing the DAG:
# in bash
snakemake --dryrun
# output
Building DAG of jobs...
Job counts:
count jobs
1 all
2 build
1 union
4
rule build:
input: text/world.txt
output: text/world_compiled.txt
jobid: 3
wildcards: name=world
rule build:
input: text/hello.txt
output: text/hello_compiled.txt
jobid: 2
wildcards: name=hello
rule union:
input: text/world_compiled.txt, text/hello_compiled.txt
output: union.txt
jobid: 1
localrule all:
input: union.txt
jobid: 0
In dbt, the wildcard-like functionality of jinja templating is overloaded. Not only is the jinja used to expand out variables like {name} in the example above, it is also used to determine the DAG via ref.
The think your second idea is spot on. There needs to be some initial evaluation of the jinja templating, and then a pass through to evaluate the refs. As I mentioned above, I realize this is difficult because right now they are so intertwined.
Nice writeup, thanks @tmastny. One totally different way of pursuing something like this might be a sort of upsert materialization?
Rather than collecting all of the tables to union in daily_activity, it might be interesting to have sub-tables like activity_login and activity_comment upsert their data into a table. This way, each model is responsible for itself and we can side step the whole question of dynamic graph building.
Another benefit of this type of materialization is that if the logic in activity_login changes but activity_comment does not, you can just run activity_login to refresh those specific rows. The naive version would delete all of the records from activity_login before inserting the full dataset. A smarter version would work more like dbt's incremental models. I think there's a way to guarantee idempotence here.
This approach presents other problems, but I have a feeling that they may be more tractable than dynamic graph generation at present. Do you think this approach adequately addresses your use case?
I'll have to do more research to understand if this could address my use case. I'm relatively new to SQL and I'm not familiar with upsert. I'm also on redshift, which this article indicates might be an additional source of difficulty.
However, I view dbt's DAG as a strength and would love to see more features around DAGs supported!
Thanks for all the hard work on dbt!
cc @beckjake
Ok, let's figure out how to make this happen! I have a bunch of ideas bouncing around in my head, and there are a couple of constraints that will guide this feature.
Thoughts:
ref returns a Relation for most models, but a CTE string name for ephemeral modelsimport dbt from Python and do something similar...models, sources, seeds, and archives should be selectableProposal:
The nodes.ref family of functions exist to 1) find resources that exist in the graph and 2) dynamically ref them, returning a Relation. The full family is:
nodes.ref.resources(*selectors, resource_types=None)
nodes.ref.models(*selectors)
nodes.ref.sources(*selectors)
nodes.ref.archives(*selectors)
nodes.ref.seeds(*selectors)
These functions should all return Relation objects that match any of the specified selectors. The models/sources/archives/seeds variants are sugar over nodes.ref.resources.
Example usage:
{% for model in nodes.ref.models("events.*") %}
select * from {{ model }} {% if not loop.last %} union all {% endif %}
{% endfor %}
*NOTE: * I think it's important that the ref() (or source()) happens implicitly in the nodes.ref.resources method. We need to capture these references, then post-process them before runtime. An alternative implementation which returns resources names would be hard for two reasons:
ref(name) vs. source(source_name, table_name). Pushing this logic to the caller would be unduly difficult/error prone i thinknodes.ref.models. An alternative implementation which returned model names and left refing to the caller would need to re-parse the whole model!The nodes.ref.* family of functions should accept a series of node selectors as varargs. These node selectors should use the same selection syntax as provided on the CLI. This includes:
package.*, package.dir.*, dir.path.to.models)event)stg_snowplow_event)source:snowplow.*+)If it turns out that there are other selectors that are useful here (like selecting by name prefix, or regex, or similar), we should consider adding those as available selectors on the CLI too.
Calls tonodes.ref.relations should return an empty list at parse-time. The function call should be stored for post-processing. When the function is called at run-time, dbt should evaluate the specified selectors and return a list of matched Relations.
Ephemeral models are going to be annoying here. I personally dislike the idea of having to document this beautiful suite of functions, noting that they either return a Relation for literally every resource type, or a string for ephemeral models :)
It might be worth investigating if we can reconcile ephemeral models here. A ref of an ephemeral model returns the name of a CTE as a string. Maybe we can encode this as a Relation with its type set to CTE, excluding a database and schema. We might already be doing something like this in some places?
It wouldn't be crazy to support other functions in the future, like nodes.find.resources which returns ParsedNode objects (or their dict representations). This would _not_ build an edge in the graph, and would probably be most useful in context like run-operations
Would love to hear everyone's thoughts on an approach like this!
I use this sort of pattern a fair bit, however, I would not use this feature to dynamically get the list of models. I have done this for models that union 13 tables, and tbh writing out the names of 13 models in order to loop over a union is already very dry, and more so it is _explicit_.
The main reason I think this use case is not great is that it amounts to a hidden rule based on the existence or not of models within a directory. What if someone added a new model there, or deleted one? I'd want it to break if that happened, not silently succeed with bad data.
Given the original example:
.
โโโ activity
โโโ activities
โ โโโ activity_comment.sql
โ โโโ activity_login.sql
โโโ daily_activity.sql
I would just do this:
{%-
set source_models = [
'activity_comment',
'activity_login'
]
-%}
WITH all_sources AS (
{% for source in source_models %}
SELECT
col1
,col2
,col3
FROM {{ ref(source) }}
{%- if not loop.last %}
UNION ALL
{%- endif %}
{%- endfor %}
)
SELECT * FROM all_sources
By itself that is not useful, so I'd also use materialized='incremental' in the config and add a unique_key which could be the individual model's primary keys if they don't collide, or more likely they do collide so use something like the dbt-utils surrogate_key function to concatenate across some useful type/categorization of the source tables and their primary key.
If there are variations in the columns between source tables, e.g. some source tables need a fake NULL column or require a different column naming, then some jinja ternary statements can help.
e.g. instead of selecting just col2 in the SQL you could do:
{{ 'NULL' if source in [activity_login] else 'col2' }} AS col2
If the rules start to get more complex then instead of setting a list of strings up front I would use a list of dictionaries e.g.
{%-
set source_model_rule = [
dict(name = 'activity_comment,
type = 'small',
depth = 'mid'
),
dict(name = 'activity_login',
type = 'extra-medium',
depth = 'deep'
)
]
-%}
On the other hand I do see use cases for this. For example I would like to use this feature for tests. It would be nice to define a test template or pattern to be run for all models in a directory, and additionally be able to access config values for models, e.g. the start_date of the dbt-utils increment_by_period materialization.
I think that's a really fair point @davehowell. I am inclined to agree with you -- making the model filepath subtly significant like this can definitely lead to confusion. I'd probably recommend that a tag selector is used here to fetch all of the models with a given tag. That feels like the right combination of "obvious" and "abstracted" to me.
I think we can decouple something like nodes.find.resources() (or nodes.find.resource()) from the larger question of dynamic dag references. It would also be compelling to make the Relation object returned by ref() include config info about the node itself, but I think we're less well suited to implement that in dbt today. I'll have a further think about this and follow up here with any comments!
As requested from @clrcrl , adding a use case here from dbt slack channel: https://getdbt.slack.com/archives/C0VLZPLAE/p1576167419187600
Related to @drewbanin's comment above,
The use case is for conditional post-hooks based on the model tags (i.e. models tagged with bi_tool should get different grants than those tagged with something else) rather than having to put all grants separately under each model section or in each config
Most helpful comment
cc @beckjake
Ok, let's figure out how to make this happen! I have a bunch of ideas bouncing around in my head, and there are a couple of constraints that will guide this feature.
Thoughts:
refreturns a Relation for most models, but a CTE string name for ephemeral modelsimport dbtfrom Python and do something similar...models,sources,seeds, andarchivesshould be selectableProposal:
nodes.ref.resources(*selectors, resource_types=None)
Overview
The
nodes.reffamily of functions exist to 1) find resources that exist in the graph and 2) dynamicallyrefthem, returning a Relation. The full family is:These functions should all return Relation objects that match any of the specified selectors. The
models/sources/archives/seedsvariants are sugar overnodes.ref.resources.Example usage:
*NOTE: * I think it's important that the
ref()(orsource()) happens implicitly in thenodes.ref.resourcesmethod. We need to capture these references, then post-process them before runtime. An alternative implementation which returns resources names would be hard for two reasons:ref(name)vs.source(source_name, table_name). Pushing this logic to the caller would be unduly difficult/error prone i thinknodes.ref.models. An alternative implementation which returned model names and leftrefing to the caller would need to re-parse the whole model!Node selection
The
nodes.ref.*family of functions should accept a series of node selectors as varargs. These node selectors should use the same selection syntax as provided on the CLI. This includes:package.*,package.dir.*,dir.path.to.models)event)stg_snowplow_event)source:snowplow.*+)If it turns out that there are other selectors that are useful here (like selecting by name prefix, or regex, or similar), we should consider adding those as available selectors on the CLI too.
Implementation
Calls to
nodes.ref.relationsshould return an empty list at parse-time. The function call should be stored for post-processing. When the function is called at run-time, dbt should evaluate the specified selectors and return a list of matched Relations.Caveats
Ephemeral models are going to be annoying here. I personally dislike the idea of having to document this beautiful suite of functions, noting that they either return a
Relationfor literally every resource type, or astringfor ephemeral models :)It might be worth investigating if we can reconcile ephemeral models here. A
refof an ephemeral model returns the name of a CTE as a string. Maybe we can encode this as a Relation with its type set to CTE, excluding a database and schema. We might already be doing something like this in some places?Other relevant things to think about
It wouldn't be crazy to support other functions in the future, like
nodes.find.resourceswhich returns ParsedNode objects (or their dict representations). This would _not_ build an edge in the graph, and would probably be most useful in context likerun-operationsWould love to hear everyone's thoughts on an approach like this!