I would like to propose the idea of adding ttl as part of a model's optional configurations.
Simply put, this would give an indicator of the expected life expectancy of the data. If a run is called that involves the said model and it's life expectancy has not expired, the run job can simply ignore the materialization of this table to save time.
ttl could be defined as a unit of time in seconds, in the example below, the model would be refreshed at most once every 3600 seconds (1h)
{{-
config(
ttl=3600,
)
-}}
...
I think this feature would be most useful for larger data pipelines that may have cross dependencies. As a project grows, it is more likely that different datasets require different refresh rates.
For example, you may have one table, tableA, that is pulling info from a source table that updates once a day. Another table, tableB, pulls info from a source table that updates every minute. Assuming both of these tables are required to materialize yet another table, tableC, we understand logically that it is less than optimal to refresh tableA at the same rate as tableB and tableC. In fact, we could leave tableA alone for a while and still keep tableC 'fresh' for most of the day.
ttl would allow data engineers to provide extra information to prevent unecessary computation and reduce time required to refresh entire data pipelines.
hey @mlavoie-sm360 - thanks for the feature request!
We typically recommend handling something like this from an orchestrator. You might be able to approximate this by adding tags to your models (like hourly, and nightly), then just re-running those models on a specified interval. That might look like:
# Run this hourly
dbt run --models tag:hourly+
# Run this nightly
dbt run
You also mentioned that you have source tables that are loaded with varying frequencies. Maybe another option is to use sources to denote the link between your source tables and dbt models? Then you could do:
# Run models that depend on Source A hourly
dbt run --models source:sourceA+
# Run models that depends on Source B more frequently
dbt run --model source:sourceB+
I think the ttl you've described here is pretty interesting, but there are a couple of issues I'd like to think through further:
Of these, I think the second issue is the bigger concern -- we could figure out some way to store state if it proved valuable for a use case like this. I have a harder time wrapping my head around how the ttl should behave in typical workflow settings. Curious to hear your thoughts on if this logic can be pushed into the orchestrator, or if you think there are other benefits of using a ttl here.
Thanks again!
Hey @drewbanin - thanks for the feedback!
You are right when you talk about approximating this behaviour by using tags and an orchestrator. In fact, that is our current approach to solving this issue, but I know it could be better. Bear with me here.
Our orchestrator is Airflow. In our SQL, we use tags to identify when things should be refreshed (daily, twice_a_day, every_3_hours, hourly, near_rt, etc.). These tags are only applied to 'leaf' queries, the ones that don't have children.
We have a handful of tables that we have identified as candidates for TTLing, these are usually part of a schedule that refreshes more often than is really necessary _for that table_. For example, tableA is part of a flow that refreshes every hour because data in tableB refreshes every hour and they are both required to generate tableC, as I had explained earlier.
To avoid refreshing tableA, we apply a tag:ttl to tableA and our orchestrator applies an --exclude tag:ttl whenenver it runs it's _normal_ scheduled jobs (daily, twice_a_day, etc.). Our orchestrator then has a separate task for tableA with it's own custom schedule. A new task is created manually for each new ttl tagged tables.
It works!
But I feel like it could be simpler DRYer. Every time we identify one of these TTL candidates, two things need to happen, the SQL needs to be tagged as a ttl, a new task in the orchestrator needs to be created to run that table on it's own schedule. This creates a risk, there is no way to know that every ttl has it's matching task. Again, on a small scale that is not an issue, as your pipeline grows, the concerns grow with it. If both of these things could happen in an atomic kind of way, risk would be reduced.
I shared all of this to add more context to the conversation, hopefully it helps paint a clearer picture for pinning down the need I am hoping this feature would cover.
In regards to your two questions, I think they are right on, so let me play ball a little with you and give it a go.
I'll start with this one: _what happens when the model logic changes? Do consumers just get the old version of the table until the ttl expires?_
Absolutely, things will keep running as they are until the time to live expires. What if I need it to expire right away? Perhaps, using force-refresh could ignore the ttl flag, like sodbt run -m +tag:daily --force-refresh?
No matter the approach taken, I feel like that question has the same answer, the underlying question is the one I feel we should ask, how do you refresh the table earlier if required?
Second: _dbt doesn't store any state outside of the database - where should we record the timeout timestamp?_
Great question, I don't know, I don't know enough about the inner workings of DBT to make a statement on that. Since we are bouncing ideas, I am going to go in a completely different direction then the one I intended in the first place, but what if dbt didn't handle any of this other then to expose the information?
What if I could do something like this and ask dbt to list every query tagged ttl and then have an elegant way of also getting the configuration values in a json format for example.
{{-
config(
tags=['ttl'],
my_custom_var='0 0 * * *'
)
-}}
select * from ...
It would then be easy to create a much DRYer task in the orchestrator asking for a list of tables with a given tag and then pulling out the required info to derive the schedule for that task. We would then have increased the flexibility of DBT without increasing it's complexity.
DBT's role here would be to provide enough flexibility within it's own framework to be leveraged by other systems to do more complex things without adding more bells and whistles.
What are your thoughts?
Two other ideas regarding the question where should we record the timeout timestamp?
source snapshot-freshnessHey @mlavoie-sm360 - thanks for the measured and thoughtful response! I think you made an _incredibly_ good point here. What do you think about this?
What if I could do something like this and ask dbt to list every query tagged ttl and then have an elegant way of also getting the configuration values in a json format for example.
We're adding a dbt ls command for v0.14.0. This command accepts model selectors, then returns all of the matching resources (eg. models).
The feature spec shows some options (docs to come!). You could do something like:
$ dbt ls --models tag:ttl --output json
{
"name": "my_model",
"resource_type": "model",
"package_name": "debug",
...
"config": {
"enabled": true,
"materialized": "incremental",
"my_custom_var": "0 0 * * *"
}
}
The output is one line per matched resource, so you could pipe this into some process that plucked out a config field and operated on it accordingly.
I feel confident that would allow us to be DRY and ATOMIC in our setup. I am looking forward to giving this approach a try.
Hey @mlavoie-sm360 - dbt ls is merged and going live in dbt v0.14.0! Check out the (prerelease) docs here: https://docs.getdbt.com/v0.14/reference#list
Most helpful comment
Hey @mlavoie-sm360 -
dbt lsis merged and going live in dbt v0.14.0! Check out the (prerelease) docs here: https://docs.getdbt.com/v0.14/reference#list