Allow model developer the ability to choose the set of columns that require updating on incremental loads
This will benefit all users that are performing an incremental update of a wide table with few columns that are mutable
hey @richmintz! Can you say more about this? I think it's a good idea - I'm curious if you're interested in this feature for performance reasons, or something else.
Hi @drewbanin!
My primary goal would be performance, minimizing writes and index rebuilds. However I also think it will generate more concise sql for large tables when a merge is required.
Got it! Sure, this makes a lot of sense in merge statements. Is this on Snowflake/BigQuery? I think a similar paradigm would apply to Redshift/Postgres, but we'd modify the update statement instead of the merge statement.
I can imagine a config like update_columns which could be supplied like:
{{
config(
materialized='incremental',
update_columns=['order_total', 'count_orders']
)
}}
If update_columns is not provided, or if it is null, dbt will default to updating all of the columns in the model.
You buy all of that?
As part of this proposed change, it would also be nice to be able to exclude the when matched then update set part of the merge altogether, as in some of my models I'm only interested in adding new rows since the source data is never updated (for event-based data for example or other append-only tables), and it makes the model execution faster (at least in BigQuery).
It could be a separate config no_update = True or just a convention that doesn't include the SQL block when update_columns is empty.
Please note I tested this and would work for BigQuery, other databases might need a different syntax to support no-op for updates.
Any thoughts?
hey @bfil - if you do not supply a unique_key for your incremental model's config, then dbt should not inject a when matched then update set.... clause to the merge statement.
Check out the implementation here:
https://github.com/fishtown-analytics/dbt/blob/7a07017b96ac332c872725343833a94b49129c68/core/dbt/include/global_project/macros/materializations/common/merge.sql#L23-L48
@drewbanin I know, but I still want to compare and insert only the new data based on that unique key rather than merging on an always FALSE condition. Makes sense?
ah! Sure, that makes a lot of sense. What do you think about pushing this code down from the materialization layer and into the modeling layer. Could you do something like:
{{ config(materialized='incremental') }}
select id, col1, col2 from {{ ref('some_table') }}
{% if is_incremental() %}
where id not in (select id from {{ this }})
{% endif %}
This is a kind of funny adaptation of the typical incremental modeling approach, but it should serve to only select the _new_ records from your source table, inserting them directly into the destination table.
@drewbanin : any update on where we are with this feature...i have a wide table [Snowflake] but i only need to update very few columns in the incremental model. any ETA's on the feature ?
hey @ee07dazn - no movement on this issue on our end to report. I think the spec I mentioned above (pasted below) is still a good idea:
{{
config(
materialized='incremental',
update_columns=['order_total', 'count_orders']
)
}
I think that this feature should only be supported on databases that support merge statements, as dbt's delete+insert approach doesn't really lend itself well to this approach. If anyone is interested in working on this one, I'd be happy to try to point you in the right direction.
The key change here will be to replace the call to adapter.get_columns_in_relation with config.get('update_columns'). We'll want to implement this for both the Snowflake and BigQuery incremental materializations when the merge strategy is used.
Thanks @drewbanin for the information and a pointer. Apart from what you had suggested, i had to make change to the default__get_merge_sql macro to make that approach work. It works now but i am just not happy to make a change to something as low level as default__get_merge_sql. Probably i think i can update the snowflake_get_merge_sql to do this job. Thinking out loud...but thanks for the help.
Hi @drewbanin : i have another question and please bear with me if its a silly one as i am new to the tool and still exploring it through a POC.
So i need to update the incremental materialisation to make sure i can use the "update_column" in config but rest of the materialisation does not need to change.
{% materialization incremental, adapter='snowflake' -%}
{%- set unique_key = config.get('unique_key') -%}
{%- set full_refresh_mode = config.get('should_full_refresh') -%}
{% set target_relation = this %}
{% set existing_relation = load_relation(this) %}
{% set tmp_relation = make_temp_relation(this) %}
{#-- Validate early so we don't run SQL if the strategy is invalid --#}
{% set strategy = dbt_snowflake_validate_get_incremental_strategy(config) -%}
-- setup
{{ run_hooks(pre_hooks, inside_transaction=False) }}
-- `BEGIN` happens here:
{{ run_hooks(pre_hooks, inside_transaction=True) }}
{% if existing_relation is none %}
{% set build_sql = create_table_as(False, target_relation, sql) %}
{% elif existing_relation.is_view %}
{#-- Can't overwrite a view with a table - we must drop --#}
{{ log("Dropping relation " ~ target_relation ~ " because it is a view and this model is a table.") }}
{% do adapter.drop_relation(existing_relation) %}
{% set build_sql = create_table_as(False, target_relation, sql) %}
{% elif full_refresh_mode %}
{% set build_sql = create_table_as(False, target_relation, sql) %}
{% else %}
{% do run_query(create_table_as(True, tmp_relation, sql)) %}
{% do adapter.expand_target_column_types(
from_relation=tmp_relation,
to_relation=target_relation) %}
{% set dest_columns = config.get('update_columns') %}
{% set build_sql = dbt_snowflake_get_incremental_sql(strategy, tmp_relation, target_relation, unique_key, dest_columns) %}
{% endif %}
{%- call statement('main') -%}
{{ build_sql }}
{%- endcall -%}
{{ run_hooks(post_hooks, inside_transaction=True) }}
-- `COMMIT` happens here
{{ adapter.commit() }}
{{ run_hooks(post_hooks, inside_transaction=False) }}
{% set target_relation = target_relation.incorporate(type='table') %}
{% do persist_docs(target_relation, model) %}
{{ return({'relations': [target_relation]}) }}
{%- endmaterialization %}
In this code, while the sql query gets executed correctly, i get the following error :
"Compilation Error in macro materialization_incremental_snowflake (macros/incremental.sql)
'persist_docs' is undefined"
which is the 3rd last line. I want the default macro to be executed if i have not redefined a particular as in this case, i don't need to.
How can i tell dbt to execute the default persist_docs macro in this model ? If you prefer, i can create it as a seperate issue but felt it was tied to the original question.
I did redefine the macro "persist_docs" in my project and made the change in the materialisation to point to my macro. But now i get
" 'dict object' has no attribute 'default__persist_docs' "
UPDATE : I updated to v0.17.0. and there is no issue any longer. Thanks anyways guys for help.
ha! Ok @ee07dazn - glad to hear that you got it working in 0.17.0! I think we should still proceed with getting a feature like this merged into dbt. If you're interested in opening a PR we'd be happy to help, otherwise, someone else might be able to pick it up in the future :)
@drewbanin : sorry, missed this comment. I will open a PR soon for this and tag you on it.
@drewbanin any update on implemeting
Got it! Sure, this makes a lot of sense in
mergestatements. Is this on Snowflake/BigQuery? I think a similar paradigm would apply to Redshift/Postgres, but we'd modify theupdatestatement instead of themergestatement.I can imagine a config like
update_columnswhich could be supplied like:{{ config( materialized='incremental', update_columns=['order_total', 'count_orders'] ) }}If
update_columnsis not provided, or if it isnull, dbt will default to updating all of the columns in the model.You buy all of that?
hey @ee07dazn - no movement on this issue on our end to report. I think the spec I mentioned above (pasted below) is still a good idea:
{{ config( materialized='incremental', update_columns=['order_total', 'count_orders'] ) }I think that this feature should only be supported on databases that support
mergestatements, as dbt's delete+insert approach doesn't really lend itself well to this approach. If anyone is interested in working on this one, I'd be happy to try to point you in the right direction.The key change here will be to replace the call to
adapter.get_columns_in_relationwithconfig.get('update_columns'). We'll want to implement this for both the Snowflake and BigQuery incremental materializations when themergestrategy is used.
@drewbanin any updates on implementing the update_columns feature?
Most helpful comment
Got it! Sure, this makes a lot of sense in
mergestatements. Is this on Snowflake/BigQuery? I think a similar paradigm would apply to Redshift/Postgres, but we'd modify theupdatestatement instead of themergestatement.I can imagine a config like
update_columnswhich could be supplied like:If
update_columnsis not provided, or if it isnull, dbt will default to updating all of the columns in the model.You buy all of that?