Dbt: Specify Columns to update on incremental Models

Created on 24 Oct 2019 · 15Comments · Source: fishtown-analytics/dbt

Describe the feature

Allow model developer the ability to choose the set of columns that require updating on incremental loads

Who will this benefit?

This will benefit all users that are performing an incremental update of a wide table with few columns that are mutable

enhancement good first issue

Source

richmintz

Most helpful comment

Got it! Sure, this makes a lot of sense in merge statements. Is this on Snowflake/BigQuery? I think a similar paradigm would apply to Redshift/Postgres, but we'd modify the update statement instead of the merge statement.

I can imagine a config like update_columns which could be supplied like:

{{
  config(
    materialized='incremental',
    update_columns=['order_total', 'count_orders']
  )
}}

If update_columns is not provided, or if it is null, dbt will default to updating all of the columns in the model.

You buy all of that?

drewbanin on 17 Nov 2019

👍4 😕1

All 15 comments

hey @richmintz! Can you say more about this? I think it's a good idea - I'm curious if you're interested in this feature for performance reasons, or something else.

drewbanin on 28 Oct 2019

Hi @drewbanin!
My primary goal would be performance, minimizing writes and index rebuilds. However I also think it will generate more concise sql for large tables when a merge is required.

richmintz on 4 Nov 2019

I can imagine a config like update_columns which could be supplied like:

{{
  config(
    materialized='incremental',
    update_columns=['order_total', 'count_orders']
  )
}}

If update_columns is not provided, or if it is null, dbt will default to updating all of the columns in the model.

You buy all of that?

drewbanin on 17 Nov 2019

👍4 😕1

As part of this proposed change, it would also be nice to be able to exclude the when matched then update set part of the merge altogether, as in some of my models I'm only interested in adding new rows since the source data is never updated (for event-based data for example or other append-only tables), and it makes the model execution faster (at least in BigQuery).

It could be a separate config no_update = True or just a convention that doesn't include the SQL block when update_columns is empty.

Please note I tested this and would work for BigQuery, other databases might need a different syntax to support no-op for updates.

Any thoughts?

bfil on 17 Jan 2020

hey @bfil - if you do not supply a unique_key for your incremental model's config, then dbt should not inject a when matched then update set.... clause to the merge statement.

Check out the implementation here:
https://github.com/fishtown-analytics/dbt/blob/7a07017b96ac332c872725343833a94b49129c68/core/dbt/include/global_project/macros/materializations/common/merge.sql#L23-L48

drewbanin on 21 Jan 2020

@drewbanin I know, but I still want to compare and insert only the new data based on that unique key rather than merging on an always FALSE condition. Makes sense?

bfil on 21 Jan 2020

ah! Sure, that makes a lot of sense. What do you think about pushing this code down from the materialization layer and into the modeling layer. Could you do something like:

{{ config(materialized='incremental') }}

select id, col1, col2 from {{ ref('some_table') }}
{% if is_incremental() %}
where id not in (select id from {{ this }})
{% endif %}

This is a kind of funny adaptation of the typical incremental modeling approach, but it should serve to only select the _new_ records from your source table, inserting them directly into the destination table.

drewbanin on 21 Jan 2020

@drewbanin : any update on where we are with this feature...i have a wide table [Snowflake] but i only need to update very few columns in the incremental model. any ETA's on the feature ?

ee07dazn on 8 Jun 2020

hey @ee07dazn - no movement on this issue on our end to report. I think the spec I mentioned above (pasted below) is still a good idea:

{{
  config(
    materialized='incremental',
    update_columns=['order_total', 'count_orders']
  )
}

I think that this feature should only be supported on databases that support merge statements, as dbt's delete+insert approach doesn't really lend itself well to this approach. If anyone is interested in working on this one, I'd be happy to try to point you in the right direction.

The key change here will be to replace the call to adapter.get_columns_in_relation with config.get('update_columns'). We'll want to implement this for both the Snowflake and BigQuery incremental materializations when the merge strategy is used.

drewbanin on 8 Jun 2020

Thanks @drewbanin for the information and a pointer. Apart from what you had suggested, i had to make change to the default__get_merge_sql macro to make that approach work. It works now but i am just not happy to make a change to something as low level as default__get_merge_sql. Probably i think i can update the snowflake_get_merge_sql to do this job. Thinking out loud...but thanks for the help.

ee07dazn on 9 Jun 2020

👍1

Hi @drewbanin : i have another question and please bear with me if its a silly one as i am new to the tool and still exploring it through a POC.

So i need to update the incremental materialisation to make sure i can use the "update_column" in config but rest of the materialisation does not need to change.

{% materialization incremental, adapter='snowflake' -%}

  {%- set unique_key = config.get('unique_key') -%}
  {%- set full_refresh_mode = config.get('should_full_refresh') -%}

  {% set target_relation = this %}
  {% set existing_relation = load_relation(this) %}
  {% set tmp_relation = make_temp_relation(this) %}

  {#-- Validate early so we don't run SQL if the strategy is invalid --#}
  {% set strategy = dbt_snowflake_validate_get_incremental_strategy(config) -%}

  -- setup
  {{ run_hooks(pre_hooks, inside_transaction=False) }}

  -- `BEGIN` happens here:
  {{ run_hooks(pre_hooks, inside_transaction=True) }}

  {% if existing_relation is none %}
    {% set build_sql = create_table_as(False, target_relation, sql) %}
  {% elif existing_relation.is_view %}
    {#-- Can't overwrite a view with a table - we must drop --#}
    {{ log("Dropping relation " ~ target_relation ~ " because it is a view and this model is a table.") }}
    {% do adapter.drop_relation(existing_relation) %}
    {% set build_sql = create_table_as(False, target_relation, sql) %}
  {% elif full_refresh_mode %}
    {% set build_sql = create_table_as(False, target_relation, sql) %}
  {% else %}
    {% do run_query(create_table_as(True, tmp_relation, sql)) %}
    {% do adapter.expand_target_column_types(
           from_relation=tmp_relation,
           to_relation=target_relation) %}
    {% set dest_columns = config.get('update_columns') %}
    {% set build_sql = dbt_snowflake_get_incremental_sql(strategy, tmp_relation, target_relation, unique_key, dest_columns) %}
  {% endif %}

  {%- call statement('main') -%}
    {{ build_sql }}
  {%- endcall -%}

  {{ run_hooks(post_hooks, inside_transaction=True) }}

  -- `COMMIT` happens here
  {{ adapter.commit() }}

  {{ run_hooks(post_hooks, inside_transaction=False) }}

  {% set target_relation = target_relation.incorporate(type='table') %}
  {% do persist_docs(target_relation, model) %}
  {{ return({'relations': [target_relation]}) }}
{%- endmaterialization %}

In this code, while the sql query gets executed correctly, i get the following error :

"Compilation Error in macro materialization_incremental_snowflake (macros/incremental.sql)
  'persist_docs' is undefined"

which is the 3rd last line. I want the default macro to be executed if i have not redefined a particular as in this case, i don't need to.

How can i tell dbt to execute the default persist_docs macro in this model ? If you prefer, i can create it as a seperate issue but felt it was tied to the original question.

ee07dazn on 9 Jun 2020

I did redefine the macro "persist_docs" in my project and made the change in the materialisation to point to my macro. But now i get
" 'dict object' has no attribute 'default__persist_docs' "

UPDATE : I updated to v0.17.0. and there is no issue any longer. Thanks anyways guys for help.

ee07dazn on 10 Jun 2020

👍1

ha! Ok @ee07dazn - glad to hear that you got it working in 0.17.0! I think we should still proceed with getting a feature like this merged into dbt. If you're interested in opening a PR we'd be happy to help, otherwise, someone else might be able to pick it up in the future :)

drewbanin on 12 Jun 2020

@drewbanin : sorry, missed this comment. I will open a PR soon for this and tag you on it.

ee07dazn on 15 Jul 2020

@drewbanin any update on implemeting

Got it! Sure, this makes a lot of sense in merge statements. Is this on Snowflake/BigQuery? I think a similar paradigm would apply to Redshift/Postgres, but we'd modify the update statement instead of the merge statement.

I can imagine a config like update_columns which could be supplied like:
{{
  config(
    materialized='incremental',
    update_columns=['order_total', 'count_orders']
  )
}}
If update_columns is not provided, or if it is null, dbt will default to updating all of the columns in the model.

You buy all of that?

hey @ee07dazn - no movement on this issue on our end to report. I think the spec I mentioned above (pasted below) is still a good idea:
{{
  config(
    materialized='incremental',
    update_columns=['order_total', 'count_orders']
  )
}
I think that this feature should only be supported on databases that support merge statements, as dbt's delete+insert approach doesn't really lend itself well to this approach. If anyone is interested in working on this one, I'd be happy to try to point you in the right direction.

The key change here will be to replace the call to adapter.get_columns_in_relation with config.get('update_columns'). We'll want to implement this for both the Snowflake and BigQuery incremental materializations when the merge strategy is used.

@drewbanin any updates on implementing the update_columns feature?