When running a full refresh of incremental tables, the table is locked when the new table is being built. This locks the table for several minutes in production for large tables.
Expected the original table to be available for querying when the temporary table is being built. The table should be locked only when moving the contents from the temporary table to the original table.
Below is a prototype where only the delete from target table and the insert into the target table are part of the transaction. This locks the table for a much smaller duration:
{% call statement(auto_begin=True) %}
delete from {{ target_table }};
insert into {{ target_table }} select * from {{ temporary_table }};
commit;
{% endcall %}
Which database are you using dbt with?
The output of dbt --version:
installed version: 0.16.0
latest version: 0.16.1
Your version of dbt is out of date! You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation
The operating system you're using:
macOS Catalina 10.15.4
The output of python --version:
Python 3.6.3
When I checked in Slack, I was told this may be a solved problem. However, I do notice a significant difference in query time between the two approaches. Let me know if I am missing something.
I believe the relevant bit of code is here:
https://github.com/fishtown-analytics/dbt/blob/8686ab9a9ddffb63b64d3ee2861f132bf77f54c1/core/dbt/include/global_project/macros/materializations/incremental/incremental.sql#L20-L27
I think the cause of the delay you're seeing is that dbt doesn't use the same approach here that it does in the table materialization.
Here's the table order of operations:
intermediate_relationtarget_relation --> backup_relationintermediate_relation --> target_relationbackup_relationThe full-refresh mode of the incremental materialization instead does:
target_relation --> backup_relationtarget_relationbackup_relationIt's worth saying that, in much older versions of dbt, the way the incremental materialization executed full-refresh runs was by... dropping the preexisting table before even opening the transaction to recreate it. So this is definitely better than that, if still not as good as it could be.
@drewbanin Do you know a good reason for the difference between the materializations? Maybe this is something we could think about for broader materialization work in 0.18.0?
hey @jtcohen6 - I can't think of a good reason why incrementals would work differently than tables on redshift. I'd be very supportive of changing the atomic swap flow for 0.18.0!
That's great! I am currently working on a custom materialization for our project using the pattern described in @jtcohen6 's response. I would be happy to contribute if that works!
As discussed above, changing to the table order of operations helped remove the lock on the table when building and resolved our issue. My change involved a custom materialization with the following lines of code:
to something like:
{% set tmp_identifier = this.identifier ~ "__dbt_tmp" %}
{%- set tmp_relation = api.Relation.create(identifier=tmp_identifier,
schema=this.schema,
database=this.database,
type='table') -%}
{% do run_query(create_table_as(False, tmp_relation, sql)) %}
{% do adapter.rename_relation(target_relation, backup_relation) %}
{% do adapter.rename_relation(tmp_relation, target_relation) %}
-- define build_sql so that it does not fail
{% set build_sql = "select 1;" %}
Let me know if it would be helpful to contribute this in a PR or if it would be better for the team to make the change since it involves changes to an important functionality. There are also definitely improvements to be made like removing the dummy sql statements.
@drkarthi sure thing - please do feel free to PR a change for this against dev/marian-anderson. If you have any questions about how to structure the code to remove the dummy sql statement, we'd be happy to take a look at the diff and send through some ideas :)
Most helpful comment
I believe the relevant bit of code is here:
https://github.com/fishtown-analytics/dbt/blob/8686ab9a9ddffb63b64d3ee2861f132bf77f54c1/core/dbt/include/global_project/macros/materializations/incremental/incremental.sql#L20-L27
I think the cause of the delay you're seeing is that dbt doesn't use the same approach here that it does in the table materialization.
Here's the table order of operations:
intermediate_relationtarget_relation --> backup_relationintermediate_relation --> target_relationbackup_relationThe full-refresh mode of the incremental materialization instead does:
target_relation --> backup_relationtarget_relationbackup_relationIt's worth saying that, in much older versions of dbt, the way the incremental materialization executed full-refresh runs was by... dropping the preexisting table before even opening the transaction to recreate it. So this is definitely better than that, if still not as good as it could be.
@drewbanin Do you know a good reason for the difference between the materializations? Maybe this is something we could think about for broader materialization work in 0.18.0?