Dbt: Full refresh of incremental tables locks the table during build

Created on 11 May 2020  路  5Comments  路  Source: fishtown-analytics/dbt

Describe the bug

When running a full refresh of incremental tables, the table is locked when the new table is being built. This locks the table for several minutes in production for large tables.

Steps To Reproduce

  1. Run dbt with full refresh on an incremental table that takes a few minutes to build.
  2. Query the table when the full refresh of the table is happening.

Expected behavior

Expected the original table to be available for querying when the temporary table is being built. The table should be locked only when moving the contents from the temporary table to the original table.

Below is a prototype where only the delete from target table and the insert into the target table are part of the transaction. This locks the table for a much smaller duration:

{% call statement(auto_begin=True) %} 
        delete from {{ target_table }};
        insert into {{ target_table }} select * from {{ temporary_table }};
        commit;
 {% endcall %}

System information

Which database are you using dbt with?

  • [ ] postgres
  • [x] redshift
  • [ ] bigquery
  • [ ] snowflake
  • [ ] other (specify: ____________)

The output of dbt --version:

installed version: 0.16.0
   latest version: 0.16.1

Your version of dbt is out of date! You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation

The operating system you're using:
macOS Catalina 10.15.4

The output of python --version:
Python 3.6.3

Additional context

When I checked in Slack, I was told this may be a solved problem. However, I do notice a significant difference in query time between the two approaches. Let me know if I am missing something.

enhancement good first issue redshifpg

Most helpful comment

I believe the relevant bit of code is here:
https://github.com/fishtown-analytics/dbt/blob/8686ab9a9ddffb63b64d3ee2861f132bf77f54c1/core/dbt/include/global_project/macros/materializations/incremental/incremental.sql#L20-L27

I think the cause of the delay you're seeing is that dbt doesn't use the same approach here that it does in the table materialization.

Here's the table order of operations:

  1. Create the new object as an intermediate_relation
  2. Rename target_relation --> backup_relation
  3. Rename intermediate_relation --> target_relation
  4. Close the transaction
  5. Drop the backup_relation

The full-refresh mode of the incremental materialization instead does:

  1. Rename target_relation --> backup_relation
  2. Create the new object as target_relation
  3. Close the transaction
  4. Drop the backup_relation

It's worth saying that, in much older versions of dbt, the way the incremental materialization executed full-refresh runs was by... dropping the preexisting table before even opening the transaction to recreate it. So this is definitely better than that, if still not as good as it could be.

@drewbanin Do you know a good reason for the difference between the materializations? Maybe this is something we could think about for broader materialization work in 0.18.0?

All 5 comments

I believe the relevant bit of code is here:
https://github.com/fishtown-analytics/dbt/blob/8686ab9a9ddffb63b64d3ee2861f132bf77f54c1/core/dbt/include/global_project/macros/materializations/incremental/incremental.sql#L20-L27

I think the cause of the delay you're seeing is that dbt doesn't use the same approach here that it does in the table materialization.

Here's the table order of operations:

  1. Create the new object as an intermediate_relation
  2. Rename target_relation --> backup_relation
  3. Rename intermediate_relation --> target_relation
  4. Close the transaction
  5. Drop the backup_relation

The full-refresh mode of the incremental materialization instead does:

  1. Rename target_relation --> backup_relation
  2. Create the new object as target_relation
  3. Close the transaction
  4. Drop the backup_relation

It's worth saying that, in much older versions of dbt, the way the incremental materialization executed full-refresh runs was by... dropping the preexisting table before even opening the transaction to recreate it. So this is definitely better than that, if still not as good as it could be.

@drewbanin Do you know a good reason for the difference between the materializations? Maybe this is something we could think about for broader materialization work in 0.18.0?

hey @jtcohen6 - I can't think of a good reason why incrementals would work differently than tables on redshift. I'd be very supportive of changing the atomic swap flow for 0.18.0!

That's great! I am currently working on a custom materialization for our project using the pattern described in @jtcohen6 's response. I would be happy to contribute if that works!

As discussed above, changing to the table order of operations helped remove the lock on the table when building and resolved our issue. My change involved a custom materialization with the following lines of code:

https://github.com/fishtown-analytics/dbt/blob/1d543b373734ab78e0ebf441888fb7d7c7d6e8fd/core/dbt/include/global_project/macros/materializations/incremental/incremental.sql#L25-L26

to something like:

{% set tmp_identifier = this.identifier ~ "__dbt_tmp" %}
{%- set tmp_relation = api.Relation.create(identifier=tmp_identifier,
                                                      schema=this.schema,
                                                      database=this.database,
                                                      type='table') -%} 

{% do run_query(create_table_as(False, tmp_relation, sql)) %}
{% do adapter.rename_relation(target_relation, backup_relation) %}
{% do adapter.rename_relation(tmp_relation, target_relation) %}
-- define build_sql so that it does not fail
{% set build_sql = "select 1;" %}

Let me know if it would be helpful to contribute this in a PR or if it would be better for the team to make the change since it involves changes to an important functionality. There are also definitely improvements to be made like removing the dummy sql statements.

@drkarthi sure thing - please do feel free to PR a change for this against dev/marian-anderson. If you have any questions about how to structure the code to remove the dummy sql statement, we'd be happy to take a look at the diff and send through some ideas :)

Was this page helpful?
0 / 5 - 0 ratings