Dbt: Full refresh of incremental tables locks the table during build

Created on 11 May 2020 · 5Comments · Source: fishtown-analytics/dbt

Describe the bug

When running a full refresh of incremental tables, the table is locked when the new table is being built. This locks the table for several minutes in production for large tables.

Steps To Reproduce

Run dbt with full refresh on an incremental table that takes a few minutes to build.
Query the table when the full refresh of the table is happening.

Expected behavior

Expected the original table to be available for querying when the temporary table is being built. The table should be locked only when moving the contents from the temporary table to the original table.

Below is a prototype where only the delete from target table and the insert into the target table are part of the transaction. This locks the table for a much smaller duration:

{% call statement(auto_begin=True) %} 
        delete from {{ target_table }};
        insert into {{ target_table }} select * from {{ temporary_table }};
        commit;
 {% endcall %}

System information

Which database are you using dbt with?

[ ] postgres
[x] redshift
[ ] bigquery
[ ] snowflake
[ ] other (specify: ____________)

The output of dbt --version:

installed version: 0.16.0
   latest version: 0.16.1

Your version of dbt is out of date! You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation

The operating system you're using:
macOS Catalina 10.15.4

The output of python --version:
Python 3.6.3

Additional context

When I checked in Slack, I was told this may be a solved problem. However, I do notice a significant difference in query time between the two approaches. Let me know if I am missing something.

enhancement good first issue redshifpg

Source

drkarthi

Most helpful comment

I believe the relevant bit of code is here:
https://github.com/fishtown-analytics/dbt/blob/8686ab9a9ddffb63b64d3ee2861f132bf77f54c1/core/dbt/include/global_project/macros/materializations/incremental/incremental.sql#L20-L27

I think the cause of the delay you're seeing is that dbt doesn't use the same approach here that it does in the table materialization.

Here's the table order of operations:

Create the new object as an intermediate_relation
Rename target_relation --> backup_relation
Rename intermediate_relation --> target_relation
Close the transaction
Drop the backup_relation

The full-refresh mode of the incremental materialization instead does:

Rename target_relation --> backup_relation
Create the new object as target_relation
Close the transaction
Drop the backup_relation

It's worth saying that, in much older versions of dbt, the way the incremental materialization executed full-refresh runs was by... dropping the preexisting table before even opening the transaction to recreate it. So this is definitely better than that, if still not as good as it could be.

@drewbanin Do you know a good reason for the difference between the materializations? Maybe this is something we could think about for broader materialization work in 0.18.0?

jtcohen6 on 11 May 2020

👍2

All 5 comments

I think the cause of the delay you're seeing is that dbt doesn't use the same approach here that it does in the table materialization.

Here's the table order of operations:

Create the new object as an intermediate_relation
Rename target_relation --> backup_relation
Rename intermediate_relation --> target_relation
Close the transaction
Drop the backup_relation

The full-refresh mode of the incremental materialization instead does:

Rename target_relation --> backup_relation
Create the new object as target_relation
Close the transaction
Drop the backup_relation

@drewbanin Do you know a good reason for the difference between the materializations? Maybe this is something we could think about for broader materialization work in 0.18.0?

jtcohen6 on 11 May 2020

👍2

hey @jtcohen6 - I can't think of a good reason why incrementals would work differently than tables on redshift. I'd be very supportive of changing the atomic swap flow for 0.18.0!

drewbanin on 12 May 2020

That's great! I am currently working on a custom materialization for our project using the pattern described in @jtcohen6 's response. I would be happy to contribute if that works!

drkarthi on 14 May 2020

As discussed above, changing to the table order of operations helped remove the lock on the table when building and resolved our issue. My change involved a custom materialization with the following lines of code:

https://github.com/fishtown-analytics/dbt/blob/1d543b373734ab78e0ebf441888fb7d7c7d6e8fd/core/dbt/include/global_project/macros/materializations/incremental/incremental.sql#L25-L26

to something like:

{% set tmp_identifier = this.identifier ~ "__dbt_tmp" %}
{%- set tmp_relation = api.Relation.create(identifier=tmp_identifier,
                                                      schema=this.schema,
                                                      database=this.database,
                                                      type='table') -%} 

{% do run_query(create_table_as(False, tmp_relation, sql)) %}
{% do adapter.rename_relation(target_relation, backup_relation) %}
{% do adapter.rename_relation(tmp_relation, target_relation) %}
-- define build_sql so that it does not fail
{% set build_sql = "select 1;" %}

Let me know if it would be helpful to contribute this in a PR or if it would be better for the team to make the change since it involves changes to an important functionality. There are also definitely improvements to be made like removing the dummy sql statements.

drkarthi on 18 May 2020

👍1

@drkarthi sure thing - please do feel free to PR a change for this against dev/marian-anderson. If you have any questions about how to structure the code to remove the dummy sql statement, we'd be happy to take a look at the diff and send through some ideas :)

drewbanin on 21 May 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Feature Request: UNLOGGED Postgres tables

boxysean · 3Comments

Threads cannot be set from an environment variable

keiththompson · 3Comments

log-path (and other output paths) cannot be set to null, because of aggressive defaults

kconvey · 3Comments

ref'ing an ephemeral model inside of a statement does not work

drewbanin · 3Comments

Future art for exposures

jtcohen6 · 3Comments