Dbt: Make snapshot data staging atomic

Created on 2 Nov 2019 · 4Comments · Source: fishtown-analytics/dbt

Describe the bug

Note: Further testing is required to determine that this bug exists, and that the description provided here accurately depicts the errant behavior and steps to reproduce it.

The dbt snapshot command selects from the source dataset twice, once to generate a set of inserts, and then again to generate a set of updates. If these queries are run outside of a transaction (ie. on Snowflake or BigQuery), then it's possible for changes to the underlying source data to present two different views of the staging data to the snapshot query.

As such, it's possible to dbt to invalidate an existing snapshot record without correspondingly inserting a new entry for that record. This leaves the snapshot table in a bad state: since there is no "current" record in the Snapshot table, subsequent dbt snapshot invocations do not record changes to the underlying source data for "closed out" records as identified by their unique keys.

Steps To Reproduce

This is in the category of a "race condition". It appears that this issue can manifest if the source data for a snapshot is modified between the execution of these two sql statements: https://github.com/fishtown-analytics/dbt/blob/f985902a002fba36f6f709c6aacf9ae20778e58c/core/dbt/include/global_project/macros/materializations/snapshot/snapshot.sql#L165-L174

We should either combine these queries into a single query, or stage the source data into a temporary table in a way that guarantees consistency between the two queries. The code here was written this way for simplicity, but there's no real reason why the two queries cannot be combined into one.

Expected behavior

Mutations to source data during a dbt snapshot should be ignored and picked up in a subsequent snapshot job.

System information

Which database are you using dbt with?

[x] bigquery
[x] snowflake

The output of dbt --version:

>=0.14.0

Additional context

In addition to the primary defect noted here, dbt snapshot should be improved to handle (or help users fix) cases where there are no "current" records for a unique key in a snapshot table. This is basically going to be dbt's fault 100% of the time, so it should be easy to "fix" that if it should happen again in the future.

Secondly, dbt should probably also enforce the constraints that it requires by default in snapshot definitions. If a table's primary key is non-unique, or if the updated_at timestamp is null in snapshots configured with the timestamp strategy, then dbt snapshot is going to fail _very_ hard. We should help users determine when this is the case and prevent dbt snapshot from getting itself into a bad state.

Snapshots bigquery bug snowflake

Source

drewbanin

👍3 👀1

Most helpful comment

The reason, by the way, for the 9999-01-01 date in the dbt_valid_to column is that these SQL results were from a view that we have on top of the snapshot that casts NULLs in that column to "future eternity." But yes, I have checked in the underlying snapshot data, and the problem does exist there.

jrandrews on 2 Apr 2020

👍2

All 4 comments

This caused some problems for us yesterday. We have a snapshot. The underlying source table added a column and didn’t increment the lastchanged field (don’t ask, long story). So I tried to manually essentially force a new version of the record in the underlying table for the snapshot by running this SQL command:

SELECT * (EXCEPT lastchanged), CURRENT_TIMESTAMP() as lastchanged FROM {{ source (‘my_source_table’) }}

Well, then this happened in the snapshot records (filtering for a stream of changes to one common id value, look at how the valid_to and valid_from timestamps overlap for the 2nd and third row):

Screen Shot 2020-04-02 at 9 57 00 AM

My guess is that because it is two different SQL statements being run, the value of CURRENT_TIMESTAMP() was different at the exact moment that the two of them were being run. It would be great to get this issue addressed so things like this don't happen.

Luckily I have a backup of the snapshot before I did this and I can go back and re-try it another way, probably just doing something like TIMESTAMP(DATETIME('2020-04-02 10:00:00'), 'UTC'))

jrandrews on 2 Apr 2020

👍2

Would this bug also cause cases where an "old" version of a record in a snapshot would be end-dated but a new record with a newer "lastchanged" timestamp would not be added? We're seeing this happen and regardless of how many times we run the "snapshot" command the new record never gets inserted.

jrandrews on 25 Apr 2020

@jrandrews I think it could! dbt runs:

An insert for all brand-new records, plus new versions of existing records
An update for all invalidated records

If a new data point which would invalidate an existing snapshot record is added to the snapshot query between (1) and (2), then the record in the snapshot table would be invalidated without a corresponding new version being inserted.

Lots of really compelling reasons to fix this issue for sure

drewbanin on 28 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings