Dbt: Duplicate current rows in snapshot table

Created on 8 Jan 2020 · 7Comments · Source: fishtown-analytics/dbt

Describe the bug

dbt snapshot runs will fail after some error creates duplicate rows that each have a null dbt_valid_to value. At some point during a previous run, duplicate rows are generated that result in an error saying Database Error in snapshot user_campaign_audit (snapshots/user_campaign_audit.sql) 100090 (42P18): Duplicate row detected during DML action
when a subsequent snapshot run is invoked.

Steps To Reproduce

Honestly, not sure how to reproduce this! We are using a Fivetran/Snowflake set up, with dbt running on an hourly GitLab CI/CD pipeline. Pipelines run after the Fivetran load is finished. The snapshot runs after the dbt run and dbt test stages. We have not observed any errors in the unique schema test on the underlying table that is being snapshotted (i.e. there are not any duplicate records in the actual data).

Expected behavior

We would expect the snapshot function to accurately update the snapshot table, and only have one single record as the currently valid record (having a null dbt_valid_to field)

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

System information

Which database are you using dbt with?

[ ] postgres
[ ] redshift
[ ] bigquery
[x] snowflake
[ ] other (specify: ____________)

The output of dbt --version:

0.15.0

The operating system you're using:

The output of python --version:
docker image python:3.7-slim-buster

Additional context

Add any other context about the problem here.

bug snowflake

Source

dave-connors3

👍1

All 7 comments

Thanks for the report @dave-connors3!

Check out this similar, but distinct, issue: https://github.com/fishtown-analytics/dbt/issues/1884

1884 looks like it's caused by a data load that happens concurrently with a `dbt snapshot` invocation. From your description, it doesn't sound like it's possible for a Fivetran load and `dbt snapshot` to occur at the same time.

The error you're seeing, Duplicate row detected during DML action, occurs when a merge statement encounters two records in the merge _source_ that correspond to a single row in the merge _destination_. Check out a quick example:

create or replace table dbt_dbanin.merge_dest_debug as (
    select 1 as id, 'drew1' as name
);

create or replace table dbt_dbanin.merge_source_debug as (
    select 1 as id, 'drew2' as name union all
    select 1 as id, 'dave' as name
);

merge into dbt_dbanin.merge_dest_debug
using dbt_dbanin.merge_source_debug
on merge_source_debug.id = merge_dest_debug.id
when matched then update set name = merge_source_debug.name
when not matched then insert (id, name) values (id, name);

--------------------------------------------------------------
ERROR: Duplicate row detected during DML action Row Values: [1, "drew1"]

For snapshots, the merge statement that dbt uses on Snowflake can be found here:
https://github.com/fishtown-analytics/dbt/blob/7a07017b96ac332c872725343833a94b49129c68/core/dbt/include/global_project/macros/materializations/snapshot/snapshot_merge.sql#L10-L23

So, I would definitely expect to see this failure if you had obvious duplicates (records with the same configured unique_key in the result of your snapshot source query. Since you said that you have a uniqueness test on this field, that probably means that the duplicate is coming in via one of the internal snapshot queries that dbt runs:

https://github.com/fishtown-analytics/dbt/blob/7a07017b96ac332c872725343833a94b49129c68/core/dbt/include/global_project/macros/materializations/snapshot/snapshot.sql#L110-L120

The join here will definitely multiply source records for each duplicate present in the destination snapshot table. A good fix for this might be to use a row_number() window function to filter for only a single record in the destination table, but the best fix for this would ensure that dupes don't end up in the destination table at all :)

There's still an open question about how these dupes got into the destination table in the first place. The only thing I can imagine is that we _might_ see this behavior if two dbt snapshot invocations were to occur at the same time. There are probably clever ways for us to make dbt snapshot more robust for this particular failure mode. Do you think something like this could have been possible in your setup?

drewbanin on 20 Jan 2020

👍1

Hey Drew --

This appears to have happened again today. The sequencing of our dbt run goes compile --> run --> test (uniqueness) --> snapshot. The snapshot is performed on one of our dbt models that passed the run and test stages, so no duplicates are present in the source table. I checked our Fivetran logs, and the database this model is based on finished it's last run at 11:17am CT, and the dbt pipeline did not start running until 11:22am CT. I think this means we can rule out concurrent data loading and the presence of duplicated records in the source table. The snowflake logs do not seem to indicate that there was any other snapshot running at the same time.

I don't think I can pin down anything that you outlined above that would trip up the snapshot functionality.

dave-connors3 on 27 Jan 2020

I am having this same issue, it first occurred on 4/6/2020. System context: we are running DBT on Snowflake, kicked off by Airflow managed by Astronomer.

We ensure we do not have duplicates in the table-to-be-snapshot by using a qualify statement in its model definition:

...
qualify row_number()
  over (
    partition by entity_id
    order by entity_id
) = 1

That entity_id is then used as the unique_key in the snapshot definition. This snapshot is using the check strategy on all columns.

We use Airflow to kickoff the snapshot after dbt run finishes, but we do have two DBT dags, one that runs hourly and one that runs daily, which could cause the snapshot to be issued during a data load. Is there a way to figure out if this is the case?

rsenseman on 16 Apr 2020

thanks @rsenseman and @dave-connors3. I do want to get to the bottom of why this is happening, but increasingly my thinking is just that we should make snapshots more resilient to dupe issues. That probably means running an extra query on the staging table to ensure uniqueness, and throwing an error instead of inserting dupes.

I'm digging way into this code over the next couple of days, and I'd like to come out the other side with a plan for updating snapshots to remove this entire class of errors

drewbanin on 21 Apr 2020

@drewbanin after I investigated further, I think our issue is caused by #1884 linked above, I just commented here before I saw that other issue

rsenseman on 21 Apr 2020

👍1

Thanks Drew!

For what it's worth, since we've upgraded dbt, we haven't run into any
additional issues, including after making schema changes to the snapshotted
table.

On Tue, Apr 21, 2020 at 4:36 PM Drew Banin notifications@github.com wrote:

thanks @rsenseman https://github.com/rsenseman and @dave-connors3
https://github.com/dave-connors3. I do want to get to the bottom of why
this is happening, but increasingly my thinking is just that we should make
snapshots more resilient to dupe issues. That probably means running an
extra query on the staging table to ensure uniqueness, and throwing an
error instead of inserting dupes.

I'm digging way into this code over the next couple of days, and I'd like
to come out the other side with a plan for updating snapshots to remove
this entire class of errors

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/fishtown-analytics/dbt/issues/2034#issuecomment-617427180,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AOHFHTETUKT6RZT33GNCVW3RNYGVPANCNFSM4KELHO7Q
.

dave-connors3 on 21 Apr 2020

👍1

closing this in favor of https://github.com/fishtown-analytics/dbt/issues/1884

drewbanin on 22 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

minimize downtime for incremental models with full-refresh

drewbanin · 4Comments

Future art for exposures

jtcohen6 · 3Comments

Support a flag to override the target and logs directories on the CLI

drewbanin · 3Comments

ref'ing an ephemeral model inside of a statement does not work