Dbt: [Wilt Chamberlain] Merge on Snowflake fails if unique_key values are not unique

Created on 19 Jun 2019  路  9Comments  路  Source: fishtown-analytics/dbt

Issue / Discussion

Description

There are known benefits to using merge on Snowflake:

  1. it's atomic
  2. it's simple
  3. it appears to be the recommended way to upsert records by some primary key

There is however a drawback: the merge statement will fail if the primary key used in the merge is not unique. For incremental models that merge by a primary key (like an order_id), this generally shouldn't be a problem, and might in fact be classified as a "feature".

For other types of incremental strategies, the merge statement pose problems. Approaches like this one incrementally load records by date, and they use an intentionally non-unique unique_key. This worked previously because dbt would simply delete data for the days present in the incremental model, then re-insert the results of the model select statement.

I feel somewhat comfortable saying something like "the name of the config is unique_key, and you shouldn't have used a non-unique value here", but I do think this is a legitimate and good incremental strategy, and it's one that I think we should continue to support if possible.

Possible approaches:

  1. Add a boolean config like incremental_merge for Snowflake which switches between the merge and delete+insert implementations (not terrible)
  2. Roll back merge for Snowflake (undesirable)
  3. Make a different materialization called merge which specifically implements the merge statement (probably undesirable)
  4. Make a different materialization called incremental_batch (or similar) which implements the delete+insert pattern (probably undesirable)

I think that the merge pattern is going to work well for the majority of existing and new incremental models out there. In testing, I actually found that some of our own internal-analytics incremental models fail with a non-unique error due to bugs in our model code, so this uniqueness assertion in the merge statement is a really handy thing to have!

What do y'all think?

help wanted snowflake

All 9 comments

As I mentioned previously, most of our incremental load patterns revolve around dates and specifically units of full days. This is similar to how Airflow looks at execution dates.

For fact tables, where we see the only real use case for incremental loads, the smallest unit of work is the day (and in a few small cases the hour of day) and rarely if ever an ID, such as Order ID. Order IDs, or keys or any unique identifers often have very little analytical significance in a data warehouse setting, so their presence is merely informational, and often optional, while the Order Date in this case is meaningful analytically.
So, as a logical pattern to drive ETL jobs, it makes sense to us to use business dates such as Order Date since they represent natural sequential slices to process new incoming data.

This pattern is somewhat rooted in our experience with Hive and Hadoop ETL, where working with well partitioned data is key to performance. Since in our use cases, most fact tables in those environments, as well as in Redshift, BigQuery and Snowflake are partitioned (or in Snowflake's case, clustered or at least sorted) by date, it makes sense to take advantage of those partitions when selecting and inserting new data.

In addition, many of our use cases involve late-arriving facts so most of our incremental dbt models are based on a sliding window of last n days. This guarantees that we not only get new data, but also process updated, as well as deleted rows where applicable.
If we were to match only at the ID level, we would not be deleting rows from our fact table that no longer exists in the incremental data from our source system.
While we may not always want to delete such rows from our fact tables, the date-based incremental load pattern gives us at least the ability to do that when needed.

I would love it if dbt could support both patterns, and if the existing incremental implementation would not break our existing load jobs.
I think option 3) above sounds reasonable to me and wouldn't just have to be limited to Snowflake, if the other paths support incremental load as intended?
So we'd have incremental which uses insert/delete on all platforms and merge which uses merge on platforms where possible.

I agree with @clausherther. All of my incremental logic revolves around simple dates too. I used to use incremental logic that would try to be very surgical and load only updated/new unique keys, but it was always overly complex to calculate and determine those. I found it much simpler to explain and reason with for what is being loaded/updated in date chunks. The increased I/O "penalty" from a daily incremental is easily worth it for my own sanity.

Another example of daily incremental really shining is when we've had awkward moments of data fidelity issues starting after x date. It was effortless to get dbt to reload those bad dates via the last n days logic. If anything, it was basically self-healing. Whereas a more surgical incremental would probably require a --full-refresh.

I would really like to see this capability maintained in dbt since it is such a simple and powerful incremental technique. Options 1 and 3 are okay for me.

Thanks for the cogent and detailed responses @joshpeng and @clausherther! We're going to add a sort of "strategy" that switches between the merge and insert_delete mechanisms for building incremental models. I'll make sure there's an easy way to flip this switch for all of the models in your project

Thanks @drewbanin for raising this issue and working so closely with the dbt community!

Are there plans to support this in GBQ?

hey @amcarvalho - no current plans to support this strategy on BigQuery, but I do agree that it would be worthwhile and compelling! Do you mind opening a new issue for us to discuss?

Will do!

Done here #2020

Was this page helpful?
0 / 5 - 0 ratings