Dbt: During a DBT run, dropping a view will also drop dependent views that haven't been recreated.

Created on 6 Mar 2020 · 2Comments · Source: fishtown-analytics/dbt

Describe the bug

When there are views in the DAGs e.g. (view_root -> view_middle -> view_leaf)

A dbt run causes problems because:

when view_root completes, it'll drop all its dependent views, i.e. view_middle and view_leaf in this case.
view_middle and view_leaf will not be re-created until they're run.
This is a problem because view_root might run very early in a run while view_middle and view_leaf run hours later and causes these views to not be available.
- Our BI tools relying on view_middle and view_leaf are broken while dbt is running.

Steps To Reproduce

Created a sample dbt project to reproduce: https://github.com/nickwu241/dbt_reproduce_drop_view_cascade_bug

It will

create a postgres docker locally
create a dbt DAG of view_root -> view_middle -> view_leaf
has post-hook wih SELECT * FROM pg_sleep(10)
- so that you can see view_middle and view_leaf are unavailable after view_root runs.
- feel free to adjust 10 to be 30 if you need more time to see the database

Expected behavior

I expect dbt to not drop the dependent views until the whole run has finished.
I expect DROP VIEW view_root_backup CASCADE; won't happen until all dependent views are re-created.

Screenshots and log output

dbt run --profiles-dir .
Running with dbt=0.15.2
Found 3 models, 0 tests, 0 snapshots, 0 analyses, 125 macros, 1 operation, 0 seed files, 1 source

09:57:30 | 
09:57:30 | Running 1 on-run-start hook
09:57:30 | 1 of 1 START hook: dbt_reproduce_drop_view_cascade_bug.on-run-star... [RUN]
09:57:30 | 1 of 1 OK hook: dbt_reproduce_drop_view_cascade_bug.on-run-start.0... [CREATE TABLE in 0.00s]
09:57:30 | 
09:57:30 | Concurrency: 1 threads (target='dev')
09:57:30 | 
09:57:30 | 1 of 3 START view model public.view_root............................. [RUN]
09:57:41 | 1 of 3 OK created view model public.view_root........................ [CREATE VIEW in 10.14s] <<<<<<<<<<< view_middle and view_leaf will be dropped here
09:57:41 | 2 of 3 START view model public.view_middle........................... [RUN]
09:57:51 | 2 of 3 OK created view model public.view_middle...................... [CREATE VIEW in 10.08s]
09:57:51 | 3 of 3 START view model public.view_leaf............................. [RUN]
09:58:02 | 3 of 3 OK created view model public.view_leaf........................ [CREATE VIEW in 10.06s]
09:58:02 | 
09:58:02 | Finished running 3 view models, 1 hook in 31.74s.

Completed successfully

Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

System information

Which database are you using dbt with?

[x] postgres
[ ] redshift
[ ] bigquery
[ ] snowflake
[ ] other (specify: ____________)

The output of dbt --version:

installed version: 0.15.2
   latest version: 0.15.2

Up to date!

The operating system you're using:
macOS Mojave 10.14.6

The output of python --version:
Python 3.7.3

Additional context

Let me know if you require more information or have any questions! You can reach me on the DBT slack as Nick Wu as well.

Thank you!

bug wontfix

Source

nickwu241

Most helpful comment

hey @nickwu241 - this is a good question! There isn't any mechanism in Postgres to define a view as "late bound" or "unbound" or anything like that. dbt unfortunately _must_ drop cascade the root tables, resulting in downstream views being dropped during the run. Here are some examples of how this is addressed on modern data warehouses:

Redshift has "late bound views" which do not need to be cascade-dropped when their parent tables are dropped
Snowflake views are late-bound
BigQuery views are late-bound

My recommendation would be to either:

create your Bi-focused models as tables instead of views so they aren't dropped during a run
consider building your dbt project in a separate schema (say, analytics_new), then swap the analytics schema for the analytics_new schema at the end of the run (this is called a blue/green deployment)
- consider (and I know, this is a heavy answer to your question) using an analytical database instead of Postgres - Snowflake, BigQuery, and Redshift are all great choices, and they come equipped with tools for analytics-minded applications like late-bound views (amongst others!)

Closing this as I don't anticipate making any changes to dbt to support this use-case on Postgres, but I do want to validate that the issue you're seeing is real, it's just not one that Postgres makes particularly easy for dbt to support today.

drewbanin on 6 Mar 2020

👍2 ❤1

All 2 comments

Redshift has "late bound views" which do not need to be cascade-dropped when their parent tables are dropped
Snowflake views are late-bound
BigQuery views are late-bound

My recommendation would be to either:

create your Bi-focused models as tables instead of views so they aren't dropped during a run
consider building your dbt project in a separate schema (say, analytics_new), then swap the analytics schema for the analytics_new schema at the end of the run (this is called a blue/green deployment)
- consider (and I know, this is a heavy answer to your question) using an analytical database instead of Postgres - Snowflake, BigQuery, and Redshift are all great choices, and they come equipped with tools for analytics-minded applications like late-bound views (amongst others!)

drewbanin on 6 Mar 2020

👍2 ❤1

@drewbanin Thank you for the detailed response and also with the recommendations, I appreciate it very much!

We are actually currently using Redshift for analytics but I produced the issue in Postgres because it's easier to reproduce with docker.

So for the short term I'll use bind: False for those views. And we'll definitely also consider the 2 other recommendations of

blue-green deploy with schemas
changing BI tools to depend on tables

Thanks again!

nickwu241 on 6 Mar 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Relation doesn't exist for dbt test after `dbt run -m state:modified` if the test was modified but the model wasn't

joellabes · 3Comments

Persist source descriptions to database

whittid4 · 3Comments

support persist_docs on all plugins

drewbanin · 3Comments

ref'ing an ephemeral model inside of a statement does not work

drewbanin · 3Comments

Add COPY GRANTS to #Snowflake table materialisation

bashyroger · 3Comments