Dbt: During a DBT run, dropping a view will also drop dependent views that haven't been recreated.

Created on 6 Mar 2020  路  2Comments  路  Source: fishtown-analytics/dbt

Describe the bug

When there are views in the DAGs e.g. (view_root -> view_middle -> view_leaf)

A dbt run causes problems because:

  1. when view_root completes, it'll drop all its dependent views, i.e. view_middle and view_leaf in this case.
  2. view_middle and view_leaf will not be re-created until they're run.
  3. This is a problem because view_root might run very early in a run while view_middle and view_leaf run hours later and causes these views to not be available.

    • Our BI tools relying on view_middle and view_leaf are broken while dbt is running.

Steps To Reproduce

Created a sample dbt project to reproduce: https://github.com/nickwu241/dbt_reproduce_drop_view_cascade_bug

It will

  • create a postgres docker locally
  • create a dbt DAG of view_root -> view_middle -> view_leaf
  • has post-hook wih SELECT * FROM pg_sleep(10)

    • so that you can see view_middle and view_leaf are unavailable after view_root runs.

    • feel free to adjust 10 to be 30 if you need more time to see the database

Expected behavior

I expect dbt to not drop the dependent views until the whole run has finished.
I expect DROP VIEW view_root_backup CASCADE; won't happen until all dependent views are re-created.

Screenshots and log output

dbt run --profiles-dir .
Running with dbt=0.15.2
Found 3 models, 0 tests, 0 snapshots, 0 analyses, 125 macros, 1 operation, 0 seed files, 1 source

09:57:30 | 
09:57:30 | Running 1 on-run-start hook
09:57:30 | 1 of 1 START hook: dbt_reproduce_drop_view_cascade_bug.on-run-star... [RUN]
09:57:30 | 1 of 1 OK hook: dbt_reproduce_drop_view_cascade_bug.on-run-start.0... [CREATE TABLE in 0.00s]
09:57:30 | 
09:57:30 | Concurrency: 1 threads (target='dev')
09:57:30 | 
09:57:30 | 1 of 3 START view model public.view_root............................. [RUN]
09:57:41 | 1 of 3 OK created view model public.view_root........................ [CREATE VIEW in 10.14s] <<<<<<<<<<< view_middle and view_leaf will be dropped here
09:57:41 | 2 of 3 START view model public.view_middle........................... [RUN]
09:57:51 | 2 of 3 OK created view model public.view_middle...................... [CREATE VIEW in 10.08s]
09:57:51 | 3 of 3 START view model public.view_leaf............................. [RUN]
09:58:02 | 3 of 3 OK created view model public.view_leaf........................ [CREATE VIEW in 10.06s]
09:58:02 | 
09:58:02 | Finished running 3 view models, 1 hook in 31.74s.

Completed successfully

Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

System information

Which database are you using dbt with?

  • [x] postgres
  • [ ] redshift
  • [ ] bigquery
  • [ ] snowflake
  • [ ] other (specify: ____________)

The output of dbt --version:

installed version: 0.15.2
   latest version: 0.15.2

Up to date!

The operating system you're using:
macOS Mojave 10.14.6

The output of python --version:
Python 3.7.3

Additional context

Let me know if you require more information or have any questions! You can reach me on the DBT slack as Nick Wu as well.

Thank you!

bug wontfix

Most helpful comment

hey @nickwu241 - this is a good question! There isn't any mechanism in Postgres to define a view as "late bound" or "unbound" or anything like that. dbt unfortunately _must_ drop cascade the root tables, resulting in downstream views being dropped during the run. Here are some examples of how this is addressed on modern data warehouses:

  • Redshift has "late bound views" which do not need to be cascade-dropped when their parent tables are dropped
  • Snowflake views are late-bound
  • BigQuery views are late-bound

My recommendation would be to either:

  • create your Bi-focused models as tables instead of views so they aren't dropped during a run
  • consider building your dbt project in a separate schema (say, analytics_new), then swap the analytics schema for the analytics_new schema at the end of the run (this is called a blue/green deployment)

    • consider (and I know, this is a heavy answer to your question) using an analytical database instead of Postgres - Snowflake, BigQuery, and Redshift are all great choices, and they come equipped with tools for analytics-minded applications like late-bound views (amongst others!)

Closing this as I don't anticipate making any changes to dbt to support this use-case on Postgres, but I do want to validate that the issue you're seeing is real, it's just not one that Postgres makes particularly easy for dbt to support today.

All 2 comments

hey @nickwu241 - this is a good question! There isn't any mechanism in Postgres to define a view as "late bound" or "unbound" or anything like that. dbt unfortunately _must_ drop cascade the root tables, resulting in downstream views being dropped during the run. Here are some examples of how this is addressed on modern data warehouses:

  • Redshift has "late bound views" which do not need to be cascade-dropped when their parent tables are dropped
  • Snowflake views are late-bound
  • BigQuery views are late-bound

My recommendation would be to either:

  • create your Bi-focused models as tables instead of views so they aren't dropped during a run
  • consider building your dbt project in a separate schema (say, analytics_new), then swap the analytics schema for the analytics_new schema at the end of the run (this is called a blue/green deployment)

    • consider (and I know, this is a heavy answer to your question) using an analytical database instead of Postgres - Snowflake, BigQuery, and Redshift are all great choices, and they come equipped with tools for analytics-minded applications like late-bound views (amongst others!)

Closing this as I don't anticipate making any changes to dbt to support this use-case on Postgres, but I do want to validate that the issue you're seeing is real, it's just not one that Postgres makes particularly easy for dbt to support today.

@drewbanin Thank you for the detailed response and also with the recommendations, I appreciate it very much!

We are actually currently using Redshift for analytics but I produced the issue in Postgres because it's easier to reproduce with docker.

So for the short term I'll use bind: False for those views. And we'll definitely also consider the 2 other recommendations of

  • blue-green deploy with schemas
  • changing BI tools to depend on tables

Thanks again!

Was this page helpful?
0 / 5 - 0 ratings