Dbt: Support Materialized Views

Created on 3 Dec 2018 · 16Comments · Source: fishtown-analytics/dbt

Docs: https://docs.snowflake.net/manuals/sql-reference/sql/create-materialized-view.html

Ideally this would also support the cluster by keyword:

create materialized view "schema"."my_mview"
cluster by ( id ) 
as 
select 1 as id,
         2 as name

bigquery help wanted redshifpg snowflake

Source

clausherther

👍13

Most helpful comment

BigQuery has a similar feature https://cloud.google.com/bigquery/docs/materialized-views-intro

Would definitely be useful

jonah-juul on 8 Apr 2020

👍8

All 16 comments

Looks like this was added in 0.13.1
(PR #1432 )

clausherther on 13 May 2019

@clausherther not so! #1432 fixed a problem where dbt couldn't run _if a materialized view lived in the dbt schema_. dbt still does not support the _creation_ of materialized views on Snowflake, though it is something I've been experimenting with recently.

In your mind, what's the advantage of using a materialized view over a dbt table model that's refreshed with some cadence? I seem to recall some pre-release Snowflake functionality that would auto-update materialized views, but I'm struggling to find that info presently

drewbanin on 13 May 2019

Ah, I misread this in the release notes: "Add "MaterializedView" relation type to the Snowflake adapter"

Personally, I don't think I have a good use case for MVs in Snowflake, that need has really been largely replaced by dbt. I think I opened the issue alongside the one for _transient_ tables, to keep track of new Snowflake features you potentially want to support.

clausherther on 13 May 2019

👍1

alright, thanks for the additional context! Happy to leave this open to track the Snowflake feature, but also welcome input from anyone who 👍'd the issue to chime in with some use cases!

drewbanin on 13 May 2019

I'm trying to think how we would deal with MViews in dbt. You don't really want to drop and recreate them every time because then there would be a lot of extra time and cost by Snowflake bringing them up-to-speed with the underlying data. They do act like views when they are "stale" or not completely refreshed so at least there wouldn't be any problems with what happens if they are queried if not completely rebuilt:

Data accessed through materialized views is always current, regardless of the amount of DML that has been performed on the base table. If a query is run before the materialized view is up-to-date, Snowflake either updates the materialized view or uses the up-to-date portions of the materialized view and retrieves any required newer data from the base table.
https://docs.snowflake.net/manuals/user-guide/views-materialized.html

Perhaps they would be like incremental tables where they are only re-created when a) they don't exist or b ) when forcing a full refresh?

There is also the potential for a lot of run-time error with MViews in Snowflake because they are limited in what they support from a SQL perspective. My concern is that it would be pretty difficult for dbt to parse and do a compile-time validation of what is not legitimate for an MView which would lead in many cases to dbt runs failing if the MView gets compiled but then fails creation or update at run-time. The list of things is too long for me to write here but see link above.

jrandrews on 7 Jun 2019

👍1

@drewbanin materialized views are useful for:

accelerating external tables with automated refreshes (better than using pipes)
maintaining a high-water mark on CDC tables eg

CREATE MATERIALIZED VIEW cdc_highwater 
AS 
(
    SELECT primary_key, column_name, MAX(wal_ts) AS wal_ts 
    FROM cdc_table
)

to accelerate consolidation into current-state tables like:

SELECT  primary_key
        , DECODE(column_name, 'col1', value) AS col1
        , DECODE(column_name, 'col2', value) AS col2
        , ...
        , DECODE(column_name, 'colN', value) AS colN
FROM cdc_table JOIN cdc_highwater USING (primary_key, column_name, wal_ts)

Frankly I think only use case 1 is value-generating, since the performance improvements are (allegedly) nontrivial for end users.

mike-weinberg on 11 Sep 2019

👍1

Thanks @mike-weinberg! I don't think it would be too challenging to create a materialized_view materialization in dbt. I think the logic would look like:

if the relation doesn't exist
- create a materialized view
else if the relation exists but is not a materialized view:
- drop the existing relation
- create a materialized view
else if the relation exists and is a materialized view and dbt is in full-refresh mode:
- replace the materialized view
else:
- no-op

I still think that the list of caveats are too restrictive for most modeling use cases (no window functions, no unions, limited aggregates, can't query views, etc etc etc). Still, happy to support these as I do think the first use-case you've pointed out here @mike-weinberg is good and legitimate. Plus, I imagine that these materialized views will only become more capable over time.

If someone is interested in creating this materialization, it all should be pretty doable in user-space! Would love to discuss the implementation further if anyone picks it up :)

drewbanin on 11 Sep 2019

🎉2

Redshift is also planning on supporting materialized views (in preview):

https://docs.aws.amazon.com/redshift/latest/dg/mv.html

It appears that many of the same limitations apply as with Snowflake materialized views.
https://docs.aws.amazon.com/redshift/latest/dg/mv-usage-notes.html

In addition, Redshift appears to keep the underlying materialized data static until a refresh command is issued at which time it either incrementally updates the data if certain conditions are met, or does a full rebuild.
https://docs.aws.amazon.com/redshift/latest/dg/mv-refresh.html#mv_REFRESH_MARTERIALIZED_VIEW_usage

While I agree that there will be limited cases where this makes sense in modeling, surely there are valid and interesting use cases.

automagic on 23 Jan 2020

Thanks @automagic! I used to be not-so-bullish on Materialized views, but the idea of these incremental refreshes is really appealing. One thing I particularly like about them: refreshing a table can happen concurrently with external queries. When dbt replaces tables, it currently blocks if there are long-running select statements hitting the table.

The one challenge here is going to be knowing when dbt should rebuild the MV, vs. just refresh it. We can leverage the --full-refresh flag, but that feels a little heavy handed. It would certainly be pretty cool if dbt could check if the columns in the view have changed, a la #320. Even better would be some sort of logical comparison (eg. check the hash of the SQL that created the model against the hash of the new model).... but that could be something for a separate issue :)

drewbanin on 23 Jan 2020

👍4 ❤1

BigQuery has a similar feature https://cloud.google.com/bigquery/docs/materialized-views-intro

Would definitely be useful

jonah-juul on 8 Apr 2020

👍8

Even with the limitations native Snowflake MV's have, supporting them in DBT will become more useful once the Transparent MV usage functionality has been released _(in private preview at the moment of writing this)_. Basically this is functionality akin to Oracle's QUERY_REWRITE.
See my write up on this new feature (and others) and the SF new features video that mentions it:
https://www.linkedin.com/pulse/my-pov-new-snowflake-features-announced-june-2020-werschkull-/

bashyroger on 17 Jun 2020

Hey @bashyroger, check out:

Initial work on a materialized_view dbt materialization in this repo
Yesterday's dbt community office hours (youtube). I mentioned the functionality akin to Oracle's QUERY_REWRITE, as it's the biggest feature currently supported for BigQuery's MVs (docs). I don't know as much about Snowflake's support for this, it sounds like it's TK.

jtcohen6 on 17 Jun 2020

Since this thread is warming up, let's make it spicy 🌶️ !

Snowflake recently announced a clever feature they're calling "transparent materialized views".

In the future, if a plan node of a query is equivalent to or is a subset of the plan node of a materialized view, then the materialized view will be used, even if it isn't referenced, because it is provably equal.

So, if you identify common aggregations, and build materialized views for those aggregations, then queries which have steps which are equivalent to those materialized views will be rewritten by the planner to use those materialized views. Alternatively, you could maintain multiple copies of the same table, but with different clustering keys, and snowflake will automatically pick the best one. This is analogous to Vertica's concept of projections.

Personally, I think this is a big deal, and if snowflake expands support to include INNER JOIN and UNION ALL clauses, then chains of MV's might become the best way to express many dags. Even in the absence of that, the ability to maintain copies of tables with various clustering keys means that MVs could accelerate regular views enough to enable decent-performing low latency reporting when combined with a CDC ELT vendor.

EDIT: When I wrote this, I either didn't see the prior two comments, or my page needed to be refreshed. Embarrassing! I'll leave this here though =)

mike-weinberg on 18 Jun 2020

👍2

What is the latest state on this one ? Is this issue still open ?

nareshkumaryadav on 16 Sep 2020

👍1

I also want to know this as I am going to source control snowflake and views also need to source control like tables.

kamleshsinhchauhan on 17 Sep 2020

Hey @nareshkumaryadav, this issue is definitely still open.

I played around with materialized views on the Core Four databases back in June, so there's code available if you want to implement a halfway-decent version in your own project.
There are some _small_ changes we'd want to make in dbt to include database metadata on MVs, specifically in Postgres + Redshift. Almost all the other code required here is SQL + Jinja, meaning that you have full power to try it, test it, and adapt it yourself.
Eventually, I imagine we'll want to define MVs as a native dbt materialization. We haven't yet for a couple of reasons:
- Unlike view, table, ephemeral, and incremental—which, with some small exceptions, have the same functionality across all four databases—a materialized_view necessarily means something quite different on each of Postgres, Redshift, Snowflake, and BigQuery.
- That functional variance also exists over time, as each warehouse continues to expand the footprint of its MV implementation. We're still not sure what we _want_ to do with MVs—real-time aggregation of event streams? rollup cubes for BI query rewrite?—and dbt has an opinionated modeling framework. We definitely don't want to constrain the bounds of what's _possible_ in dbt, but at the same time we don't want to write a bunch of code into dbt-core without having some strong recommendations for when to leverage that code.

All of which is to say: I'd find it really valuable to hear more about what you're thinking of doing with MVs in your own project.

@kamleshsinhchauhan It sounds like what you're after may be slightly different: we have long supported view and table as materialization types on Snowflake. Check out the docs on materializations. If you are thinking about Snowflake materialized views, then this is the right thread!

jtcohen6 on 21 Sep 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

alias models to change their names in the database

drewbanin · 4Comments

Add COPY GRANTS to #Snowflake table materialisation

bashyroger · 3Comments

ref'ing an ephemeral model inside of a statement does not work

drewbanin · 3Comments

log-path (and other output paths) cannot be set to null, because of aggressive defaults

kconvey · 3Comments

minimize downtime for incremental models with full-refresh

drewbanin · 4Comments