Dbt: Don't provide a diststyle to redshift tables when diststyle is configured as "auto"

Created on 25 Mar 2020  路  2Comments  路  Source: fishtown-analytics/dbt

Describe the bug

Redshift does not support explicitly setting a diststyle to auto. Instead, if a user configures a diststyle as "auto", dbt should ignore the config. This compels Redshift to use the "auto" diststyle.

See this thread for more info: https://github.com/fishtown-analytics/dbt/issues/1882

Steps To Reproduce

-- models/my_model.sql

{{ config(materialized='table', dist='auto') }}

select 1 as id

The model will fail to build indicating that auto is not a valid diststyle.

System information

Which database are you using dbt with?

  • [ ] postgres
  • [x] redshift
  • [ ] bigquery
  • [ ] snowflake
  • [ ] other (specify: ____________)

The output of dbt --version:

0.16.0
bug good first issue redshifpg

Most helpful comment

Hi @drewbanin

Thanks for keeping the burning flame of this topic! I like the idea of explicit is better than implicit.

I revisited the Redshift Docs and found it explaining how Redshift chooses the distribution key and the sort key for CTAs, referral documentation that brought me into this page here.

In a nutshell, if it is a simple query the CTA table will inherit the distribution key from the source table. In another hand, if it is a complex query with joins and group by, Redshift will choose the distribution key and the sort key based on the Query Plan.

If you don't specify DISTKEY or DISTSTYLE, CTAS determines the distribution style for the new table based on the query plan for the SELECT clause. For more information, see Inheritance of Column and Table Attributes.

My point is, it is still an automatic mechanism since the user can choose to not define the distribution key, and Redshift will choose one. However, it will not mark the distribution key as AUTO on the metadata tables definition. Probably, Redshift will not choose a new distribution key based on table usage for example.

I took the opportunity to execute your suggested example and check the distsyle _(I've omitted the materialization in the config block)_, here it is:

select "table", diststyle from SVV_TABLE_INFO
where "table" like 'my_test%';
-- result:
-- table        diststyle
-- my_test  EVEN

Anyway, I will be very happy contributing with this PR.

All 2 comments

Hi @drewbanin

Thanks for keeping the burning flame of this topic! I like the idea of explicit is better than implicit.

I revisited the Redshift Docs and found it explaining how Redshift chooses the distribution key and the sort key for CTAs, referral documentation that brought me into this page here.

In a nutshell, if it is a simple query the CTA table will inherit the distribution key from the source table. In another hand, if it is a complex query with joins and group by, Redshift will choose the distribution key and the sort key based on the Query Plan.

If you don't specify DISTKEY or DISTSTYLE, CTAS determines the distribution style for the new table based on the query plan for the SELECT clause. For more information, see Inheritance of Column and Table Attributes.

My point is, it is still an automatic mechanism since the user can choose to not define the distribution key, and Redshift will choose one. However, it will not mark the distribution key as AUTO on the metadata tables definition. Probably, Redshift will not choose a new distribution key based on table usage for example.

I took the opportunity to execute your suggested example and check the distsyle _(I've omitted the materialization in the config block)_, here it is:

select "table", diststyle from SVV_TABLE_INFO
where "table" like 'my_test%';
-- result:
-- table        diststyle
-- my_test  EVEN

Anyway, I will be very happy contributing with this PR.

Awesome, thanks @rodrigodelmonte! A PR would be wonderful and very much welcomed!

Was this page helpful?
0 / 5 - 0 ratings