Dbt: model blocks

Created on 19 Oct 2016  路  7Comments  路  Source: fishtown-analytics/dbt

This issue presents a suggested alternative to the way that dbt code is currently organized.

Currently, every .sql file in the models directory is its own model. This proposal is to break that 1:1 relationship between files and models, and instead allow models to be defined similarly to how functions are defined in a programming language. For example:

def [model_1_name]

SELECT a, b
from table_c
where d is not null

def [model_2_name]

...

By itself, this change doesn't accomplish anything except for adding some light organization to a models directory. But the relaxation of the constraint of 1 file : 1 model allows users to create models very differently.

Currently, the 1:1 constraint encourages users to write models that are fairly long and complex, frequently 50-100 lines long or more. These models often contain several CTEs, each of which represents a separate data transformation. These CTEs, because they are wrapped within a larger model, cannot be tested and are not a part of the dependency graph. Plus, long code with multiple CTEs is harder to read, write, and maintain.

This file structure would allow model authors to decompose their models into multiple components, and then control each component as a part of the DAG. Each def is a model, but the default for all models is to be ephemeral, unless instructed otherwise. Ephemeral models can still be tested but are not materialized directly into the database. We still need to consider how this default interacts with model configuration scoping.

Additionally, authors would be able to provide common aliases for queries within that file using an import directive. The import directive is a convenient way for model authors to create a single alias for a table that is then used throughout the entire file. Imports would happen at the beginning of a file:

import {{ref('a_long_model_name')}} as table_c

def [model_1_name]

SELECT a, b
from table_c
where d is not null

def [model_2_name]

...

This import functionality will increase code quality and reduce lines of code, as currently model authors are writing verbose CTEs at the beginning of every model to provide this aliasing functionality.

enhancement

Most helpful comment

A use case that came up today that I think is a good example of where the one-to-one paradigm breaks down:
A dbt user wants to create sanitized views over their fct_ tables, that excludes PII information (assume we have a list of known PII column names).
At the moment this would be achieved one of two ways:

  • creating an extra model for each fct_ table, leading to lots of redundant code
  • creating the view via a post-hook, which means that the objects don't show up in the DAG.

Just thought it worth adding this as a use-case to think through :)

All 7 comments

/cc @drewbanin

here are some notes on syntax after discussing this extensively over the past few days.


--how users create things:

create_test(args)
create_model(args)
create_macro(args)


--where args is:

args = [
  sql = "",
  materialized = "",
  ...
]

--syntax options follow


--HCL-like example

model "quickbooks_bill_payments" {
  name = "something_else"
  config {
  materialized = true
  }
  sql = """
    select *
    from ${model.quickbooks_billpayment_lines.table_name}
    where whatever
  """
}


--python-like example

def quickbooks_bill_payments (config=(args)) {

  with bill_payments as (

    select * from {{ref('quickbooks_bill_payments')}}

  ), billpayment_lines as (

    select * from {{ref('quickbooks_billpayment_lines')}}

  ), d1 as (

    select
      bp.id, bp.txn_date, bpl.amount, payment_account_id, ap.id as ap_id
    from bill_payments bp
      inner join billpayment_lines bpl on bp.id = bpl.bill_payment_id
      join (select id from {{ref('quickbooks_accounts')}} where type = 'Accounts Payable') ap
        on 1 = 1

  )

  select
    id, txn_date, amount, payment_account_id as account_id, 'credit' as transaction_type,
    'bill payment' as source
    {% if var('uses_classes') == "true" %}
      , null::bigint as class_id
    {% endif %}
  from
    d1

  union all

  select
    id, txn_date, amount, ap_id, 'debit', 'bill payment'

    {% if var('uses_classes') == "true" %}
      , null::bigint
    {% endif %}
  from
    d1

}

Sweet writeup -- thanks @jthandy

Reasons we might want this:

  • type safety when passing data between packages
  • specific functions (like import)
  • managing scope
  • managing file count
  • more idiomatic override

A quick update here: this is still very relevant. We've doubled down much harder on jinja since this issue was originally raised, so my guess is the most likely syntax would look something closer to this:

--jinja-like example

{% model quickbooks_bill_payments() %}

  with bill_payments as (

    select * from {{ref('quickbooks_bill_payments')}}

  ), billpayment_lines as (

    select * from {{ref('quickbooks_billpayment_lines')}}

  ), d1 as (

    select
      bp.id, bp.txn_date, bpl.amount, payment_account_id, ap.id as ap_id
    from bill_payments bp
      inner join billpayment_lines bpl on bp.id = bpl.bill_payment_id
      join (select id from {{ref('quickbooks_accounts')}} where type = 'Accounts Payable') ap
        on 1 = 1

  )

  select
    id, txn_date, amount, payment_account_id as account_id, 'credit' as transaction_type,
    'bill payment' as source
    {% if var('uses_classes') == "true" %}
      , null::bigint as class_id
    {% endif %}
  from
    d1

  union all

  select
    id, txn_date, amount, ap_id, 'debit', 'bill payment'

    {% if var('uses_classes') == "true" %}
      , null::bigint
    {% endif %}
  from
    d1

{% endmodel %}

A use case that came up today that I think is a good example of where the one-to-one paradigm breaks down:
A dbt user wants to create sanitized views over their fct_ tables, that excludes PII information (assume we have a list of known PII column names).
At the moment this would be achieved one of two ways:

  • creating an extra model for each fct_ table, leading to lots of redundant code
  • creating the view via a post-hook, which means that the objects don't show up in the DAG.

Just thought it worth adding this as a use-case to think through :)

cooked this up to explore what sub-tables (like @clrcrl is describing) could look like as custom materializations: https://gist.github.com/drewbanin/ea256260dcb8474414660d1c90d39e3f

Was this page helpful?
0 / 5 - 0 ratings