Dbt: Record failing rows for tests into an auditable table

Created on 6 Aug 2018 · 10Comments · Source: fishtown-analytics/dbt

Feature

Feature description

dbt should record rows that fail schema or data tests into tables, to be used for auditing. There are two general ways this could work:

insert into an errors table with a schema like:
- run started at
- invocation id
- test name
- failing row primary key / json
Create a table for each test failure in a scratch schema, matching the schema of the test query that failed

The baked-in schema tests will need to be rewritten to provide debuggable information in these tables.

This should probably be an opt-in feature for tests. It could be configured either in schema.yml, though it would also be good to configure whole swaths of tests at once.

Who will this benefit?

dbt users could use this table to quickly determine why their tests failed. At present, it's surprisingly difficult to debug a failed test.

enhancement

Source

drewbanin

👍22

Most helpful comment

I support and would use the use case of reporting on failed tests. I am more interested in using them for auditing than for debugging — I would like a nice easy table to query via Looker for a conditional "Go Fix ETL" alert.

When I went looking to do this feature I naturally reached for hooks, as I'm using them already for audit purposes much as described in the documentation.

I'm logging the individual model run:

on-run-start:
    - "create table if not exists {{ target.schema }}.audit_dbt_runs (model text, state text, time timestamp)"

models:
  pre-hook:
    - "insert into {{ target.schema }}.audit_dbt_runs (model, state, time) values ('{{this.name}}', 'start', getdate())"
  post-hook:
    - "insert into {{ target.schema }}.audit_dbt_runs (model, state, time) values ('{{this.name}}', 'end', getdate())"

As well as the overall results:

on-run-start:
    - "create table if not exists {{ target.schema }}.audit_dbt_results (node text, status text, execution_time decimal(10,2), time timestamp)"
on-run-end:
    - "insert into {{ target.schema }}.audit_dbt_results (node, status, execution_time, time) values {{ results_values(results) }}"

(I have a simple macro results_values to turn the results into insertable values.)

In this case I would generally like the exact same thing for tests. Something exactly analogous to the on-run-start/on-run-end with the end run context results would suit my needs:

on-test-start:
    - "create table if not exists {{ target.schema }}.audit_dbt_test_results (node text, status text, execution_time decimal(10,2), time timestamp)"
on-test-end:
    - "insert into {{ target.schema }}.audit_dbt_test_results (node, status, execution_time, time) values {{ results_values(results) }}"

ianterrell on 1 Apr 2019

❤5

All 10 comments

@drewbanin - I totally get that we wouldn't want to bring back all the rows by default, what would be awesome, would be some kind of executable SQL on fail. So the workflow would look like:

Run test
Evaluate Pass/Fail
_[optionally]_ On fail run SQL and spit out the result to logs.

The reason I think this would work nicely is that our analysts are used to looking at the logs of a build to see if something has failed, and having the info right there in the logs would be a massive productivity gain - albeit if there's a config overhead to write in explicitly what should happen on failing for each test.

alanmcruickshank on 14 Nov 2018

Totally with you @alanmcruickshank. I think also related is: https://github.com/fishtown-analytics/dbt/issues/517

I'm into the idea, but need to give some more thought into how dbt would support this. We don't have a good way of configuring groups of tests at the moment, and i imagine setting report_on_failure: true for every schema test would not be ideal.

Very open to ideas if you have them!

drewbanin on 14 Nov 2018

When I went looking to do this feature I naturally reached for hooks, as I'm using them already for audit purposes much as described in the documentation.

I'm logging the individual model run:

on-run-start:
    - "create table if not exists {{ target.schema }}.audit_dbt_runs (model text, state text, time timestamp)"

models:
  pre-hook:
    - "insert into {{ target.schema }}.audit_dbt_runs (model, state, time) values ('{{this.name}}', 'start', getdate())"
  post-hook:
    - "insert into {{ target.schema }}.audit_dbt_runs (model, state, time) values ('{{this.name}}', 'end', getdate())"

As well as the overall results:

on-run-start:
    - "create table if not exists {{ target.schema }}.audit_dbt_results (node text, status text, execution_time decimal(10,2), time timestamp)"
on-run-end:
    - "insert into {{ target.schema }}.audit_dbt_results (node, status, execution_time, time) values {{ results_values(results) }}"

(I have a simple macro results_values to turn the results into insertable values.)

In this case I would generally like the exact same thing for tests. Something exactly analogous to the on-run-start/on-run-end with the end run context results would suit my needs:

on-test-start:
    - "create table if not exists {{ target.schema }}.audit_dbt_test_results (node text, status text, execution_time decimal(10,2), time timestamp)"
on-test-end:
    - "insert into {{ target.schema }}.audit_dbt_test_results (node, status, execution_time, time) values {{ results_values(results) }}"

ianterrell on 1 Apr 2019

❤5

that looks like a great start to a 'dbt_results_mart', where you could have models as a dim, dbt_runs as a dim and fact and errors as an optional fact

ironmonkey88 on 1 Apr 2019

@ianterrell This is a great idea. Are you able & willing to share your results_values macro?

Aylr on 15 May 2019

i'd be interested in getting a look at that macro was well.

ironmonkey88 on 15 May 2019

Here's an easy macro for this.

{% macro unpack_results(results) %}

  {% for res in results -%}
-- This is useful for debugging
--     {{ log(res, info=True) }}
        ('{{ invocation_id }}', '{{run_started_at}}', NOW(), '{{ res.node.unique_id }}', '{{ res.status }}', '{{ res.error if res.error != None else ""}}', '{{ res.skip if res.skip != None else "" }}', '{{ res.fail if res.fail != None else "" }}', {{ res.execution_time }} ){{ "," if not loop.last }}
  {% endfor %}

{% endmacro %}

Aylr on 15 May 2019

👍1

Hi @Aylr and @gordonhwong!

In case it's still useful for anyone, the following macro is what I use:

{% macro results_values(results) %}
  {% for res in results -%}
    {% if loop.index > 1 %},{% endif %}
    ('{{ res.node.alias }}', '{{ res.status }}', {{ res.execution_time }}, getdate())
  {% endfor %}
{% endmacro %}

This pairs with:

on-run-start:
    - "create table if not exists {{ target.schema }}.audit_dbt_results (node text, status text, execution_time decimal(10,2), time timestamp)"
on-run-end:
    - "insert into {{ target.schema }}.audit_dbt_results (node, status, execution_time, time) values {{ results_values(results) }}"

ianterrell on 17 May 2019

👍2

that looks like a great start to a 'dbt_results_mart', where you could have models as a dim, dbt_runs as a dim and fact and errors as an optional fact

This is exactly what I am looking for!
IMO a run / test / error mart is a prerequisite for any mature data pipeline. Especially to track data run duration and data quality over time. Such a model can then be used _as a source_ for firing alerts when data quality starts to _deviate from an average_ , a run deviates from an average etc....

Of course one could do this by bringing in other tools that help with this _(like https://docs.greatexpectations.io, a tool I have seen being suggested a few times)._
But imo this should be part of the CORE of dbt; for me it falls under the altering and logging features displayed here: https://www.getdbt.com/product/

bashyroger on 12 Jun 2020

👍1

Hey @bashyroger, thanks for bringing this topic back to the fore. I just opened a new issue to sketch out our latest thinking about how dbt should store tests in the database: https://github.com/fishtown-analytics/dbt/issues/2593