Dbt: strict field validation for schema.yml

Created on 25 Jun 2019  路  11Comments  路  Source: fishtown-analytics/dbt

Issue: field validation for schema.yml

Issue description

Invalid/nonexistent field names can have descriptions in schema.yml that are populated in the documentation.
If a field is deleted, but was previously documented (correctly), the docs will indicate that that field still exists.

Fix

Apply column validation for fields that don't have tests applied to them

Steps to reproduce

You can create any column name and add a description in schema.yml and it will be populated in the documentation

    columns:
        - name: deleted_column_name
          description: this field has been deleted

Most helpful comment

Drew pointed me to this issue when I asked about this on Slack.

My question was:

Is there a way to configure tests to identify columns that are in the results but _not_ covered by a schema test?

Love the strict: true idea - that seems like it would do the trick.

All 11 comments

Hey @jwerderits - thanks for making this issue! How do you think dbt should handle this in practice? Do you think a dbt run should fail? Or dbt docs generate? Or something else?

@drewbanin I think it makes the most sense for a dbt test to fail on this condition because it is already involved with schema validation. Encapsulating the logic here provides a warning that there are errors within the project without preventing a run or docs generate to fail. It might also be useful to dbt docs generate to help find where the erroneous field(s) exists.

This would be really useful! Would it be possible to extend the validation to check for the opposite scenario, columns which exist in the model but don't have entry in the schema.yml file?

@JackArthurton yeah! I think that's a great idea. I'm imagining that this would be opt-in, so you could annotate a schema.yml specification with strict: true (or similar) which would throw an error if the schema specification and the model are mismatched

Drew pointed me to this issue when I asked about this on Slack.

My question was:

Is there a way to configure tests to identify columns that are in the results but _not_ covered by a schema test?

Love the strict: true idea - that seems like it would do the trick.

Any progress regarding field validation?

I'd like to be able to define the schema like this:

models:
  - name: my_model
    columns:
      - name: column_1
         tests:
           - exists

hey @smomni - we haven't prioritized this one yet! If you're in a pinch, you can actually define your own custom schema test in your dbt project: https://docs.getdbt.com/docs/custom-schema-tests

If you create a macro in your macros/ directory like this:

{% macro test_exists(model, column_name) %}

    select count({{ column_name }})
    from {{ model }}
    where 1=0
    limit 1

{% endmacro %}

Then you'll be able to assert that columns exist with:

models:
  - name: my_model
    columns:
      - name: column_1
         tests:
           - exists

I think the version of this that we add to dbt natively will be a little bit smarter than this. We can just run a single query to find all of the columns in a table, then check them against the columns in the schema file. The schema test i shared above will run one query per column which probably isn't optimal

Hey guys this looks like it could be a really useful idea. For some context - we want to create a self service ELT pipeline whereby people can create their own datasets from upstream sources. In order to mitigate the risk of undocumented datasets entering the warehouse it would be nice to enforce people to document the newly created tables and views with descriptions.

The ideal use would be a project level flag that turns on the necessity of schemas to have descriptions. This would then be used as part of CI/CD to stop undocumented schemas getting to production.

hey @smomni - we haven't prioritized this one yet! If you're in a pinch, you can actually define your own custom schema test in your dbt project: https://docs.getdbt.com/docs/custom-schema-tests

If you create a macro in your macros/ directory like this:

{% macro test_exists(model, column_name) %}

    select count({{ column_name }})
    from {{ model }}
    where 1=0
    limit 1

{% endmacro %}

Then you'll be able to assert that columns exist with:

models:
  - name: my_model
    columns:
      - name: column_1
         tests:
           - exists

I think the version of this that we add to dbt natively will be a little bit smarter than this. We can just run a single query to find all of the columns in a table, then check them against the columns in the schema file. The schema test i shared above will run one query per column which probably isn't optimal

this approach doesn't fail if the column is not present in the table

@ArafathC are you sure about that? The query should fail because the specified column does not exist in the table. I'm curious if you could elaborate about what you mean

@drewbanin Sorry about that.. Was able to double check and confirm it works

Was this page helpful?
0 / 5 - 0 ratings