Invalid/nonexistent field names can have descriptions in schema.yml that are populated in the documentation.
If a field is deleted, but was previously documented (correctly), the docs will indicate that that field still exists.
Apply column validation for fields that don't have tests applied to them
You can create any column name and add a description in schema.yml and it will be populated in the documentation
columns:
- name: deleted_column_name
description: this field has been deleted
Hey @jwerderits - thanks for making this issue! How do you think dbt should handle this in practice? Do you think a dbt run should fail? Or dbt docs generate? Or something else?
@drewbanin I think it makes the most sense for a dbt test to fail on this condition because it is already involved with schema validation. Encapsulating the logic here provides a warning that there are errors within the project without preventing a run or docs generate to fail. It might also be useful to dbt docs generate to help find where the erroneous field(s) exists.
This would be really useful! Would it be possible to extend the validation to check for the opposite scenario, columns which exist in the model but don't have entry in the schema.yml file?
@JackArthurton yeah! I think that's a great idea. I'm imagining that this would be opt-in, so you could annotate a schema.yml specification with strict: true (or similar) which would throw an error if the schema specification and the model are mismatched
Drew pointed me to this issue when I asked about this on Slack.
My question was:
Is there a way to configure tests to identify columns that are in the results but _not_ covered by a schema test?
Love the strict: true idea - that seems like it would do the trick.
Any progress regarding field validation?
I'd like to be able to define the schema like this:
models:
- name: my_model
columns:
- name: column_1
tests:
- exists
hey @smomni - we haven't prioritized this one yet! If you're in a pinch, you can actually define your own custom schema test in your dbt project: https://docs.getdbt.com/docs/custom-schema-tests
If you create a macro in your macros/ directory like this:
{% macro test_exists(model, column_name) %}
select count({{ column_name }})
from {{ model }}
where 1=0
limit 1
{% endmacro %}
Then you'll be able to assert that columns exist with:
models:
- name: my_model
columns:
- name: column_1
tests:
- exists
I think the version of this that we add to dbt natively will be a little bit smarter than this. We can just run a single query to find all of the columns in a table, then check them against the columns in the schema file. The schema test i shared above will run one query per column which probably isn't optimal
Hey guys this looks like it could be a really useful idea. For some context - we want to create a self service ELT pipeline whereby people can create their own datasets from upstream sources. In order to mitigate the risk of undocumented datasets entering the warehouse it would be nice to enforce people to document the newly created tables and views with descriptions.
The ideal use would be a project level flag that turns on the necessity of schemas to have descriptions. This would then be used as part of CI/CD to stop undocumented schemas getting to production.
hey @smomni - we haven't prioritized this one yet! If you're in a pinch, you can actually define your own custom schema test in your dbt project: https://docs.getdbt.com/docs/custom-schema-tests
If you create a macro in your
macros/directory like this:{% macro test_exists(model, column_name) %} select count({{ column_name }}) from {{ model }} where 1=0 limit 1 {% endmacro %}Then you'll be able to assert that columns exist with:
models: - name: my_model columns: - name: column_1 tests: - existsI think the version of this that we add to dbt natively will be a little bit smarter than this. We can just run a single query to find all of the columns in a table, then check them against the columns in the schema file. The schema test i shared above will run one query per column which probably isn't optimal
this approach doesn't fail if the column is not present in the table
@ArafathC are you sure about that? The query should fail because the specified column does not exist in the table. I'm curious if you could elaborate about what you mean
@drewbanin Sorry about that.. Was able to double check and confirm it works
Most helpful comment
Drew pointed me to this issue when I asked about this on Slack.
My question was:
Love the
strict: trueidea - that seems like it would do the trick.