Dbt: Node selection for @ on dbt test is incomplete

Created on 11 Oct 2019 · 7Comments · Source: fishtown-analytics/dbt

Describe the bug

Our CI runs dbt run followed by dbt test in a completely clean snowflake warehouse. When using the @ syntax for both, run passes fine, but tests will fail because there are relationship tests on models that aren't built.

Steps To Reproduce

In as much detail as possible, please provide steps to reproduce the issue. Sample data that triggers the issue, example model code, etc is all very helpful here.

If you view the graph on https://gitlab-data.gitlab.io/analytics/dbt/snowflake/#!/model/model.gitlab_dw?g_v=1&g_i=@zuora you'll see that the sfdc_executive_business_review model is not built, but on a dbt test this relationship test is run https://gitlab.com/gitlab-data/analytics/blob/master/transform%2Fsnowflake-dbt%2Fmodels%2Fsfdc%2Fbase%2Fschema.yml#L66

Expected behavior

I would either expect that test to be skipped or the relevant model to be built.

Screenshots and log output

These are the logs for an example failed job with @zuora

System information

Which database are you using dbt with?

[x] snowflake

The output of dbt --version:

Running with dbt=0.14.2

The operating system you're using:
Mac OS X

The output of python --version:
3.7.4

Additional context

Add any other context about the problem here.

bug

Source

tayloramurphy

👍1

All 7 comments

The repro for this case will be very complicated, but I think the issue is when tests get added in to the final graph.

We first collect all the matching models in the graph (and properly perform the children's ancestors calculation there).
Then we go find all the tests that descend from those collected values.
But we never go back up the graph and find all the ancestors of the tests we just collected, we assume that the original collection pass got all of them.

I'm not sure if we should be excluding the tests whose ancestors aren't in the graph, or doing another pass and adding all the test ancestors in, but one of those choices is the appropriate solution here.

beckjake on 11 Oct 2019

👍1

I think a simple reproduction case should be:

-- models/model1.sql

select 1 as id

-- models/model2.sql

select 1 as fk

models/schema.yml

models:
  - name: model1
    columns:
      - name: id
         relationships:
           to: ref('model2')
           field: fk

Running:

# Runs only model1
dbt run --models @model1

# Runs no tests
dbt test --models @model1

# Runs the test
dbt test --models model1 model2

I'm not sure if we should be excluding the tests whose ancestors aren't in the graph, or doing another pass and adding all the test ancestors in, but one of those choices is the appropriate solution here.

It's compelling to think that tests should only be run if _all_ of their ancestors are selected in the run, but that will make custom data tests pretty wonky! This is probably the right answer for @tayloramurphy's use case, but probably the wrong answer in the more general case.

I do not think that in this case, the selected nodes in dbt run --models @zuroa should be made aware of the tests that exist on these models -- that would _definitely_ be unexpected behavior! I do want to note that that would be the natural behavior of my suggestion in #2203 :)

So, I think the only other path forward here is to make relationship tests smarter. I think it would be pretty reasonable to special-case schema tests that select from more than one model. We can add functionality to support tests returning with a warning that can be escalated to an error with --warn-error, and we could leverage the relation cache to determine if the to relation exists in the database or not.

Curious what y'all think?

drewbanin on 10 Apr 2020

It's compelling to think that tests should only be run if all of their ancestors are selected in the run, but that will make custom data tests pretty wonky!

What will be wonky about it? Custom data tests should use ref and source just like everything else, right?

beckjake on 10 Apr 2020

True, but something about this behavior for custom data tests feels wrong to me. Maybe custom data tests are less of an issue because you can conceivably select them by name/path? Let me think on that some more... it's not quite as jarring as I initially felt it was, but I don't feel _great_ about it still.

drewbanin on 10 Apr 2020

I'm here from the future to say that the @ operator seems to work for tests now, insofar as the reproduction case that @drewbanin included above does the following in dbt v0.18.1:

# Runs both model1 and model2
dbt run --models @model1

# Runs the test
dbt test --models @model1

I came here because I've been thinking a lot about:

if we should be excluding the tests whose ancestors aren't in the graph

Particularly with reference to state:modified + testing caveats, where one "leg" of a test may be modified but the other one may not be, and we'll only have one in our selection criteria for both run + test. "All parents must be present" is one of a few potential approaches I'm mulling over.

In any case, I'm going to close this issue :)

jtcohen6 on 10 Nov 2020

I'm here from the future to say that the @ operator seems to work for tests now, insofar as the reproduction case that @drewbanin included above does the following in dbt v0.18.1:
# Runs both model1 and model2
dbt run --models @model1

# Runs the test
dbt test --models @model1
I came here because I've been thinking a lot about:

if we should be excluding the tests whose ancestors aren't in the graph

Particularly with reference to state:modified + testing caveats, where one "leg" of a test may be modified but the other one may not be, and we'll only have one in our selection criteria for both run + test. "All parents must be present" is one of a few potential approaches I'm mulling over.

In any case, I'm going to close this issue :)