compare subcommand for identifying stale/unmanaged relations.Similar to https://github.com/fishtown-analytics/dbt/issues/615, but maybe less objectionable!
dbt compare should inspect the code repository and determine which views and tables are described therein. DBT should compare those relations to the system tables, and produce some output identifying any discrepancies. Something like
$ dbt compare
Comparing local codebase to data warehouse.
...
Identified 2 relations in the warehouse that are not managed by DBT.
VIEW: schema_name.view_name
TABLE: schema_name.table_name
This feature is useful for warehouse admins who want to identify and remove stale relations.
I frequently forget to clean up after myself if I've done some refactoring (e.g., renaming models or deleting old models that have been superseded by others). Since DBT has all of the requisite information to tell me about this, it seems like the right tool for the job.
DBT probably shouldn't be in the business of dropping relations, and there's some risk that this feature could be mis-used or the output misinterpreted (in particular, for people who have complex schema-renaming rules for different targets). @drewbanin can add some more detailed thoughts about where this might go wrong.
@mikekaminsky I think I'm into this idea. It would be convenient if the command would output a list of statements that could be run to drop the tables too. I think this would work really well as an operation. Maybe those should have access to the manifest so they can inspect the models in the project, and the state of the database. Then if users want to create an operation to actually drop the tables... that would be none of my business :)
For anyone else coming to this thread: there are heaps of ways that dbt could get this "wrong". If a user forgets to run dbt deps, or if they have configs to change the materialization setting of a model from table to ephemeral in dev, or if they aren't up to date with master, dbt could conceivably advise that current and in-use tables should be dropped. I want to enable a workflow for users, but I don't want dbt to make it easy to do something really bad by accident.
Thanks for the request @mikekaminsky!
My 2 cents: would be useful to have a config setting per model/model folder how dbt should behave in regards to synching the list of models and the list of db objects. E.g. value safe - do not ever drop anything from target. except tables - drop views but not tables, sync - full sync dropping everything that is not in the list of current models. Something along those lines.
If dbt is going to provide features like this, then there needs to be something other than console output.
In the future dbt-as-api world then it should just be returning a model describing the changes, but in the meantime I thing a useful solution would be to be able to specify an output file (json?) to return results to.
Stumbled onto this after the tenth time of us deleting models but forgetting to drop them in the dbt-managed schema.
It would be helpful to have a command like this, maybe dbt diff, that shows what objects exist in dbt-managed schemas and spit out the commands that would need to be run to remove old relations.
Thanks for the bump @tayloramurphy - I feel better about building this these days than I have in the past! The thing that's changed is that dbt's core constructs have coalesced around a handful of resources (seeds, models, snapshots). I actually think the obviously better version of this (one which both enumerates the relations, but also supports dropping the deleted ones) is within our reach too :)
My one hesitation is that adding a new subcommand (diff, compare, etc) is pretty heavy-weight! I feel ok about that in this particular case, but I wouldn't want to do that as a rule for future features similar to this one.
You buy all of that?
@drewbanin for sure I buy that! I don't have a strong opinion that this should be a separate command, I'm just not knowledgeable enough about the internal workings to know if it would make sense as a flag against run/test or via some other implementation.
The key thing for me on this one is, if we make the assumption that dbt is an infra as code tool where we define what the state of our warehouse should be, then there needs to be a way to sync the warehouse to what the code says should exist. Basically - here's what the information schema _should_ say exists, here's what the information schema _does_ say exists, and here's how to sync them.
I'd be for pulling compare out of dbt-helper and into dbt-core since I don't have time to maintain dbt-helper any more if it's helpful to have it as reference:
https://github.com/mikekaminsky/dbt-helper/blob/master/core/compare.py
In thinking about this for an upcoming sprint, I agree with the comments above that:
dbt compare or dbt diff have some semantic overlap with other proposals (e.g. https://github.com/fishtown-analytics/dbt/issues/1677) to identify models that are new/changed since a previous run.To my mind, one way we could accomplish the compare or diff operation would be as a flag to dbt ls. It can return console ouput, JSON, or (with extra specification) execute a series of drop statements. Here's what I could envision:
dbt ls --orphaned # list database relations that do not map to current dbt models
dbt ls --orphaned --execute-drop --dry-run # list DDL dbt will run
dbt ls --orphaned --execute-drop # execute drop statements
I admit that this is different behavior from current dbt ls commands, which returns a list of dbt resources, never database objects.
For what it's worth, dbt ls already requires a valid profile definition (though I don't think it requires that the profile actually be able to connect), so the break from current behavior is no big deal from a development perspective - it's not like we'd have to do anything along the lines of what deps does.
Most helpful comment
@mikekaminsky I think I'm into this idea. It would be convenient if the command would output a list of statements that could be run to drop the tables too. I think this would work really well as an operation. Maybe those should have access to the
manifestso they can inspect the models in the project, and the state of the database. Then if users want to create an operation to actually drop the tables... that would be none of my business :)For anyone else coming to this thread: there are heaps of ways that dbt could get this "wrong". If a user forgets to run
dbt deps, or if they have configs to change the materialization setting of a model fromtabletoephemeralin dev, or if they aren't up to date withmaster, dbt could conceivably advise that current and in-use tables should be dropped. I want to enable a workflow for users, but I don't want dbt to make it easy to do something really bad by accident.Thanks for the request @mikekaminsky!