Dbt: Add namespacing for dbt resources

Created on 30 Jan 2019 · 20Comments · Source: fishtown-analytics/dbt

Placeholder issue for thoughts related to namespacing. When we're able to prioritize work on namespacing, let's convert this to a more actionable issue:

Questions:

how can we namespace models to avoid ambiguity?
what do package namespaces look like for models?
is there a way to "import" resources from a package, rather than injecting everything into the toplevel context?

Other similar questions / use cases welcomed :)

help wanted

Source

drewbanin

👍17

Most helpful comment

@drewbanin I like the two notions of making packages within a project easy, and dropping the name uniqueness across packages.

This is similar to ideas that Betterment has been tossing around _a lot_ lately. We have models across a small but slowly growing number of internal domains that all belong in the same project, but have different owners and use cases depending on the domain. We want to relax the requirement for unique model names across these domains (_especially_ for ephemeral models).

We were thinking of suggesting custom schema names (not the actual materialized schema names, which may differ from run to run) as a namespace, but I think if we could easily create intra-project packages, that could suffice. The requirements that we've been thinking about are as follows (let's say for these requirements that namespace could either be the name of a package or of a custom schema):

Models can be identified either by "model_name" or ("namespace", "model_name").
e.g., If two models have the same name x but different namespaces A and B:
- DBT does _not_ fail to compile the project, as it does now
- Using ref('x') from a model in a different namespace (not A or B) fails in compile because the reference is ambiguous
- Using ref('A', 'x') from. a model in a different namespace than A is fine
- Using ref('x') from a model in namespace A implies ref('A', 'x') (i.e., DBT first it checks whether x is unambiguous and, if it is not, tries to find 'A', 'x')
Two models named y within the same namespace, or that materialize to the same schema, do cause a compile failure (as is the current behavior)

I know everyone on our team would also be thrilled if _ephemeral_ models within different folders of the same package could be similarly named without creating conflicts as well, but the points above are more important.

mjumbewu on 9 Apr 2020

👍6 ❤3

All 20 comments

Hi @drewbanin, is this different / related to the comment here "be sure to namespace your models" ?

https://docs.getdbt.com/docs/configuring-models

We're looking for a clean way to separate sets of pipelines, e.g. all pipelines related to a consumer ML service, and pipelines related to analytics. This will be particularly useful as the size of our pipelines grow, the ability to run dbt run/test on a specific namespace will ensure the processing completes faster, as non-related models aren't built.

danimcgoo on 5 Feb 2019

hey @danimcgoo - yeah - this would be slightly different than the namespacing described in the link you shared above. The type of namespacing described here would make it possible to refer to models with more specific names. Imagine you wanted to have two models with the same name, but materialized in different schemas. Namespacing in dbt would let you do:

-- models/shopify_uk/sales.sql
select ...

-- models/shopify_fr/sales.sql
select ...

And then later:

select * from {{ ref('shopify_uk.sales') }
union all
select * from {{ ref('shopify_fr.sales') }

Here, there would be two models named sales.sql, but you'd be able to unambiguously refer to either using some sort of "namespace". TBD _exactly_ what shape that will take in dbt, but I think that as dbt projects grow, we'll increasingly be pushed away from our current paradigm of globally unique model names, and this is one way to accomplish that.

I would say that for the moment, dbt's approach to model selection might be sufficient for your case. Have you checked out tagging yet? You can use dbt_project.yml to apply tags to whole directories of models, then these models with those tags on the CLI with --models!

Hope some of this is helpful -- lmk if there's anything else I can clarify

drewbanin on 5 Feb 2019

👍1

Hey @drewbanin, tagging looks like it'll do the trick. thanks fo the tip!

danimcgoo on 11 Feb 2019

Hi @drewbanin

Is this "Help Wanted" still current? I'm interested in supporting this feature in our internal projects because we have 100s of sql files that have a hierarchy of subdirectories under the "models" directory to help keep things semantically organized. Also, we have several SQL files with the same name but that exist in different directories (as illustrated above with the sales.sql example).

Offhand, I think that this feature needs to be enabled with a flag somewhere to avoid getting an error during the compile phase when two SQL files have the same name (i.e., the "dbt found two resources with the name" CompilationException. One place to add this flag would be in dbt_project.yml at the top-level. Maybe like this:

dbt_project.yml:

  use_reference_namespace = true

The next trick is to thread this parameter through to the places where the unique identifiers are generated. As one example, in parser/base_sql.py BaseSqlParser.parse_sql_node, a unique_id is generated by joining the resource_type, package_name, and node name (i.e., filename base) using '.'. This unique_id is later split apart in utils.py id_matches(). As such, I don't think that we can use the '.' character as the separate the namespace. Instead, maybe it could be a '/', just like the directory separator:

select * from {{ ref('shopify_uk/sales') }}
union all
select * from {{ ref('shopify_fr/sales') }}

So, I think one solution would approximately be to update the get_path methods (in parser/schemas.py and parser/base.py) to take a node instead of name as the last parameter, and then conditionally include the directory prefix in the name based on the use_reference_namespace flag. All these directories would be relative to the directory specified in dbt_project.yml, and could include subdirectories of subdirectories (e.,g, ref('shopify_uk/sub_dir/sales')).

Thoughts? Thanks!

heisencoder on 26 Sep 2019

(Actually, in looking closer at the id_matches() logic, there is support for using '.' in the name. For example, NodeType.Source requires that there be a single '.' in the node_name)

heisencoder on 26 Sep 2019

Hey @heisencoder - before we begin any work here, I'd like to better develop our thinking around how namespacing will work in general. In your example (shopify_uk/sales), what is shopify_uk relative to? One obvious answer is the "root" models/ directory, but that doesn't extend super well to a world with packages.

dbt already has a notion of "fully qualified names" -- this looks like package_name.path.to.model_name. It's how the --models selector works, and how models are configured in the dbt_project.yml file. In dbt's selector logic, we have some code to assume that the package_name is the "root" package if one is not provided.

If namespacing is turned on, should it be allowed to ref a model _without_ namespacing it? Or are namespaces solely intended to make it possible to handle models with shared names?

Presently, dbt will assert that no two models share the same name. I think we'll need to defer that validation until all of the refs are evaluated. If a ref can evaluate to more than one model, then we can make that a compile-time error.

I'm definitely more in favor of using dots (.) to represent namespaces compared to slashes (/) just insofar as these won't be "real" file paths.

What do you think about all of this?

drewbanin on 27 Sep 2019

👍1

Thanks @drewbanin for your thoughts!

I haven't looked into the "fully qualified names" that you've mentioned for the --models selector, so don't know how this proposal plays into that. Maybe it can just slide in with minimal changes? The package_name is already included in the node's unique_id, so maybe they can work together?

Here's a proposed approach that hopefully fits in the framework you've outlined:

Update the name computation to use the full path relative to the resources-path in which the model was found. Use '.' to separate path components. e.g. A file in 'models/dir1/dir2/model.sql' would have a name of 'dir1.dir2.model'.
Update utils.find_in_list_by_name and utils.find_in_subgraph_by_name to first attempt to find an exact match, and then later default to looking only at the model name without the package prefix to match current behavior. However, If looking just at the model name, it will be necessary to find all such matches and raise an exception if two or more matches apply.
Hunt down and update any code that relies on the previous naming convention.

We could do these changes without adding a namespacing mode, although adding a namespacing mode could help with forward compatibility issues with the manifest.json format.

As a proof-of-concept, here are examples of the changes I imagined to utils.find_in_list_by_name:

def find_in_list_by_name(haystack, target_name, target_package, nodetype):
    """Find an entry in the given list by name.

    target_name could either be the base of the model name or could include the
    package prefix separated with '.' characters.
    """
    base_name_matches = []
    for model in haystack:
        name = model.get('unique_id')
        if id_matches(name, target_name, target_package, nodetype, model):
            return model
        if id_matches(name, target_name.split('.')[-1], target_package, nodetype, model):
            base_name_matches.append(model)
    if len(base_name_matches >= 2):
        raise_compiler_error(
            'Ambiguous reference %s has multiple matches: %s' % target_name, base_name_matches)
    if len(base_name_matches == 1):
        return base_name_matches[0]
    return None

(And then also update id_matches to not blow up when the target_name contains '.')

Note that as a side-effect of including the full package prefix in the target_name, the compile tests in _load_schema_tests no longer fail because they're using the fully qualified name.

heisencoder on 27 Sep 2019

Hey @heisencoder - I think everything you're saying here is totally reasonable. The big challenge I can imagine is that we've always held as a hard constraint that resource names are unique. As such, there may be pernicious failures in parts of the codebase which rely on these model names being unique. We'll definitely have to figure out how to adjust the unique_id attribute of nodes, as it would be _very_ bad if those unique ids were no longer unique :)

I'm supportive of this idea, but we need to figure out:

how to generate unique ids for nodes (we can probably leverage the FQN path here)
understand the breadth of changes outside of the ref function context that are required:
- i know the docs site will need to be updated
- i don't believe we parse values out of the unique_id string anywhere, but we'd need to verify that
- figure out things around the edges, like how schema.yml works when models can have duplicate names

This is probably going to be a pretty big project!

drewbanin on 1 Oct 2019

Given the large number of edge cases, I think it makes sense to flag control this new code so that old code doesn't have to worry about these edge cases. So I'm thinking that this could be controlled by a use_namespaces = true option at the top-level of the dbt_project.yml file. We could theoretically allow turning this off or on at different levels of the models section of the dbt_project.yml, but I'm hoping to avoid allowing that level of configuration.

The unique_ids for the nodes would then become the FQN path instead of just the base filename. This should make them be unique in all contexts.

I can provide some suggested documentation text.

I hadn't thought about the name parameter in the schema.yml files, but here are my two thoughts:

The path prefix to the model name can be implied by the path prefix of the schema.yml file. I'm guessing that in general, each directory will have its own schema.yml file that describes the models in the same directory. Since this is convention, I'm guessing that this won't always be followed.
We could expand the syntax to allow the models.name field to include the dotted FQN to the model name. This would allow schema.yml files to describe models in other directories.

heisencoder on 1 Oct 2019

Why not just copy what other programming environments do (python) and make the folder the package name? As a newbie, I expected it already worked list this. I think this should be the default with an option to disable it that allows legacy users to migrate.

preston-hf on 13 Feb 2020

👍3

Also, I will add that the source selector works about like I'd want from a usage point of view. That may provide a better way to migrate as well. Any time the old ref() is called, the uniqueness for that model name is ensured, and generates and exception if there are duplicates. A new ref that uses packages (or maybe just a new arg!) would limit this check to the namespace that was specified.

preston-hf on 13 Feb 2020

I love the source('my_schema', 'my_table') syntax. To echo @preston-hf, could we replicate that syntax for ref with ref('my_schema', 'my_model')? If the schema name is set in schema.yml or dbt_project.yml, then that is the model's schema/namespace. If not set, then we keep the current behavior.

rpedela-recurly on 6 Apr 2020

hey @rpedela-recurly - we definitely don't want to namespace things by schema name! Destination schemas are configuration in dbt, so the actual schema that a model renders into should be totally different in dev vs. prod.

You can actually _currently_ supply two arguments to the ref function, eg:

{{ ref('package_name', 'model_name') }}

In practice, this isn't super useful though -- model names must currently be unique across all packages included in a project. Increasingly, my thinking is that we should:

make it easier to define multiple packages inside of a single project
drop the requirement that model names must be unique across packages

If two models share the same name, then you would need to qualify the model name with a package name in the ref statement. If the two models have the same database representation, then dbt could raise a compiler error.

Do you buy that?

drewbanin on 6 Apr 2020

👍1

I didn't realize you could define package_name. Number 1 would suffice I think then, and at least I would make package_name = schema_name, but there would be more flexibility for others. As far as defining package/model relations, I still like setting the package for a set of models in either dbt_project.yml or schema.yml. But I also like having the ability to keep packages the way they are, as I see them being useful for large projects with many contributors. What if ref was updated to allow this:

ref('package', 'namespace_within_package', 'model')
ref('namespace_within_project', 'model')

Then namespace is defined in schema.yml or dbt_project.yml. I really like having the source/ref syntax look like SQL schema_name.table_name even if the real schema differs because of dev/prod environments.

rpedela-recurly on 6 Apr 2020

👍2

@drewbanin I like the two notions of making packages within a project easy, and dropping the name uniqueness across packages.

Models can be identified either by "model_name" or ("namespace", "model_name").
e.g., If two models have the same name x but different namespaces A and B:
- DBT does _not_ fail to compile the project, as it does now
- Using ref('x') from a model in a different namespace (not A or B) fails in compile because the reference is ambiguous
- Using ref('A', 'x') from. a model in a different namespace than A is fine
- Using ref('x') from a model in namespace A implies ref('A', 'x') (i.e., DBT first it checks whether x is unambiguous and, if it is not, tries to find 'A', 'x')
Two models named y within the same namespace, or that materialize to the same schema, do cause a compile failure (as is the current behavior)

mjumbewu on 9 Apr 2020

👍6 ❤3

Also @drewbanin, when you say "make it easier to define multiple packages inside of a single project", what do you imagine that looks like? Something along the lines of the following structure, where DBT perhaps infers the package names of sub-projects, and maybe inherits project config from the "parent"?:

project/
|-- domain1/
|   |-- models/
|   +-- tests/
|
|-- domain2/
|   |-- models/
|   +-- tests/
|
|-- models/
|
|-- tests/
|
|-- dbt_project.yml
+-- packages.yml

# project/packages.yml

packages:
  - local: domain1
  - local: domain2

# project/dbt_project.yml

name: 'company_project'
version: '1.0'

models:
  domain1:
    schema: d1

  domain2:
    schema: d2

Maybe even allowing the packages definition to be folded into the _dbt_project.yml_ file?

mjumbewu on 9 Apr 2020

👍3

I am very interested in being able to namespace objects like models and sources and macros—at least in a rudimentary fashion by removing the requirement that model names be unique even between packages—and I'd be happy to try to put together a PR if someone more familiar with the project can give me a little guidance on where to get started.

jgysland on 13 Apr 2020

👍2

hey @mjumbewu - thanks for writing all of this out!

I really like your suggestions on how we should approach model-level namespacing: it's really sensible and well-specified.

Re: making namespacing easier: I like the general idea of your suggestion:

packages:
  - local: domain1
  - local: domain2

The only issue with this approach is that you need to run dbt deps to create symlinks from dbt_modules/domain1/ to domain1/. This would be a little bit circuitous and it doesn't work incredibly well on Windows (you know, no symlinks).

Instead, I think we could do one of the following:

1. Smarter local packages
We could add a config like symlink to the packages.yml file for local: packages. If this value is false, then dbt wouldn't try to symlink the package into dbt_modules/ and would instead point to the package dir directly.

Pros:

Feels pretty natural
You could use arbitrary file paths

Cons:

I think this would be a big change to dbt. We definitely _can_ read the packages.yml file and use it to inform parsing/running, but I don't think that's how things work today

2. A new dbt_project.yml config
The solution to having too many configs is to add another config. We could add a top-level dbt_project.yml config like package-paths. This config would let you enumerate paths in your project that contain dbt packages. Given the example you showed above, that would look like:

project/
|-- domain1/
|   |-- models/
|   +-- tests/
|
|-- domain2/
|   |-- models/
|   +-- tests/
|
|-- models/
|
|-- tests/
|
|-- dbt_project.yml
+-- packages.yml

# dbt_project.yml

source-paths: ['models']
package-paths: ['domain1', 'domain2']

Alternatively, we could get clever with source-paths and support something like:

source-paths:
  project: ['models']
  packages: ['domain1', 'domain2']

But I don't think I _love_ that approach.

Pros:

I think this should be relatively straightforward to implement today. dbt already knows to look in dbt_modules for packages, and this would be an extension to that logic

Cons:

???
I can't think of any real downsides to this approach

3. config-level namespaces

This was the one I had in mind when I wrote about making namespacing easier, but increasingly, I don't think this approach is such a good idea. I was picturing a models: level config which supports a namespace as a config value. That would look like:

models:
  my_project:
    path1:
      namespace: domain1
    path2:
      namespace: domain2

All of the models in models/path1 would be namespaced under the package domain1, and all of the models in models/path2 would be namespaced under the package domain2.

Pros:

It's pretty flexible!
Models/maros/seeds/snapshots/etc can live closer together, rather than in totally separate folders
Cons:
I think an approach like this would make correctly parsing models pretty challenging!
The models: config is only intended to configure models, but we'd want to namespace other resources (say, tests/snapshots/seeds/etc) in a package-aware way too

These are the things bouncing around in my head. I think approach 2 outlined above is going to be our best bet, but I'm curious what you all think too.

drewbanin on 22 Apr 2020

Hi there, just wanted to continue that discussion

I'm a beginner on DBT, so I would just like to give my opinion from a "beginner" point of view.

Seeds

First I would like to point out that this would be useful for Seeds as well, consider the following simple use-case:

2 pipelines are being worked on, they target 2 different databases, A and B, this is the folder structure

data/a/calendar.csv
data/b/calendar.csv

The 2 pipelines target 2 different Databases, so their names do not clash when they run against the Warehouse. But A needs a different fiscal calendar than B.

Now, as a beginner, my first intuition would be to use ref("my_proj", "a.calendar") because of the --select argument available to dbt seed

Sub projects

I think the "sub project" abstraction is quite hard to wrap my head around:

Now I need to look at the config to figure out if a directory has a special meaning?
What if I have a name clash in a large sub-project? Aren't we back to this issue?

Proposal

(I'm aware it's a summary/mix of some other solutions above)

Have the ability to call ref-able objects either by their full "module path" ref("my_proj", "a.calendar") or their name "excluding module" ref("my_proj", "calendar"), note that ref("a.calendar") or ref("calendar") should be valid if it's not ambiguous cross project
In the function to resolve the object, we would look at a.calendar and see that there aren't duplicate objects returned, in case calendar is used, we could return an Exception "Ambiguous ref: use either a.calendar or b.calendar" . I'm actually not familiar with how the code deals with ambiguous objects cross packages but I would imagine part of the logic could be re-used.
I'm not very experienced with the generated documentation, maybe we could display names of objects as full explicit references in case of ambiguity (perhaps by running a version of the above logic on all objects to be rendered), or just the object name otherwise. Again I'm not familiar with how ambiguous objects cross packages are rendered on the docs but it could inform implementation.

Pros:

Simplicity
No config change
Intuitive to beginners
Shielded against potential clashes of the sub-projects solution

Cons:

Can't assume uniqueness within a project
Need to introduce logic to detect ambiguous objects in modules

dmateusp on 8 Jul 2020

@drewbanin Where does this go from here? Is this the 'DBT extension proposal'? Do you need to make a BDFL pronouncement?

Was shocked to discover that this is an outstanding issue still. I've always used multiple packages with qualified refs and I guess just got lucky that I never repeated a model name.