Placeholder issue for thoughts related to namespacing. When we're able to prioritize work on namespacing, let's convert this to a more actionable issue:
Questions:
Other similar questions / use cases welcomed :)
Hi @drewbanin, is this different / related to the comment here "be sure to namespace your models" ?
https://docs.getdbt.com/docs/configuring-models
We're looking for a clean way to separate sets of pipelines, e.g. all pipelines related to a consumer ML service, and pipelines related to analytics. This will be particularly useful as the size of our pipelines grow, the ability to run dbt run/test on a specific namespace will ensure the processing completes faster, as non-related models aren't built.
hey @danimcgoo - yeah - this would be slightly different than the namespacing described in the link you shared above. The type of namespacing described here would make it possible to refer to models with more specific names. Imagine you wanted to have two models with the same name, but materialized in different schemas. Namespacing in dbt would let you do:
-- models/shopify_uk/sales.sql
select ...
-- models/shopify_fr/sales.sql
select ...
And then later:
select * from {{ ref('shopify_uk.sales') }
union all
select * from {{ ref('shopify_fr.sales') }
Here, there would be two models named sales.sql, but you'd be able to unambiguously refer to either using some sort of "namespace". TBD _exactly_ what shape that will take in dbt, but I think that as dbt projects grow, we'll increasingly be pushed away from our current paradigm of globally unique model names, and this is one way to accomplish that.
I would say that for the moment, dbt's approach to model selection might be sufficient for your case. Have you checked out tagging yet? You can use dbt_project.yml to apply tags to whole directories of models, then these models with those tags on the CLI with --models!
Hope some of this is helpful -- lmk if there's anything else I can clarify
Hey @drewbanin, tagging looks like it'll do the trick. thanks fo the tip!
Hi @drewbanin
Is this "Help Wanted" still current? I'm interested in supporting this feature in our internal projects because we have 100s of sql files that have a hierarchy of subdirectories under the "models" directory to help keep things semantically organized. Also, we have several SQL files with the same name but that exist in different directories (as illustrated above with the sales.sql example).
Offhand, I think that this feature needs to be enabled with a flag somewhere to avoid getting an error during the compile phase when two SQL files have the same name (i.e., the "dbt found two resources with the name" CompilationException. One place to add this flag would be in dbt_project.yml at the top-level. Maybe like this:
dbt_project.yml:
use_reference_namespace = true
The next trick is to thread this parameter through to the places where the unique identifiers are generated. As one example, in parser/base_sql.py BaseSqlParser.parse_sql_node, a unique_id is generated by joining the resource_type, package_name, and node name (i.e., filename base) using '.'. This unique_id is later split apart in utils.py id_matches(). As such, I don't think that we can use the '.' character as the separate the namespace. Instead, maybe it could be a '/', just like the directory separator:
select * from {{ ref('shopify_uk/sales') }}
union all
select * from {{ ref('shopify_fr/sales') }}
So, I think one solution would approximately be to update the get_path methods (in parser/schemas.py and parser/base.py) to take a node instead of name as the last parameter, and then conditionally include the directory prefix in the name based on the use_reference_namespace flag. All these directories would be relative to the directory specified in dbt_project.yml, and could include subdirectories of subdirectories (e.,g, ref('shopify_uk/sub_dir/sales')).
Thoughts? Thanks!
(Actually, in looking closer at the id_matches() logic, there is support for using '.' in the name. For example, NodeType.Source requires that there be a single '.' in the node_name)
Hey @heisencoder - before we begin any work here, I'd like to better develop our thinking around how namespacing will work in general. In your example (shopify_uk/sales), what is shopify_uk relative to? One obvious answer is the "root" models/ directory, but that doesn't extend super well to a world with packages.
dbt already has a notion of "fully qualified names" -- this looks like package_name.path.to.model_name. It's how the --models selector works, and how models are configured in the dbt_project.yml file. In dbt's selector logic, we have some code to assume that the package_name is the "root" package if one is not provided.
If namespacing is turned on, should it be allowed to ref a model _without_ namespacing it? Or are namespaces solely intended to make it possible to handle models with shared names?
Presently, dbt will assert that no two models share the same name. I think we'll need to defer that validation until all of the refs are evaluated. If a ref can evaluate to more than one model, then we can make that a compile-time error.
I'm definitely more in favor of using dots (.) to represent namespaces compared to slashes (/) just insofar as these won't be "real" file paths.
What do you think about all of this?
Thanks @drewbanin for your thoughts!
I haven't looked into the "fully qualified names" that you've mentioned for the --models selector, so don't know how this proposal plays into that. Maybe it can just slide in with minimal changes? The package_name is already included in the node's unique_id, so maybe they can work together?
Here's a proposed approach that hopefully fits in the framework you've outlined:
resources-path in which the model was found. Use '.' to separate path components. e.g. A file in 'models/dir1/dir2/model.sql' would have a name of 'dir1.dir2.model'.utils.find_in_list_by_name and utils.find_in_subgraph_by_name to first attempt to find an exact match, and then later default to looking only at the model name without the package prefix to match current behavior. However, If looking just at the model name, it will be necessary to find all such matches and raise an exception if two or more matches apply.We could do these changes without adding a namespacing mode, although adding a namespacing mode could help with forward compatibility issues with the manifest.json format.
As a proof-of-concept, here are examples of the changes I imagined to utils.find_in_list_by_name:
def find_in_list_by_name(haystack, target_name, target_package, nodetype):
"""Find an entry in the given list by name.
target_name could either be the base of the model name or could include the
package prefix separated with '.' characters.
"""
base_name_matches = []
for model in haystack:
name = model.get('unique_id')
if id_matches(name, target_name, target_package, nodetype, model):
return model
if id_matches(name, target_name.split('.')[-1], target_package, nodetype, model):
base_name_matches.append(model)
if len(base_name_matches >= 2):
raise_compiler_error(
'Ambiguous reference %s has multiple matches: %s' % target_name, base_name_matches)
if len(base_name_matches == 1):
return base_name_matches[0]
return None
(And then also update id_matches to not blow up when the target_name contains '.')
Note that as a side-effect of including the full package prefix in the target_name, the compile tests in _load_schema_tests no longer fail because they're using the fully qualified name.
Hey @heisencoder - I think everything you're saying here is totally reasonable. The big challenge I can imagine is that we've always held as a hard constraint that resource names are unique. As such, there may be pernicious failures in parts of the codebase which rely on these model names being unique. We'll definitely have to figure out how to adjust the unique_id attribute of nodes, as it would be _very_ bad if those unique ids were no longer unique :)
I'm supportive of this idea, but we need to figure out:
ref function context that are required:schema.yml works when models can have duplicate namesThis is probably going to be a pretty big project!
Given the large number of edge cases, I think it makes sense to flag control this new code so that old code doesn't have to worry about these edge cases. So I'm thinking that this could be controlled by a use_namespaces = true option at the top-level of the dbt_project.yml file. We could theoretically allow turning this off or on at different levels of the models section of the dbt_project.yml, but I'm hoping to avoid allowing that level of configuration.
The unique_ids for the nodes would then become the FQN path instead of just the base filename. This should make them be unique in all contexts.
I can provide some suggested documentation text.
I hadn't thought about the name parameter in the schema.yml files, but here are my two thoughts:
schema.yml file. I'm guessing that in general, each directory will have its own schema.yml file that describes the models in the same directory. Since this is convention, I'm guessing that this won't always be followed.models.name field to include the dotted FQN to the model name. This would allow schema.yml files to describe models in other directories.Why not just copy what other programming environments do (python) and make the folder the package name? As a newbie, I expected it already worked list this. I think this should be the default with an option to disable it that allows legacy users to migrate.
Also, I will add that the source selector works about like I'd want from a usage point of view. That may provide a better way to migrate as well. Any time the old ref() is called, the uniqueness for that model name is ensured, and generates and exception if there are duplicates. A new ref that uses packages (or maybe just a new arg!) would limit this check to the namespace that was specified.
I love the source('my_schema', 'my_table') syntax. To echo @preston-hf, could we replicate that syntax for ref with ref('my_schema', 'my_model')? If the schema name is set in schema.yml or dbt_project.yml, then that is the model's schema/namespace. If not set, then we keep the current behavior.
hey @rpedela-recurly - we definitely don't want to namespace things by schema name! Destination schemas are configuration in dbt, so the actual schema that a model renders into should be totally different in dev vs. prod.
You can actually _currently_ supply two arguments to the ref function, eg:
{{ ref('package_name', 'model_name') }}
In practice, this isn't super useful though -- model names must currently be unique across all packages included in a project. Increasingly, my thinking is that we should:
If two models share the same name, then you would need to qualify the model name with a package name in the ref statement. If the two models have the same database representation, then dbt could raise a compiler error.
Do you buy that?
I didn't realize you could define package_name. Number 1 would suffice I think then, and at least I would make package_name = schema_name, but there would be more flexibility for others. As far as defining package/model relations, I still like setting the package for a set of models in either dbt_project.yml or schema.yml. But I also like having the ability to keep packages the way they are, as I see them being useful for large projects with many contributors. What if ref was updated to allow this:
ref('package', 'namespace_within_package', 'model')
ref('namespace_within_project', 'model')
Then namespace is defined in schema.yml or dbt_project.yml. I really like having the source/ref syntax look like SQL schema_name.table_name even if the real schema differs because of dev/prod environments.
@drewbanin I like the two notions of making packages within a project easy, and dropping the name uniqueness across packages.
This is similar to ideas that Betterment has been tossing around _a lot_ lately. We have models across a small but slowly growing number of internal domains that all belong in the same project, but have different owners and use cases depending on the domain. We want to relax the requirement for unique model names across these domains (_especially_ for ephemeral models).
We were thinking of suggesting custom schema names (not the actual materialized schema names, which may differ from run to run) as a namespace, but I think if we could easily create intra-project packages, that could suffice. The requirements that we've been thinking about are as follows (let's say for these requirements that namespace could either be the name of a package or of a custom schema):
"model_name" or ("namespace", "model_name").x but different namespaces A and B:ref('x') from a model in a different namespace (not A or B) fails in compile because the reference is ambiguousref('A', 'x') from. a model in a different namespace than A is fineref('x') from a model in namespace A implies ref('A', 'x') (i.e., DBT first it checks whether x is unambiguous and, if it is not, tries to find 'A', 'x')y within the same namespace, or that materialize to the same schema, do cause a compile failure (as is the current behavior)I know everyone on our team would also be thrilled if _ephemeral_ models within different folders of the same package could be similarly named without creating conflicts as well, but the points above are more important.
Also @drewbanin, when you say "make it easier to define multiple packages inside of a single project", what do you imagine that looks like? Something along the lines of the following structure, where DBT perhaps infers the package names of sub-projects, and maybe inherits project config from the "parent"?:
project/
|-- domain1/
| |-- models/
| +-- tests/
|
|-- domain2/
| |-- models/
| +-- tests/
|
|-- models/
|
|-- tests/
|
|-- dbt_project.yml
+-- packages.yml
# project/packages.yml
packages:
- local: domain1
- local: domain2
# project/dbt_project.yml
name: 'company_project'
version: '1.0'
models:
domain1:
schema: d1
domain2:
schema: d2
Maybe even allowing the packages definition to be folded into the _dbt_project.yml_ file?
I am very interested in being able to namespace objects like models and sources and macros鈥攁t least in a rudimentary fashion by removing the requirement that model names be unique even between packages鈥攁nd I'd be happy to try to put together a PR if someone more familiar with the project can give me a little guidance on where to get started.
hey @mjumbewu - thanks for writing all of this out!
I really like your suggestions on how we should approach model-level namespacing: it's really sensible and well-specified.
Re: making namespacing easier: I like the general idea of your suggestion:
packages:
- local: domain1
- local: domain2
The only issue with this approach is that you need to run dbt deps to create symlinks from dbt_modules/domain1/ to domain1/. This would be a little bit circuitous and it doesn't work incredibly well on Windows (you know, no symlinks).
Instead, I think we could do one of the following:
1. Smarter local packages
We could add a config like symlink to the packages.yml file for local: packages. If this value is false, then dbt wouldn't try to symlink the package into dbt_modules/ and would instead point to the package dir directly.
Pros:
Cons:
packages.yml file and use it to inform parsing/running, but I don't think that's how things work today2. A new dbt_project.yml config
The solution to having too many configs is to add another config. We could add a top-level dbt_project.yml config like package-paths. This config would let you enumerate paths in your project that contain dbt packages. Given the example you showed above, that would look like:
project/
|-- domain1/
| |-- models/
| +-- tests/
|
|-- domain2/
| |-- models/
| +-- tests/
|
|-- models/
|
|-- tests/
|
|-- dbt_project.yml
+-- packages.yml
# dbt_project.yml
source-paths: ['models']
package-paths: ['domain1', 'domain2']
Alternatively, we could get clever with source-paths and support something like:
source-paths:
project: ['models']
packages: ['domain1', 'domain2']
But I don't think I _love_ that approach.
Pros:
dbt_modules for packages, and this would be an extension to that logicCons:
3. config-level namespaces
This was the one I had in mind when I wrote about making namespacing easier, but increasingly, I don't think this approach is such a good idea. I was picturing a models: level config which supports a namespace as a config value. That would look like:
models:
my_project:
path1:
namespace: domain1
path2:
namespace: domain2
All of the models in models/path1 would be namespaced under the package domain1, and all of the models in models/path2 would be namespaced under the package domain2.
Pros:
models: config is only intended to configure models, but we'd want to namespace other resources (say, tests/snapshots/seeds/etc) in a package-aware way tooThese are the things bouncing around in my head. I think approach 2 outlined above is going to be our best bet, but I'm curious what you all think too.
Hi there, just wanted to continue that discussion
I'm a beginner on DBT, so I would just like to give my opinion from a "beginner" point of view.
First I would like to point out that this would be useful for Seeds as well, consider the following simple use-case:
2 pipelines are being worked on, they target 2 different databases, A and B, this is the folder structure
data/a/calendar.csv
data/b/calendar.csv
The 2 pipelines target 2 different Databases, so their names do not clash when they run against the Warehouse. But A needs a different fiscal calendar than B.
Now, as a beginner, my first intuition would be to use ref("my_proj", "a.calendar") because of the --select argument available to dbt seed
I think the "sub project" abstraction is quite hard to wrap my head around:
(I'm aware it's a summary/mix of some other solutions above)
Have the ability to call ref-able objects either by their full "module path" ref("my_proj", "a.calendar") or their name "excluding module" ref("my_proj", "calendar"), note that ref("a.calendar") or ref("calendar") should be valid if it's not ambiguous cross project
In the function to resolve the object, we would look at a.calendar and see that there aren't duplicate objects returned, in case calendar is used, we could return an Exception "Ambiguous ref: use either a.calendar or b.calendar" . I'm actually not familiar with how the code deals with ambiguous objects cross packages but I would imagine part of the logic could be re-used.
I'm not very experienced with the generated documentation, maybe we could display names of objects as full explicit references in case of ambiguity (perhaps by running a version of the above logic on all objects to be rendered), or just the object name otherwise. Again I'm not familiar with how ambiguous objects cross packages are rendered on the docs but it could inform implementation.
Pros:
Cons:
@drewbanin Where does this go from here? Is this the 'DBT extension proposal'? Do you need to make a BDFL pronouncement?
Was shocked to discover that this is an outstanding issue still. I've always used multiple packages with qualified refs and I guess just got lucky that I never repeated a model name.
Most helpful comment
@drewbanin I like the two notions of making packages within a project easy, and dropping the name uniqueness across packages.
This is similar to ideas that Betterment has been tossing around _a lot_ lately. We have models across a small but slowly growing number of internal domains that all belong in the same project, but have different owners and use cases depending on the domain. We want to relax the requirement for unique model names across these domains (_especially_ for ephemeral models).
We were thinking of suggesting custom schema names (not the actual materialized schema names, which may differ from run to run) as a namespace, but I think if we could easily create intra-project packages, that could suffice. The requirements that we've been thinking about are as follows (let's say for these requirements that
namespacecould either be the name of a package or of a custom schema):"model_name"or("namespace", "model_name").xbut different namespacesAandB:ref('x')from a model in a different namespace (notAorB) fails in compile because the reference is ambiguousref('A', 'x')from. a model in a different namespace thanAis fineref('x')from a model in namespaceAimpliesref('A', 'x')(i.e., DBT first it checks whetherxis unambiguous and, if it is not, tries to find'A', 'x')ywithin the same namespace, or that materialize to the same schema, do cause a compile failure (as is the current behavior)I know everyone on our team would also be thrilled if _ephemeral_ models within different folders of the same package could be similarly named without creating conflicts as well, but the points above are more important.