Kedro: [KED-1512] Ability to add catalogs together

Created on 26 Mar 2020  路  8Comments  路  Source: quantumblacklabs/kedro

Description

I am would like to request the ability to add two DataCatalog objects together to get the addition of the two. This will improve the ability to reuse pipelines and catalogs from existing projects.

Example

Let's say I already have two DataCatalogs loaded in c1 and c2. I would like the ability to add them together.

full_catalog = c1 + c2

Context

I am trying to combine pipelines from two different projects. Kedro natively gives me the ability to add the pipelines together, but not the DataCatalog

Possible Implementation

Create a new DataCatalog using dictionary unpacking from the individual DataCatalogs. In the case of duplicates the second catalog will take preference.

def __add__(self, other):
   DataCatalog({**self.__dict__['_data_sets'], **other.__dict__['_data_sets']})
Feature Request

All 8 comments

Thanks for raising this @WaylonWalker, I've added it to our issue tracker to discuss. :)

@WaylonWalker Sorry, why is this necessary? You can already combine multiple (sets of) catalog files by naming them starting with catalog at the root level, or placing them anywhere under a directory at the root level whose name starts with catalog. See https://github.com/quantumblacklabs/kedro/blob/0.15.8/kedro/context/context.py#L242.

This is for combining two entirely separate projects without copy/paste

This is for combining two entirely separate projects without copy/paste

Without copy-pasting what? Sorry if I'm being dense.

In your combined architecture, you can add the pipelines together. You still need your KedroContext, where you can customize the _get_catalog function, if you need to read from multiple hierarchies.

Could you provide a minimal example if I'm wrong?

In order to utilize catalogs from completely separate projects started with different kedro run commands, stored in completely separate git repos, I would currently need to copy the YAML files from one project to the other. Am I correct in thinking that? Or is there another way to share catalogs from completely separate projects.

Hope this makes a bit more sense now.

You can override _get_catalog, changing:

conf_catalog = self.config_loader.get("catalog*", "catalog*/**")

to:

conf_catalog = self.config_loader.get(
    "catalog*",
    "catalog*/**",
    "/path/to/other_project/base/conf/catalog*",
    "/path/to/other_project/base/conf/catalog*/**",
)

or something, I believe (not tested). You could also create a second config loader.

It's not super clean since you're constructing based on paths outside the project, but neither is constructing a data catalog elsewhere IMO.

Also, sorry if I'm opining too much here... (I'm not part of the team or anything, just opinionated :P)

I had not thought about this as a solution before, I will need to think through how I can integrate that. I have a group of folks that are doing hypothesis testing and general analysis much less than pipeline building. I am trying to give them a way to utilize multiple catalogs together easily. I think being able to utilize the + operator would be the easiest and most natural for them. I will continue to think about the solution you proposed.

Hi @WaylonWalker! We believe that the suggestion made by @deepyaman is the recommended way to tackle this.

Catalogs are rarely created or manipulated by hand, and usually there is only one "master" catalog object that is used to feed in the data into the pipelines. Therefore if you rely on KedroContext to construct the catalog object from all the necessary catalog configs, then utilising built-in ConfigLoader feature to read multiple configs would make perfect sense.

The Pipeline is slightly different in that sense since it is always composed programmatically, and the project usually works with at least a handful of those. Therefore pipeline algebra is a useful "syntactic sugar" for the end users.

I will close the issue for now, but please let us know if this doesn't resolve your issue. And as always, thanks a lot for your input 馃憤

Was this page helpful?
0 / 5 - 0 ratings