Kedro: [KED-2143] Adding a ConfigLoader instance into hook specs params

Created on 9 Sep 2020  路  7Comments  路  Source: quantumblacklabs/kedro

Description

When developping a kedro plugin, i regularly need to access to configs and potentielly some plugin-specific configs files. Since the plugin use hook mechanism, i no longer can bring whatever context attribute to my hook implemantation (except the parameters defined in the hook specs).

Context

Here in the kedro-mlflow plugin we were forced to redefine a ConfigLoader instance inside the plugin.
That lead to incoherence between the context ConfigLoader property and the new Configloader created inside the hook.

Other plugins will need this functionality, i imagine a kedro-spark plugin that use hook mechanism and access a spark config file from project folder path (spark.yml), or a kedro-sas plugins that do the same thing (getting configs in order to create a parametrized session)

Possible Implementation

A possible implementation is to pass the context config_loader to the hook.

hook specs

 @hook_spec
    def before_pipeline_run(
        self, run_params: Dict[str, Any], pipeline: Pipeline, catalog: DataCatalog, config_loader: ConfigLoader
    ) -> None:

context

hook_manager = get_hook_manager()
hook_manager.hook.before_pipeline_run(  # pylint: disable=no-member
            run_params=record_data, pipeline=filtered_pipeline, catalog=catalog, config_loader=self.config_loader
        )
Feature Request

Most helpful comment

Thanks for the insightful answer. I use one of above solutions as a better than nothing way to achieve what I want, and wait for the KedroSession to be more official :smile: If you need beta-testers, feel free to ask!

All 7 comments

Hi @takikadiri , you've highlighted a very good point. We thought about this and we've actually added a set of hooks to register library components, such as pipelines, data catalog, and config loader, with a Kedro project. I think might solve your use case.
This functionality will be made available in 0.16.5, which is going to be released very soon. :)

Thank you @lorenabalan for the quick reply ! It's realy great having the possibility to registrer library component such as the config loader, i will certainly use it.
But my point here is about not having the possibility to pass the config loader instance (created with register_config_loader) to another hook let's say the pipeline_before_run hook.
There may be something that escapes me about the hook mechanisms :)

Hello @lorenabalan, I am not sure if I miss the point but I think this is not what is at stake here, correct me if I'm wrong.

I don't know if this is the best place to write this or if it should be in another issue, but here is a more detailed description of the problem and discussion on different design decisions and potential decisions.

Description

Since hooks have been released in kedro==0.16.0, they have become a popular tool among developers who create kedro plugins (to be honest the community is small but quite active 馃槈 ).

It is a common pattern for hook to need to access to configuration files (for instance to create a session with an external tool with credentials, to use parameters inside the hook and more likely in the case of kedro-mlflow to use a custom config file for the plugin.

I personnaly feel that this configuration file access must be template-independent. The hook is not supposed to assume anything on the template (which may be changed by the user) since the ProjectContext already have all the necessary informations (i.e. mainly the ConfigLoader initiated but potentially other attributes of the ProjectContext). If the hook needs to recreate any attributes of the ProjectContext (for instance the ConfigLoader), there is a high risk that the hook behaves differently than the ProjectContext, which is something we absolutely want to avoid.

Concrete use cases:

Use case 1: Accessing proprietary configuration file inside hook

  • Let's say that the user modifies the register_config_loader (for instance to use the TemplatedConfigLoader in your documentation):
from kedro.config import TemplatedConfigLoader

class ProjectHooks:
    @hook_impl
    def register_config_loader(self, conf_paths: Iterable[str]) -> ConfigLoader:
        return TemplatedConfigLoader(
            conf_paths,
            globals_pattern="*globals.yml",
            globals_dict={"param1": "pandas.CSVDataSet"}
        )
  • Now imagine I have another plugin that must access its own configuration file (say mlflow.yml) inside hook calls. For instance
class MlflowNodeHook:
    @hook_impl
    def before_node_run(
        self,
        node: Node,
        catalog: DataCatalog,
        inputs: Dict[str, Any],
        is_async: bool,
        run_id: str,
    ) -> None:
     # get the config loader of the current context
     config_loader = get_config_loader()  # actually, config_loader is not available here, this magic function does not exist! i need to eventually get the one registered in the project

    # do whatever I want using the conf and implementing my own logic
    conf_mlflow = config_loader.get("mlflow*", "mlflow*/**") 
    do_my_own_logic(conf_mlflow )

Use case 2: Accessing credentials file inside hook

Let's say that I want to create a connection with a remote server (say SAS) globally to interact before/afeter node, and eventually inside node

class MlflowPipelineHook:
    @hook_impl
    def before_pipeline_run(
        self, run_params: Dict[str, Any], pipeline: Pipeline, catalog: DataCatalog
    ) -> None:
      # get the config loader of the current context
     config_loader = get_config_loader()  # actually, config_loader is not available here, this magic function does not exist!

    # do whatever I want using the conf and implementing my own logic
    credentials = config_loader.get("credentials*", "credentials*/**") 
    saspy.SASsession(credentials)

@WaylonWalker @deepyaman You guys seem to develop a lot of hooks, do these use cases are hitting you too? I see you sometimes use environment variable for configuration of your hooks, I guess it is somehow related to this.

Overview of solutions to this problem

Existing solution

Existing solution 1 : recreate config loader locally

For instance, example 1 would become:

class MlflowNodeHook:
    @hook_impl
    def before_node_run(
        self,
        node: Node,
        catalog: DataCatalog,
        inputs: Dict[str, Any],
        is_async: bool,
        run_id: str,
    ) -> None:
     # recreate the config loader manually
        conf_paths = [
            str(self.project_path / self.CONF_ROOT / "base"), # these attributes are not accessible outside the context, they must be hardcoded actually
            str(self.project_path / self.CONF_ROOT / self.env), # suppressed
        ]
       hook_manager = get_hook_manager()
        config_loader = hook_manager.hook.register_config_loader(  # pylint: disable=no-member
            conf_paths=conf_paths
        ) or ConfigLoader(conf_paths)

    # do whatever I want using the conf and implementing my own logic
    conf_mlflow = config_loader.get("mlflow*", "mlflow*/**") 
    do_my_own_logic(conf_mlflow )

Pros:

  • it is a "better than nothing" solution

Cons:

  • project_path is hardcoded
  • env is not accessible
  • code is not reused, if CONF_ROOT changes
    This is neither a reliable, nor flexible, nor maintenable way to do this

Existing solution 2 : reload context when possible

Some hooks methods have access to some of the project context attributes: for instance, after_catalog_created can access credentials, before_pipeline_run and after_pipeline_run can access project_path. In these methods, we can call load_context(project_path) to access to all of the context attributes.

Pros:

  • it works perfectly

Cons:

  • it recreates a whole context object for very little use and might get slow
  • this solution does not work for all hooks (before_node_run and after_node_run do not have access to the project_path for instance)

Existing solution 3 : assume call is made at the root of the kedro project and go back to solution 2

For the hooks without access to the project_path, call load_context() without the project_path argument.

Pros:

  • it is the best solution I've found so far

Cons:

  • it assumes your working directory is the root of the kedro project
  • It prevents the user to use kedro interactively with a different working directory, and may also break using kedro inside jupyter notebook.

Potential solutions which need development on Kedro's side

Solution 1: Add config loader to all hooks

As the title of this issue states, a solution would be to pass the config loader to each @hook_spec parameters to make it accessible within hooks

Solution 2: Use the KedroSession ?

By digging in the code, I noticed a merged yet not documented feature called KedroSession. This creates a global variable which is accessible without any hypothesis on the template just by calling get_current_session(), and it contains the context, hence the ConfigLoader. It should be accessible in the hooks.

Pros:

  • the problem will be solved very smoothly (and in an even more general way, since any attributes of the context will be accessible).

Cons:

@DmitriiDeriabinQB it seems you are the one developing KedroSession, is it how it is intended to be used in the future?

@Galileo-Galilei Steel Toes utilizes the project's context by defining your hooks as a property on the ProjectContext rather than a list.

from steel_toes import SteelToes

class ProjectContext(KedroContext):
   project_name = "kedro0160"
   project_version = "0.16.1"
   package_name = "kedro0160"

   @property
   def hooks(self):
      self._hooks = [ SteelToes(self), ]
      return self._hooks

You can see where the context is used inside the hook here. I do feel like this is a bit of a hack and asks users to implement hooks on their project in a non-traditional way. The next upcoming change will make the context argument not required. Note that context contains a config_loader method that might be useful for you.

I would really like to get access to the project's context inside of a hook, especially if we could configure hook behavior inside of .kedro.yml. I think this would align with how plugins work on pytest. I am do not know how it works, but I know when using a plugin like pytest-cov you can pass in command-line arguments, or add to a config file to configure how it is ran. https://pytest-cov.readthedocs.io/en/latest/config.html.

Hello @WaylonWalker and thanks for the reply. This is a clever hack and works like a charm, but it breaks auto-discovery and configuration in kedro.yml as you mention (not to mention that it is a user facing change, even if it easy to setup). I feel that it can be a temporary way to make the hook more stable, but it is definitely not a long term solution and should be integrated to kedro core IMHO. Aligning on pytest sounds reasonable indeed.

By the way, it seems @tamsanh is hitting the same problem and need to access the context inside his KedroWings hook to be able to use interactively (which is the same issue some users mention here and here in kedro-mlflow.)

Some tests on KedroSession look promising (initalise a session before_pipeline_run and retrieve it anywhere you need it), but I don't want to rely on it since it is explicitly mentionned in the script that it is not stable and may change even between releases.

@Galileo-Galilei you are right assuming that KedroSession has been designed to eventually become responsible for carrying KedroContext (and project data in general) which would make the use case that you've describe much less painful. Hence, as you have already noticed it has been made a singleton to ensure its accessibility from hooks, for example.

However, this is still a work in progress and currently it's not at the stage where we can officially announce it and freeze the design. The general idea is that KedroSession will gradually take over the responsibility for the lifecycle events, while KedroContext will be treated as a "gatekeeper to the library components" (definition by @limdauto) in a new model.

Thanks for the insightful answer. I use one of above solutions as a better than nothing way to achieve what I want, and wait for the KedroSession to be more official :smile: If you need beta-testers, feel free to ask!

Was this page helpful?
0 / 5 - 0 ratings