Kedro: Allow extra hooks to be passed via the Kedro CLI

Created on 14 Jul 2020 · 8Comments · Source: quantumblacklabs/kedro

Description

Is your feature request related to a problem? A clear and concise description of what the problem is: "I'm always frustrated when ..."

I should be able to run with or without a set of optional hooks. For example:

kedro run --hooks src.hookshot.hooks.TeePlugin

kedro run --hooks src.hookshot.hooks.TeePlugin,src.hookshot.hooks.CachePlugin

Context

Why is this change important to you? How would you use it? How can it benefit other users?

Not all hooks are core to functionality. Using something like a debugging hook shouldn't require a code change.

This is also helpful for testing different hooks as part of a CI/CD process.

Possible Implementation

See https://github.com/deepyaman/hookshot/commit/0f792f9112031fc2642d30e161c21a0b0c9f539d#diff-ebf803a458716ccda133fadc42e45057 (would add it to KedroContext though).

Feature Request

Source

deepyaman

👍1

Most helpful comment

@deepyaman your timing is uncanny. We are building this as we speak 😂

However, we are going with an opt-out model, hence @yetudada's question. Basically, after user installs a plugin, we will automatically register the plugin's hooks through a dedicated kedro.hooks entrypoint. This is similar to how global_commands and project_commands are working at the moment. This is the pytest's model. We will also provide a mechanism for users to disable hooks that they don't want to run on a per-run basis.

I'm curious what you and @WaylonWalker think about these two models. Do you think an opt-out (auto-discovery) model or an opt-in model (your proposal) would work better?

limdauto on 15 Jul 2020

🎉2

All 8 comments

Hi Deepyaman, thanks for creating this! If I play back what you would like to do, you would like to be able to use different Hooks in Kedro without modifying your code? So Kedro automatically discovers them?

yetudada on 15 Jul 2020

Hey @yetudada! Kedro doesn't need to automatically discover them. For example:

kedro run --hooks src.hookshot.hooks.TeePlugin

kedro run --hooks src.hookshot.hooks.TeePlugin,src.hookshot.hooks.CachePlugin

If the project context already had hooks = (ExistingPlugin(),) defined, the latter would be equivalent to modifying the code to be hooks = (ExistingPlugin(), TeePlugin(), CachePlugin()).

deepyaman on 15 Jul 2020

Using something like a debugging hook shouldn't require a code change.

👏 Well said. It would be great to be able to run ad-hoc/one-off hooks on the fly without a code change.

For now/alternatively you could use an env variable to toggle what the hook does. This requires you to implement the reading of env variables in your hook, and implementing the hook on the pipeline.

WaylonWalker on 15 Jul 2020

@deepyaman your timing is uncanny. We are building this as we speak 😂

I'm curious what you and @WaylonWalker think about these two models. Do you think an opt-out (auto-discovery) model or an opt-in model (your proposal) would work better?

limdauto on 15 Jul 2020

🎉2

Off the top of my head, a potential issue I foresee with the auto-discovery model is users who install a bunch of plugin hooks for testing, or—_shudder_—use a single base conda env with Kedro installed for all their projects. This isn't a big issue for commands, because it's just your kedro -h output littered with commands you don't need; however, when it comes to plugins, do they need to uninstall or explicitly disable all the unused hooks? I'm curious what the mechanism to opt-out would be, maybe this isn't such a big deal.

I think opt-in is also easier for my use case, where I want to benchmark raw pipeline vs pipeline + plugin1 vs pipeline + plugin 2, and I like the explicitness, but these are probably smaller concerns from a dev perspective.

deepyaman on 16 Jul 2020

Hi @deepyaman thank you for sharing, these are some really interesting questions and comments. I'm adding some of my thoughts, let me know what you make of them.

It feels like the opt-in / opt-out choice really just depends on how easy it is to opt out. Indeed uninstalling stuff would be a hassle. I saw pytest uses an env variable to blanket-disable auto-discovery, which we considered for Kedro. But they also have another environment variable to specify which plugins to use (or just a variable), so maybe something similar can be mirrored in Kedro (it’d be different conf environments for different plugin combinations, so for your benchmarking example you'd have 3 environments). Hooks behaviour is sth that feels too heavy to just be configured at runtime, because it strongly interacts with or change the run flow, to me it’d be safer as static config.

There’s a model in my head that makes me quite reticent about feeding hooks through the CLI, where it’s not a developer that messes with the job spec, so the less technical it is, or the less code knowledge you need to understand it, the better. The person on support doesn’t need to understand what hooks to activate, they can just switch to a different environment and job done, if the env is there. If not, a developer should create it, which feels _intentional_.

lorenabalan on 16 Jul 2020

👍1

I really do like the simplicity of installing pytest plugins and they just work. For the most part though the auto-discovered plugins do not change my test. Typically those only change outside the scope of a single test (im thinking about pytest-watch, parallel runner, coverage, sugar). Things that actually change the execution of my test generally require me to add it in the code by function run, or fixture. I feel that kedro plugins are able to so easily fall into the second category and change the output of your pipeline that it should not be autodiscovered.

As much as we --__shutter__-- about users with a single base environment it's a common workflow. Much of the learning content focuses very heavily on the details of manipulating DataFrames that most new data scientists are missing out on how to manage their local dev environment, and where to place their solution in the codebase.

I think @lorenabalan brings up a good point about support. If I am running support on an issue I want to quickly get some information about it before making some big code changes. It would be nice to be able to quickly add a retry on fail, or full failure report hook without a change to the prod environment.

Whether this just sits in your teams template (hooks list) and turned off by default, or the framework allows it to be enabled with a flag isn't a big differnence to me. Both require the pre-thought that we may want some extra information from PROD at some point. Both solutions require me to reach into PROD to install the hook, making the code change to add it to the list is not a big deal. note my PROD env is using kedro-docker, other folks with a different prod solution may have a different opinion.

WaylonWalker on 16 Jul 2020

It feels like the opt-in / opt-out choice really just depends on how easy it is to opt out.

Agree that this is of paramount importance.

But they also have another environment variable to specify which plugins to use (or just a variable), so maybe something similar can be mirrored in Kedro

Would be happy with this. It's the current way of specifying hooks plus an alternative solution to this issue, which is to specify the same (almost same, as it seems to overwrite rather than append) list as an environment variable.

(it’d be different conf environments for different plugin combinations, so for your benchmarking example you'd have 3 environments).

Eeeh. I don't think I like only depending on this because Kedro only supports one level of inheritance in conf. If I already have a dev and prod env, and I want to run my benchmarks for dev, I need to create dev-hooklist1, dev-hooklist2, etc.? Maybe this is just a separate issue, since struggles with conf inheritance aren't unique to this use case.

There’s a model in my head that makes me quite reticent about feeding hooks through the CLI, where it’s not a developer that messes with the job spec, so the less technical it is, or the less code knowledge you need to understand it, the better. The person on support doesn’t need to understand what hooks to activate, they can just switch to a different environment and job done, if the env is there. If not, a developer should create it, which feels _intentional_.

From a support perspective, this makes sense. The current implementation only allows one set of hooks to be defined across envs, and your aforementioned suggestion to have hooks defined on a per-env basis makes this more flexible.

However, from a dev perspective, I want more control. Creating a new env mirroring a non-base env isn't trivial, so the process is heavy and annoying. On our last project, we were basically replacing files in config and reloading to get around this (despite already having a lot of copy-paste config at higher levels).

Can we support both? It's not like the extra_hooks implementation pattern is foreign to Kedro, as it's similar to how extra_params are handled.

deepyaman on 19 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings