First of all, thank you very much for the great framework. I'd found myself not working in notebooks for weeks since I've switched to Kedro. Now I can debug everything and still have interactive access to my data and models. You've changed the quality of my professional life being and I am deeply grateful for this.
Now, my proposal. I think that in addition to existing Python based API for pipeline definition it would be nice to have a possibility to describe pipelines in YAML files. Moreover, the more I am thinking about this the more I'm convinced that this should be the default way of pipeline definition.
I've found that in 95% of cases my pipeline.py files are "static". Meaning that I don't need to dynamically define pipeline nodes and their IO. Also (my subjective opinion) most "dynamical" use cases can be handled by config templates and we already have this feature.
By putting pipeline definitions into conf we'll make modular pipelines easier to adjust and configure since any pipeline consumer will have a clear picture of used inputs and outputs which is extremely important since we have "global" dataset naming. We can also mix nodes from different pipelines on a consumer level (check the possible implementation below).
Another reason to use YAML-first pipeline definitions is that it could be much easier to integrate pipeline consistency checks into modern IDEs since we just need to parse YAML files (parameters, pipelines and catalog) and check that the corresponding paths to python modules is correct.
We can use the same structure as in the current pipeline.py and hooks.py created by starters and same agreement as in catalog.yml.
Let's say I have the following pipeline in src/my_package/analysis/pipelines.py:
from kedro.pipeline import Pipeline, node
from .nodes import select_important_features, visualize
def create_pipeline(**kwargs):
return Pipeline(
[
node(
select_important_features,
inputs=dict(
dataset='05_model_input/inliers.parquet',
cluster_labels='07_model_output/cluster-labels.parquet',
),
outputs='08_reporting/important-features.yml',
),
node(
visualize,
inputs=dict(
dataset='05_model_input/inliers.parquet',
important_features='08_reporting/important-features.yml',
),
outputs=dict(
charts='08_reporting/important-features-charts.html',
histograms='08_reporting/important-features-histograms.html',
),
tags=['html_reports']
),
],
tags=['cluster_analysis'],
)
I also have to register it in src/my_package/hooks.py:
from my_package.pipelines import analysis
class ProjectHooks:
@hook_impl
def register_pipelines(self) -> Dict[str, Pipeline]:
analysis_pipeline = analysis.create_pipeline()
# ...
return {
'analysis': analysis_pipeline,
# ...
}
With YAML API we may do the same in conf/base/pipelines/analysis/pipelines.py:
analysis:
nodes:
# or even `- handler: select_important_features` since the pipeline is modular and we have agreement that node handlers live in `nodes.py`
- handler: my_package.pipelines.analysis.nodes.select_important_features
inputs:
dataset: 05_model_input/inliers.parquet
cluster_labels: 07_model_output/cluster-labels.parquet
outputs: 08_reporting/important-features.yml
- handler: my_package.pipelines.analysis.nodes.visualize
inputs:
dataset: 05_model_input/inliers.parquet
important_features: 08_reporting/important-features.yml
outputs:
charts: 08_reporting/important-features-charts.html
histograms: 08_reporting/important-features-histograms.html
tags:
- html_reports
tags:
- cluster_analysis
And from this we can automatically conclude what should be the registered pipeline name. Meaning, no annoying boiler-plating in the hooks.py is required. Also it will be much easier to mix nodes from several modular pipelines. For example, in the presented use case we might have global reports_publishing pipeline which can be used to publish HTML reports.
We also can use internal YAML templating to create an alias for 05_model_input/inliers.parquet which is the most frequent reason I introduce any "dynamics" into my pipeline definitions.
I wanted to define Kedro pipeline in YAML too, so I implemented this option in PipelineX package.
You can define Kedro pipeline like this:
# parameters.yml
PIPELINES:
__default__:
=: pipelinex.FlexiblePipeline
module: # Optionally specify the default Python module so you can omit the module name to which functions belongs
decorator: # Optionally specify function decorator(s) to apply to each node
nodes:
- inputs: ["params:model", train_df, "params:cols_features", "params:col_target"]
func: sklearn_demo.train_model
outputs: model
- inputs: [model, test_df, "params:cols_features"]
func: sklearn_demo.run_inference
outputs: pred_df
func corresponds to handler in your proposal.
In this example, lists are used for inputs, but dictionaries can be used for inputs/outputs as well.
For more options, please see:
https://github.com/Minyus/pipelinex#enhanced-kedro-context-yaml-interface-for-kedro-pipelines
Here is an example Kedro project using Iris dataset:
https://github.com/Minyus/pipelinex_sklearn
PipelineX supports Kedro 0.16.x.
I plan to make PipelineX work with Kedro 0.17.x, but I hope Kedro natively supports pipeline in YAML too.
Thank you a lot, @Minyus!
Looks like you've done a great job in PipelieneX. I am now trying to extend Kedro with a set of "experiment-running" tools which uses Kedro pipelines as a basis for experiment setup and then provide interface for parameter tuning via "meta-parameters". Probably I need to take a closer look on you project since you've probably done something similar (and more).
For better or worse I've already switched to 0.17.0, so, I'll give you a star and wait.
However, despite all your great work, I still believe that YAML pipeline definition should be a part of Kedro framework. Let's see what is the core contributors' point of view on this.
@Sitin Thank you for the detailed issue/suggestion.
However, despite all your great work, I still believe that YAML pipeline definition should be a part of Kedro framework. Let's see what is the core contributors' point of view on this.
A team within QuantumBlack Labs has also built a plugin (internally) that allows users to define pipelines in YAML, quite similar to what you describe; @drqb or @willashford may be able to chime in here, as its developers. At this time, it's not used on the majority of Kedro projects, but having it as a plugin allows the Kedro framework itself to remain (comparatively) lightweight.
P.S. You may get sparse replies from the core team until January, as a lot of people are on holiday.
Hi @Sitin
Yes, I hope Kedro natively supports YAML interface for pipeline too.
Meanwhile, I prepared kedro starter templates (based on pandas-iris starter) that work with Kedro 0.17.0 at:
https://github.com/Minyus/kedro-starters-sklearn
The YAML pipeline is at https://github.com/Minyus/kedro-starters-sklearn/blob/master/sklearn-mlflow-yamlholic-iris/%7B%7B%20cookiecutter.repo_name%20%7D%7D/conf/base/parameters.yml#L34-L50
To use YAML interface for pipeline and run config, run:
kedro new --starter https://github.com/Minyus/kedro-starters-sklearn.git --directory sklearn-mlflow-yamlholic-iris
Hooks for MLflow tracking are included, but it should work as is even if MLflow is not installed.
@Sitin Thank you for your kind words, we're really happy Kedro changed your way of working for the better and made you happier as a result! Thank you for being part of the community as well by opening this issue.
Currently in Kedro we do not plan to add native support for yml pipeline definitions and we'd rather leave that to plugins to do it for people who are interested. We are aware that some of our users (even internally at QuantumBlack) prefer to define their pipelines in yml, but currently we favour using Python as pipeline definition language.
Just to give more context why, I will list some of our main considerations:
catalog.yml, parameters.yml or logging.yml can and should be defined by the DevOps/MLOps person deploying the application, thus they are configDefining pipelines in Python has some drawbacks as well:
yaml, xml or json)To summarise, Python is too powerful as a pipeline definition language, but on the flip-side has excellent tooling support. Where yaml on the other hand is more concise and closer to being declarative, but tooling support is lacking and if used to pipelines, can easily be mistaken for config.