I have been through the tutorial and the docs and have found something which I think could be included as a new feature. It's possible that there may be an alternative way to do this so please enlighten me if there is.
I wanted to define a reusable node that could be configured through function arguments and used in multiple stages of the pipeline. To my understanding, the parameters config doesn't fit what I am trying to do because there will be multiple instances of the same node in the pipeline but with different arguments. The example I have at hand is a node which vectorizes some text based on an input column. I want to be able to define a node like:
def vectorize(data: pd.DataFrame, vectorizer: TfidfVectorizer, column: str):
documents = data[column]
return vectorize.transform(documents, vectorizer)
so that in create_pipeline I can do:
pipeline = Pipeline(
[
...
node(vectorize,
["question_pairs", "vectorizer", "question1"],
"matrix1"),
node(vectorize,
["question_pairs", "vectorizer", "question2"],
"matrix1"),
...
]
)
where the third item in the input list is the column I wish to vectorize
Allowing for such paramerisation of nodes will mean that nodes can be reused throughout the pipeline.
If I was to run this code, it would fail saying that "question1" or "question2" cannot be found in the DataCatalog. A possible implementation would be to allow any surplus parameters that are not defined in the DataCatalog to be passed in as function arguments. This would allow for the type of behaviour I am looking to implement with a reusable node.
A possible alternative is to wrap the vectorize function with functools.partial and specify the column parameter to create a new partial object, but this fails because __name__ is undefined for a functools partial object.
Hi @oasis789, you might be interested in this commit which has been merged to develop, and allows nodes made of partial functions.
Please, either clone from develop and pip install ., or wait until the next release :)
Also, thank you a lot for your suggestions, the core team will get back to you shortly on them!
With the current release, you could do the following:
Use partial as you suggested but also wrap it with update_wrapper
Add parameters to the catalog, using feed_dict (Ugly! However this might be very handy in other cases!):
catalog = DataCatalog.from_config(conf_catalog, conf_creds)
catalog.add_feed_dict({
"question1": "question1",
"question2": "question2"
})
pipeline = Pipeline(
[
...
node(vectorize,
["question_pairs", "vectorizer", "question1"],
"matrix1"),
node(vectorize,
["question_pairs", "vectorizer", "question2"],
"matrix1"),
...
]
)
from kedro.contrib.io.catalog_with_default import DataCatalogWithDefault
from kedro.io import MemoryDataSet, DataCatalog
catalog = DataCatalog.from_config(catalog_conf, credentials_conf)
catalog = DataCatalogWithDefault.from_data_catalog(
catalog,
default=lambda x: MemoryDataSet(x)
)
# same pipeline as above and same behavior
Keep in mind that this solution includes DataCatalogWithDefault which is _not_ in the core module, but rather in contrib.
Hi @oasis789, thank you for contributing to Kedro by opening a feature request here! We're really happy you find Kedro useful for your projects!
You can reuse nodes in Kedro as it is at the moment, you only need to make sure they output to different datasets. When designing the pipeline functionality, we've decided that we will not allow different nodes to output the same dataset due to the undefined behaviour of a pipeline in that case - it is unclear whether the first output is the valid one or the second; or maybe a merge of the two, and if so - how that merge should happen when we don't know the format of the data beforehand? That's why no pipelines can contain nodes which output the same thing. What we encourage instead is users to be explicit about merging their results by adding a merging node.
As for your feature request for adding parameters to your functions when you want to reuse them, it's something we've been thinking about for some time and we think that the best way to do it for now is by using @tsanikgr's suggestion for partial function applications.
An alternative future functionality we've discussed is an easy way to provide parameter values directly, so in your example "question1" and "question2" will be parameters in parameters.yml, which will have the name of the column. So you can just provide parameter names to your node input and their values will get injected to your nodes. This way you can control the column names from the parameters.yml in the configuration. However that might make your pipeline less explicit and harder to understand from a new user. We'd be really happy to hear your opninion on the topic, since we'd like to collect as many viewpoints as possible before we commit to certain functionality.
Please feel free to share more ideas on this topic or any others. Glad to see you contributing with your ideas.
Hi @idanov and @tsanikgr - thanks for getting back to me and for the warm response.
I think the partial approach is the most intuitive from my perspective since it will facilitate the design of _reusable_ functions that can be configured with partial into distinct and unique nodes. This is always good because it will reduce code duplication and allow a single function to define many nodes.
I am experimenting with a few use cases using kedro to get a better feel of the framework so I can make a recommendation to adopt it for projects internally.
I have a few ideas for features that you may have already decided against:
__call__ for Pipeline and Node objects so that you can do adder_node(dict(a=2, b=3))
instead of
adder_node.run(dict(a=2, b=3))
It's small addition to add a bit of syntactic sugar that would be nice. I'd be more than happy to open a PR for this if you think its worth including
With datasets in the DataCatalog, inferring the file extension from the dataset type
Altering the definition of datasets so that the name is inferred from the filepath. At the moment, there is a duplication of naming in the filepath and of the actual name of the dataset used within kedro. For datasets added by the user this makes sense, but for persisting output from pipelines, the dataset definition can be tedious especially if there are a lots of thing you would like to persist. A more concise definition could be:
sgd_model:
type: PickleLocalDataSet
output_dir: data/06_models
where the name, extension and filepath are all inferred.
parameters.yml to functions or nodes with matching names. For example, if I define the following in parameters.yml:train_vectorizer:
max_features: 5000
this dictionary of parameters should be injected into the corresponding function or node named train_vectorizer. Having this functionality will allow a more pythonic definition of functions so I could do the following to pass up the parameters:
def train_vectorizer(documents, **kwargs) -> TfidfVectorizer:
vectorizer = TfidfVectorizer(**kwargs)
return vectorizer.fit(documents)
This would prevent the "parameters" input argument from littering the Pipeline definition.
These are just a few ideas that have come to mind in the short time I have been exploring kedro. I'd be more than happy to discuss and even contribute.
I would also like to see the ability to pass parameters directly to nodes. I agree that this would encourage code reuse. The resulting pipeline might look like this:
pipeline = Pipeline(
[
node(
my_node,
{'kwarg_1': 'value_1', 'kwarg_2': 'value_2, ...},
'data_catalog_entry_1',
'data_catalog_entry_2',
),
where the my_node function has a signature like:
def my_node(df, kwarg_1=None, kwarg_2=None):
...something...
(I'd be happy to implement this myself and open a PR)
I would also like to see the ability to pass parameters directly to nodes. I agree that this would encourage code reuse.
Hi @zacernst. Thank you for your suggestion. Would the usage of partial work for your case as suggested by @tsanikgr?
I have a few ideas for features that you may have already decided against:
* Implementing `__call__` for `Pipeline` and `Node` objects so that you can doadder_node(dict(a=2, b=3))instead of
adder_node.run(dict(a=2, b=3))It's small addition to add a bit of syntactic sugar that would be nice. I'd be more than happy to open a PR for this if you think its worth including
@oasis789 the idea above seems very reasonable, we'd be happy to consider that in our tech discussions. We'll keep you posted on our decision about adding this syntactic sugar. Meanwhile, it would be nice if you can open a separate issue for each of your ideas, so we can keep track of them independently.
* With datasets in the DataCatalog, inferring the file extension from the dataset type * Altering the definition of datasets so that the name is inferred from the filepath. At the moment, there is a duplication of naming in the filepath and of the actual name of the dataset used within kedro. For datasets added by the user this makes sense, but for persisting output from pipelines, the dataset definition can be tedious especially if there are a lots of thing you would like to persist. A more concise definition could be:sgd_model: type: PickleLocalDataSet output_dir: data/06_modelswhere the name, extension and filepath are all inferred.
For the moment we've decided to keep the name of datasets decoupled from filenames and extensions in order to allow for greater flexibility for referring to different files. E.g. if we add some coupling between names and filenames, we'd struggle to provide the main benefit of the DataCatalog, namely being able to swap out the input / output files for your pipeline without any code changes. This problem could be partly solved by some other suggestions like https://github.com/quantumblacklabs/kedro/issues/42 . I suggest you to keep track on the development of that one and please let us know if it can solve for you the repetition problem you are referring to.
* Automatically injecting nested dictionaries in `parameters.yml` to functions or nodes with matching names. For example, if I define the following in `parameters.yml`:train_vectorizer: max_features: 5000this dictionary of parameters should be injected into the corresponding function or node named
train_vectorizer. Having this functionality will allow a more pythonic definition of functions so I could do the following to pass up the parameters:def train_vectorizer(documents, **kwargs) -> TfidfVectorizer: vectorizer = TfidfVectorizer(**kwargs) return vectorizer.fit(documents)This would prevent the "parameters" input argument from littering the Pipeline definition.
This one sounds like an interesting suggestion and could improve the experience for pipeline design. However it is at odds of another benefit of the current way of defining the pipeline, which is the explicit definition of the pipeline, which makes it easier to understand that a particular node depends on the parameters by looking at the pipeline definition only. We have something on our backlog which will partially address your concerns by allowing you to pass subsets of the parameters to your node by referring to a particular parameter set by adding params:train_vectorizer to the inputs of your node. Please watch that space, we hope that we'd be able to add this quite soon if we don't foresee any major usability issues, since the technical implementation is trivial. Hopefully that would solve the problem for you or at least the biggest part of it :)
Hi @oasis789, please see this commit with respect to @idanov 's reference to params:train_vectorizer.
With regards to the __call__ suggestion, I've updated the title with our internal ticket number to keep track of this more easily. :)
Awesome! I'd also be keen on contributing some code. Let me know if there is something from this that I can pick up.
Awesome! I'd also be keen on contributing some code. Let me know if there is something from this that I can pick up.
That is excellent to hear! You're more than welcome to open a PR addressing your above suggestion:
Implementing
__call__for Pipeline and Node objects so that you can do
adder_node(dict(a=2, b=3))
instead of
adder_node.run(dict(a=2, b=3))
If you do, would be super handy if you referenced our ticket number in the PR title as well. 馃槃
Hi @oasis789 just to give you a heads up that we've started the work for the above ahead of our next release, we will reference the commit here once it's merged. Please do feel free to browse our issues page and pick up anything labelled with "good first issue" or "help wanted" that hasn't got a PR in the works.
@uwaisiqbal here is the relevant commit as per my last comment: https://github.com/quantumblacklabs/kedro/commit/38abd00587a63842af6981e0244ae3d371690e36
I will close this issue as answered, but you're welcome to raise new ones you believe are worth discussing. Many thanks for your suggestions, we look forward to more contributions from you! :)
@uwaisiqbal
Have you found a way of passing node parameters? I am trying to make it work what @tsanikgr posted here but I don't really get it. I am just finding my way into python...
I'd need to put that somehow in my run.py to update the catalog with the parameters before a pipeline is run but I am not sure how to approach that. Would be glad If you can point me into a direction here :)
edit: I found the answer here:
I can just prefix the parameters like so:
node(
split_data,
["master_table", "params:test_size", "params:random_state"],
["X_train", "X_test", "y_train", "y_test"],
)
and add following in parameters.yml
test_size: 5
random_state: 1
Most helpful comment
I would also like to see the ability to pass parameters directly to nodes. I agree that this would encourage code reuse. The resulting pipeline might look like this:
where the
my_nodefunction has a signature like:(I'd be happy to implement this myself and open a PR)