I defined a function called "split_train_test"which is used to split a dataset into training and test datasets. "split_train_test" has two parameters, one is an input dataframe(defined in catalog.yml), the other one is a specific date used to split the dataset.
I got a error "ValueError: Pipeline input(s) {'201801'} not found in the DataCatalog". It seems that in node function, we are only allowed to pass the names of datasets as parameters to our function.
pipeline.py
node(
func=split_train_test,
inputs=dict(df="preprocessed_transactions", test_date="201801"),
outputs=["preprocessed_training", "preprocessed_test"]
)
nodes.py
```
def split_train_test(df: pd.DataFrame, test_date: int) -> pd.DataFrame:
log.info(f"Start to split dataset into training and test datasets")
df = train_test_split.split_data(df, test_date=int(test_date))
return df
````
Hi @adslwang4601, input and output indeed need to be a dataset instead of pure python value. So in your case, you can add this to your catalog.yml:
test_date:
type: MemoryDataSet
data: 201801 # or whatever test date you have in mind
And define your inputs as
inputs=dict(df="preprocessed_transactions", test_date="test_date")
Actually, on second thought, your use case seems like a perfect use case for parameters: https://kedro.readthedocs.io/en/latest/04_user_guide/03_configuration.html#using-parameters. You can specify test date in parameters.yml and refer to it in the node as params:test_date
Hi @adslwang4601 I'm going to close this issue. If you still need help, please feel free to reopen it.
Most helpful comment
Actually, on second thought, your use case seems like a perfect use case for parameters: https://kedro.readthedocs.io/en/latest/04_user_guide/03_configuration.html#using-parameters. You can specify test date in
parameters.ymland refer to it in the node asparams:test_date