Kedro: Load data from intermediate after processing?

Created on 18 Sep 2020 · 3Comments · Source: quantumblacklabs/kedro

Hi, I am new to Kedro and have been looking through the documentation and can't find a reference for automatically loading the intermediate (already processed dataset) vs processing each time I run a pipeline. In other words, I would like to pre process a file, save to intermediate location:

kibot_minute_ibm:
  type: pandas.CSVDataSet
  filepath: data/01_raw/kibot/minute/ibm.csv

X_trn:
  type: pickle.PickleDataSet
  filepath: data/02_intermediate/X_trn.pkl

X_tst:
  type: pickle.PickleDataSet
  filepath: data/02_intermediate/X_tst.pkl

The above does that, but the next time I run "kedro run" it does the whole pipeline again even though the original source data file hasn't changed. Is there a way to enable caching when node hasn't changed and the data itself hasn't changed?

Question

Source

jmrichardson

👍3

Most helpful comment

Hi @jmrichardson,

Unfortunately, this feature hasn't been supported by Kedro's high level API (Kedro context or CLI) although several Kedro users have requested:

https://github.com/quantumblacklabs/kedro/issues/30 @gotin
https://github.com/quantumblacklabs/kedro/issues/55 @Minyus
https://github.com/quantumblacklabs/kedro/pull/60 @Minyus
https://github.com/quantumblacklabs/kedro/issues/82 @anuarora1990

I have seen 3 approaches by Kedro users.

Since this feature is implemented by Kedro's low level API (runner) as run_only_missing, I (@Minyus) implemented this feature in the custom Kedro context in my PipelineX package at:
https://github.com/Minyus/pipelinex/blob/master/src/pipelinex/framework/context/flexible_run_context.py#L131
@miyamonz posted a great suggestion so users can add the feature easily.
https://github.com/quantumblacklabs/kedro/issues/509 @miyamonz
@deepyaman implemented TeePlugin:
https://github.com/quantumblacklabs/kedro/issues/420 @deepyaman

Hope Kedro supports this feature as other tools such as Spotify's Luigi do.

Minyus on 19 Sep 2020

👍2

All 3 comments

So I believe that at the moment, this isn't something supported (at least out of the box) with Kedro. There is CachedDataSet, but that is for caching _within_ a given Kedro run, not between. A common pattern here, which you might like to adopt is having two (or more!) pipelines:

raw_to_intermediary which is a pipeline whose first node(s) takes raw datasets and last node(s) output intermediary ones, saving them to disk.
process_intermediary which is a pipeline whose first node(s) take intermediary datasets and process them.

Then kedro run --pipeline=raw_to_intermediary processes the raw data and saves the intermediary data to disk. You can then experiment within the process_intermediary pipeline and do kedro run --pipeline=process_intermediary to only run the second pipeline, using the saved data from disk as input, without running the raw_to_intermediary pipeline.

You can then do kedro run --pipeline=raw_to_intermediary whenever the source file changes, or you make a change to the code within that pipeline. It's a bit of a manual solution, but it works nonetheless.