Kedro: Tracking dependencies by artifact's creation time like GNU make

Created on 27 Sep 2020 · 13Comments · Source: quantumblacklabs/kedro

I'm sure we really need this..
It will provide ability to resume calculation based on availablity of artifacts and their creation time.
Without this feature kedro calculates entire pipeine from cratch.. And that is not suitable in many cases.

Feature Request

Source

sergun

Most helpful comment

I found a parallel conversation is happening over on kedro.community.

https://discourse.kedro.community/t/speeding-up-pipeline-processing-with-change-detection/90

WaylonWalker on 1 Oct 2020

👍3

All 13 comments

Hi @sergun thank you for opening the feature request. Could you please elaborate a bit more on what kind of feature you are looking for? There's a template for a feature request, and it would be great if you could fill them in.
What calculation would you like to know? The runtime of pipeline execution, or IO operation etc.

921kiyo on 28 Sep 2020

Description

Currently kedro can only calculate pipilenes from scratch and with some manually specified settings (you can specify nodes from which it can calculate everything or you can ask kedro to calculate only missing datasets by Runner.run_only_missing).

Context

In many use-cases model GNU make works better.. As you know make automaticlly tracks dependencies between artefacts (the same does kedro) but it also automatically tracks artifacts' mtime and understand which artifact should be re-created by taking in mind mtime of output artefact and dependencies.
Such model is suitable e.g. for feature engineering, when you have a lot of SQL-scripts, each of them creates some temporary table. Tables are joined and finally you have some resulting table with feature values. It is nice that data scientist can modify some script and make will automatically understand that table which is created by this script should be re-created and all tables depend on it should be also re-calculated. And other tables should be untouched. This use-cases is completely unsupported by kedro.

Possible Implementation

No concreted ideas. But I think that we can integrate mtime in Dataset class and add ability to select nodes to be calculated based on mtime of input/output Datasets of each node.

Possible Alternatives

Do not see them.

sergun on 28 Sep 2020

👍3

As for me, it will be really great to add such a functionality!

DChulok on 29 Sep 2020

I am not 100% sure that I follow what you are looking for, but my personal feeling so far is that this is a feature that can be achieved through the use of hooks.

All of the kedro DataSets are currently backed by fsspec, a quick scan of their API revealed that there is a .modified() method that returns a timestamp of the modified path. I am not exactly sure how you would get to this inside of a hook though.

You can access the kedro dataset instance dynamically by using getattr(catalog.datasets, 'dataset_name'). Inside the dataset instance, you will find all sorts of information about your dataset. I was able to get to the modified time of my datasets, but it did not appear that it was leveraging fsspec for the file system agnostic methods. Instead it seemed like it was specific to the filesystem type I was using.

Once you have the time what do you need to do with that? I am not familiar with GNU make and how it utilizes mtime to avoid recreating artifacts that are not necessary to make.

This would be another great application for #400. If I could set a default update frequency (daily), and override that frequency in my catalog to tell kedro that this dataset is only refreshed (weekly/monthly/cron expression?). Then it could figure out if its time to update or now.

WaylonWalker on 30 Sep 2020

Thanks @WaylonWalker !

But I do not see how to skip execution of node's processing function from some hook.
The idea is to do not call this function if mtime of input datasets of a node are earlier than mtime of it's output datasets and these datasets exist.

sergun on 1 Oct 2020

I've quickly hacked together (with emphasis on _hack_) a prototype of what a hook would look like that enables something like this: https://gist.github.com/mzjp2/076bfd73b0215bda01ee71186966389d

mzjp2 on 1 Oct 2020

👍2

That is a really cool hook @mzjp2! If we could tag nodes with a run frequency and combine with this it would make things easy to blindly run everything and only update out of date nodes.

WaylonWalker on 1 Oct 2020

Thanks @WaylonWalker !

But I do not see how to skip execution of node's processing function from some hook.
The idea is to do not call this function if mtime of input datasets of a node are earlier than mtime of it's output datasets and these datasets exist.

I see now. What if you grabbed the ast or bytecode of the function of each node and cached it. Then you can check if the function itself has changed since last run, or if input data has changed since the last run.

WaylonWalker on 1 Oct 2020

I found a parallel conversation is happening over on kedro.community.

https://discourse.kedro.community/t/speeding-up-pipeline-processing-with-change-detection/90

WaylonWalker on 1 Oct 2020

👍3

@WaylonWalker thanks for the link!
It is interesting that in our company we created very similar "in-house" solution based on old-school Makefiles :-)
With change detection based on MTIME of data files and processing scripts.
It works well but we decided to try kedro and wondered that there are no similar things implemented here..

sergun on 3 Oct 2020

@mzjp2 thanks a lot! Great job!
It seems that it should be core thing/concept in the future not hook based.. What do you think?

sergun on 3 Oct 2020

I think I'd like to get my hands dirty with this one. I'll look into this in the context of Hacktoberfest. I'll make a draft PR where I reference this issue and the discussion in kedro.community. Should be a fun one.
@dataengineerone @sergun FYI

pascalwhoop on 6 Oct 2020

@pascalwhoop I think it would be one of the most significant cotribution to kedro :1st_place_medal:
I think it make sense to consider both ways: hash-based / time-based tracking of changes.

From my Makefile-based for ML experience I can say it is really cool when you do not need to think which nodes should be executed after some change (of params or data, or maybe code). BTW in the Makefile-based colution parameters were also files and they were incuded in make recipes as dependencies..
They only problematic place with time-based tracking is cases when you add whitespace or lineend symbol to a file with parameters or to source code script and make wants to recalculate something dependend on them :-)
I also like make becuase you identidy your task by artifact (filename) not by id of node.. I find this more intuitive..

sergun on 6 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings