Pipelines: Data Versioning with Kubeflow

Created on 10 Feb 2021 · 3Comments · Source: kubeflow/pipelines

Hello,

I am working on setting up an in-house ML infrastructure for my company and we decided to go with Kubeflow. We need to ensure that the pipelines have the provision for data versioning as well. I understand from the official documentation of Kubeflow that it is possible with Rok Data Management. However, we are interested in exploring other options as well. Thus, my question comes in two parts:

Is there any alternative to using Rok for data versioning with Kubeflow pipelines?
Is it possible to use DVC for data versioning with Kubeflow?

Thanks :)

kinquestion

Source

VindhyaSRajan

Most helpful comment

Can KFP be set to take kubernetes snapshots after each step, and then pass the name of that snapshot as the data_source for the next step? If not, I think that is a solid place to start for having data versioning built into KFP. I have been working on code for Kale that, once working, should create pipelines that create snapshots after each step https://github.com/kubeflow-kale/kale/pulls.

DavidSpek on 27 Feb 2021

👍2

All 3 comments

Hi @VindhyaSRajan!

Did you look into https://github.com/pachyderm/pachyderm? I think they also have some integration with KFP.
KFP is very flexible to work with any external system by components.

Any feedback on gaps?