Hello,
I am working on setting up an in-house ML infrastructure for my company and we decided to go with Kubeflow. We need to ensure that the pipelines have the provision for data versioning as well. I understand from the official documentation of Kubeflow that it is possible with Rok Data Management. However, we are interested in exploring other options as well. Thus, my question comes in two parts:
Thanks :)
Hi @VindhyaSRajan!
Did you look into https://github.com/pachyderm/pachyderm? I think they also have some integration with KFP.
KFP is very flexible to work with any external system by components.
Any feedback on gaps?
Can KFP be set to take kubernetes snapshots after each step, and then pass the name of that snapshot as the data_source for the next step? If not, I think that is a solid place to start for having data versioning built into KFP. I have been working on code for Kale that, once working, should create pipelines that create snapshots after each step https://github.com/kubeflow-kale/kale/pulls.
That's an interesting idea, how do you envision that being a 1st party feature? Does it need to?
Most helpful comment
Can KFP be set to take kubernetes snapshots after each step, and then pass the name of that snapshot as the
data_sourcefor the next step? If not, I think that is a solid place to start for having data versioning built into KFP. I have been working on code for Kale that, once working, should create pipelines that create snapshots after each step https://github.com/kubeflow-kale/kale/pulls.