I see a lot of value in using DVC during the development phase of a DS project, especially having the ability to reproduce outputs only if dependencies have changed.
One of the problems we are trying to solve is how we can move a data scientists code back and forth from development and production. Ideally we would want their local development experience to be translated easily into production. I've created a toy project with DVC to see if we could use it for developing a multi-step pipeline which does data transfers between each step.
However, there is one thing that is unclear when scheduling this same pipeline in Kubeflow/Airflow. Let's assume that my pipeline is as follows
If I do all of my local development (dvc run
, dvc repro
) then everything works. But in a production setting I will have unique inputs to my pipeline. For example the datetime stamp or other input variables will change. I can integrate this with DVC by having a file called parameters
as a dependency to the Get Data
step.
So when I run the pipeline on Airflow on different days, then the dependencies for step 1 will be different, which means it will get recomputed.
The problem that I have is that all of the steps in the graph have their hashes hardcoded based on the local development environment. So even if I rerun this whole pipeline multiple times with the same input parameters, none of the *.dvc
files in the pipeline will be updated, meaning everything will rerun from scratch. That's because they are running in an isolated production environment and not committing code back to the project repo. So dvc
looses it's value when wrapped in a scheduler.
Am I missing something, or is DVC primarily useful in local development only?
@woop, great question!
Could you please clarify few things to make sure I understand your scenario correctly:
Get Data
, Transform Data
, Train Model
, Evaluate Model
) work in a single Airflow job/k8s container?Get Data
step work outside dvc and updates data files in a dvc repo?dvc repro
which does the rest: Transform Data
, Train Model
, Evaluate Model
. Right?dvc add
, commit/git commit
and push/dvc push
in production? Is it correct?Thanks for the fast reply!
dvc repro evaluate.dvc
which would run everything. In production I could do the same (if I was using a single task/container), or what I wanted was to dvc repro
/dvc run
each step while running multiple containers. I'm not clear how that would help though, because it would seem like each step would recompute all depedencies up to get data
.git commit
if I want reproducibility. Especially because our orchestrator is set up in such a way to git clone the project repo into a docker container based on a git SHA. So if we did a git commit then it would not update subsequent steps. Thank you for sharing details! It is clear with 1-3.
Re (4)... When do you clone git repo based on SHA into a docker container? When you create the docker that needs to be run or when docker it is running (it gets SHA from somewhere and clones the repo based on it).
For reproducibility between container runs in prod you need to share state between runs. Git repo is one of the ways to share the state. To get the state in prod you need to clone a recent version of repo when you run container (not a repo based on some SHA). To update the state you need to run something like git commit && git push && dvc push
from prod.
It might look unnatural to commit/push from prod. And it can create a mess in your git history. To avoid these issues you might use a separate Git branch for prod (like in traditional development - master
and prod
branches).
Thank you for sharing details! It is clear with 1-3.
Re (4)... When do you clone git repo based on SHA into a docker container? When you create the docker that needs to be run or when docker it is running (it gets SHA from somewhere and clones the repo based on it).
For reproducibility between container runs in prod you need to share state between runs. Git repo is one of the ways to share the state. To get the state in prod you need to clone a recent version of repo when you run container (not a repo based on some SHA). To update the state you need to run something like
git commit && git push && dvc push
from prod.It might look unnatural to commit/push from prod. And it can create a mess in your git history. To avoid these issues you might use a separate Git branch for prod (like in traditional development -
master
andprod
branches).
We basically have two approaches.
Let's assume we don't pin the cloned repo to a specific git commit. Let's just take the latest one with git clone
. Wouldn't we create a race condition if we have multiple of these pipelines running at the same time, because they are presumably sharing the "latest" commit? Or is the suggestion that each pipeline type should have its own branch? We'd be creating a git commit for every step in the pipeline for every pipeline run.
I guess I need to think about this a bit more. I like the idea of being able to pick up from a very specific commit and completely reproduce the state of that system, but it seems this could be quite difficult to manage and standardize.
@woop can you elaborate on running multiple pipelines in parallel? If you have a single production system that is retraining some stuff on cron (let's say daily) it makes to me to have a separate "production" branch. Every commit of it should be the latest state of master with updates to the DVC-files forced to the top of it.
@woop can you elaborate on running multiple pipelines in parallel? If you have a single production system that is retraining some stuff on cron (let's say daily) it makes to me to have a separate "production" branch. Every commit of it should be the latest state of master with updates to the DVC-files forced to the top of it.
So the idea is to have endlessly growing branches?
Every time the same step runs, the query will change, which will update the input data, which will rerun all the steps. Then it will commit all of this to the repository for each step in each pipeline every time it runs.
Would need to squash those git repos eventually, they will become massive I think.
@woop I'm not sure I'm following you on "Then it will commit all of this to the repository for each step in each pipeline every time it runs.". Why do you need to commit every step separately? Usually, you would run the whole pipeline and then fo git commit
. Am I missing something?
@woop I'm not sure I'm following you on "Then it will commit all of this to the repository for each step in each pipeline every time it runs.". Why do you need to commit every step separately? Usually, you would run the whole pipeline and then fo
git commit
. Am I missing something?
If step 1 doesn't commit, then step 2 will have to rerun step 1 when it runs, and step 3 will have to rerun step 2 and step 1 when it runs. That is because they are still looking at old data hashes. The only way to get around this (it seems) is to commit at every step the latest hashes at each step.
If step 1 doesn't commit, then step 2 will have to rerun step 1 when it runs
This is true, only if you are going to run step 2 in a different environment/separate machine? Is it your case? In general you can run dvc repro
, interrupt it in the middle, and it won't run completed steps again if run dvc repro
again.
To some extent, dvc repro
is automatically doing "commits" locally by updating DVC-files as it processes the steps.
Unless I'm missing something :)
If step 1 doesn't commit, then step 2 will have to rerun step 1 when it runs
This is true, only if you are going to run step 2 in a different environment/separate machine? Is it your case? In general you can run
dvc repro
, interrupt it in the middle, and it won't run completed steps again if rundvc repro
again.To some extent,
dvc repro
is automatically doing "commits" locally by updating DVC-files as it processes the steps.Unless I'm missing something :)
Correct. In my case we are running each step in a pipeline as a separate container (Kubeflow Pipelines, or Airflow with Kubernetes). What this means is that I need to somehow get the DVC files into the next container so that those previous steps don't rerun. One way is to do git commits, another is having a data management layer that does this between steps.
I think running DVC stages in totally isolated containers is an anti pattern. We can be creative with the workarounds but having each command run in isolated contexts defeats the purpose of a tool that has in its core the inspection of existing state.
If you鈥檙e not persisting this state otherwise, by having a shared file system or committing changes to upstream, you鈥檙e better off having a single job that dvc repro the entire pipeline.
I solved this in my project by creating a small script to sync DVC stages between prod and dev. It's something like this:
sync-dvc data/dev/stage-1.dvc data/prod/stage-1.dvc
It copies the stage config, but keeps the asset hashes unchanged where possible. I agree it would be great if DVC supported this out of the box.
Most helpful comment
I think running DVC stages in totally isolated containers is an anti pattern. We can be creative with the workarounds but having each command run in isolated contexts defeats the purpose of a tool that has in its core the inspection of existing state.
If you鈥檙e not persisting this state otherwise, by having a shared file system or committing changes to upstream, you鈥檙e better off having a single job that dvc repro the entire pipeline.