Dvc: Running DVC in production

Created on 30 Jun 2019 · 12Comments · Source: iterative/dvc

I see a lot of value in using DVC during the development phase of a DS project, especially having the ability to reproduce outputs only if dependencies have changed.

One of the problems we are trying to solve is how we can move a data scientists code back and forth from development and production. Ideally we would want their local development experience to be translated easily into production. I've created a toy project with DVC to see if we could use it for developing a multi-step pipeline which does data transfers between each step.

However, there is one thing that is unclear when scheduling this same pipeline in Kubeflow/Airflow. Let's assume that my pipeline is as follows

Get Data
Transform Data
Train Model
Evaluate Model

If I do all of my local development (dvc run, dvc repro) then everything works. But in a production setting I will have unique inputs to my pipeline. For example the datetime stamp or other input variables will change. I can integrate this with DVC by having a file called parameters as a dependency to the Get Data step.

So when I run the pipeline on Airflow on different days, then the dependencies for step 1 will be different, which means it will get recomputed.

The problem that I have is that all of the steps in the graph have their hashes hardcoded based on the local development environment. So even if I rerun this whole pipeline multiple times with the same input parameters, none of the *.dvc files in the pipeline will be updated, meaning everything will rerun from scratch. That's because they are running in an isolated production environment and not committing code back to the project repo. So dvc looses it's value when wrapped in a scheduler.

Am I missing something, or is DVC primarily useful in local development only?

question

Source

woop

👍7 🚀1

Most helpful comment

I think running DVC stages in totally isolated containers is an anti pattern. We can be creative with the workarounds but having each command run in isolated contexts defeats the purpose of a tool that has in its core the inspection of existing state.

If you’re not persisting this state otherwise, by having a shared file system or committing changes to upstream, you’re better off having a single job that dvc repro the entire pipeline.

villasv on 14 Nov 2019

👍2

All 12 comments

@woop, great question!

Could you please clarify few things to make sure I understand your scenario correctly:

Do all the steps (Get Data, Transform Data, Train Model, Evaluate Model) work in a single Airflow job/k8s container?
Does Get Data step work outside dvc and updates data files in a dvc repo?
After data files are updated you run dvc repro which does the rest: Transform Data, Train Model, Evaluate Model. Right?
You don't want to add/dvc add, commit/git commit and push/dvc push in production? Is it correct?

dmpetrov on 30 Jun 2019

Thanks for the fast reply!

Ideally they would be in multiple containers, but for a start I would be fine running them in a single container.
Yes
Locally I could just do dvc repro evaluate.dvc which would run everything. In production I could do the same (if I was using a single task/container), or what I wanted was to dvc repro/dvc run each step while running multiple containers. I'm not clear how that would help though, because it would seem like each step would recompute all depedencies up to get data.
I am fine with doing any of these, but it seems very unnatural to me to do git commit if I want reproducibility. Especially because our orchestrator is set up in such a way to git clone the project repo into a docker container based on a git SHA. So if we did a git commit then it would not update subsequent steps.

woop on 30 Jun 2019

Thank you for sharing details! It is clear with 1-3.

Re (4)... When do you clone git repo based on SHA into a docker container? When you create the docker that needs to be run or when docker it is running (it gets SHA from somewhere and clones the repo based on it).

For reproducibility between container runs in prod you need to share state between runs. Git repo is one of the ways to share the state. To get the state in prod you need to clone a recent version of repo when you run container (not a repo based on some SHA). To update the state you need to run something like git commit && git push && dvc push from prod.

It might look unnatural to commit/push from prod. And it can create a mess in your git history. To avoid these issues you might use a separate Git branch for prod (like in traditional development - master and prod branches).

dmpetrov on 30 Jun 2019

Thank you for sharing details! It is clear with 1-3.

Re (4)... When do you clone git repo based on SHA into a docker container? When you create the docker that needs to be run or when docker it is running (it gets SHA from somewhere and clones the repo based on it).

For reproducibility between container runs in prod you need to share state between runs. Git repo is one of the ways to share the state. To get the state in prod you need to clone a recent version of repo when you run container (not a repo based on some SHA). To update the state you need to run something like git commit && git push && dvc push from prod.

It might look unnatural to commit/push from prod. And it can create a mess in your git history. To avoid these issues you might use a separate Git branch for prod (like in traditional development - master and prod branches).

We basically have two approaches.

We build the codebase into the container image and just call it directly.
We call a base image which contains all dependencies with a startup script. This script pulls in the specified git repo and runs the startup script.

Let's assume we don't pin the cloned repo to a specific git commit. Let's just take the latest one with git clone. Wouldn't we create a race condition if we have multiple of these pipelines running at the same time, because they are presumably sharing the "latest" commit? Or is the suggestion that each pipeline type should have its own branch? We'd be creating a git commit for every step in the pipeline for every pipeline run.

I guess I need to think about this a bit more. I like the idea of being able to pick up from a very specific commit and completely reproduce the state of that system, but it seems this could be quite difficult to manage and standardize.

woop on 30 Jun 2019

@woop can you elaborate on running multiple pipelines in parallel? If you have a single production system that is retraining some stuff on cron (let's say daily) it makes to me to have a separate "production" branch. Every commit of it should be the latest state of master with updates to the DVC-files forced to the top of it.

shcheklein on 2 Jul 2019

@woop can you elaborate on running multiple pipelines in parallel? If you have a single production system that is retraining some stuff on cron (let's say daily) it makes to me to have a separate "production" branch. Every commit of it should be the latest state of master with updates to the DVC-files forced to the top of it.

So the idea is to have endlessly growing branches?

Every time the same step runs, the query will change, which will update the input data, which will rerun all the steps. Then it will commit all of this to the repository for each step in each pipeline every time it runs.

Would need to squash those git repos eventually, they will become massive I think.

woop on 4 Jul 2019

@woop I'm not sure I'm following you on "Then it will commit all of this to the repository for each step in each pipeline every time it runs.". Why do you need to commit every step separately? Usually, you would run the whole pipeline and then fo git commit. Am I missing something?

shcheklein on 4 Jul 2019

@woop I'm not sure I'm following you on "Then it will commit all of this to the repository for each step in each pipeline every time it runs.". Why do you need to commit every step separately? Usually, you would run the whole pipeline and then fo git commit. Am I missing something?

If step 1 doesn't commit, then step 2 will have to rerun step 1 when it runs, and step 3 will have to rerun step 2 and step 1 when it runs. That is because they are still looking at old data hashes. The only way to get around this (it seems) is to commit at every step the latest hashes at each step.

woop on 5 Jul 2019

If step 1 doesn't commit, then step 2 will have to rerun step 1 when it runs

This is true, only if you are going to run step 2 in a different environment/separate machine? Is it your case? In general you can run dvc repro, interrupt it in the middle, and it won't run completed steps again if run dvc repro again.

To some extent, dvc repro is automatically doing "commits" locally by updating DVC-files as it processes the steps.

Unless I'm missing something :)

shcheklein on 6 Jul 2019

If step 1 doesn't commit, then step 2 will have to rerun step 1 when it runs

This is true, only if you are going to run step 2 in a different environment/separate machine? Is it your case? In general you can run dvc repro, interrupt it in the middle, and it won't run completed steps again if run dvc repro again.

To some extent, dvc repro is automatically doing "commits" locally by updating DVC-files as it processes the steps.

Unless I'm missing something :)

Correct. In my case we are running each step in a pipeline as a separate container (Kubeflow Pipelines, or Airflow with Kubernetes). What this means is that I need to somehow get the DVC files into the next container so that those previous steps don't rerun. One way is to do git commits, another is having a data management layer that does this between steps.

woop on 7 Jul 2019

👀2

If you’re not persisting this state otherwise, by having a shared file system or committing changes to upstream, you’re better off having a single job that dvc repro the entire pipeline.

villasv on 14 Nov 2019

👍2

I solved this in my project by creating a small script to sync DVC stages between prod and dev. It's something like this:

sync-dvc data/dev/stage-1.dvc data/prod/stage-1.dvc

It copies the stage config, but keeps the asset hashes unchanged where possible. I agree it would be great if DVC supported this out of the box.

z0u on 31 Mar 2020

👀2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

dvc status and dvc pull is failing with data of size 11.5gb

analystanand · 3Comments

Handle git failure to run checkout with a better message

shcheklein · 3Comments

rename `git` property (a Repo object) in class `dvc.scm.Git`?

jorgeorpinel · 3Comments

typo in docs

siddygups · 3Comments

completion: support import-url and get-url

dnabanita7 · 3Comments