Dvc: repro: use build cache for deterministic stages

Created on 18 Oct 2018  路  23Comments  路  Source: iterative/dvc

In my experiment I run a few different preprocessing steps which create a different CSV file, then I am modeling this data and also checking different parameters.
When I want to run the same experiment on 4 different machines (dvc is connected to the same remote cache).
Running every type of preprocessing will be done on every machine which takes a lot of time and could be omitted by running dvc pull before dvc repro and dvc push after it.
It could work with one command like dvc repro --remote

enhancement p0-critical

Most helpful comment

Yes, I will definitely to give it a try)

All 23 comments

Hi @AratorField !

Thank you for sharing your scenario! It would be indeed a very useful feature! I think that dvc checkout would've benefited from it as well. How about something like a autofetch option for dvc config, that will tell dvc that it should fetch cache before performing operations such as repro and checkout? Would that be suitable for you?

Thanks,
Ruslan

@efiop the config option will change API. It won't be clear what exactly dvc repro does from code. An option dvc repro --fetch might be a good option instead of global config param.

I am running dvc repro in a loop so something like -f -p / --fetch --push will be ok for me.
Setting in config autofetch and autopush is ok too.

@AratorField If you are running it in a loop, why can't you simply call dvc pull before it and dvc push after it? I was thinking about an interactive scenario, where such new options make sense, but in a loop it is easy enough to just call pull/push in your script.

I am doing it like you said when I run an experiment on a server where I run most of my experiments. But then I am reproducing manually some of the best experiments locally.

Ah, I see. So it looks something like:

git pull
dvc pull
dvc repro
git commit ...
git push
dvc push

Right? If so, I think a more suitable approach would be to add git hooks that will call dvc pull after git pull and dvc push after git push, similar to what we currently have with dvc checkout, where you can call dvc install to install a post-checkout hook for you so dvc checkout is automatically called after git checkout. Would that suit you?

Currently dvc install only installs post-checkout hook, so we will have to add support for post-pull and post-push hooks as well. Which totally fits within our current API architecture.

Oh... So it works totally differently than I thought. Now I know why my algorithm is doing preprocessing every time.

Let me explain you a bit more.
I am building about 100 models with different parameters (standard hyperparameter optimization).
Before data modeling there is preprocessing, so in dvc repro there are two steps, preprocessing and modeling.
I am working with text and in preprocessing I have script which can do two things:

  1. convert all to lowercase
  2. remove punctuation

So there could be 4 different types of preprocessing:

  • None
  • convert all to lowercase
  • remove punctuation
  • convert all to lowercase and remove punctuation

I am setting the type of preprocessing in a config file which is a dependency of preprocessing.py which is in pipeline.

What I thought the DVC is working is:
After running dvc repro with some type of preprocessing and then run dvc repro again with another type of preprocessing, it won't do preprocessing again if I run it for the third time with the first type of preprocessing because DVC will say:

Hay! We have already done this type of preprocessing and stored the output file in the cache so now I can just take this file from the cache instead of running preprocessing again!

But it looks like it stores only the last version of the file, right?

You are right, dvc doesn't currently remember every dependencies + command = outputs combination that have occurred in the past, so it can't just pull up outputs from the cache this way. That being said, I have been thinking about this scenario a lot in the past and it would be indeed extremely useful to have such feature. The best way we could implement this is to utilize git(or any other scm system that the repository is based on) to get a table of dependencies + command = outputs values, that will help us quickly identify if this combination of dependencies has been already processed by this particular command, so we could pull up appropriate outputs without recomputing. I would call this something like a "build cache" as apposed to the current "data cache" and would indeed utilize it with a special option for dvc repro, e.g. something like dvc repro --use-build-cache that will tell dvc to check every stage to see if this combination(of deps and command) has been built before and if it was then it is okay for dvc to just pull up the result without recomputing. It is also worth pointing out that since dvc doesn't keep 100% track of the environment you run your pipeline in(e.g. system libs versions and so on), there is always a chance that you won't get the same result if you build something in two environments. That said, with a little bit of care from the user(e.g. probabilistic models will always produce different result, so if you were to use --use-build-cache option, you need to be aware of that it will pull up the old result which will not be equal to the one that you would've got by actually rebuilding) , this feature should be extremely useful. Is this something that you would be interested in? If you are interested, we can up the priority for this one.

It would love to see it!
As you said, one has to be careful because not every script is deterministic, so maybe a better way to do that is to add an adequate option to dvc run like --deterministic. So when you will run dvc repro it will know which part of the pipeline could be taken from the cache.

Makes sense! I will look into it soon. Thank you so much for the feedback!

NOTE: we can use git history as build cache by searching for existing committed dvc files in history. We could also cache that operation to only parse git history once. Also need to store local build cache for uncommited changes. Let's start with the latter one.

This Git (any SCM) history based solution from @efiop seems good. However, it is specific to SCM and won't work if a user does not commit changes (which might be natural for hyper param search).

It might be benifitial to support more "dynamic" data structure which is not tight to SCM histroy and stores buildrun caches after each run (even without commits). This structure can be still populated from the Git history if the histroy exists.

One possible solution... support "build cache" directory with symlinkshardlinks to outputs in the cache. Link: md5(dependencies)_md5(command)_outputname --> output_in_cache. So, if dvc run finds a command with a corresponded cache it creates outputs without rerunning the command.

I totally agree with @dmpetrov on that! It would be cool to have the ability to match stage inputs to outputs. I can imagine several scenarios when this feature would be extremely useful:
1) You just want to play around with data locally, and caring about branches/cache is just too bothering.
2) You or your teammate run an experiment which has been done before but everyone has already forgotten about that.
3) You want to create several models using dvc pipeline by changing some inputs in the beginning of the pipeline and put results aside. BTW, this is a very common usecase for me.

I understand that there might be some caveats related to using different environments but I believe that most of the people, who use DVC as a tool that guarantees reproducibility of experiments, do freeze their dependencies. If someone updates his working environment on purpose, he should just reset build-cache manually or run stage/pipeline with --force flag.

Thank you, @vasinkd, for the insights and the scenarios.

The third scenario is especially interesting. I鈥檇 appreciate if you could provide more details on this scenario and explain the difference with the 2nd. I feel this scenario and pain it solves but more solid use case can help to define requirements.

Yeah, DVC definitely lacks this features.

Actually, all three scenarios are the same: we run the same experiment several times and do not want to recalculate outputs if they are available in local/remote cache.

Second option is more about checking data in the remote cache. This is going to be helpful during experimentation phase.
Third option is related to retraining of models: e.g. I run the same pipeline for different input data each month. I do it sequentially, in Docker container containing dvc pipeline and required source code. Some stages inputs in the middle of pipeline might happen to be the same for different pipeline input data. Local build-cache might be helpful in that situation.

BTW, I think it is not so difficult to implement. We could create a folder build-cache inside .dvc folder and store full .dvc files (or just the part related to outputs) under a hash of inputs. Therefore, it would be possible to merge branches painlessly if outputs are ordered in a deterministic order. Merge conflicts will signal that something is wrong with an experiment setup on one of the machines.

@vasinkd Yeah, we've thought about fulling a similar type of cache through parsing git repo history to find all previously ran dvc-files plus some non-commited that were ran in this local repo. I might be missing something, but I think we could totally start with the approach you've suggested. :+1: Maybe you would like to contribute a patch? :slightly_smiling_face:

Yes, I will definitely to give it a try)

Did you get anywhere on this @vasinkd ?

@elgehelge Not really. I have been a little bit busy recently. I'm going to look at this issue closely these weekends. Yet, I would highly appreciate any help! :)

I am trying to sort it all out in my head, but somehow all of the changes that I would like are all tied closely together 馃槄 Having a build cache is a very important piece of the puzzle. However, In my opinion it should not be based on md5(command) rather something a little more fine grain like params as discussed in this issue: https://github.com/iterative/dvc/issues/3393

Anyways, this should not scare you off. Starting with an md5 hash of the command would be a good place to start 馃憤

Related to #1871 btw.

@efiop, could you help clarify how it is related? I don't understand the relation.

@elgehelge In that issue we also talk about build cache as a way to store pipeline results. Those are related, but could be implemented separately. This ticket is the easiest to implement, if you don't take params into account, but once params are implemented they will change the dvc-file hash anyways, so the logic should (at least I don't foresee any issues right now) work as is.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

shcheklein picture shcheklein  路  3Comments

tc-ying picture tc-ying  路  3Comments

analystanand picture analystanand  路  3Comments

TezRomacH picture TezRomacH  路  3Comments

dnabanita7 picture dnabanita7  路  3Comments