dvc: consider introducing build matrix

Created on 14 Aug 2018 · 15Comments · Source: iterative/dvc

https://github.com/iterative/dvc/issues/973#issuecomment-412739728

I.e. something like:

matrix:
  include:
    - workdir: runs/gs1
    - workdir: runs/gs2
cmd: process.py input output
deps:
  - path: input
outs:
  - path: output
     cache: True

enhancement p3-nice-to-have

Source

efiop

👍5

Most helpful comment

@efiop
One other interesting project to look into would be makepp (http://makepp.sourceforge.net/)

It is a make program that tracks inputs and outputs using md5 checksums that are stored inside a project in .makepp directories. It is a "drop-in" replacement for make.

hhoeflin on 14 Aug 2019

👍3

All 15 comments

Also maybe something like:

cmd: mycmd input output $PARAMS
matrix:
   - name: experiment1
     params: --option 1
   - name: experiment2
     params: --option 2

it will produce output.experiment1, output.experiment2 and so on for the stages down the pipeline.
so basically output files down the pipeline will have suffixes corresponding to the experiment that they Or maybe instead of suffixes, there would be automatically created directories that would store those outputs for each experiment.

efiop on 15 Nov 2018

If I understand it correctly, this can already be handled by outputting a directory, using a command that contains a for cycle, right?

Something like this:

mkdir output; for i in {0..100}; do mycmd input/gs${i}/options.json output/gs${i}; done

This approach also makes it possible to run all tasks in parallel, if you are able to submit asynchronously and wait for all tasks to finish:

dvc run -d input -o output 'mkdir output; for i in {1..100}; do mycmd input/gs${i}/options.json output/gs${i} &; done; wait_for_results gs{1..100}'

# Formatted script:
mkdir output; 
for i in {1..100}; do 
    mycmd input/gs${i}/options.json output/gs${i} &; 
done; 
wait_for_results gs{1..100}

The problem with outputting a directory is that when you want to run an additional experiment, or if some of your experiments fail, you have to rerun all of the other ones as well. Therefore I think it's better to think in terms of one experiment = one DVC file. Making it possible to run these tasks in parallel #755 would make that usable.

For example:

mkdir output; 
# Move to output directory to create DVC files there
cd output;
for i in {1..100}; do 
    # Would have to execute in parallel
    dvc run -d ../input/gs${i}/options.json -o gs${i} mycmd ../input/gs${i}/options.json gs${i}; 
done;

prihoda on 22 Nov 2018

👍1

@prihoda Great point! This https://github.com/iterative/dvc/issues/1214 should be useful for such scenarios as well, since you will be able to tell dvc to not remove output before reproduction.

efiop on 23 Nov 2018

👍1

I'm still trying to understand the build matrix stuff. And I think we cannot solve this problem without intoroducing a concept of reconfigurable stages. Let me explain this.

Parallelism

First, it looks like build matrix can be a part of the parallel execution #755 problem when parallel steps are specified in a single stage as a build matrix with a certain level of parallelism.

However, an ideal parallelization solution should be able to run commands even from different stages. So, I'd discuss the parallel execution problem and build-matrix problem separately.

Reconfiguration

Second, there are many issues that are pointing to build matrix. Most of them are related to reconfiguration of a step or a pipeline:

In #973 @Hong-Xiang was asking about reusing (reconfigurable) pipelines.
#1416 parametrize pipeline \ step - not config file, just parameters.
#1119 repetitive commands. I see a similarity with parametrizable commands where only a single output is in use and without creating a separate directory for each experiment (./output.p instead of gs1/output.p).

To make a stage reconfigurable many questions has to be answered (how to pass configs and params, how to specify inputs and outputs) and some assumptions should be made. Reconfigurable steps is the problem that we need to solve first before introducing a build matrix and before trying to implement something like this:

cmd: mycmd input output $PARAMS
matrix:
   - name: experiment1
     params: --option 1
   - name: experiment2
     params: --option 2

Only after that, we will be able to introduce build matrix or decide to use just cycles as @prihoda said.

I've created a new issue #1462 for reconfigurable stages.

dmpetrov on 31 Dec 2018

👍2

Have you looked into getting this behaviour using e.g. snakemake and building this into a dvc run step?

hhoeflin on 8 Aug 2019

👍1

@hhoeflin We didn't. Could you elaborate, please? How would that look? :slightly_smiling_face:

efiop on 8 Aug 2019

@efiop My own experience with snakemake is limited - so take with a grain of salt. But the command is basically just a rule. Outputs are the targets, where the experiment name would be coded in the directory_name of the output or the target suffix (as you have suggested before). Snakemake allows you to easily parse these target names into its subcomponents. For the parameters, you could use a dictionary that injects different parameters into the rule depending on the experiment name.

Hope this helps. The documentation of snakemake is really good. Have a look there.

hhoeflin on 9 Aug 2019

👍2

@hhoeflin Thanks for elaborating! :slightly_smiling_face: I don't think we will be able to natively integrate snakemake into dvcfiles, since we use pure yaml, but we could definitely check it out to see if we could make some conclusions from it and use them while implementing our own feature.

efiop on 12 Aug 2019

@efiop
One other interesting project to look into would be makepp (http://makepp.sourceforge.net/)

It is a make program that tracks inputs and outputs using md5 checksums that are stored inside a project in .makepp directories. It is a "drop-in" replacement for make.

hhoeflin on 14 Aug 2019

👍3

Small bump out of curiosity.

Is there any plan to introduce such a feature ?

We are right now producing +500 .dvc files and updating them frequently (changing deps, outs, cmd, not just the hashes).
With the introduction of multi-stage in dvc v1, we will be able to reduce this to ~80 files. Kudos !
With this kind of change, we will be able to reduce this to 1 .dvc file with the 80 elements as a matrix.

For context :
We are currently using cookiecutter (https://github.com/cookiecutter/cookiecutter) to produce dvc pipelines.
This works but having a matrix system would greatly improve readability, usage (repro) and maintanability.

Wirg on 2 Jul 2020

@Wirg I think the mechanisms for that will be introduced as a part of https://github.com/iterative/dvc/issues/2799 . We are actively working on that right now, though we are still in the early stages of development.

efiop on 2 Jul 2020

@Wirg, can you use YAML anchors? It might not be sufficient considering our YAML structure, but for small cases (such as sharing wdir), it might work.

skshetry on 2 Jul 2020

@efiop

Thanks for the lightning fast answer and the expectation management on the development stage.

I subscribed to the provided issue.

Wirg on 2 Jul 2020

👍1

@skshetry thanks for your suggestion

@Wirg, can you use YAML anchors? It might not be sufficient considering our YAML structure, but for small cases (such as sharing wdir), it might work.

I am not clear on how yaml anchors would improve the situation ?
We will probably use them to do the multistage update with dvc v1.
But I am not clear on how it will replace a matrix feature.

As an example, we will run this kind of .dvc files with various configs : HP and / or input data (currently ~80)

cmd: python src/pipeline.py
  --sub_folder {{cookiecutter.sub_folder}}
  --base_dir {{cookiecutter.base_dir}}
  --input_data {{cookiecutter.input_data}}
  --config_file {{cookiecutter.config_file}}
wdir: {{cookiecutter.dvc_wdir}}
deps:
- path: {{cookiecutter.base_dir}}/{{cookiecutter.sub_folder}}/annotations/{{cookiecutter.input_data}}
- path: src/pipeline.py
- path: {{cookiecutter.config_file}}
outs:
- path: {{cookiecutter.base_dir}}/{{cookiecutter.sub_folder}}/outputs

Our current approach is to produce this kind of files for each config thanks to cookiecutter and run them with dvc repro -R.

Wirg on 2 Jul 2020

That was really helpful long-running discussion. It helped us a lot in identifying a possible solution #3633 and the first implementation #4734.

Let's close this issue and move all the discussions to #3633.

dmpetrov on 18 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Handle git failure to run checkout with a better message

shcheklein · 3Comments

pkg: deploy model or dataset

dmpetrov · 3Comments

typo in docs

siddygups · 3Comments

Exception is raised when adding the same data again

shcheklein · 3Comments

`dvc config cache.type` should print default values if there is no "cache" section in configuration file

nik123 · 3Comments