DVC 0.91.0, installed with pipx on Ubuntu 18.04 with Python 3.6.9
The repro
command seems to ignore dependency order when multiple targets are specified. For example, if I'm doing transfer learning from model A
to model B
, then training stage B
depends on A
:
01-prep-a 01-prep-b
* **** *
* **** *
* **** *
02-train-a ********* 02-train-b
When reproducing just the models, I would like to run:
dvc repro -f --downstream 02-train-*
But sometimes, this causes 02-train-B
to run before 02-train-A
. Perhaps it depends on the order that results when the shell expands the wildcard.
I think inter-stage dependencies should resolved by DVC when multiple targets are given.
Hi @z0u !
So A
and B
somehow depend on each other, right? In that case, it seems it wont make a difference in -f
case, as all of them and their dependencies will be reproduced anyway. Or do you mean that it would be nice to avoid reproducing common deps multiple times in that scenario?
Hi @efiop ,
Sorry for not being clear about the dependencies. Yep, B
depends on A
. So what I expect to happen is: A
always executes before B
.
|Command|Effect|
|---|---|
|dvc repro
|Runs 01-prep-A, 01-prep-B, 02-train-A, 02-train-B|
|dvc repro -f 02-train-A
|Runs 01-prep-A, 01-prep-B, 02-train-A|
|dvc repro -f --downstream 02-train-A
|Runs 02-train-A, 02-train-B|
|dvc repro -f --downstream 02-train-B 02-train-A
|Runs 02-train-A, 02-train-B|
|dvc repro -sf 02-train-*
|Runs 02-train-A, 02-train-B|
Got it @z0u . So to implement it we will need to pass all targets in https://github.com/iterative/dvc/blob/0.91.3/dvc/command/repro.py#L41 and then properly collect them and put them into the common stages to run in https://github.com/iterative/dvc/blob/0.91.3/dvc/repo/reproduce.py#L53 . Unfortunately we will not be able to get to implementing this right away ourselves, but we would definitely help out if anyone else is interested in contributing a PR :slightly_smiling_face:
I don't think issue is reproducable in latest releases of dvc (at least in commit 0275a0d8
).
If stages are stored in dvc.yaml then output of dvc repro -f --downstream 02-train-*
is:
ERROR: "Stage '02-train-*' not found inside 'dvc.yaml' file"
If stages are stored in old-fashioned way (i.e. dvc files with no dvc.yaml file), then output of dvc repro -f --downstream 02-train-*
is:
ERROR: 'dvc.yaml' does not exist.
P.S.: it seems that it is impossible to use wildcards with dvc repro
in the latest releases. Not sure if it is a bug though.
@nik123
dvc itself doesn't evaluate the wildcards and never did - your shell does that and then passes it to dvc. What shell do you use? And do you have our completion scripts installed? If so, could you check that you have the latest ones?
@efiop , I use standard Ubuntu intepreter (which is bash AFAIK). bash --version
gives me: 4.4.20(1)-release (x86_64-pc-linux-gnu)
I've reinstalled latest version (1.1.1) via pip (pip install dvc
) and rerun dvc repro -f --downstream 02-train-*
. Now I get the same error for dvc.yaml file but error disappeared with old-fashioned dvc files.
@nik123, why are you moving back-and-forth with different formats?
Also, when you are using 02-train-*
, there's no .dvc
files, so it's complaining as no file exists with that matching glob.
dvc repro
cannot understand globs right now. Regarding when it was .dvc
files, it was getting interpreted by your shell
and the expanded version was being sent to dvc
. Now that you don't have any files matching the glob, it is sent as-is to dvc
.
@nik123, why are you moving back-and-forth with different formats?
Once I failed to reproduce the issue with new dvc.yaml format, I've tried to reproduce it with dvc-files. That's the only reason.
Also, when you are using
02-train-*
, there's no.dvc
files, so it's complaining as no file exists with that matching glob.
dvc repro
cannot understand globs right now. Regarding when it was.dvc
files, it was getting interpreted by your shell
and the expanded version was being sent todvc
. Now that you don't have any files matching the glob, it is sent as-is todvc
.
@skshetry I've checked it with debugger and it seems you are absolutely right. Thanks for the help!
P.S.: I don't know what are the plans for old format support but it will probably be canceled at some point. That means that dvc repro -f --downstream 02-train-*
probably won't be the valid command and it raises the question if this issue should be fixed at all.
P.S.: I don't know what are the plans for old format support but it will probably be canceled at some point. That means that dvc repro -f --downstream 02-train-* probably won't be the valid command and it raises the question if this issue should be fixed at all.
@nik123 It will be valid with our completion scripts, we just didn't get to working on that feature yet. https://github.com/iterative/dvc/issues/3743
Hi! Thanks for continuing to look into this, but I think the current discussion misses the point of this issue. The wildcard was just a concise way to describe the problem. The point is: 02-train-B
should not run before 02-train-A
if both are specified. The corresponding example in the new syntax is (I think; untested):
dvc repro -f --downstream 02-train-B 02-train-A
Which should execute 02-train-A
and then 02-train-B
.
Most helpful comment
Hi! Thanks for continuing to look into this, but I think the current discussion misses the point of this issue. The wildcard was just a concise way to describe the problem. The point is:
02-train-B
should not run before02-train-A
if both are specified. The corresponding example in the new syntax is (I think; untested):Which should execute
02-train-A
and then02-train-B
.