dvc pipeline list
only shows 1 pipeline as a result even though there are several stages that start from the same previous stage.
The use case that help me discover it, is that i am creating several models. Each models feeds from a baseline pipeline that loads and preprocess data.
Platform and method of installation: pkg Mac
DVC version: 0.87.0
Python version: 3.7.5
Platform: Darwin-18.7.0-x86_64-i386-64bit
Binary: True
Package: osxpkg
Filesystem type (workspace): ('apfs', '/dev/disk1s1')
Reproduction script:
#!/bin/bash
rm -rf repo
mkdir repo
pushd repo
set -ex
git init --quiet
dvc init -q
echo data >> data
dvc add data -q
dvc run -d data -o preprocessed -f 1.dvc "cat data > preprocessed"
dvc run -d preprocessed -o res1 -f 2.dvc "cat preprocessed > res1"
dvc run -d preprocessed -o res2 -f 3.dvc "cat preprocessed > res2"
dvc pipeline list
Will show:
+ dvc pipeline list
1.dvc
2.dvc
3.dvc
data.dvc
While one could expect two lists:
1: data -> 1.dvc -> 2.dvc
2: data -> 1.dvc -> 3.dvc
Isn't it related to #2392?
While one could expect two lists:
@pared But that is the same pipeline. dvc pipeline list
just dumps the pipeline stages, not really each possible path from root to leaves.
@guillecarc take note of @efiop comment.
Still, looking at behaviour for show and list, it seems that for
show
- pipeline is the target
stage with its predecessors
list
- (in given example) pipeline is root stage with all "child" stages.
It seems to me we should either unify the behaviour or make this disrepancy clear.
Would you agree?
@efiop @jorgeorpinel?
If I remember correctly, the problem (also discussed in #2392) is that the show
return the whole connected-component of the stage and not only "up to the stage".
@drorata As I recall, yes that was the case back then. But along the development process we had some changes that influenced how pipeline
commands are handled.
For example:
#!/bin/bash
rm -rf repo
mkdir repo
pushd repo
git init --quiet && dvc init -q
echo data >> data
dvc add data -q
dvc run -q -d data -o data_train "echo data_train >> data_train"
dvc run -q -d data -o data_test "echo data_test >> data_test"
dvc run -q -d data_test -d data_train -o result "echo result >> result"
dvc run -q -d data -o branch "echo branch >> branch"
When we run dvc pipeline show result.dvc
we get:
| data.dvc |
+----------+*
*** ***
** **
** **
+---------------+ +----------------+
| data_test.dvc | | data_train.dvc |
+---------------+ +----------------+
*** ***
** **
** **
+------------+
| result.dvc |
+------------+
When we run dvc pipeline show --ascii data_test.dvc
:
| data.dvc |
+----------+
*
*
*
+---------------+
| data_test.dvc |
+---------------+
So now, show
takes target and its predecessors.
I think we should talk through how we are handling pipeline
subcommands.
Now, I would say we cannot get the full idea of how our DAG looks without analysis of stage files/ running pipeline show
a few times.
In the given example we would have to:
dvc pipelines list
to see all interconnected stagesdvc pipelines show
at least 2 times (for result.dvc
and branch.dvc
) to get the idea of how the whole project looks like. Of course, seeing the project first time, it will be much harder.It'd be great to review the 2 cmd refs in the docs repo. Maybe this issue can be transferred there or a new one opened. Thanks
I cannot say whether it is a documentation discussion or design, but what I can say... unfortunately, I didn't update DVC in the main project I'm using it in (I'm too worried that something will break). So maybe indeed the behavior changed. @pared when did it change?
Otherwise, I would say that there should be the following options:
.dvc
stage show all:Does it make sense?
@drorata just to note:
we are talking here about stage collection for pipelines
command which has only "visual" value.
Collecting stages and building graph for reproduction happens elsewhere, and has not been touched for quite some time.
What version do you use in your main repo? If newer versions were to introduce some problems, you could probably roll them back using git. As long as the cache is not removed and you use git, you should be fine.
The change that introduced different show rules was introduced in 0.82.9
version
Closing as there's no pipeline list, there's dvc dag
for dvc pipeline show
.
Most helpful comment
@drorata just to note:
we are talking here about stage collection for
pipelines
command which has only "visual" value.Collecting stages and building graph for reproduction happens elsewhere, and has not been touched for quite some time.
What version do you use in your main repo? If newer versions were to introduce some problems, you could probably roll them back using git. As long as the cache is not removed and you use git, you should be fine.
The change that introduced different show rules was introduced in
0.82.9
version