Dvc: DVC pipeline list only shows 1 pipeline

Created on 9 Mar 2020  路  10Comments  路  Source: iterative/dvc

dvc pipeline list only shows 1 pipeline as a result even though there are several stages that start from the same previous stage.

The use case that help me discover it, is that i am creating several models. Each models feeds from a baseline pipeline that loads and preprocess data.

Platform and method of installation: pkg Mac

DVC version: 0.87.0
Python version: 3.7.5
Platform: Darwin-18.7.0-x86_64-i386-64bit
Binary: True
Package: osxpkg
Filesystem type (workspace): ('apfs', '/dev/disk1s1')

discussion p2-medium

Most helpful comment

@drorata just to note:
we are talking here about stage collection for pipelines command which has only "visual" value.
Collecting stages and building graph for reproduction happens elsewhere, and has not been touched for quite some time.

What version do you use in your main repo? If newer versions were to introduce some problems, you could probably roll them back using git. As long as the cache is not removed and you use git, you should be fine.

The change that introduced different show rules was introduced in 0.82.9 version

All 10 comments

Reproduction script:

#!/bin/bash

rm -rf repo
mkdir repo

pushd repo
set -ex

git init --quiet
dvc init -q

echo data >> data
dvc add data -q


dvc run -d data -o preprocessed -f 1.dvc "cat data > preprocessed"
dvc run -d preprocessed -o res1 -f 2.dvc "cat preprocessed > res1"
dvc run -d preprocessed -o res2 -f 3.dvc "cat preprocessed > res2"

dvc pipeline list

Will show:

+ dvc pipeline list
1.dvc
2.dvc
3.dvc
data.dvc

While one could expect two lists:

1: data -> 1.dvc -> 2.dvc
2: data -> 1.dvc -> 3.dvc

Isn't it related to #2392?

While one could expect two lists:

@pared But that is the same pipeline. dvc pipeline list just dumps the pipeline stages, not really each possible path from root to leaves.

@guillecarc take note of @efiop comment.
Still, looking at behaviour for show and list, it seems that for
show - pipeline is the target stage with its predecessors
list- (in given example) pipeline is root stage with all "child" stages.

It seems to me we should either unify the behaviour or make this disrepancy clear.
Would you agree?
@efiop @jorgeorpinel?

If I remember correctly, the problem (also discussed in #2392) is that the show return the whole connected-component of the stage and not only "up to the stage".

@drorata As I recall, yes that was the case back then. But along the development process we had some changes that influenced how pipeline commands are handled.
For example:

#!/bin/bash

rm -rf repo
mkdir repo

pushd repo

git init --quiet && dvc init -q

echo data >> data

dvc add data -q

dvc run -q -d data -o data_train "echo data_train >> data_train"
dvc run -q -d data -o data_test "echo data_test >> data_test"
dvc run -q -d data_test -d data_train -o result "echo result >> result"

dvc run -q -d data -o branch "echo branch >> branch"

When we run dvc pipeline show result.dvc we get:

                 | data.dvc |                    
                 +----------+*                   
               ***            ***                
             **                  **              
           **                      **            
+---------------+            +----------------+  
| data_test.dvc |            | data_train.dvc |  
+---------------+            +----------------+  
               ***            ***                
                  **        **                   
                    **    **                     
                +------------+                   
                | result.dvc |                   
                +------------+                   

When we run dvc pipeline show --ascii data_test.dvc:

  | data.dvc |     
  +----------+     
        *          
        *          
        *          
+---------------+  
| data_test.dvc |  
+---------------+  

So now, show takes target and its predecessors.
I think we should talk through how we are handling pipeline subcommands.
Now, I would say we cannot get the full idea of how our DAG looks without analysis of stage files/ running pipeline show a few times.
In the given example we would have to:

  1. Run dvc pipelines list to see all interconnected stages
  2. Run dvc pipelines show at least 2 times (for result.dvc and branch.dvc) to get the idea of how the whole project looks like. Of course, seeing the project first time, it will be much harder.

It'd be great to review the 2 cmd refs in the docs repo. Maybe this issue can be transferred there or a new one opened. Thanks

I cannot say whether it is a documentation discussion or design, but what I can say... unfortunately, I didn't update DVC in the main project I'm using it in (I'm too worried that something will break). So maybe indeed the behavior changed. @pared when did it change?

Otherwise, I would say that there should be the following options:

  1. Given a .dvc stage show all:

    1. preceding stages

    2. all stages in the connected component of the DAG to which the stage belongs to

  2. or, all connected components of the DAG.

Does it make sense?

@drorata just to note:
we are talking here about stage collection for pipelines command which has only "visual" value.
Collecting stages and building graph for reproduction happens elsewhere, and has not been touched for quite some time.

What version do you use in your main repo? If newer versions were to introduce some problems, you could probably roll them back using git. As long as the cache is not removed and you use git, you should be fine.

The change that introduced different show rules was introduced in 0.82.9 version

Closing as there's no pipeline list, there's dvc dag for dvc pipeline show.

Was this page helpful?
0 / 5 - 0 ratings