Some notes on the playground, how to reproduce, and what I'm testing for.
Repo to reproduce: here.
I'm running testing on a new version of the example-get-started
that I built with this updated gen script.
The repo itself (that script from above generates) can be downloaded here. Should be enough to reproduce issues below described below. And can be useful for everyone as an easy entry point to try DVC 1.0 on a bit more sophisticated project.
This list below might include things not related 100% to the latest changes we introduced in 1.0, but all of them are potentially relevant. E.g., we change pipelines interface and should try scenarios around.
dvc repro
behavior and UI/UXdvc.yaml
prefix by default. Instead of running a stage like this dvc repro dvc.yaml:prepare
, I should be able to run it like this (if dvc.yaml
is found in the current dir? or if it is a single one in the project?)$ dvc repro prepare
$ dvc repro model.pkl
WARNING
is a rare thing that we need to show), or don't show again. Instead make a message about the stage that we execute more meaningful (see below)$ dvc repro
WARNING: assuming default target 'dvc.yaml'.
$ dvc repro
WARNING: stage: 'data/data.xml.dvc' is locked. Its dependencies are not going to be reproduced.
[x] When I run #3834dvc repro
I don't really see the stage name being reproduced - we can probably include it w/o within the "Running command" message. It can solve the problem of some WARNING (see above).
[x] When you change something and it hits cache it feels that we need to explain a bit on what's going on? show summary of changes to the workspace? (Partially fixed by #3834)
dvc repro
options--dry
does not work at all. $ vim scr/train.py (some dummy change)
$ dvc repro --dry`
[x] -p
,. --pipeline
- deprecate? It does not make sense moving forward to accept .dvc
files.
-p
still makes sense for running a pipeline of a given stage, even in 1.0.
Though,repro -p
will not work as either dvc file or stage name needs to be specified.
[x] Related to the previous one - we still accept .dvc
files into dvc repro
- we should not be doing this anymore - semantics is not clear anymore. Deprecate or remove.
dvc repro
is still backward compatible. And, we cannot detect if the repo is 1.0-one,
as the repo can have a mixture of old-style dvcfiles and new pipeline files.
[x] #3831--help
review and fix it. e.g. we don't expect Dvcfile
by default anymore.
$ dvc repro --help
DVC-file to reproduce. 'Dvcfile' by default.
[x] We probably need "ignore-run-cache" option? https://github.com/iterative/dvc/pull/3788
[x] https://github.com/iterative/dvc/pull/3788--ignore-build-cache
- name can be confusing now. Do we keep the same semantics at all? (Same question regarding the dvc run
.)
[ ] dvc pipeline show
- expects Dvcfile
, should now do the same as dvc repro
+ fix help.
[ ] dvc pipeline show
output is broken from a lot of different perspectives - see this comment for suggestions and details.
[x] We broke the progress bar for the get/import
. It is now Downloading 0/1 files
vs Downloading 0/36.2M
before. Mean that we effectively don't show a progress bar for the case like:
dvc get https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml
This is not fixed by reverting https://github.com/iterative/dvc/commit/b312895e77138d298ba492315058918fc697902f?w=1
Extracted to #3874
$ dvc repro prepare
The default is dvc.yaml
. The addressing is via :stage_name
(i.e. without explicit dvc.yaml). This is ugly, I know. But, this was done so as to have one way of addressing things (on data related commands, the stage name might collide with the file name, eg: dvc checkout foo
). I'm fine with just having it addressed without colon on pipeline-related commands though.
$ dvc repro model.pkl
I don't like this, because that specific stage might be generating lots of other stuffs than just model.pkl
. And, if we go this way, the same argument I mentioned of colliding with stage name applies here.
When I run dvc repro I don't really see the stage name being reproduced
Good point. That's a bad UX.
--dry does not work at all.
Broken by 18e8f075728
Related to the previous one - we still accept .dvc files into dvc repro - we should not be doing this anymore - semantics is not clear anymore. Deprecate or remove.
You mean, not supporting old .dvc
files at all bearing a cmd
/pipeline stage? I'm okay with deprecation.
--help review and fix it. e.g. we don't expect Dvcfile by default anymore.
Oops.
dvc pipeline show - expects Dvcfile, should now do the same as dvc repro + fix help
The problem is what to show for bare pipeline show. One idea is to show all possible DAGs
(could be too much, could also limit to just dvc.yaml
by default) and other is to show all of the stages inside a dvc.yaml
file one-by-one.
because that specific stage might be generating lots of other stuffs than just model.pkl.
not a problem I believe, I think most people will understand that stage will be running (means generating outputs). Also in majority of cases there will be one - two outputs, one of the metrics - it's totally fine. On the other hand - it's very convenient - I don't care about some stage names, I care about my model.
And, if we go this way, the same argument I mentioned of colliding with stage name applies here.
yes, but it's a matter of defining rules (first check stage name, second output or something like this). I would also be surprised (and it's bad design) if stage and output that are not related to each other have the same name.
@shcheklein thank you for the great summary!
Also, I'd add that repro is too verbose. I do repro for two unchanged stages without lock:
$ dvc repro
WARNING: assuming default target 'dvc.yaml'.
Stage is cached, skipping.
Output 'users.csv' didn't change. Skipping saving.
Stage is cached, skipping.
Output 'summary.json' doesn't use cache. Skipping saving.
Output 'logs.csv' didn't change. Skipping saving.
Output 'logs' didn't change. Skipping saving.
To track the changes with git, run:
git add dvc.lock
All this information is overwhelming - it takes time to read and understand that basically nothing happened. What is actually important - two stages hit the run-cache and lock file was generates (this info is not presented directly).
I'd expect something like:
$ dvc repro
Restoring stage 'proc' from run-cache.
Restoring stage 'trainme' from run-cache.
Lock file 'dvc.lock' was generated.
To track the changes with git, run:
git add dvc.lock
because that specific stage might be generating lots of other stuffs than just model.pkl.
not a problem I believe, I think most people will understand that stage will be running (means generating outputs). Also in majority of cases there will be one - two outputs, one of the metrics - it's totally fine. On the other hand - it's very convenient - I don't care about some stage names, I care about my model.
Maybe a message like "Additional output generated: someplot.png, metrics.json" would be nice in that case?
And, if we go this way, the same argument I mentioned of colliding with stage name applies here.
yes, but it's a matter of defining rules (first check stage name, second output or something like this). I would also be surprised (and it's bad design) if stage and output that are not related to each other have the same name.
Yes. As long as there is a clear hierarchy of rules. It should be fine. I would say look for a stage name first, then look for a filename. At least I would love dvc repro <stagename>
to work without the :
.
Another tiny thing. I am not sure how the plot files works, but I got a little surprised by the number of files I saw the first time I did a git status
. Here is what I saw:
I know there is _only_ 3 plot files, but still I got a little surprised. Maybe the plots could be combined in one json file?
-p,. --pipeline - deprecate? It does not make sense moving forward to accept .dvc files.
The dvc.yaml
is not a single pipeline, so it still makes sense. It is confusing when dvc pipeline -p dvc.yaml
is run, as it will try to run every pipeline of every stage inside the dvc.yaml
.
Related to the previous one - we still accept .dvc files into dvc repro - we should not be doing this anymore - semantics is not clear anymore. Deprecate or remove.
Can you please elaborate on this? We are keeping backward compatibility, so what do you mean by deprecate?
dvc pipeline show - expects Dvcfile, should now do the same as dvc repro + fix help.
We need to decide what should we do for dvc pipeline show
from options that we have:
--ascii
)dvc.yaml
./cc @shcheklein
The dvc.yaml is not a single pipeline, so it still makes sense. It is confusing when dvc pipeline -p dvc.yaml is run, as it will try to run every pipeline of every stage inside the dvc.yaml.
could you please elaborate a bit? not sure I completely understood your point.
so what do you mean by deprecate?
If we can detect new/old project - don't allow to run this if it's a new one? I guess what I want to simplify the logic eventually in terms of what do we accept into dvc repro
. I hope that we'll be able to accept an output path dvc repro model.pkl
, stage name (resolves to the dvc.yaml
in the current dir) and something advanced like - path to yaml + stage name, or muliple stages, multiple outputs (not sure about this).
But would love to start removing stuff that does not make much sense from the new version perpespective.
We need to decide what should we do for dvc pipeline show from options that we have
we have a separate ticket(s) for changing semantics of the dvc pipeline show
- e.g. #3661 , especially the problem that it's broken in a way right now - https://github.com/iterative/dvc/issues/3661#issuecomment-617886626
I would at say as a minimum for the DVC 1.0 release we can just keep the same semantics and accept exactly the same stuff the dvc repro
accepts. In case when no arguments provided, it means dvc.yaml:*
I guess? (it'll potentially solve that issue I mentioned btw).
@shcheklein, I have extracted two of the issues from here. And, pipeline show
has already an open issue. The rest are already done.
Closing this one...
Most helpful comment
Maybe a message like "Additional output generated: someplot.png, metrics.json" would be nice in that case?
Yes. As long as there is a clear hierarchy of rules. It should be fine. I would say look for a stage name first, then look for a filename. At least I would love
dvc repro <stagename>
to work without the:
.