Dvc: 1.0 (not only) testing - comments, finding, issues, etc

Created on 11 May 2020  路  8Comments  路  Source: iterative/dvc

Some notes on the playground, how to reproduce, and what I'm testing for.

Repo to reproduce: here.

  • I'm running testing on a new version of the example-get-started that I built with this updated gen script.

  • The repo itself (that script from above generates) can be downloaded here. Should be enough to reproduce issues below described below. And can be useful for everyone as an easy entry point to try DVC 1.0 on a bit more sophisticated project.

  • This list below might include things not related 100% to the latest changes we introduced in 1.0, but all of them are potentially relevant. E.g., we change pipelines interface and should try scenarios around.

dvc repro behavior and UI/UX

  • [x] I think we should assume dvc.yaml prefix by default. Instead of running a stage like this dvc repro dvc.yaml:prepare, I should be able to run it like this (if dvc.yaml is found in the current dir? or if it is a single one in the project?):
$ dvc repro prepare

3842

  • [x] Related to the previous one. We should probably accept an output as a target. It can solve some problems with autocomplete and improve experience. Related #3743 (especially last comments). (Extracted to #3875)
$ dvc repro model.pkl
  • [x] Repro excessive warning on the default target file name. Either make it info (in general I would say WARNING is a rare thing that we need to show), or don't show again. Instead make a message about the stage that we execute more meaningful (see below). #3822
$ dvc repro
WARNING: assuming default target 'dvc.yaml'.
  • [x] Import stage reusing lock mechanism leads to ugly warnings on every repro (PR: #3823):
$ dvc repro
WARNING: stage: 'data/data.xml.dvc' is locked. Its dependencies are not going to be reproduced.
  • [x] When I run dvc repro I don't really see the stage name being reproduced - we can probably include it w/o within the "Running command" message. It can solve the problem of some WARNING (see above). #3834

  • [x] When you change something and it hits cache it feels that we need to explain a bit on what's going on? show summary of changes to the workspace? (Partially fixed by #3834)

dvc repro options

$ vim scr/train.py (some dummy change)
$ dvc repro --dry`
  • [x] -p,. --pipeline - deprecate? It does not make sense moving forward to accept .dvc files.

    -p still makes sense for running a pipeline of a given stage, even in 1.0.
    Though, repro -p will not work as either dvc file or stage name needs to be specified.

  • [x] Related to the previous one - we still accept .dvc files into dvc repro - we should not be doing this anymore - semantics is not clear anymore. Deprecate or remove.

    dvc repro is still backward compatible. And, we cannot detect if the repo is 1.0-one,
    as the repo can have a mixture of old-style dvcfiles and new pipeline files.

  • [x] --help review and fix it. e.g. we don't expect Dvcfile by default anymore. #3831

$ dvc repro --help
DVC-file to reproduce. 'Dvcfile' by default.

Other commands

  • [ ] dvc pipeline show - expects Dvcfile, should now do the same as dvc repro + fix help.

  • [ ] dvc pipeline show output is broken from a lot of different perspectives - see this comment for suggestions and details.

  • [x] We broke the progress bar for the get/import. It is now Downloading 0/1 files vs Downloading 0/36.2M before. Mean that we effectively don't show a progress bar for the case like:

dvc get https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml

This is not fixed by reverting https://github.com/iterative/dvc/commit/b312895e77138d298ba492315058918fc697902f?w=1

Extracted to #3874

bug p0-critical testing

Most helpful comment

because that specific stage might be generating lots of other stuffs than just model.pkl.

not a problem I believe, I think most people will understand that stage will be running (means generating outputs). Also in majority of cases there will be one - two outputs, one of the metrics - it's totally fine. On the other hand - it's very convenient - I don't care about some stage names, I care about my model.

Maybe a message like "Additional output generated: someplot.png, metrics.json" would be nice in that case?

And, if we go this way, the same argument I mentioned of colliding with stage name applies here.

yes, but it's a matter of defining rules (first check stage name, second output or something like this). I would also be surprised (and it's bad design) if stage and output that are not related to each other have the same name.

Yes. As long as there is a clear hierarchy of rules. It should be fine. I would say look for a stage name first, then look for a filename. At least I would love dvc repro <stagename> to work without the :.

All 8 comments

$ dvc repro prepare

The default is dvc.yaml. The addressing is via :stage_name (i.e. without explicit dvc.yaml). This is ugly, I know. But, this was done so as to have one way of addressing things (on data related commands, the stage name might collide with the file name, eg: dvc checkout foo). I'm fine with just having it addressed without colon on pipeline-related commands though.

$ dvc repro model.pkl

I don't like this, because that specific stage might be generating lots of other stuffs than just model.pkl. And, if we go this way, the same argument I mentioned of colliding with stage name applies here.

When I run dvc repro I don't really see the stage name being reproduced

Good point. That's a bad UX.

--dry does not work at all.

Broken by 18e8f075728

Related to the previous one - we still accept .dvc files into dvc repro - we should not be doing this anymore - semantics is not clear anymore. Deprecate or remove.

You mean, not supporting old .dvc files at all bearing a cmd/pipeline stage? I'm okay with deprecation.

--help review and fix it. e.g. we don't expect Dvcfile by default anymore.

Oops.

dvc pipeline show - expects Dvcfile, should now do the same as dvc repro + fix help

The problem is what to show for bare pipeline show. One idea is to show all possible DAGs
(could be too much, could also limit to just dvc.yaml by default) and other is to show all of the stages inside a dvc.yaml file one-by-one.

because that specific stage might be generating lots of other stuffs than just model.pkl.

not a problem I believe, I think most people will understand that stage will be running (means generating outputs). Also in majority of cases there will be one - two outputs, one of the metrics - it's totally fine. On the other hand - it's very convenient - I don't care about some stage names, I care about my model.

And, if we go this way, the same argument I mentioned of colliding with stage name applies here.

yes, but it's a matter of defining rules (first check stage name, second output or something like this). I would also be surprised (and it's bad design) if stage and output that are not related to each other have the same name.

@shcheklein thank you for the great summary!

Also, I'd add that repro is too verbose. I do repro for two unchanged stages without lock:

$ dvc repro
WARNING: assuming default target 'dvc.yaml'.
Stage is cached, skipping.
Output 'users.csv' didn't change. Skipping saving.
Stage is cached, skipping.
Output 'summary.json' doesn't use cache. Skipping saving.
Output 'logs.csv' didn't change. Skipping saving.
Output 'logs' didn't change. Skipping saving.

To track the changes with git, run:

    git add dvc.lock

All this information is overwhelming - it takes time to read and understand that basically nothing happened. What is actually important - two stages hit the run-cache and lock file was generates (this info is not presented directly).

I'd expect something like:

$ dvc repro
Restoring stage 'proc' from run-cache.
Restoring stage 'trainme' from run-cache.
Lock file 'dvc.lock' was generated.

To track the changes with git, run:

    git add dvc.lock

because that specific stage might be generating lots of other stuffs than just model.pkl.

not a problem I believe, I think most people will understand that stage will be running (means generating outputs). Also in majority of cases there will be one - two outputs, one of the metrics - it's totally fine. On the other hand - it's very convenient - I don't care about some stage names, I care about my model.

Maybe a message like "Additional output generated: someplot.png, metrics.json" would be nice in that case?

And, if we go this way, the same argument I mentioned of colliding with stage name applies here.

yes, but it's a matter of defining rules (first check stage name, second output or something like this). I would also be surprised (and it's bad design) if stage and output that are not related to each other have the same name.

Yes. As long as there is a clear hierarchy of rules. It should be fine. I would say look for a stage name first, then look for a filename. At least I would love dvc repro <stagename> to work without the :.

Another tiny thing. I am not sure how the plot files works, but I got a little surprised by the number of files I saw the first time I did a git status. Here is what I saw:
image
I know there is _only_ 3 plot files, but still I got a little surprised. Maybe the plots could be combined in one json file?

-p,. --pipeline - deprecate? It does not make sense moving forward to accept .dvc files.

The dvc.yaml is not a single pipeline, so it still makes sense. It is confusing when dvc pipeline -p dvc.yaml is run, as it will try to run every pipeline of every stage inside the dvc.yaml.

Related to the previous one - we still accept .dvc files into dvc repro - we should not be doing this anymore - semantics is not clear anymore. Deprecate or remove.

Can you please elaborate on this? We are keeping backward compatibility, so what do you mean by deprecate?

dvc pipeline show - expects Dvcfile, should now do the same as dvc repro + fix help.

We need to decide what should we do for dvc pipeline show from options that we have:

  1. Show all of the pipeline of each stages one at a time. (in --ascii)
  2. Show all the DAG of the repo
  3. Show all the DAG from stages inside dvc.yaml.

/cc @shcheklein

The dvc.yaml is not a single pipeline, so it still makes sense. It is confusing when dvc pipeline -p dvc.yaml is run, as it will try to run every pipeline of every stage inside the dvc.yaml.

could you please elaborate a bit? not sure I completely understood your point.

so what do you mean by deprecate?

If we can detect new/old project - don't allow to run this if it's a new one? I guess what I want to simplify the logic eventually in terms of what do we accept into dvc repro. I hope that we'll be able to accept an output path dvc repro model.pkl, stage name (resolves to the dvc.yaml in the current dir) and something advanced like - path to yaml + stage name, or muliple stages, multiple outputs (not sure about this).

But would love to start removing stuff that does not make much sense from the new version perpespective.

We need to decide what should we do for dvc pipeline show from options that we have

we have a separate ticket(s) for changing semantics of the dvc pipeline show - e.g. #3661 , especially the problem that it's broken in a way right now - https://github.com/iterative/dvc/issues/3661#issuecomment-617886626

I would at say as a minimum for the DVC 1.0 release we can just keep the same semantics and accept exactly the same stuff the dvc repro accepts. In case when no arguments provided, it means dvc.yaml:* I guess? (it'll potentially solve that issue I mentioned btw).

@shcheklein, I have extracted two of the issues from here. And, pipeline show has already an open issue. The rest are already done.

Closing this one...

Was this page helpful?
0 / 5 - 0 ratings

Related issues

robguinness picture robguinness  路  3Comments

ghost picture ghost  路  3Comments

dnabanita7 picture dnabanita7  路  3Comments

tc-ying picture tc-ying  路  3Comments

mdscruggs picture mdscruggs  路  3Comments