Dvc: Metrics caching

Created on 8 Feb 2019  路  16Comments  路  Source: iterative/dvc

Hi!)
Nowtime DVC doesn't cache metrics. It means, that user needs to store it in git history. But DVC doesn't suggest to add to git.

Maybe it's possible to do something like:

1) After pipeline stop add to output git add <metrics/file/path>. For example:

dvc run -d train.py -M metrics.json -o last_checkpoint.zip python train.py

and output should be:

To track the changes with git run:

    git add .gitignore last_checkpoint.zip.dvc metrics.json

2) Add -m option to dvc run that will cache metrics (alternatively to -o and -O options)

I think 1 way is most perfect, cause not destroy paradigm - "source code in git, outputs in DVC".

Thanks!

enhancement

Most helpful comment

@shcheklein We should do both: modify message and add -m. Totally agree with original post from @toodef :slightly_smiling_face:

All 16 comments

@mroutis I'm not sure about "good first issue" though. It requires rebuilding a logic for dvc metrics show --all-bracnches significantly. @efiop should have more context on this.

Oh, @shcheklein , I was thinking only about expanding the message to include git add metrics.json, maybe I rushed a little bit to label it as good first issue, indeed; What do you think the approach should be?

I would say, if changing the message (append the file name to git add) solves enough of the problem for @toodef, then let's just do this, because it's 10x simpler then changing DVC to actually cache the file.

Implementing -m is really easy, we used to have it in the past and all logic is still there. For example, you could workaround it by specifying -o metrics.json in dvc run and then explicitly marking as metric with dvc metrics add metrics.json. :)

Why was the reason to remove -m, @efiop?

@efiop does it handle metrics show --all-branches in this scenario?

@efiop ok, i will try. But when i do dvc run -m metrics.json -o metrics.json - dvc metrics show -a show error that metrics.json mark as output from stage.
Is it different than -o metrics.json + metrics add metrica.json?

@mroutis The reason was our discussion with @dmpetrov , where we decided that it is a safer choice to just support -M, so that metrics are stored in git and not in dvc cache. But, as it turned out, sometimes users want that, so we need to bring it back just to save hustle :slightly_smiling_face:

@shcheklein It runs dvc checkout when checking out another branch.

@toodef
Ah, right. Ok, then try adding metrics: true manually to the entry for metrics.json in that dvc file. That should work :) I.e.

cmd: echo OLOLO > metrics              
md5: cf910dde1bf5bdeb94723ddbfcc74ecd  
outs:                                  
- cache: true                          
  md5: cbccc79ac9213325a623c38851b01c88
  metric: true                         
  path: metrics                        

@efiop it's good to know (re dvc checkout), thanks! It we decide to switch to using git API instead of (git checkout + dvc checkout) will we have a problem then? Doing dvc checkout can take some time, especially if cache is in copy mode, right?

@efiop When using ls-files, we will be able to access cache files directly, instead of relying on dvc checkout to put them into the workspace.

@efiop good point, makes total sense. So, what are the action points here? Should we introduce -m then?

@shcheklein We should do both: modify message and add -m. Totally agree with original post from @toodef :slightly_smiling_face:

Now metrics really caching! Thank u for very quickly implementation!!!)

But dvc metrics show -a show same values:

hnm:
        metrics.json: {"train": {"jaccard": 0.6354309320449829, "dice": 0.7804347276687622, "loss": 0.9194902181625366}, "validation": {"jaccard": 0.6705400347709656, "dice": 0.8039528727531433, "loss": 0.9115020632743835}}
master:
        metrics.json: {"train": {"jaccard": 0.6354309320449829, "dice": 0.7804347276687622, "loss": 0.9194902181625366}, "validation": {"jaccard": 0.6705400347709656, "dice": 0.8039528727531433, "loss": 0.9115020632743835}}

after git checkout hnm && dvc checkout output is:

hnm:
        metrics.json: {"train": {"jaccard": 0.6649578809738159, "dice": 0.8037408590316772, "loss": 0.8932506442070007}, "validation": {"jaccard": 0.6277576684951782, "dice": 0.772462785243988, "loss": 0.9959300756454468}}
master:
        metrics.json: {"train": {"jaccard": 0.6649578809738159, "dice": 0.8037408590316772, "loss": 0.8932506442070007}, "validation": {"jaccard": 0.6277576684951782, "dice": 0.772462785243988, "loss": 0.9959300756454468}}

OS: Ubuntu 18.04.2 LTS
DVC: 0.26.0
Git: 2.17.1

Reopened for now, to investigate. 馃憖 into it now.

Can confirm that dvc metrics show -a returns the same result for all branches. Workaround for now dvc metrics show -a metrics.json. We will the command asap.

Now it's work very well)! Thanks a lot!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dmpetrov picture dmpetrov  路  64Comments

andrethrill picture andrethrill  路  70Comments

jorgeorpinel picture jorgeorpinel  路  45Comments

JoeyCarson picture JoeyCarson  路  53Comments

ynop picture ynop  路  41Comments