Dvc.org: metrics/plots: clarify when/why to cache metrics/plots files or not (Git vs DVC tracking)

Created on 3 Sep 2020 · 8Comments · Source: iterative/dvc.org

UPDATE: Jump to https://github.com/iterative/dvc.org/issues/1755#issuecomment-690321804

In this section:

https://dvc.org/doc/start/experiments#collecting-metrics

It says to add the metrics output files to the git repository:

git add scores.json prc.json

I initially assumed you were not supposed to track these files in git, and later saw this confirmed here: (https://discuss.dvc.org/t/dvc-metrics-diff-on-metrics-stored-in-dvc-cache/386/2) I only tried to do this because I was getting diff not supported when trying to run dvc metrics diff

Anyways, I attempted to add the metric files to git, based on this documentation and also as a troubleshooting effort, and got the error:

ERROR: failed to reproduce 'dvc.yaml':  output 'metrics/dmod1.json' is already tracked by SCM (e.g. Git).
    You can remove it from Git, then add to DVC.
        To stop tracking from Git:
            git rm -r --cached 'metrics/dmod1.json'
            git commit -m "stop tracking metrics/dmod1.json"

Which seems to be expected.

discussion

Source

bobertlo

All 8 comments

So, I found the cache: False option in the metrics documentation. It looks like this is covered in the dvc run invocation in the getting started guide, but I still feel like it is less than clear in the documentation. This seems like a really important piece of information to at least mention in the tutorial.

bobertlo on 3 Sep 2020

You can chose whether metrics/plots files are tracked by Git or DVC (cache field in the . But since they're usually small text files, tracking them with Git is recommended. dvc metrics/plots commands work on any metrics/plots files found in the workspace, regardless of what tool tracks them.

But yes, this is a common confusion. The docs could try to help better explain this, but it may be a problem in the design of DVC core. @bobertlo do you envision a better way these 2 options (caching metrics/plots files or not) could be handled?

jorgeorpinel on 8 Sep 2020

@jorgeorpinel Just to describe the process of my initial misunderstanding:

DVC tracking is the 'default' way to store the metrics (in the sense of not adding extra config options/flags) but https://dvc.org/doc/start/experiments adds an argument to track the files with git. I did the tutorial, then added metrics to my dvc.yaml manually, and did not realize this distinction, since I was pretty much working from memory.

One immediate solution could be to use the 'default' behavior in the tutorial (and maybe add a drop down mentioning the difference?) I just don't see how it could be described more clearly but there could be a good way to do the opposite.

Maybe then a more detailed description could be given in the command reference? I see that it is mentioned below one of the figures, but maybe a short section just stating there are two ways to track the metrics, give a reason to recommend git, and mention both the dvc run flag and cache: False config setting.

If tracking with git is recommended, it is unfortunate that it is the method that requires more special intervention.

I'd be happy to try to draft some of these changes if there is some agreement with any of these ideas?

bobertlo on 9 Sep 2020

👍1

I don't think its super important to change the method used in the tutorial but I think it would be really helpful if the different was mentioned once somewhere outside of the drop down.

bobertlo on 9 Sep 2020

DVC tracking is the 'default' way to store the metrics... but https://dvc.org/doc/start/experiments adds an argument to track the files with git

Hmmm yeah I guess cache: true is the default, good point! Also on the inconsistency with dvc exps. With run though, you have to chose to specify either -m or -M, but still, maybe it feels like the lower case is the more common option that "should" be used typically.

We should probably open an issue on the core repo to report this, would you be able to do so @bobertlo? A feature request to make non-cached metrics/plots the default (and possibly suggest better option names for run).

Maybe then a more detailed description could be given in the command reference? ...

Which command reference are we talking about? That's something we can definitely address here but do you mean in metrics? in run? Somewhere else? Please link 🙂

jorgeorpinel on 9 Sep 2020

Hmmm yeah I guess cache: true is the default, good point! Also on the inconsistency with dvc exps. With run though, you have to chose to specify either -m or -M, but still, maybe it feels like the lower case is the more common option that "should" be used typically.

Ah, I didn't realize there was the -m/-M distinction, I don't use dvc run to create my pipelines and I was basically going from memory on this. I did not think to review that documentation.

We should probably open an issue on the core repo to report this, would you be able to do so @bobertlo? A feature request to make non-cached metrics/plots the default (and possibly suggest better option names for run).

Sure, I could look at options for this and bringing it up there. I'm looking into this a little more though and changing this value would definitely be a breaking change. Maybe this could be done on the next major release, but even then it is questionable?

Which command reference are we talking about? That's something we can definitely address here but do you mean in metrics? in run? Somewhere else? Please link 🙂

Wow, good question. I was just thinking metrics but I'm not sure about that now, since it applies to metrics, plots, and run. It is mentioned in run, but I'm not sure if a meaningful explanation duplicated across plots/metrics would help and there doesn't look like much space in the Get Started guide.

I wonder if a little overview user guide might be appropriate for plots/metrics that could unload some of the hidden text in the Get Started section, cover things like this? I think it could both de-duplicate content in other sections and also expand a little on some topics.

Just some ideas!

bobertlo on 9 Sep 2020

👍1

I thought a little more this morning and I guess the core problem was that I broke something and I understood that there was a mismatch, but wasn't sure what I did wrong to get there. I went back to the Get Started guide and also the metrics command reference and had a hard time trying to find the needed information (while I see now cache: False is in the Get Started guide, but somewhat hidden.

Maybe the simplest solution would be a little section in troubleshooting to link to from the error message? I suppose if I just did what the error message told me to do it would have all worked, but I was concerned about what I did wrong and why I got there. Why was it not allowing me to track the metrics in git? Am I not supposed to track them? The tutorial said to track them? etc.

I think I may be approaching this project from a different mindset than a lot of other users (and keep breaking things) but if someone else types -m instead of -M it's not a great place to leave a user.

EDIT:

So I think this would definitely solve the issue, and we could either:

Add a blurb in the troubleshooting section other errors point to.
Add a little section covering the reasoning behind git/dvc storage of plots and metrics and point all other mentions of this choice (Get Started, metrics, run, plots) and either point the error there or have the troubleshooting section also point to this.

bobertlo on 10 Sep 2020

👍1

looking into this a little more though and changing this value would definitely be a breaking change
if someone else types -m instead of -M it's not a great place to leave a user

Doesn't necessarily break things, depends on how it's implemented e.g. rename the -m/M options (and leave the old ones hidden for backward compatibility). In any case it's still a reasonable and well-motivated feature request, once you open it and the team decides on it, we can then discuss release logistics — that's kind of secondary, don't worry about it 🙂

it applies to metrics, plots, and run.

[ ] OK so we should look into updating those 3 refs. I think also the DVC Files and Dirs guide, in fact I think I'd describe this there mostly and briefly mention + link to the explanation from the 3 cmd refs.

use the 'default' behavior in the tutorial (and maybe add a drop down mentioning the difference?)

I see now cache: False is in the Get Started guide, but somewhat hidden

[ ] Indeed. We can also check if that needs to be improved.

attempted to add the metric files to git, and got ERROR: failed to reproduce 'dvc.yaml': output 'metrics/dmod1.json' is already tracked by SCM

section in troubleshooting to link to from the error message?

Not a bad idea either. Also a change for the core repo though, let's include in the feature request as a first step/alternative just in case. Thanks

Why was it not allowing me to track the metrics in git? Am I not supposed to track them? The tutorial said to track them

Great questions to make sure are covered somewhere in docs.

jorgeorpinel on 12 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

add "Jupyter notebook" article

efiop · 5Comments

cmd ref: document target granularity for push/pull/fetch/checkout/status -c et al.

efiop · 4Comments

sidebar dropdown toggle

utkarshsingh99 · 3Comments

error: pathspec 'code'' did not match any file(s) known to git

lunasdejavu · 4Comments

how to: use NFS as a DVC remote

efiop · 4Comments