dvc metrics diff

Created on 22 Dec 2019 · 13Comments · Source: iterative/dvc

Today, we can track metrics but metrics become much more valuable when you can see differences/improvements over time (commits/branches). A new dvc metrics diff command is needed.

$ dvc metrics diff HEAD^^
        metr.json:
        {
            "top1-error": 0.0385,
            "top5-error": 0.039221
        }

Open question: What should we do about not float/integer metrics? Let's don't support (ignore) them. Any other ideas?

Note, we should deal with float formatting carefully - we don't want to see diff values like 0.0001624000000000000001.

Also, an easy-to-parse output option is required:

$ dvc metrics diff HEAD^^ --to-json
{
    {
        "file": "metr.json"
    "changed": {
            "top1-error": {
                "old": 0.0385,
                "new": 0.0384824,
                "diff": 0.0000176
            },
        "removed": {
            "top5-error": {
                "old": 0.039221
            }
        }
        "added": {
            "loss": {
                "new": 0.0384824
            }
        }
}

feature request p2-medium product research

Source

dmpetrov

👍2

Most helpful comment

@jorgeorpinel I was expecting some follow up requirements on this one, but looks like everything mentioned in this ticket is already implemented. So let's close it for now. Thanks for the heads up!

p.s. I know I'm late to this party but it seems more like a dvc metrics delta to me. diff typically shows the base value removed and then the current value added (doesn't compute an arithmetic difference).

But this diff is a metrics diff, direct comparison with git diff is not quite correct here. Plus, git diff shows before and after without the difference simply because it is working with strings and we, in dvc, have an ability to tell if something is a number. API part is actually closer to git diff, as it has both old and new, but for CLI users don't want to see the old value, they just want to see the change. I'm sure there will be some followups here, after we receive some feedback.

efiop on 29 Jan 2020

👍2

All 13 comments

A workaround would be pretty trivial though. dvc get from HEAD^^, dvc get from current HEAD and run old plain diff on it.

efiop on 22 Dec 2019

👍1

@efiop yeap! Plain diff won't work, unfortunately - the order of metrics might easily change and you see a total mess. Also, we need a nice looking diff numbers, not old and new values.

But agree, we just need a nice looking shortcut for that.

dmpetrov on 23 Dec 2019

Are you guys contemplating in the specs every stage of the pipeline?
Something like this:

{
      "train": {
        "train_time": "3d 8h 23m 15s",
        "memory_consume": "8Gb"
      },
      "eval": {
        "inference_time": 0.001,
        "memory_consume": "124Mb",

        "top1-error": 0.0385,
        "top5-error": 0.039221
      }
    }

DavidGOrtega on 26 Dec 2019

👀1 👍1

@dmpetrov

Open question: What should we do about not float/integer metrics? Let's don't support (ignore) them. Any other ideas?

For starters, I would not support them, but I think at some point we could add functionality allowing the user to define how to calculate the difference between particular metrics, by executing some custom-defined method. That could resemble how we are dealing with summon right now.
It sounds kind of complicated as for now.

Question from me:
Do we want to support diff only for 2 revs ("old" and "new")? Or will we want to support revs ranges at some point (eg dvc metrics diff HEAD~10 for observing changes during last 10 iterations)? If so, the notion of "old", "new" and "diff" might have to be changed.

pared on 30 Dec 2019

❤1

That could resemble how we are dealing with summon right now.

👍 I have the same thoughts - we need a "custom" metrics file to support multiple formats (our basic json + csv are not enough). Probably, different types of metrics can be represented as separate summoning-objects (not necessary metrics inside summoning object).

Metrics are a bit more complicated than just numbers. There are numbers, numbers with сonfidence level, confusion matrixes and so one. It is not easy to find a single way of dealing with them. Even for numbers metrics, it is important to know if you are minimizing or maximizing it (is +013 good or bad?) to visualize properly and find the "best one".

Do we want to support diff only for 2 revs

Only two. The same idea as git diff.

dmpetrov on 2 Jan 2020

👍1

Question from me:
Do we want to support diff only for 2 revs ("old" and "new")? Or will we want to support revs ranges at some point (eg dvc metrics diff HEAD~10 for observing changes during last 10 iterations)? If so, the notion of "old", "new" and "diff" might have to be changed.

One of the best features would be observability. One of the best features of dvc is having all the metrics all together. I would vote for having that somehow.

DavidGOrtega on 2 Jan 2020

👍1

@DavidGOrtega @dmpetrov Well, that seems like a conflict "behave as git" vs "this might be useful in ML use-case". I guess we can always think about implementing another new metrics command (--range flag for diff could also be option, but I think it would not make too much sense, if we are going to stick to old, new notion for diff)

pared on 3 Jan 2020

@DavidGOrtega I'd appreciate if you could provide a use case (with a command example) when it will be helpful (and not easily replicable with multiple dvc diff)

dmpetrov on 3 Jan 2020

@dmpetrov is your question is regarding my question

Are you guys contemplating in the specs every stage of the pipeline?

Im just only asking without any intention of going towards it, was only to have a better picture.

If your question is why "all together" observability would be one of the best features is to have a general overview of the experiment. Conceptually an experiment can have many permutations of data, parameters and even implementation but at the end what should matter is how it performs according to the metrics that you want to measure.
The comparison between the last two may not mean that you have the best performance despite it has improved compared with the last one.

So, yes, you could be able to do multiple diffs but for an experiment with many trials that would be difficult to handle.

DavidGOrtega on 3 Jan 2020

Hi. Is this closed by #3051? (docs on their way too: iterative/dvc.org/pull/933)

p.s. I know I'm late to this party but it seems more like a dvc metrics delta to me. diff typically shows the base value removed and then the current value added (doesn't compute an arithmetic difference).

jorgeorpinel on 27 Jan 2020

👍1

A workaround would be pretty trivial...
agree, we just need a nice looking shortcut

Please note that we already provided a short script for this same workaround some time ago in https://github.com/iterative/dvc/issues/770#issuecomment-512693256 ! @efiop @dmpetrov

Even for numbers metrics, it is important to know if you are minimizing or maximizing it (is +013 good or bad?)

Agree. For certain numeric scales and ranges a simple B-A calculation may yield no meaning. For these cases maybe add --min or --max flags (or --compare=min/max/etc) so it just tells you which version has the best metric (and its value)?

jorgeorpinel on 27 Jan 2020

👍1

p.s. I know I'm late to this party but it seems more like a dvc metrics delta to me. diff typically shows the base value removed and then the current value added (doesn't compute an arithmetic difference).

efiop on 29 Jan 2020

👍2

it seems more like a dvc metrics delta to me. diff typically shows the base value removed and then the current value added (doesn't compute an arithmetic difference).

@jorgeorpinel Re the ‘git diff’ - it is a problem of Git which just cannot quantify the difference. If they can they would show numbers instead of line-by-line diff 😀

But really, if number diff does not work for dvc metrics it means the metrics were not defined properly. We need to introduce a more strict requirements for metric files.

dmpetrov on 3 Feb 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings