Today, we can track metrics but metrics become much more valuable when you can see differences/improvements over time (commits/branches). A new dvc metrics diff
command is needed.
$ dvc metrics diff HEAD^^
metr.json:
{
"top1-error": 0.0385,
"top5-error": 0.039221
}
Open question: What should we do about not float/integer metrics? Let's don't support (ignore) them. Any other ideas?
Note, we should deal with float formatting carefully - we don't want to see diff values like 0.0001624000000000000001
.
Also, an easy-to-parse output option is required:
$ dvc metrics diff HEAD^^ --to-json
{
{
"file": "metr.json"
"changed": {
"top1-error": {
"old": 0.0385,
"new": 0.0384824,
"diff": 0.0000176
},
"removed": {
"top5-error": {
"old": 0.039221
}
}
"added": {
"loss": {
"new": 0.0384824
}
}
}
A workaround would be pretty trivial though. dvc get
from HEAD^^, dvc get
from current HEAD and run old plain diff
on it.
@efiop yeap! Plain diff
won't work, unfortunately - the order of metrics might easily change and you see a total mess. Also, we need a nice looking diff numbers, not old and new values.
But agree, we just need a nice looking shortcut for that.
Are you guys contemplating in the specs every stage of the pipeline?
Something like this:
{
"train": {
"train_time": "3d 8h 23m 15s",
"memory_consume": "8Gb"
},
"eval": {
"inference_time": 0.001,
"memory_consume": "124Mb",
"top1-error": 0.0385,
"top5-error": 0.039221
}
}
@dmpetrov
Open question: What should we do about not float/integer metrics? Let's don't support (ignore) them. Any other ideas?
For starters, I would not support them, but I think at some point we could add functionality allowing the user to define how to calculate the difference between particular metrics, by executing some custom-defined method. That could resemble how we are dealing with summon
right now.
It sounds kind of complicated as for now.
Question from me:
Do we want to support diff only for 2 revs
("old" and "new")? Or will we want to support revs
ranges at some point (eg dvc metrics diff HEAD~10
for observing changes during last 10 iterations)? If so, the notion of "old", "new" and "diff" might have to be changed.
That could resemble how we are dealing with
summon
right now.
👍 I have the same thoughts - we need a "custom" metrics file to support multiple formats (our basic json + csv are not enough). Probably, different types of metrics can be represented as separate summoning-objects (not necessary metrics inside summoning object).
Metrics are a bit more complicated than just numbers. There are numbers, numbers with сonfidence level, confusion matrixes and so one. It is not easy to find a single way of dealing with them. Even for numbers metrics, it is important to know if you are minimizing or maximizing it (is +013 good or bad?) to visualize properly and find the "best one".
Do we want to support diff only for 2
revs
Only two. The same idea as git diff
.
Question from me:
Do we want to support diff only for 2 revs ("old" and "new")? Or will we want to support revs ranges at some point (eg dvc metrics diff HEAD~10 for observing changes during last 10 iterations)? If so, the notion of "old", "new" and "diff" might have to be changed.
One of the best features would be observability. One of the best features of dvc is having all the metrics all together. I would vote for having that somehow.
@DavidGOrtega @dmpetrov Well, that seems like a conflict "behave as git" vs "this might be useful in ML use-case". I guess we can always think about implementing another new metrics command (--range
flag for diff
could also be option, but I think it would not make too much sense, if we are going to stick to old
, new
notion for diff)
@DavidGOrtega I'd appreciate if you could provide a use case (with a command example) when it will be helpful (and not easily replicable with multiple dvc diff
)
@dmpetrov is your question is regarding my question
Are you guys contemplating in the specs every stage of the pipeline?
Im just only asking without any intention of going towards it, was only to have a better picture.
If your question is why "all together" observability would be one of the best features is to have a general overview of the experiment. Conceptually an experiment can have many permutations of data, parameters and even implementation but at the end what should matter is how it performs according to the metrics that you want to measure.
The comparison between the last two may not mean that you have the best performance despite it has improved compared with the last one.
So, yes, you could be able to do multiple diffs but for an experiment with many trials that would be difficult to handle.
Hi. Is this closed by #3051? (docs on their way too: iterative/dvc.org/pull/933)
p.s. I know I'm late to this party but it seems more like a dvc metrics delta
to me. diff
typically shows the base value removed and then the current value added (doesn't compute an arithmetic difference).
A workaround would be pretty trivial...
agree, we just need a nice looking shortcut
Please note that we already provided a short script for this same workaround some time ago in https://github.com/iterative/dvc/issues/770#issuecomment-512693256 ! @efiop @dmpetrov
Even for numbers metrics, it is important to know if you are minimizing or maximizing it (is +013 good or bad?)
Agree. For certain numeric scales and ranges a simple B-A calculation may yield no meaning. For these cases maybe add --min
or --max
flags (or --compare=min/max/etc
) so it just tells you which version has the best metric (and its value)?
@jorgeorpinel I was expecting some follow up requirements on this one, but looks like everything mentioned in this ticket is already implemented. So let's close it for now. Thanks for the heads up!
p.s. I know I'm late to this party but it seems more like a dvc metrics delta to me. diff typically shows the base value removed and then the current value added (doesn't compute an arithmetic difference).
But this diff is a metrics diff, direct comparison with git diff
is not quite correct here. Plus, git diff shows before and after without the difference simply because it is working with strings and we, in dvc, have an ability to tell if something is a number. API part is actually closer to git diff
, as it has both old and new, but for CLI users don't want to see the old value, they just want to see the change. I'm sure there will be some followups here, after we receive some feedback.
it seems more like a
dvc metrics delta
to me.diff
typically shows the base value removed and then the current value added (doesn't compute an arithmetic difference).
@jorgeorpinel Re the ‘git diff’ - it is a problem of Git which just cannot quantify the difference. If they can they would show numbers instead of line-by-line diff 😀
But really, if number diff does not work for dvc metrics it means the metrics were not defined properly. We need to introduce a more strict requirements for metric files.
Most helpful comment
@jorgeorpinel I was expecting some follow up requirements on this one, but looks like everything mentioned in this ticket is already implemented. So let's close it for now. Thanks for the heads up!
But this diff is a metrics diff, direct comparison with
git diff
is not quite correct here. Plus, git diff shows before and after without the difference simply because it is working with strings and we, in dvc, have an ability to tell if something is a number. API part is actually closer togit diff
, as it has both old and new, but for CLI users don't want to see the old value, they just want to see the change. I'm sure there will be some followups here, after we receive some feedback.