Dvc: support push/pull/metrics/gc, etc across different commits

Created on 6 Mar 2019  路  70Comments  路  Source: iterative/dvc

Currently dvc metrics show can show metric values across different branches (-a) and different tags (-T).
Can you consider supporting showing different metric values across different commits in the same branch?


The background of this is (simplified example): say I'm currently training a model, where I'm changing a certain parameter, param1 (for instance, number of trees in a forest). The way I probably would like to work is to find a first value for param1, commit the current state, continue changing param1 and continue committing the successive states that I consider worth saving. At some point I would like to look back and identify the setup that gave me the best results.

The way DVC currently works forces me to create a new branch/tag for each trial I want to keep track of, and this seems a bit overwhelming.

Depending on how different the experiments I'm running are and their level of granularity I could decide how to keep track of them (new commits VS new branches/tags).

Notes:

  • The example above is overly simplified and there are better ways of tuning specific models parameters. But this gets more complicated if I'm changing more stuff (model hyperparameters, data processing, features to use, etc).
  • If dvc were to support what I'm proposing here, an extra argument would probably be required to limit how many commits DVC would look back at. Otherwise it would show all the metric values since the beginning of the repo history, which can be unhelpful and messy.
feature request p1-important research

Most helpful comment

To give a new user's perspective on the issue (talking about push/pull really rather than gc), I had assumed that dvc push was equivalent to git push: i.e. you make several local commits then push all of them to a remote. What @pared said basically:

By default, in git, if you want to save something in repo, you will commit it. You do 10 commits and then push. All 10 commits has been pushed. What dvc is currently allowing you to do, is to push current changes. So... we do 10 commits and push dependencies/outputs from only last one. I believe that default behaviour should be pushing all dependencies from all commits, that have not been yet pushed. That is the only way to make sure all commits are not broken, and not demanding from user to periodically making pushes on branch.

... so the actual behaviour of dvc push caught me by surprise initially. I understand that this is hard from a performance perspective, but from a data integrity point of view, I think it's an important option to have. Particularly for raw data which isn't reproducable from anywhere, the dvc cache is the only place where it exists: if you push in the wrong order then you can end up with lost data.

All 70 comments

@andrethrill Would you like to compare two specific commits or just the dynamics of your metrics changing across a range of commit? The latter one is probably more suitable for a graphical tool, like tensorboard or something. Or are you looking for a CLI way of doing that, using different filters (e.g. find max metric across N commits)?

Hi @efiop !

I'm aware of TensorBoard but that's not exactly what I was talking about.

I would like to have a way of running a few consecutive different experiments and see their metrics. Just like dvc metrics show -a currently does but without needing to create different branches. DVC seems like a good fit for this because I could checkout the experiment that gave me the best results and have everything version controlled (model, data, etc).

Or are you looking for a CLI way of doing that, using different filters (e.g. find max metric across N commits)?

If that were to be supported it would be great of course. But for what I'm talking about, just looking at the output in the same form as in dvc metrics show -a would be enough.

@andrethrill Ah, so something like dvc metrics show HEAD~10 to show metrics for 10 last commits on the current branch?

Exactly @efiop ! And/or some other nice variations of it: dvc metrics show HEAD~{commitHash} show metrics since commitHash on the current branch.

@andrethrill AFAIK HEAD~{commitHash} is not supported by git and it would be great to leave the syntax similar if not the same as in git :) But I get your point, there is probably a git-way to do that. Thanks for a great feature request!

@efiop indeed, I was not thinking from git perspective. The syntax would have to be different :)

@andrethrill @efiop It seems the ability to dvc metrics show only specific tag(s) (instead of either the current HEAD only or all tags) might be one feasible way to engineer this feature request. Of course, you would have to create a tag for each of the commits you would like to show.

Anyhow, I would find the ability to dvc metrics show only a specific tag or tags useful for some parts of my workflow.

Since the logic behind all the commands is similar it's probably make sense to implement it for all commands that support -T, -a options now.

@andrethrill @efiop It seems the ability to dvc metrics show only specific tag(s) (instead of either the current HEAD only or all tags) might be one feasible way to engineer this feature request. Of course, you would have to create a tag for each of the commits you would like to show.

What about to dvc metrics show not just for arbitrary tag but for arbitrary commit? The same way git checkout command allows to checkout both arbitrary tag (git checkout <tag_id>) and commits (git checkout <commit_id>).

The syntax of course would be different from git. Something like dvc metrics show --id <tag/commit>. Additionally you can use the same syntax in other commands, like dvc push --id <tag/commit>.

This approach also solves the issue when you have several local commits and in each commit the single data file tracked by DVC has been overriden. Current implementation of dvc push only pushes last version of your data file from the last commit. The --id option will allow to push all the previous versions of your data by executing dvc push --id <commit> for all the previous commits.

@andrethrill @brbarkley @nik123
1. I see that we might have here few smaller tasks. Ill try to identify them:
- metrics show HEAD~X - which will show metrics from last X commits
- metrics show --since {rev}- show metrics since rev
- metrics show --ids={rev1},{rev2},{rev3} - like show -a but restricted to particular revs
Do those options make sense?

  1. This is just for metrics, how would you guys see other operations, like push, pull, etc?
    a) I imagine, that for example in case of push, one might want to push all dependencies that has been binded to some git revision with stage files. Do you think it would be feasible to include some option for that? Like dvc push --all-revs? That could also be used for other ops, like pulling.
    b) Does points from 1. make sense for operations like push, pull etc? Have you ever needed to pull for some range of commits/tags/branches? Or maybe --all-revs for push would be enough?

My 2 cents on this.

  1. Since -a and -T are symmetric across push/pull/gc/metrics-show, we should make a new option symmetric as well. Especially considering that it does not make implementation more complicated.
  2. Using commits is yet another way to manage "experiments". So, it makes sense to provide these options to all the commands that support -a, -T.
  3. CLI-interface-wise: would be great if we can keep a single option - --revisions? or something like that. Would really want to avoid introducing positions arguments, and a few options on top just to manage this case.
  4. (not scope of this issue, but can affect certain decisions) We'll need to introduce a filter on top -a, -T. Something like regexp to filter branch/tag names. Can we reuse this new option to simplify certain options in this case?

@shcheklein I agree that for simplicity it would be much better to implement --revisions but looking at original feature request, supporting something like dvc metric show HEAD~10 looks like desired approach too. I think that using revisions to compare last 10 commits results will be a headache for users. One will need to either tag all commits, or git log and copy-paste revs to the --revisions option.

EDIT: let me clarify that I am talking here specifically about show

I would also like to start another discussion about --revisions option.
As we discussed privately, we come to the conclusion, that probably the most convinient way for user to use --revsions would be to provide coma separeted revision ids, like:

--revisions {rev1},{rev2},{rev3} (note that providing revisions like: --revisions {rev1} {rev2} {rev3} is not viable options, since we would not know when to start parsing targets)

The problem with this approach is that coma is viable character to be included in branch name. So this edge case would break currently considered approach.

The other way to do that would be --revsion option (--revision {rev1} --revision {rev2} --revision {rev3})

I think we cannot expect users to name branches in a way that would be convinient for us, do you agree?

EDIT: as discussed with @Suor, we not necesarily need to use coma as separator, git forbids some characters in branch names, like colons.

Possible solution: require providing revisions after parsing targets, that would make parsing multiple targets and multiple revisions possible.
How that could look like:

https://asciinema.org/a/pSVHHQ17uQBwzN2v0BUI8VaSK

We are using rev already, why not --revs?

@Suor I agree, especially that its short and understandable.

My thinking was - is it possible to derive from the string that is passed to --revs what exactly do we want to address? - like if it's a comma separated list of git hashes then we work with ids, etc . Is there an example in Git cli where it expects a list of revs?

@shcheklein looking through documentation, I think closest example would be using refspec:
https://git-scm.com/book/en/v2/Git-Internals-The-Refspec

What do you mean by what do we want to address? Determining whether it is commit sha, branch or tag?

Yep, either it's a commit, branch, tag or a list of those.

@shcheklein do we actually need to know what it is? AFAIK git checkout accepts any of those.

It seems to me, that we need to decide which way we go with implementation of this feature.

  1. I would leave discussion "Do we support dvc metrics show HEAD~10" for some other issue, as it is some particular use case that is not related to push/pull etc...

  2. How to implement data sync operations for few revisions, we discussed 3 approaches so far:

  3. dvc pull -rev rev1 -rev rev2 -rev rev3 file.dvc
  4. dvc pull file.dvc --revs rev1 rev2
  5. dvc pull --revs=rev1:rev2:rev3 file.dvc

I think we should go with the last one, because its faster to use that first one, and does not introduce some strong assumption as do the second approach (I mean requiring passing revs after targets)

@pared 1 and 2 are tied together. metrics and pull/push/etc should have (if it is feasible) the same syntax for working with references. Unless we decide to redesign it of course, but I don't see the point of that just yet. I totally agree with you, that most would probably just want to have an ability to do something with last N commits or something, so we need to give that syntax a bit of though, which might actually change the approach with --revs.

We've discussed that the second approach (the one that is requiring passing revs after targets) is absolutely terrible, just forget about it :slightly_smiling_face:

My thinking was - is it possible to derive from the string that is passed to --revs what exactly do we want to address? - like if it's a comma separated list of git hashes then we work with ids, etc . Is there an example in Git cli where it expects a list of revs?

@shcheklein I agree with @pared , this is a terrible idea, git doesn't distinguish between those so neither should we, especially just to adopt some joining syntax. I would much rather go with --rev rev1 --rev2 rev2 and have it deterministic than invent comma joining syntax to be able to --revs rev1,rev2. Though, the : looks promising, if it is indeed forbidden by git to have tags/branches with those. That being said, using colons is not intuitive at all.

I see that we might have here few smaller tasks. Ill try to identify them:
metrics show HEAD~X - which will show metrics from last X commits
metrics show --since {rev}- show metrics since rev
metrics show --ids={rev1},{rev2},{rev3} - like show -a but restricted to particular revs
Do those options make sense?

@pared This makes a lot of sense to me from user perspective, but I would probably go with something like

metrics show --from-rev HEAD~X (it is implied that --to-rev is HEAD, same as any git command does)
metrics show --from-rev rev
metrics show --rev rev1 --rev rev2 --rev rev3

This is just for metrics, how would you guys see other operations, like push, pull, etc?
a) I imagine, that for example in case of push, one might want to push all dependencies that has been binded to some git revision with stage files. Do you think it would be feasible to include some option for that? Like dvc push --all-revs? That could also be used for other ops, like pulling.
b) Does points from 1. make sense for operations like push, pull etc? Have you ever needed to pull for some range of commits/tags/branches? Or maybe --all-revs for push would be enough?

I'm not sure --all-revs makes any sense in git world, since it feels like it would include detached heads and stuff, which is generally considered to be trash. That being said, being able to push/pull/etc all history seems very useful to me. If talking in git terms, one would maybe expect dvc push to push everything from the initial commit, but we didn't do that because it might be excessive because of the data size. But maybe users have another opinion? :slightly_smiling_face:

@efiop by saying --all-revs I was thinking about current branch, probably the naming could be improved :)

@efiop --rev rev1 --rev rev2 is just too verbose.

@efiop @pared last N commits or from this commit is not a good idea. Why should I care to count commits or copy paste sha? It should work just like git log does, calculating and showing things as you scroll.

yep, I agree with @Suor . multiple --rev is too verbose. Btw, do we need to support a list of commits at all (back to the ticket itself)? Can we just support the last N commits for now? then we need just a simple option -n 10 or something.

@shcheklein I guess we could, though approach with --revs does not bind us to current branch as -n do.
Lets imagine we have been working on two branches, iterated a bit, have few commits on each branch.

  • with --revs approach we need to specify, by hand, all commits that we want to push
  • wtih -n we would probably have to checkout branch_1 && dvc push -n {X}, then checkout branch_2 && dvc push -n {Y}

both approaches requires us to play around with git log to check which/how many commits we need to push.

How about do it as git does?
if we produce few commits on local branch, we should push all of them. If user does not need some commit outputs, he should squash/fixup the commit it, effectively removing its dependencies from history. This way user does not need to worry that he forgot to dvc push something, and the repository is broken in some point, yet still he has power to remove some intermediate results.

Guys, why are you talking about dvc push -n ...? This is a very strange thing to my liking and nobody asked for it. And we don't need -n for metrics either, as I explained above dvc metrics show --log is what topic starter really asked.

@shcheklein dvc metrics --revs foo,bar,baz might be also useful and is easier to implement, this is why we are discussing it.

@Suor an extra argument would probably be required to limit how many commits DVC would look back at. Otherwise it would show all the metric values since the beginning of the repo history, which can be unhelpful and messy. - this exactly -n I'm talking about. And more or less for the same reason.

I think --log behavior is less convenient here tbh. I want to see a table with just recent metrics, I don't need a full log usually. --log and -n do not contradict each other. But I would love to see a case for --log first. Especially we need -n (or come with a better name) to support this across different commands like push/pull, etc. Yes, I want to push just a few recent commits, because I forgot to run dvc push recently. How can I do this? It can be and should be symmetric across push/pull/metrics show/gc.

--revs [list] - again, it's too complicated to start with, I don't see a clear analogy, I don't see how it can be actually used. Me specifying a list of git hash commits? No, I doubt that anyone would do this. But again, we can do this when we see more signals.

@pared -a (and -T) options can play very well along -n.

@shcheklein you calc and show metrics that fit on screen, then calc and show more if user scrolls down. What's messy about it? You propose user to guess or count instead - this is the messy one.

push -n still doesn't make sense to me at all. I don't think people need it, nobody ever asked for it and it contradicts to how dvc gc operates.

scrolls break your context (the way it's usually done is that it disappears when you terminate it), scroll is harder to implement (unless you do an actual fetch for all commits), scroll is not what ppl are asking for in the ticket ... and for a good reason. My intuition that I just want to see a reasonable table right away. Want more and scroll - just do -n 1000 | less.

The only reason I use the default log with scroll is when I actually need to search for something. But in our case it would mean that we'll have to put together the full table in advance, right?

Is it only push -n that does not make sense, or all other commands?

It looks to me that the author of this issue wanted the log feature, -n is an awkward substitute for it. I agree that -n is easier to implement.

I disagree that scrolling is only needed for searching, I almost never search git log, but I scroll it occasionally. And even for searching one doesn't need to precompute everything.

-n doesn't make sense for push, pull, gc, it only makes sense for metrics.

Okay, I think we need to decide what are we actually focusing in this issue, and create new issue accordingly. We talk about supporting data synchronization commands among commits (thats what title indicates), while original description is about tracking metrics. That makes discussion two threaded.

I propose to focus here on show functionality for last commits and create issue for dealing with data sync operations, due to fact that original description is about supporting metrics.

So, there are two pieces under discussion.

  1. The CLI interface. What do we want to see. So far I see the following options:
  • -n 10 - simple to implement, correlates to mind with the ticket, feels like that we will need it anyway (like git log has the limit), works well with -a and -T, can be easily extended with more advanced --log argument.
  • dvc show --log works as a git log. I'm not sure tbh how are we going to implement this if we don't introduce -n with it, to limit the number of commit we go back. Or it's not that easy to implement. Also, in my experience I want my table at my screen, not in less.
  • --revs - I have doubts someone is going to specify explicitly commits, to much burden. Can be implemented if we need this.
  1. Symmetry with data management.. Usually if something breaks symmetry there is a strong reason to avoid this. Let's imagine the following situation, you just cloned the repo. Cache is empty. And I want to see some latest progress on the current branch. How do I do this? How do I pull/fetch data for the last N commits to see my table of experiments?

We really need this. I will need it even to adjust the example in get started.
It not a big deal to implement it across all commands at once.

So, I would say, let's go with the first option and symmetrical implementation of it. Don't see any reasons to complicate it in the first iteration.

I agree that --refs should wait for more demand.

However, if we go with a first option we should not allow -n mixing with -a nor -T, this doesn't make sense for show metrics - how would you order those? And thus has no need in push/pull.

There is ready to use pager implementation, which accepts generator and thua makes implementing dvc metrics show --log comparable in complexity with -n thing.

@shcheklein I am not a fan of -n option:
By default, in git, if you want to save something in repo, you will commit it. You do 10 commits and then push. All 10 commits has been pushed. What dvc is currently allowing you to do, is to push current changes. So... we do 10 commits and push dependencies/outputs from only last one. I believe that default behaviour should be pushing all dependencies from all commits, that have not been yet pushed. That is the only way to make sure all commits are not broken, and not demanding from user to periodically making pushes on branch.

Example:

#/bin/bash

rm -rf repo storage

mkdir repo storage

cd repo

git init >> /dev/null && dvc init >> /dev/null

dvc remote add -d storage ../storage
echo data >> data

dvc add data >> /dev/null

git add -A
git commit -am "first"

dvc push

echo data2 >> data

dvc add data

git add -A
git commit -am "second"

echo data3 >> data

dvc add data

git add -A
git commit -am "third"

dvc push

rm -rf .dvc/cache
rm data

git checkout HEAD~1
# no cache found
dvc checkout

@pared We would have to break the current behavior. And it's not only about dvc push. It's also about pull/checkout/fetch/gc. I gave you an example where you have to run pull (not push) in order to being even able to run metrics show.

Analogy with Git is relevant just to certain extent. The biggest problem that when you deal with data all operations are way more expensive than with code. It means you need way higher level of granularity. Thus we don't pull/push all commits by default - it would be just too expensive. (Thus we have commands like -a, -T to be able to extend the scope.) From all perspectives - number of files to analyze, potential amount of data to push, etc.

@shcheklein still dvc push -n looks way too fragile, any dvc gc -r and whoosh, no data.

I would argue that dvc push/pull -n not needed for dvc metrics show --log to be useful. The workflow may look like:

  1. Experiment, commiting each change along the way.
  2. Use dvc metrics show --log to review.
  3. git tag whatever commits work the best.
  4. Finalize with:
    bash git push dvc push -T

@shcheklein
I have to agree that data needs to be approached differently, and breaking current behaviour is a no-no.

I like approach proposed by @Suor, that way user would not have problem with situation like "Out of 10 last commits I want to preserve 3, 7, 9". Though that still requires some work on user side. -n seems to have potential to grow cache size unnecessarily, just like my idea with pushing all dependencies/outputs.

@Suor I don't see any difference with -a or -T in terms of "fragility" tbh. All those options expand the scope. If we have some concerns about gc -n we can just introduce a prompt (for -a, -T, -n, etc).

The workflow you described is too cumbersome. I have to tag experiments, which is not nice. And there is no way for me to see the stuff after I cloned the repo. I don't see how will --log work in your case after I clone the repo. Or do you suggest to ignore missing files? Basically return nothing for all previous commits? Until I do dvc pull -aT? Which is by itself can be very different from what I want - just to glance to the latest scope of experiments (not all tags, not all branches - those could be too expensive).

And to reiterate - we can implement the --log. I just believe we would still need -n. And we will need it across all commands. So why don't we just start with something simple and add more advanced and tricky (impossible?) to implement stuff later?

@shcheklein gc -n doesn't make sense at all, it is a sliding window destroying data. If you always run gc -a T at least you won't ever remove tagged and branched data, this makes some sense. And if we ever use push -n we can't ever run dvc gc at all, so no point in having those too.

My workflow example is practical and is what author was doing. Yes, you need tags to persist data, this is what you are doing anyway with dvc, because there is no other way and dvc push -n won't add it. What is your proposed workflow?

Still no use for -n that I see. This is a feature nobody asked, it doesn't enable any new non-broken workflow. Why should we bother with it?

I'm not sure I understood the sliding windows analogy here.gc -n makes perfect sense to me. First, gc is not destroying data, it cleans only cache by default. And this is exactly a way for me to specify - just keep the data to a certain depth, locally. Similar to git clone depth.

We are going in circles, and I don't see any other new arguments there, not that much to add here. To clarify the workflow:

  1. A do modeling as I do, it's up to me to assign tags or just commits.
  2. I have a simple command to glance the metrics for the past.
  3. When someone clones they need to run pull -n.

On sliding window destrying data - I meant gc -n -c/-r. Once we do dvc push -n we can't use dvc gc -r/-c in any form without destroying data: dvc gc -aT -n N is that sliding window. So I don't think we should allow push -n and so pull -n doesn't make sense either.

I don't see how your workflow works:

  1. When someone runs pull -n they might get something or not. Depends on many factors.
  2. You need to communicate to your team a proper N for each branch that could be fetched.
  3. If someones adds a commit or two on top of the branch he needs to tell everyone that it's N + x now.
  4. If someone from point 3 forgets to push then branch is broken.
  5. You need to communicate to your team to never use dvc gc -c/-r.

We are not going in circles we only now got to discuss workflows. And yours looks broken to me.

A two side notes from this discussion:

  1. Looks like dvc gc has a dangerous design: it removes the most by default and becomes safer as you add options. Feels like it should be the opposite.

  2. We might want to treat metrics differently than other artifacts in push/pull/gc. They are small and can be used to review your work.

And again, with dvc push/pull/gc -n we are inventing a feature from the blue, which nobody asked and maybe nobody will ever use. We are providing a new way to shoot yourself in the foot though.

So, I think you addressed the gc concern yourself. At least that it's dangerous or not.

2 - partially answers your question re the workflow and why pull/push -n are needed. There are other cases when I need them - diff two datasets for example.

Don't feel any difference between --log and -n for metrics show in terms of what do you need to communicate, push (branch is broken), etc. But -n is way easier to implement, that's what was asked for, it's enough in 99% cases.

We are providing a way bigger way to shoot yourself by implementing --log. I don't even know how will the workflow look like for it (when I clone something). I don't know how to implement it.

-n comes for free more or less for push/pull/gc if it's implemented for metrics as far as I remember.

Can you describe your intended workflow by steps? Because I can't see it. A I see dvc push/pull/gc -n as a source of grief not something useful.

Basic case:

  • You clone something with git clone.
  • You run dvc pull -n 10 or dvc fetch -n 10 to fetch _some_ context about the model. You need this step if metrics are cached. And we had a case for cached metrics.
  • You run dvc metrics show -n 10 to print actual numbers as a table.

If you need more you can just increase the number.

When you do experiments:

  • run stuff as you want, do commits as you want, even push when you want.
  • from time to time run metrics -n N to see the progress
  • optionally push -n if you want to save only certain stuff to remote. It's the same as push -a or push -T, I don't see that much difference here.

And again, I see --log as a premature complicated feature. -n comes for free, it's symmetric with other commands, it's must have for pull -n/fetch -n

You clone and dvc pull -n 10 and it doesn't work, because you/someone else didn't push, because branch has moved, because someone ran gc. You dvc push -n and it inflates your remote, you can't call dvc gc -r/-c in any form to fix that because it will break everyone elses workflow, which people will occasionally do. You loose time communicating or trying to guess whatever N you need.

As I say it's a source of grief and a feature nobody asked, it breaks dvc conceptually as I see it. We should not break this into the wild. So the only thing we can safely do is dvc metrics show -n and that one is inferior to --log, which is only marginally harder to implement.

We are hard on features already, we should not add more questionable ones on top.

You still haven't explained me yet how do you see --log to be implemented at all. Or at least how will it work after you clone a repo and want to see something.

While I agree in general that complicating stuff is bad, in this specific case I mentioned multiple times -n for pull/fetch comes for free, it makes sense from the symmetry point of view, it makes sense in certain cases.

Implementation: go over commits backwards, yield show metrics text, pass generator to click.echo_via_pager(). It won't work after clone unless you repro everything, my intended workflow does not need this, we can discuss treating metrics differently though.

Nothing comes for free. And this desire for symmetry could be misleading, symmetry has no inherent value it's a heuristic of a value, which fails here. It looks to me like dvc push/pull/gc -n has low value (nobody asked, hard to use) and given all the complications a net negative one, so we should feature cut it.

my intended workflow does not need this what's your intended workflow? I assume that I need a way to have the same experience for metrics show no matter is it just cloned or not.

I agree to postpone implementing -n for gc/push until we get a signal about it (I'm sure we will get it). But it's essential part for metrics. And we need -n no matter if we have --log or not. Again, which is harder to implement and we might not need it at the end of the day.

There are already two types of metrics - cached and non-cached. I don't remember exact reasons but users were asking about them. It should be easy to find an issue.

Just had a good discussion with ODS guys who are trying DVC for huge datasets, intermediate artifacts, etc (up to 3TB input). They need a way to do gc by all commits - to keep the data that was committed and pushed. They raised the same concern with GC - default behavior is dangerous and destructive.

It looks like it makes a lot of sense to change the behavior - default should be collecting DVC files from all commits (should we fetch all branches as well? should we notify if something is not fetched from remote? how can we ensure that all commits exist locally?).

In this case dvc gc -n 1 will preserve the current behavior. And -n makes much more sense for gc. I'm pretty sure that push -n will be part of the workflow as well.

I like the idea of changing gc behaviour to account or all commits by default. We've been talking about it previously, but in order to implement that we would need support particular commits, which is the feature that this PR is about.

I not very sure about -n 1 syntax though, because it so far makes sense to re-use git syntax. E.g. dvc gc HEAD~20, but maybe there are some issues with it that I'm not seeing yet.

In any case, changing gc sounds like a great start to me.

Ok, so no matter how will discussion go further (about -n) I think everyone agrees that current state of gc shall not be preserved, right? Can i start with changing default behaviour of gc to collecting all dependencies from all revisions and removing only those caches that are not in any revision?

@pared that might be slow on long history repos. So probably not trivial to do it right, I would discuss/plan it properly first.

@efiop the problem with accepting HEAD~20 can be that it will force us to use different style for gc and for other commands that already have positional arguments (metrics show, for example). I would really like to have the simple syntax first, that is the same across all commands if possible.

I also don't like cryptic git's style - HEAD~20. It think it's way easier to understand explicit -n. Similar to git log -n 10.

Btw, I believe this ticket is not about particular commits in the first place (--rev or similar options we can decide to add in the future). This ticket is about going back to a certain depth (either log semantics like Alexander suggested, or -n, or log -n like in Git):

If dvc were to support what I'm proposing here, an extra argument would probably be required to limit how many commits DVC would look back at. Otherwise it would show all the metric values since the beginning of the repo history, which can be unhelpful and messy.

I agree with @Suor that it's better to discuss the gc change first. May be even create a separate ticket for that. And make this one depend on it.

any update on this issue? I see it have been declared "important" but also removed from "In progress"....Would love to have this!

It seems to me that what the user wanted to accomplish (dvc metrics show accross different commits -- making small parameter changes and checking the metrics for these parameter values) can be implemented more easily and cleaner with directories for each experiment.

In general, let's say that the user has a table with parameters and their values. He can write a script that for each parameter values creates a new experiment directory and (re)produces the results. Then he stores on the table all the results (metrics), removes all the experiment directories (cleanup), and commits on Git this table that contains the parameter values and the corresponding results. This is much cleaner than making a small commit for each parameter value and considering each commit as an experiment.

Regarding the other idea of limiting the output of dvc metrics show -a -T with a range of commits, this might be useful in some cases.

@yfarjoun Sorry for such a huge delay. We've introduced required changes for internal brancher, as well as introduced non-official hidden --all-commits flag for gc(please don't rely on it, it is really in a beta mode for now). So changes for metrics and other commands should be not that far, yet they are not on this sprint. I'm bumping the priority to make this move faster. Thanks a lot for the feedback! 馃檪

Btw, if anyone would be willing to give a shot contributing a patch for this, we will be happy to help 馃檪

thanks for the update. no need to apologize, I just wanted to make sure you know that this is still a desired feature!

To give a new user's perspective on the issue (talking about push/pull really rather than gc), I had assumed that dvc push was equivalent to git push: i.e. you make several local commits then push all of them to a remote. What @pared said basically:

By default, in git, if you want to save something in repo, you will commit it. You do 10 commits and then push. All 10 commits has been pushed. What dvc is currently allowing you to do, is to push current changes. So... we do 10 commits and push dependencies/outputs from only last one. I believe that default behaviour should be pushing all dependencies from all commits, that have not been yet pushed. That is the only way to make sure all commits are not broken, and not demanding from user to periodically making pushes on branch.

... so the actual behaviour of dvc push caught me by surprise initially. I understand that this is hard from a performance perspective, but from a data integrity point of view, I think it's an important option to have. Particularly for raw data which isn't reproducable from anywhere, the dvc cache is the only place where it exists: if you push in the wrong order then you can end up with lost data.

And yet another confusing and missing option to push multiple commits I believe - https://github.com/iterative/dvc.org/issues/1087 ... may be also make sense to have --all-commits.

Is this feature still in plans?

I ended up with little workaround for pushing data among various commits. I simply added git hook at .git/hooks/pre-commit. So every time I commit something my data is syncrhonized. Here is my hook:

 #!/bin/sh

# 1. List files staged for commit (excluding deleted files)
# 2. Filter dvc files.
# 3. Push updated dvc files into remote
git diff --cached --name-only --diff-filter=d | egrep ".dvc$" | xargs --no-run-if-empty dvc push

Of course it noticeably increases time for each commit but it also solves my problem with data synchronization. I hope it would help someone else but me.

Hi! Resurrecting this discussion 馃 (per a support question related to deep learning: having to pick a winner from 500K epochs, and it's definitely not the last one):

Specifically on metrics diff commands, refer to #4211: dvc plots diff already accepts multiple revisions, so dvc metrics diff could also do so (and you can send it ranges of commits with something like git log --format:$h HEAD~10..).
But I'm guessing this will totally crash if I send it 500K SHAs... Plus you wouldn't even want to commit that many variations of an experiment (so this relates to run-cache as well)

But what about accepting standard Git commit ranges? (Both dit diff and git range-diff accept them, for different purposes.) And then print a summary with just some stats like mean, norm, max, min (configurable, perhaps).

Ivan mentioned we may want to avoid cryptic Git syntax in https://github.com/iterative/dvc/issues/1691#issuecomment-513885271, but I'm not sure why. We use Git as the underlying versioning engine so why not leverage more of it's features?

I don't think this issue is really related to that discussion. Epoch is not the result of the run, so there is no commit or model for each of those. In current terms it might be a datapoint in some plot or simply an intermediate state, which might be saved or not upon users wish.

I think you're right with respect to that particular user's support case. Still I think this idea is worth considering for some of our commands:

accepting standard Git commit ranges? ... And then print a summary with just some stats like mean, norm, max, min (configurable

p.s. add dvc exp diff per another support case.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dmpetrov picture dmpetrov  路  35Comments

ynop picture ynop  路  41Comments

Casyfill picture Casyfill  路  56Comments

jorgeorpinel picture jorgeorpinel  路  45Comments

pared picture pared  路  73Comments