Currently dvc metrics show
can show metric values across different branches (-a
) and different tags (-T
).
Can you consider supporting showing different metric values across different commits in the same branch?
The background of this is (simplified example): say I'm currently training a model, where I'm changing a certain parameter, param1
(for instance, number of trees in a forest). The way I probably would like to work is to find a first value for param1
, commit the current state, continue changing param1
and continue committing the successive states that I consider worth saving. At some point I would like to look back and identify the setup that gave me the best results.
The way DVC currently works forces me to create a new branch/tag for each trial I want to keep track of, and this seems a bit overwhelming.
Depending on how different the experiments I'm running are and their level of granularity I could decide how to keep track of them (new commits VS new branches/tags).
Notes:
@andrethrill Would you like to compare two specific commits or just the dynamics of your metrics changing across a range of commit? The latter one is probably more suitable for a graphical tool, like tensorboard or something. Or are you looking for a CLI way of doing that, using different filters (e.g. find max metric across N commits)?
Hi @efiop !
I'm aware of TensorBoard but that's not exactly what I was talking about.
I would like to have a way of running a few consecutive different experiments and see their metrics. Just like dvc metrics show -a
currently does but without needing to create different branches. DVC seems like a good fit for this because I could checkout the experiment that gave me the best results and have everything version controlled (model, data, etc).
Or are you looking for a CLI way of doing that, using different filters (e.g. find max metric across N commits)?
If that were to be supported it would be great of course. But for what I'm talking about, just looking at the output in the same form as in dvc metrics show -a
would be enough.
@andrethrill Ah, so something like dvc metrics show HEAD~10
to show metrics for 10 last commits on the current branch?
Exactly @efiop ! And/or some other nice variations of it: dvc metrics show HEAD~{commitHash}
show metrics since commitHash
on the current branch.
@andrethrill AFAIK HEAD~{commitHash}
is not supported by git and it would be great to leave the syntax similar if not the same as in git :) But I get your point, there is probably a git-way to do that. Thanks for a great feature request!
@efiop indeed, I was not thinking from git perspective. The syntax would have to be different :)
@andrethrill @efiop It seems the ability to dvc metrics show
only specific tag(s) (instead of either the current HEAD only or all tags) might be one feasible way to engineer this feature request. Of course, you would have to create a tag for each of the commits you would like to show.
Anyhow, I would find the ability to dvc metrics show
only a specific tag or tags useful for some parts of my workflow.
Since the logic behind all the commands is similar it's probably make sense to implement it for all commands that support -T, -a options now.
@andrethrill @efiop It seems the ability to
dvc metrics show
only specific tag(s) (instead of either the current HEAD only or all tags) might be one feasible way to engineer this feature request. Of course, you would have to create a tag for each of the commits you would like to show.
What about to dvc metrics show
not just for arbitrary tag but for arbitrary commit? The same way git checkout
command allows to checkout both arbitrary tag (git checkout <tag_id>
) and commits (git checkout <commit_id>
).
The syntax of course would be different from git. Something like dvc metrics show --id <tag/commit>
. Additionally you can use the same syntax in other commands, like dvc push --id <tag/commit>
.
This approach also solves the issue when you have several local commits and in each commit the single data file tracked by DVC has been overriden. Current implementation of dvc push
only pushes last version of your data file from the last commit. The --id
option will allow to push all the previous versions of your data by executing dvc push --id <commit>
for all the previous commits.
@andrethrill @brbarkley @nik123
1. I see that we might have here few smaller tasks. Ill try to identify them:
- metrics show HEAD~X
- which will show metrics from last X
commits
- metrics show --since {rev}
- show metrics since rev
- metrics show --ids={rev1},{rev2},{rev3}
- like show -a
but restricted to particular revs
Do those options make sense?
dvc push --all-revs
? That could also be used for other ops, like pulling.--all-revs
for push
would be enough?My 2 cents on this.
-a
and -T
are symmetric across push/pull/gc/metrics-show
, we should make a new option symmetric as well. Especially considering that it does not make implementation more complicated.-a
, -T
.--revisions
? or something like that. Would really want to avoid introducing positions arguments, and a few options on top just to manage this case.-a
, -T
. Something like regexp
to filter branch/tag names. Can we reuse this new option to simplify certain options in this case?@shcheklein I agree that for simplicity it would be much better to implement --revisions
but looking at original feature request, supporting something like dvc metric show HEAD~10
looks like desired approach too. I think that using revisions to compare last 10 commits results will be a headache for users. One will need to either tag all commits, or git log
and copy-paste revs to the --revisions
option.
EDIT: let me clarify that I am talking here specifically about show
I would also like to start another discussion about --revisions
option.
As we discussed privately, we come to the conclusion, that probably the most convinient way for user to use --revsions
would be to provide coma separeted revision ids, like:
--revisions {rev1},{rev2},{rev3}
(note that providing revisions like: --revisions {rev1} {rev2} {rev3}
is not viable options, since we would not know when to start parsing targets
)
The problem with this approach is that coma is viable character to be included in branch name. So this edge case would break currently considered approach.
The other way to do that would be --revsion
option (--revision {rev1} --revision {rev2} --revision {rev3}
)
I think we cannot expect users to name branches in a way that would be convinient for us, do you agree?
EDIT: as discussed with @Suor, we not necesarily need to use coma as separator, git forbids some characters in branch names, like colons.
Possible solution: require providing revisions after parsing targets, that would make parsing multiple targets and multiple revisions possible.
How that could look like:
We are using rev
already, why not --revs
?
@Suor I agree, especially that its short and understandable.
My thinking was - is it possible to derive from the string that is passed to --revs
what exactly do we want to address? - like if it's a comma separated list of git hashes then we work with ids, etc . Is there an example in Git cli where it expects a list of revs?
@shcheklein looking through documentation, I think closest example would be using refspec:
https://git-scm.com/book/en/v2/Git-Internals-The-Refspec
What do you mean by what do we want to address
? Determining whether it is commit sha, branch or tag?
Yep, either it's a commit, branch, tag or a list of those.
@shcheklein do we actually need to know what it is? AFAIK git checkout
accepts any of those.
It seems to me, that we need to decide which way we go with implementation of this feature.
I would leave discussion "Do we support dvc metrics show HEAD~10
" for some other issue, as it is some particular use case that is not related to push/pull etc...
How to implement data sync operations for few revisions, we discussed 3 approaches so far:
dvc pull -rev rev1 -rev rev2 -rev rev3 file.dvc
dvc pull file.dvc --revs rev1 rev2
dvc pull --revs=rev1:rev2:rev3 file.dvc
I think we should go with the last one, because its faster to use that first one, and does not introduce some strong assumption as do the second approach (I mean requiring passing revs
after targets
)
@pared 1 and 2 are tied together. metrics and pull/push/etc should have (if it is feasible) the same syntax for working with references. Unless we decide to redesign it of course, but I don't see the point of that just yet. I totally agree with you, that most would probably just want to have an ability to do something with last N commits or something, so we need to give that syntax a bit of though, which might actually change the approach with --revs.
We've discussed that the second approach (the one that is requiring passing revs after targets) is absolutely terrible, just forget about it :slightly_smiling_face:
My thinking was - is it possible to derive from the string that is passed to --revs what exactly do we want to address? - like if it's a comma separated list of git hashes then we work with ids, etc . Is there an example in Git cli where it expects a list of revs?
@shcheklein I agree with @pared , this is a terrible idea, git doesn't distinguish between those so neither should we, especially just to adopt some joining syntax. I would much rather go with --rev rev1 --rev2 rev2
and have it deterministic than invent comma joining syntax to be able to --revs rev1,rev2
. Though, the :
looks promising, if it is indeed forbidden by git to have tags/branches with those. That being said, using colons is not intuitive at all.
I see that we might have here few smaller tasks. Ill try to identify them:
metrics show HEAD~X - which will show metrics from last X commits
metrics show --since {rev}- show metrics since rev
metrics show --ids={rev1},{rev2},{rev3} - like show -a but restricted to particular revs
Do those options make sense?
@pared This makes a lot of sense to me from user perspective, but I would probably go with something like
metrics show --from-rev HEAD~X (it is implied that --to-rev is HEAD, same as any git command does)
metrics show --from-rev rev
metrics show --rev rev1 --rev rev2 --rev rev3
This is just for metrics, how would you guys see other operations, like push, pull, etc?
a) I imagine, that for example in case of push, one might want to push all dependencies that has been binded to some git revision with stage files. Do you think it would be feasible to include some option for that? Like dvc push --all-revs? That could also be used for other ops, like pulling.
b) Does points from 1. make sense for operations like push, pull etc? Have you ever needed to pull for some range of commits/tags/branches? Or maybe --all-revs for push would be enough?
I'm not sure --all-revs
makes any sense in git world, since it feels like it would include detached heads and stuff, which is generally considered to be trash. That being said, being able to push/pull/etc all history seems very useful to me. If talking in git terms, one would maybe expect dvc push
to push everything from the initial commit, but we didn't do that because it might be excessive because of the data size. But maybe users have another opinion? :slightly_smiling_face:
@efiop by saying --all-revs
I was thinking about current branch, probably the naming could be improved :)
@efiop --rev rev1 --rev rev2
is just too verbose.
@efiop @pared last N commits or from this commit is not a good idea. Why should I care to count commits or copy paste sha? It should work just like git log
does, calculating and showing things as you scroll.
yep, I agree with @Suor . multiple --rev
is too verbose. Btw, do we need to support a list of commits at all (back to the ticket itself)? Can we just support the last N commits
for now? then we need just a simple option -n 10
or something.
@shcheklein I guess we could, though approach with --revs
does not bind us to current branch as -n
do.
Lets imagine we have been working on two branches, iterated a bit, have few commits on each branch.
--revs
approach we need to specify, by hand, all commits that we want to push-n
we would probably have to checkout branch_1 && dvc push -n {X}
, then checkout branch_2 && dvc push -n {Y}
both approaches requires us to play around with git log
to check which/how many commits we need to push.
How about do it as git does?
if we produce few commits on local branch, we should push all of them. If user does not need some commit outputs, he should squash/fixup the commit it, effectively removing its dependencies from history. This way user does not need to worry that he forgot to dvc push something, and the repository is broken in some point, yet still he has power to remove some intermediate results.
Guys, why are you talking about dvc push -n ...
? This is a very strange thing to my liking and nobody asked for it. And we don't need -n
for metrics either, as I explained above dvc metrics show --log
is what topic starter really asked.
@shcheklein dvc metrics --revs foo,bar,baz
might be also useful and is easier to implement, this is why we are discussing it.
@Suor an extra argument would probably be required to limit how many commits DVC would look back at. Otherwise it would show all the metric values since the beginning of the repo history, which can be unhelpful and messy.
- this exactly -n
I'm talking about. And more or less for the same reason.
I think --log behavior is less convenient here tbh. I want to see a table with just recent metrics, I don't need a full log usually. --log and -n do not contradict each other. But I would love to see a case for --log first. Especially we need -n
(or come with a better name) to support this across different commands like push/pull, etc. Yes, I want to push just a few recent commits, because I forgot to run dvc push recently. How can I do this? It can be and should be symmetric across push/pull/metrics show/gc.
--revs [list]
- again, it's too complicated to start with, I don't see a clear analogy, I don't see how it can be actually used. Me specifying a list of git hash commits? No, I doubt that anyone would do this. But again, we can do this when we see more signals.
@pared -a
(and -T
) options can play very well along -n
.
@shcheklein you calc and show metrics that fit on screen, then calc and show more if user scrolls down. What's messy about it? You propose user to guess or count instead - this is the messy one.
push -n
still doesn't make sense to me at all. I don't think people need it, nobody ever asked for it and it contradicts to how dvc gc
operates.
scrolls break your context (the way it's usually done is that it disappears when you terminate it), scroll is harder to implement (unless you do an actual fetch for all commits), scroll is not what ppl are asking for in the ticket ... and for a good reason. My intuition that I just want to see a reasonable table right away. Want more and scroll - just do -n 1000 | less
.
The only reason I use the default log with scroll is when I actually need to search for something. But in our case it would mean that we'll have to put together the full table in advance, right?
Is it only push -n
that does not make sense, or all other commands?
It looks to me that the author of this issue wanted the log feature, -n
is an awkward substitute for it. I agree that -n
is easier to implement.
I disagree that scrolling is only needed for searching, I almost never search git log
, but I scroll it occasionally. And even for searching one doesn't need to precompute everything.
-n
doesn't make sense for push, pull, gc, it only makes sense for metrics.
Okay, I think we need to decide what are we actually focusing in this issue, and create new issue accordingly. We talk about supporting data synchronization commands among commits (thats what title indicates), while original description is about tracking metrics. That makes discussion two threaded.
I propose to focus here on show
functionality for last commits and create issue for dealing with data sync operations, due to fact that original description is about supporting metrics.
So, there are two pieces under discussion.
-n 10
- simple to implement, correlates to mind with the ticket, feels like that we will need it anyway (like git log has the limit), works well with -a
and -T
, can be easily extended with more advanced --log
argument.dvc show --log
works as a git log
. I'm not sure tbh how are we going to implement this if we don't introduce -n
with it, to limit the number of commit we go back. Or it's not that easy to implement. Also, in my experience I want my table at my screen, not in less.--revs
- I have doubts someone is going to specify explicitly commits, to much burden. Can be implemented if we need this.We really need this. I will need it even to adjust the example in get started.
It not a big deal to implement it across all commands at once.
So, I would say, let's go with the first option and symmetrical implementation of it. Don't see any reasons to complicate it in the first iteration.
I agree that --refs
should wait for more demand.
However, if we go with a first option we should not allow -n
mixing with -a
nor -T
, this doesn't make sense for show metrics
- how would you order those? And thus has no need in push/pull
.
There is ready to use pager implementation, which accepts generator and thua makes implementing dvc metrics show --log
comparable in complexity with -n
thing.
@shcheklein I am not a fan of -n
option:
By default, in git, if you want to save something in repo, you will commit it. You do 10 commits and then push. All 10 commits has been pushed. What dvc is currently allowing you to do, is to push current changes. So... we do 10 commits and push dependencies/outputs from only last one. I believe that default behaviour should be pushing all dependencies from all commits, that have not been yet pushed. That is the only way to make sure all commits are not broken, and not demanding from user to periodically making pushes on branch.
Example:
#/bin/bash
rm -rf repo storage
mkdir repo storage
cd repo
git init >> /dev/null && dvc init >> /dev/null
dvc remote add -d storage ../storage
echo data >> data
dvc add data >> /dev/null
git add -A
git commit -am "first"
dvc push
echo data2 >> data
dvc add data
git add -A
git commit -am "second"
echo data3 >> data
dvc add data
git add -A
git commit -am "third"
dvc push
rm -rf .dvc/cache
rm data
git checkout HEAD~1
# no cache found
dvc checkout
@pared We would have to break the current behavior. And it's not only about dvc push
. It's also about pull/checkout/fetch/gc
. I gave you an example where you have to run pull
(not push
) in order to being even able to run metrics show
.
Analogy with Git is relevant just to certain extent. The biggest problem that when you deal with data all operations are way more expensive than with code. It means you need way higher level of granularity. Thus we don't pull/push
all commits by default - it would be just too expensive. (Thus we have commands like -a
, -T
to be able to extend the scope.) From all perspectives - number of files to analyze, potential amount of data to push, etc.
@shcheklein still dvc push -n
looks way too fragile, any dvc gc -r
and whoosh, no data.
I would argue that dvc push/pull -n
not needed for dvc metrics show --log
to be useful. The workflow may look like:
dvc metrics show --log
to review.git tag
whatever commits work the best.bash
git push
dvc push -T
@shcheklein
I have to agree that data needs to be approached differently, and breaking current behaviour is a no-no.
I like approach proposed by @Suor, that way user would not have problem with situation like "Out of 10 last commits I want to preserve 3, 7, 9". Though that still requires some work on user side. -n
seems to have potential to grow cache size unnecessarily, just like my idea with pushing all dependencies/outputs.
@Suor I don't see any difference with -a
or -T
in terms of "fragility" tbh. All those options expand the scope. If we have some concerns about gc -n
we can just introduce a prompt (for -a, -T, -n, etc).
The workflow you described is too cumbersome. I have to tag experiments, which is not nice. And there is no way for me to see the stuff after I cloned the repo. I don't see how will --log
work in your case after I clone the repo. Or do you suggest to ignore missing files? Basically return nothing for all previous commits? Until I do dvc pull -aT
? Which is by itself can be very different from what I want - just to glance to the latest scope of experiments (not all tags, not all branches - those could be too expensive).
And to reiterate - we can implement the --log
. I just believe we would still need -n
. And we will need it across all commands. So why don't we just start with something simple and add more advanced and tricky (impossible?) to implement stuff later?
@shcheklein gc -n
doesn't make sense at all, it is a sliding window destroying data. If you always run gc -a T
at least you won't ever remove tagged and branched data, this makes some sense. And if we ever use push -n
we can't ever run dvc gc
at all, so no point in having those too.
My workflow example is practical and is what author was doing. Yes, you need tags to persist data, this is what you are doing anyway with dvc, because there is no other way and dvc push -n
won't add it. What is your proposed workflow?
Still no use for -n
that I see. This is a feature nobody asked, it doesn't enable any new non-broken workflow. Why should we bother with it?
I'm not sure I understood the sliding windows analogy here.gc -n
makes perfect sense to me. First, gc
is not destroying data, it cleans only cache by default. And this is exactly a way for me to specify - just keep the data to a certain depth, locally. Similar to git clone depth.
We are going in circles, and I don't see any other new arguments there, not that much to add here. To clarify the workflow:
pull -n
.On sliding window destrying data - I meant gc -n -c/-r
. Once we do dvc push -n
we can't use dvc gc -r/-c
in any form without destroying data: dvc gc -aT -n N
is that sliding window. So I don't think we should allow push -n
and so pull -n
doesn't make sense either.
I don't see how your workflow works:
pull -n
they might get something or not. Depends on many factors.N
for each branch that could be fetched.N + x
now.dvc gc -c/-r
.We are not going in circles we only now got to discuss workflows. And yours looks broken to me.
A two side notes from this discussion:
Looks like dvc gc
has a dangerous design: it removes the most by default and becomes safer as you add options. Feels like it should be the opposite.
We might want to treat metrics differently than other artifacts in push/pull/gc
. They are small and can be used to review your work.
And again, with dvc push/pull/gc -n
we are inventing a feature from the blue, which nobody asked and maybe nobody will ever use. We are providing a new way to shoot yourself in the foot though.
So, I think you addressed the gc
concern yourself. At least that it's dangerous or not.
2 - partially answers your question re the workflow and why pull/push -n
are needed. There are other cases when I need them - diff
two datasets for example.
Don't feel any difference between --log
and -n
for metrics show in terms of what do you need to communicate, push (branch is broken), etc. But -n
is way easier to implement, that's what was asked for, it's enough in 99% cases.
We are providing a way bigger way to shoot yourself by implementing --log. I don't even know how will the workflow look like for it (when I clone something). I don't know how to implement it.
-n comes for free more or less for push/pull/gc if it's implemented for metrics as far as I remember.
Can you describe your intended workflow by steps? Because I can't see it. A I see dvc push/pull/gc -n
as a source of grief not something useful.
Basic case:
git clone
.dvc pull -n 10
or dvc fetch -n 10
to fetch _some_ context about the model. You need this step if metrics are cached. And we had a case for cached metrics.dvc metrics show -n 10
to print actual numbers as a table.If you need more you can just increase the number.
When you do experiments:
metrics -n N
to see the progresspush -n
if you want to save only certain stuff to remote. It's the same as push -a or push -T, I don't see that much difference here.And again, I see --log
as a premature complicated feature. -n
comes for free, it's symmetric with other commands, it's must have for pull -n
/fetch -n
You clone and dvc pull -n 10
and it doesn't work, because you/someone else didn't push, because branch has moved, because someone ran gc. You dvc push -n
and it inflates your remote, you can't call dvc gc -r/-c
in any form to fix that because it will break everyone elses workflow, which people will occasionally do. You loose time communicating or trying to guess whatever N you need.
As I say it's a source of grief and a feature nobody asked, it breaks dvc conceptually as I see it. We should not break this into the wild. So the only thing we can safely do is dvc metrics show -n
and that one is inferior to --log
, which is only marginally harder to implement.
We are hard on features already, we should not add more questionable ones on top.
You still haven't explained me yet how do you see --log
to be implemented at all. Or at least how will it work after you clone a repo and want to see something.
While I agree in general that complicating stuff is bad, in this specific case I mentioned multiple times -n for pull/fetch comes for free, it makes sense from the symmetry point of view, it makes sense in certain cases.
Implementation: go over commits backwards, yield show metrics text, pass generator to click.echo_via_pager()
. It won't work after clone unless you repro everything, my intended workflow does not need this, we can discuss treating metrics differently though.
Nothing comes for free. And this desire for symmetry could be misleading, symmetry has no inherent value it's a heuristic of a value, which fails here. It looks to me like dvc push/pull/gc -n
has low value (nobody asked, hard to use) and given all the complications a net negative one, so we should feature cut it.
my intended workflow does not need this
what's your intended workflow? I assume that I need a way to have the same experience for metrics show
no matter is it just cloned or not.
I agree to postpone implementing -n
for gc/push
until we get a signal about it (I'm sure we will get it). But it's essential part for metrics. And we need -n
no matter if we have --log
or not. Again, which is harder to implement and we might not need it at the end of the day.
There are already two types of metrics - cached and non-cached. I don't remember exact reasons but users were asking about them. It should be easy to find an issue.
Just had a good discussion with ODS guys who are trying DVC for huge datasets, intermediate artifacts, etc (up to 3TB input). They need a way to do gc by all commits
- to keep the data that was committed and pushed. They raised the same concern with GC - default behavior is dangerous and destructive.
It looks like it makes a lot of sense to change the behavior - default should be collecting DVC files from all commits (should we fetch all branches as well? should we notify if something is not fetched from remote? how can we ensure that all commits exist locally?).
In this case dvc gc -n 1
will preserve the current behavior. And -n
makes much more sense for gc
. I'm pretty sure that push -n
will be part of the workflow as well.
I like the idea of changing gc behaviour to account or all commits by default. We've been talking about it previously, but in order to implement that we would need support particular commits, which is the feature that this PR is about.
I not very sure about -n 1
syntax though, because it so far makes sense to re-use git syntax. E.g. dvc gc HEAD~20
, but maybe there are some issues with it that I'm not seeing yet.
In any case, changing gc
sounds like a great start to me.
Ok, so no matter how will discussion go further (about -n
) I think everyone agrees that current state of gc
shall not be preserved, right? Can i start with changing default behaviour of gc to collecting all dependencies from all revisions and removing only those caches that are not in any revision?
@pared that might be slow on long history repos. So probably not trivial to do it right, I would discuss/plan it properly first.
@efiop the problem with accepting HEAD~20 can be that it will force us to use different style for gc
and for other commands that already have positional arguments (metrics show
, for example). I would really like to have the simple syntax first, that is the same across all commands if possible.
I also don't like cryptic git's style - HEAD~20. It think it's way easier to understand explicit -n
. Similar to git log -n 10
.
Btw, I believe this ticket is not about particular commits
in the first place (--rev
or similar options we can decide to add in the future). This ticket is about going back to a certain depth (either log
semantics like Alexander suggested, or -n
, or log -n
like in Git):
If dvc were to support what I'm proposing here, an extra argument would probably be required to limit how many commits DVC would look back at. Otherwise it would show all the metric values since the beginning of the repo history, which can be unhelpful and messy.
I agree with @Suor that it's better to discuss the gc
change first. May be even create a separate ticket for that. And make this one depend on it.
any update on this issue? I see it have been declared "important" but also removed from "In progress"....Would love to have this!
It seems to me that what the user wanted to accomplish (dvc metrics show accross different commits -- making small parameter changes and checking the metrics for these parameter values) can be implemented more easily and cleaner with directories for each experiment.
In general, let's say that the user has a table with parameters and their values. He can write a script that for each parameter values creates a new experiment directory and (re)produces the results. Then he stores on the table all the results (metrics), removes all the experiment directories (cleanup), and commits on Git this table that contains the parameter values and the corresponding results. This is much cleaner than making a small commit for each parameter value and considering each commit as an experiment.
Regarding the other idea of limiting the output of dvc metrics show -a -T
with a range of commits, this might be useful in some cases.
@yfarjoun Sorry for such a huge delay. We've introduced required changes for internal brancher, as well as introduced non-official hidden --all-commits
flag for gc
(please don't rely on it, it is really in a beta mode for now). So changes for metrics and other commands should be not that far, yet they are not on this sprint. I'm bumping the priority to make this move faster. Thanks a lot for the feedback! 馃檪
Btw, if anyone would be willing to give a shot contributing a patch for this, we will be happy to help 馃檪
thanks for the update. no need to apologize, I just wanted to make sure you know that this is still a desired feature!
To give a new user's perspective on the issue (talking about push/pull
really rather than gc
), I had assumed that dvc push
was equivalent to git push
: i.e. you make several local commits then push all of them to a remote. What @pared said basically:
By default, in git, if you want to save something in repo, you will commit it. You do 10 commits and then push. All 10 commits has been pushed. What dvc is currently allowing you to do, is to push current changes. So... we do 10 commits and push dependencies/outputs from only last one. I believe that default behaviour should be pushing all dependencies from all commits, that have not been yet pushed. That is the only way to make sure all commits are not broken, and not demanding from user to periodically making pushes on branch.
... so the actual behaviour of dvc push
caught me by surprise initially. I understand that this is hard from a performance perspective, but from a data integrity point of view, I think it's an important option to have. Particularly for raw data which isn't repro
ducable from anywhere, the dvc cache is the only place where it exists: if you push
in the wrong order then you can end up with lost data.
And yet another confusing and missing option to push multiple commits I believe - https://github.com/iterative/dvc.org/issues/1087 ... may be also make sense to have --all-commits
.
Is this feature still in plans?
I ended up with little workaround for pushing data among various commits. I simply added git hook at .git/hooks/pre-commit
. So every time I commit something my data is syncrhonized. Here is my hook:
#!/bin/sh
# 1. List files staged for commit (excluding deleted files)
# 2. Filter dvc files.
# 3. Push updated dvc files into remote
git diff --cached --name-only --diff-filter=d | egrep ".dvc$" | xargs --no-run-if-empty dvc push
Of course it noticeably increases time for each commit but it also solves my problem with data synchronization. I hope it would help someone else but me.
Hi! Resurrecting this discussion 馃 (per a support question related to deep learning: having to pick a winner from 500K epochs, and it's definitely not the last one):
Specifically on metrics diff commands, refer to #4211: dvc plots diff
already accepts multiple revisions, so dvc metrics diff
could also do so (and you can send it ranges of commits with something like git log --format:$h HEAD~10..
).
But I'm guessing this will totally crash if I send it 500K SHAs... Plus you wouldn't even want to commit that many variations of an experiment (so this relates to run-cache as well)
But what about accepting standard Git commit ranges? (Both dit diff
and git range-diff
accept them, for different purposes.) And then print a summary with just some stats like mean, norm, max, min (configurable, perhaps).
Ivan mentioned we may want to avoid cryptic Git syntax in https://github.com/iterative/dvc/issues/1691#issuecomment-513885271, but I'm not sure why. We use Git as the underlying versioning engine so why not leverage more of it's features?
I don't think this issue is really related to that discussion. Epoch is not the result of the run, so there is no commit or model for each of those. In current terms it might be a datapoint in some plot or simply an intermediate state, which might be saved or not upon users wish.
I think you're right with respect to that particular user's support case. Still I think this idea is worth considering for some of our commands:
accepting standard Git commit ranges? ... And then print a summary with just some stats like mean, norm, max, min (configurable
p.s. add dvc exp diff
per another support case.
Most helpful comment
To give a new user's perspective on the issue (talking about
push/pull
really rather thangc
), I had assumed thatdvc push
was equivalent togit push
: i.e. you make several local commits then push all of them to a remote. What @pared said basically:... so the actual behaviour of
dvc push
caught me by surprise initially. I understand that this is hard from a performance perspective, but from a data integrity point of view, I think it's an important option to have. Particularly for raw data which isn'trepro
ducable from anywhere, the dvc cache is the only place where it exists: if youpush
in the wrong order then you can end up with lost data.