Dvc: Reconsider gc implementation

Created on 26 Jul 2019  ยท  73Comments  ยท  Source: iterative/dvc

As pointed out in discussion in #1691, we should reconsider gc implementation.
Currently, if called without any options, dvc will collect current branch dependencies and outputs checksums, and remove everything besides it. We can easily clear history of changes with this command. gc should be safer with default options. Straightforward implementation could get all outputs for all revisions in git repo and remove everything that is not on list.

As pointed out by @Suor, this approach might be slow for repository with long history.

enhancement p1-important research ui

Most helpful comment

@kenahoo If you don't mind --dry-run to be almost as slow as a non-dry one then that is not hard to implement. The thing is we don't really know what we will delete until we list all the cache/all the remote cache, which is the slow part.

All 73 comments

This is more about UI defaults than implementation, which will surely be affected too.

yes - very much agree. I am new to dvc, but the gc implementation is already making me nervous.

One additional note on gc. Should "metrics" files be treated differently than the rest? I agree that gc should easily collect old model files etc, but keeping the development of performance over time around would be nice.

@Suor Why do you think this is more about UI than implementation? I am not familiar in detail with the implementation, but the existence of the "--projects" parameter points to an interesting problem. What should happen to files that aren't referenced anywhere in the current git repository?

The conservative answer to this would be to not remove them. However this sounds like a major change in strategy, where only those items are considered for removal that
a) are known where they are referenced
b) are cleared for deletion (e.g. by being in a certain branch or by having never versions for the same file in the a branch)

Or am I misunderstanding this?

Thanks

Hi @hhoeflin ! Sorry for the delay.

Should "metrics" files be treated differently than the rest? I agree that gc should easily collect old model files etc, but keeping the development of performance over time around would be nice.

We usually recommend keeping metrics files in the git (e.g. dvc run -M) for that precise reason plus if you'd lose access to your cache at some point, you would still be able to use git to see what the metric was like.

I am not familiar in detail with the implementation, but the existence of the "--projects" parameter points to an interesting problem. What should happen to files that aren't referenced anywhere in the current git repository?

In the current state of things, those files are going to be removed, that is why gc needs --projects if you are using the same cache or remote for a number of projects. The perk of doing that is that you get inter-project deduplication of cache files, which might be handy :slightly_smiling_face: In other cases, we usually recommend keeping separate caches/remotes for your projects.

a) are known where they are referenced

Currently dvc cache/remote doesn't have any notion of which project it came from, it would require some additional complications to make it aware of that.

b) are cleared for deletion (e.g. by being in a certain branch or by having never versions for the same file in the a branch)

It is clear that we need different strategies for gc. Here are a few related tickets about other strategies: https://github.com/iterative/dvc/issues/678 https://github.com/iterative/dvc/issues/377 https://github.com/iterative/dvc/issues/155 https://github.com/iterative/dvc/issues/855
And we clearly need to adopt some reasonable default approach by gathering a number of strategies :slightly_smiling_face:

Improving gc is around the top of our TODO list, so we will get to dedicating it more time soon. Thank you for the feedback, we really appreciate it! :slightly_smiling_face:

An excellent overview and answer @efiop, thank you :)

A few interesting questions have come to mind reading this. For example, should it be analyzing _all_ commits in _all_ branches in the repo by default (vs only _all_ commits in the current branch)? How do we ensure that we have all the branches we need? How do we ensure (and should we?) that when we run it, the branch is correct?

Should it be for example, a global command (in a dvc get sense) - (e.g. dvc remote gc, dvc cache gc - just throwing some ideas) that takes a Git repo link (or multiple links) and analyzes it (them) holistically? By default saving everything - all commits/tags/branches/repos. Then a few options that will be cutting the scope down - branch names, tag names, dates of commits, etc.

One of the sources of issues is we are treating local gc and remote gc the same now, which is wrong for the very usual scenario of using both - in that case local cache is temporary, so we want to say what we want to keep. However for last/permanent storage such as remote cache we usually want to keep everything referenced by default. This is complicated by the existence of other scenarios like shared machine with shared local cache. We should somehow separate this cases. BTW, when we think like that #2037 becomes obvious.

Another thing is, given distributed nature of git, it's impossible to implement dvc gc against shared cache safely - one might just have pushed something only he has ref to and we will collect that unknowingly. So we shouldn't try to be perfect, we should try to be safe though. We may not collect recent files in shared cache for example.

@Suor
Thank you for that explanation. From my point of view, the main issue of gc is a bit of a different one than you describe. It is its basic philosophy of "deleting everything that I don't know I need" rather than "deleting everything that I know I don't need".

Extracted from #2147:

Since git has a distributed nature and we are collecting everything not referenced we may remove something recently pushed and only referenced in git commits not available locally yet - not pulled or not even pushed yet by an author.

The sane method to circumvent this is providing a grace period: do not gc anything newer than N days. In the case someone has just pushed something one shouldn't and wants to remove that it should be still possible to do that:

dvc gc -c --grace-period=0

This is "I know what I am doing even thoufgh I just messed up flag" :)

@Suor grace-period won't work, because a user might use some models that he has generated earlier and by removing those we would break the project for him. Next logical step is LRU, but, once again it is not possible to properly implement it in a distributed nature. I think gc should only be run with the upstream project, to include all important references, and for forks users should use separate storage until their PR is merged.

@efiop LRU won't work either, if that model is in local cache it won't be accessed anyway. But grace period will work for most cases, the example with using some old model, which is still in use, but never referenced by any tag/branch is artificial.

@Suor I'm not saying that is not referenced by any tag/commit/branch, I'm saying that you might've added a dataset 2 years ago and still using it on your master. How would a grace period work in that case?

@efiop if it's referenced then it won't be collected, the grace period is needed to protect something only recently added and not referenced in a particular repo dvc gc is run in, but maybe referenced in other clone.

@hhoeflin "deleting everything that I don't know I need" is the right thing for local repo backed by remote. So the issue is the same approach in both situations.

Interesting discussion. Note that I added issue #2451 above, somewhat related.

Something else I can add is that, as I discovered recently, after removing data tracked by DVC (that has been pushed to the default remote), running dvc gc --cloud doesn't do anything. I would expect it to collect the data from the remote, and probably also from the local cache. But it seems that with the --cloud (or --remote) flag, GC only compares between what's on local cache and remote storage to decide what to do. It will only have an effect if you run dvc gc first to clean up the local cache.

But maybe this should its own separate issue?

@jorgeorpinel It doesn't look that way, we use the same list of used cache for local and for remote: https://github.com/iterative/dvc/blob/0.57.0/dvc/repo/gc.py#L113 (note clist is the same). So it compares what is used in your workspace to local or remote cache. If it is not that way, then we have a bug.

@jorgeorpinel , as @efiop noticed, this is not expected.

However, it is not a bug, actually, I screwed it while I was writing the following script (the same one I shared on Discord):

git init
dvc init

remote=$(mktemp -d)
dvc remote add -d remote ${remote}

dvc run -o foo "echo foo > foo"
git add -A
git commit -m "Initial commit with foo"
dvc push

dvc run -o bar "echo bar > bar"
git add -A
git commit -m "add bar"
dvc push

tree ${remote}
# /tmp/tmp.X0CngjZ6Ri
# โ”œโ”€โ”€ c1
# โ”‚  โ””โ”€โ”€ 57a79031e1c40f85931829bc5fc552
# โ””โ”€โ”€ d3
#    โ””โ”€โ”€ b07384d113edec49eaa6238ad5ff00

# Rollback
git checkout HEAD~1

# `gc` with the `--cloud` flag, is the same as
# `gc` with the `--remote` flag using the default remote,
# in this case, the one named "remote"
#
# whenever you `gc` a remote, DVC compares the local cache (`.dvc/cache`)
# with the remote and tries to remove all the records that are not in it.
#
# The command below will do nothing because `.dvc/cache` and `${remote}`
# are the same.
dvc gc --cloud

# First, you need to `gc` locally
dvc gc

# As `bar.dvc` doesn't exist in this commit, it will garbage collect it:
tree .dvc/cache
# .dvc/cache
# โ”œโ”€โ”€ c1
# โ””โ”€โ”€ d3
#    โ””โ”€โ”€ b07384d113edec49eaa6238ad5ff00

# Now, you can do:
dvc gc --cloud

# And see that it removed the files :)
tree ${remote}
# /tmp/tmp.X0CngjZ6Ri
# โ”œโ”€โ”€ c1
# โ””โ”€โ”€ d3
#    โ””โ”€โ”€ b07384d113edec49eaa6238ad5ff00

As you can see, I didn't test/verify that gc --cloud deleted the cache (I just assume it didn't, since I was biased with such behavior -- I was confusing it with status :see_no_evil: )

You can run the script and see for yourself by running a tree ${remote} after the first dvc gc --cloud.

I really apologize for the misunderstanding!

Conclusion: dvc gc --cloud does the same comparison than dvc gc.

Thanks, @efiop , for pointing out this :zap:

gc should be safer with default options

I agree with this. By default gc should delete as little as possible (maybe nothing?), and as you add more options to the command it should delete more. This makes sense because if you delete something by mistake you cannot restore it back, but if you don't delete something by mistake you can run the command again with more options in order to delete it.

I would also say that the name gc is a bit cryptic and maybe misleading. I was wondering what it means, until I read the docs. Why not call it cleanup or delete?

Another thing is that a remote storage seems to me like a backup/restore place for the local cache. If we see it like this, then the only thing that has to deal with it is the script (or scripts, or command) that makes the synchronization between the local cache and the remote storage. Depending on the sync policy, everything that is deleted on the local cache may be deleted on the remote storage as well, or everything that is deleted on the remote storage will be deleted on the local cache as well.

The point is that dvc commands should operate only on the local cache and not touch the remote storage, except for the command(s) that synchronize them (the local cache and the remote storage).

I think that the only way to make the cleanup process both safe and flexible is to make it interactive. This means that the user should have enough information to decide which cached files to delete and which ones to keep.

When I look at the cache directory, I see only files and directories with strange names, and I am not able to understand which files in my workspace they represent, when they were cached, what is the git tag/branch to which they are related (if there is one), what is the git commit message (if there is one), etc. If I had this information, then I might be able to decide easily which ones of the cache files to delete and which ones to keep. I wouldn't mind even deleting them manually (plain rm command and other shell tricks).

So, the crucial question is: can DVC show such metadata information about the cached files? Maybe it stores such information on the state DB. Extracting this information from git would not be a good idea, both for performance reasons and because DVC is supposed to be able to work independently of git (with dvc init --no-scm).

In any case, the gc command (or whatever it is called) would be safer if it does not delete the files immediately, but shows a list of files that it is going to delete and asks for confirmation. But this list of files makes sense only if we know something about them and their context (the name of the file that was cached, the time, the commit message, etc.). Otherwise, if all we see is a list of checksums, we won't be able to make a useful decision.

@dashohoxha

Extracting this information from git would not be a good idea, both for performance reasons and because DVC is supposed to be able to work independently of git (with dvc init --no-scm).

--no-scm case is the simplest one, anything that is used in the current workspace stays, rest is deleted.

Interactivity would be one of the next stages. We at least should be able to delete everything that is not used in any commit, it is basic stuff.


Ok guys. Looks like the first step in this would be to make dvc gc by default walk through all commits for all branches/tags/etc to collect everything used and then remove anything that is not in that list. Optimizations and grace periods will follow later, for now we can keep assuming that users have ceased using the shared cache or remote when running gc. Your thoughts? @iterative/engineering

@efiop
I agree with that. This seems easy to implement, and will make gc much safer than it is now. But grace period should follow soon after initial improvements.

I would throw in one other suggestion into the mix - and that is 'register' actual usage in a cache by projects that use it.

In this situation, any project that is cloned, when it is being linked to a cache needs to register all the files it wants to keep in the cache, setting a flag in the cache. When the project doesn't need the files anymore, it removes the flag. Cleanup in a cache is then simply the removal of all files without a flag.

This way, cleaning of caches could be safe from deletions that other projects used in the case of shared caches. I think the "tricky" bit is to cleanly distinguish the 'source' of any flag set inside the cache. This could be a unique name with meta-information.

any project that is cloned, when it is being linked to a cache needs to register all the files it wants to keep in the cache

How can this be done? is it with dvc add file and dvc commit file.dvc, or you are thinking about something else?

This could be a unique name with meta-information.

Are you thinking about keeping a meta file for each cached file, like this: .dvc/cache/24/90a3d39b0004e4afeb517ef0ddbe2d.meta, or like this: .dvc/meta/24/90a3d39b0004e4afeb517ef0ddbe2d?

Maybe in dvc init for the initial registering of the repo in the cache file

'dvc add' and 'dvc commit', 'dvc push' ensure that the tagged files are updated.

I was thinking of keeping one meta-file in the cache per repo. The meta-file would be a listing of all files that the linked repo wants tagged.

For garbage collection, you concatenate all meta-files, and remove all cache-files that are not listed there.

Your thoughts? @iterative/engineering

So, beeing safe dvc gc will remove nothing if there is a clear history of commits\experiments. What option should a user set to actually remove data? Use case - there is no disk space left and user needs to clean up some old experiments/commits?

What option should a user set to actually remove data? Use case - there is no disk space left and user needs to clean up some old experiments/commits?

IMHO the next in line for removal should be the commits that have no tags and are not the top of a branch. Something like this:

dvc gc --commits

GC needs to have some more aggressive strategies from the get-go even if we make it safe by default.

dvc gc --commits

Yes, something is definitely needed. Surgical GC (a single commit) might be not enough. A user might have 100 of experiments. We need an option to remove everything starting from a commit or from a date or everything except the current commit (the current default). Aggressive strategies are needed to make GC useful.

We should think about remove everything that is already on remote strategy for local gc.

We need an option to remove everything starting from a commit or from a date or everything except the current commit (the current default). Aggressive strategies are needed to make GC useful.

What about something like this:

dvc gc --before-date="some-date-and-time-stamp"
dvc gc --before-commit="commit-id"

I think it is prudent if these options apply only on the current branch, unless another branch is specified, or --all-branches is given.

On the most aggressive side, dvc gc --all may be used to remove everything, except for the current commit. This should do the same thing that dvc gc does right now.

We need an option to remove everything starting from a commit or from a date or everything except the current commit (the current default).

I think I agree the current behavior makes sense (questioning the validity of this issue). I would guess that is what most people want to do with dvc gc. Maybe just add a confirmation dialog?

keeping one meta-file in the cache per repo...

Doesn't this start making the cache dir less friendly for manual editing? I guess regular internal cache dirs could by default not have such a repo registry meta-file, and then DVC would assume they are internal. But then internal caches could not be shared by other projects...

We should think about remove everything that is already on remote strategy for local gc.

This is a great point, but also think about edge cases where there are several remotes and only sets of the data has been pushed to each one.

And also as I mentioned in Discord I'm starting to think caches and remotes are basically the same thing. Say we rename them just "storage" and differentiate them as being local (either internal or external) and remote? But this I guess is mostly unrelated to dvc gc. Maybe I'll open a separate ticket...

dvc gc --before-date="some-date-and-time-stamp"

For something like this we should probably use dateutil.parser to recon most date formats but still its easy to break by users.

@jorgeorpinel

Doesn't this start making the cache dir less friendly for manual editing?

I would argue that the cache-dir shouldn't be manually edited. And in any case, I still think that the unresolved issue of using one cache for many projects needs a better solution than is currently implemented. Registration of what a project uses in which case is a simple option. But I would like to hear other solutions (and please not to list them on the command-line; that is way too brittle).

To be clear, suggested initial implementation would look like this:

dvc gc - remove everything that is not used in any commit in the local git repo;
dvc gc -a - would behave the same as in the current implementation;
dvc gc -T - would behave the same as in the current implementation;
dvc gc --workspace - would behave the same as `dvc gc`(no args) behaves in the current implementation; Naming could be disputed for sure.

This would basically make the current approach much safer, along with implementing most of the needed logic for walking commits, so that we could easily support per-commit gc later. This approach also leaves all the useful aggressive approaches available.

Thoughts @iterative/engineering ?

I would say that we should explicitly state what we want to remove. For now there are four areas that we separate - workspace, tags, branch heads, all other commits (history). So it might looks like:

dvc gc  # remove anything not referenced from any commit nor workspace
dvc gc --remove=history  # same as above plus history
dvc gc --remove=history,tags,branches  # only leave things referenced from workspace 

I would specifically don't want using -a, -T to not confuse people about possible push/pull/gc symmetry.

@Suor could you elaborate/clarify on the semantics. I'm not sure I understand how dvc gc --remove=history is different from dvc gc --remove=history,tags, for example. Do you mean that other commits == history == commits in the current branch that do not have tags assigned on them?

I agree, that if we revert the meaning of the -a and -T (from the options that specify a scope to keep, to the options that specify a scope to remove) we should probably avoid using the same names.

Decided to go with implementing dvc gc --all-commits first and then see what way to go later.

@efiop ๐Ÿ‘ could you please summarize and describe the semantics of the first wave of changes once again please? Ideally it should become a ticket for docs :)

I would argue that the cache-dir shouldn't be manually edited.

Still, I believe the cache dir was originally designed to be easy to at least browse and understand manually. So if possible, avoiding additional meta-files in the cache dir would be ideal.

dvc gc - remove everything that is not used in any commit in the local git repo;

@efiop what happens if the Git remote has branches that the local repo doesn't? Those versions will be collected, right? (i.e. dvc gc won't git fetch --all to check this.)

I would specifically don't want using -a, -T to not confuse people about possible push/pull/gc symmetry.

This is a good point actually. I second that! But the --remove syntax seems weird. An option per type of stuff to delete seems about right. It's just about finding the right names...

To be clear, our first step is to leave dvc gc as is, but add dvc gc --all-commits flag, which would remove anything that is not used in any commit in the repository. We've decided to go this way so that we could implement all the hard internals needed for it without waiting for naming/policy discussion to end. After the first step is done, --all-commits will most likely become the default dvc gc behavior, maybe even right before or right after the initial PR is merged, so we have some time to discuss the final naming here. We'll add a doc ticket (or even will submit an initial PR) after or just before that initial PR is merged, so we don't waste our efforts.

@efiop what happens if the Git remote has branches that the local repo doesn't? Those versions will be collected, right? (i.e. dvc gc won't git fetch --all to check this.)

@jorgeorpinel We can start by only using local commits first, as the logic behind it will be the same, the difference is only that we need to pull remote branches too. As a first step, we could even put that responsibility onto the user, but we should be definitely aware of that in dvc.

Once more, to be absolutely clear, we want to move along with the implementation for the internal logic (e.g. collecting all stages for all commits in the repo) and deal with naming of flags and default behavior later as a second step, since that one would be much easier in terms of implementation.

add dvc gc --all-commits flag

This is a good strategy - to implement the logic first. The option --all-commits should be hidden until we agreed on the name and switching the behavior.

One thing to keep in mind for upcoming iterations on this task:

When we want to gc remote, we need to specify two flags: -c and -r. It seems excessive. Maybe we should reconsider this.

Related: https://stackoverflow.com/questions/58163305/dvc-gc-and-files-in-remote-cache/58164544#58164544

@pared I think you can specify either of them and it should (-c is the same as `-r ). Btw, if you see that docs can be improved/are not consistent, please don't hesitate to edit/clarify :)

@shcheklein, you are right, but only in case of -c.
So -c can work alone (but it will handle only default remote). What I had in mind was to reconsider -r behavior, so that it wouldn't need -c to work. I was thinking that it should be dvc change, rather than docs. What do you think?

@pared you are right. I've mistaken it with dvc status. I agree that they -r should imply -c. The same way as in dvc status.

Ok, guys, judgement day has come, we need to agree on further steps. So we already have --all-commits, which should become default behaviour. So we will need to think about other options. We could start by renaming --all-tags/--all-branches to --keep-all-tags/--keep-all-branches as a direct transition and then bother with the rest (e.g. selecting specific revisions/tags/branches). What do you think? @iterative/engineering

Btw, need to not forget about the point brought by @jorgeorpinel , which is that we probably need to fetch all branches/tags when running gc --all-commits. Or maybe at least warn the user for now.

So we already have --all-commits, which should become default behaviour...
We could start by renaming --all-tags/--all-branches to --keep-all-tags/--keep-all-branches as a direct transition and then bother with the rest

Iterating back on this once that is done sounds like a practical plan to me ๐Ÿ‘

start by renaming --all-tags/--all-branches to --keep-all-tags/--keep-all-branches

Should we also discuss antonyms --keep-all-tags, -t and --no-keep-all-tags, -T? Might be useful if these options are configurable defaults in dvc's system-wide config which may need selective switching off in individual commands.

@casperdcl It is a separate discussion. For now we just want to make iterations in the right direction. New options could be considered later.

There should be no way for people to specify what they want to keep, only what they want to remove. Otherwise it won't be safe.

The only exception is when everything is backed by remote and you only remove local cache if that is pushed already, then it makes sense to specify what to keep.

There should be no way for people to specify what they want to keep, only what they want to remove. Otherwise it won't be safe.

But that is dangerous because of deduplication, when you delete some cache file which might be used by someone else. It is still useful, but safe approach would probably involve ref counting in some way or another (e.g. from git or from some remote-based refs).

But that is dangerous because of deduplication, when you delete some cache file which might be used by someone else.

If it is implemented the way it is now then it will be safe, no need for ref counting. The thing should just be translated from what to remove to what to keep inside Repo.gc(). For example:

dvc gc --remove=anon,tags

translates into "keep everything referenced from workspace and all branch heads", i.e. remove things only referenced from anonymous commits and tags (specified by user explicitly) and not referenced at all (default).

@Suor Ah, so you mean that we will still collect all references from the whole repo history, right? So that if some cache file is referenced in some tag but also referenced in a branch (that we don't want to delete) then it won't be removed from cache by gc, right?

@efiop we still collect all we want to keep and remove everything outside those set. So we keep the implementation and only change the UI/API.

So if we remove anon refs we don't need to collect all the history. If we are keeping anon refs then yes, we should collect the whole history.

@Suor Sorry, I don't quite get it. So say we run:

dvc gc --remove=mytag

then it would work as:
1) collect cache used in all commits except mytag
2) delete everything that is used in mytag, except if it is in the list of used cache from 1)

right?

I have followed this thread for a long time by now, and I must admit that I am confused how this is supposed to work with cross-repo reference counting in the case of remote storage that is shared. Also it is hard to follow in this thread what is really intended by now. Would it be possible (e.g. by the person who intends to do the implementation) to outline the exact design in a wiki?

@efiop we are discussing switching defaults/changing UI now. There is no way to remove (or keep) any particular tags at the moment, so let's not discuss that. There are 4 things, which can be referenced: working tree, all branch heads, all tags, all commits. We keep any union of things referenced from those. So if we don't say we want to remove all things referenced from anon commits then only unreferenced things are removed. I.e.:

dvc gc --remove=tags

is the same as

dvc gc

Because you need to explicitly say that you don't want things referenced from ordinary commits, so if you want to remove all anon commits and tags you say:

dvc gc --remove=commits,tags

I have followed this thread for a long time by now, and I must admit that I am confused how this is supposed to work with cross-repo reference

@hhoeflin That would work the same as it was before: through --projects ๐Ÿ™‚

For now we are discussing current steps that are: move to --all-commits by default and transform old options (--all-tags, --all-branches) into a new CLI. There will be next steps after that though, for sure. So stay tuned! ๐Ÿ™‚

@Suor Ok, so

OLD: dvc gc
NEW: dvc gc --remove=commits,tags,branches

OLD: dvc gc --all-commits
NEW: dvc gc

OLD: dvc gc --all-tags
NEW: dvc gc --remove=branches,commits

OLD: dvc gc --all-branches
NEW: dvc gc --remove=tags,commits

? Did I get you right?

@efiop yep

Few opinions that I'd like to add:

  1. We should move away from the current gc terminology (it's a very git-centric term). Let's have dvc cache remove/clean.
  1. Separate cleaning up remote and local cache: dvc remote vs dvc cache. And, let's move config commands to dvc config or, separately discuss those.

  2. gc/clear/remove should really be about removing stuffs. Keeping things sound reverse thinking and a weird semantic to use when removing stuffs (though, we do need this for few edgecases).

  3. If the commands are intuitive, it would be more helpful than the keeping safe thing.
    Of course, we prompt and try to be more transparent about removing things.

  4. But, Of course, let's iterate through history and try to be safe. But, I'd propose we only iterate to a certain depths (use a sane default? perhaps 10% of the history or top 100s/1000s). But, we should tell the user what's happening and the depths can be configurable with a flag.

I could see a few uses of removing caches, which I have enlisted below.

  1. I'm out of space. Do something (does not matter anything specific to remove/keep):
I believe these scenarios could be intelligently solved by users if we provide more information on

what's eating up the space (it might be hard, but we could try anyway). So, something like dvc cache stats? And, they can remove by using different cache clean commands (read along).

  1. I experimented with a branch, it didn't pan out. Clean cache from that branch (but, I'd like to keep the branch?).
dvc cache clean --branch=<branch(es)>
  1. Need to remove specific file from cache:
dvc cache remove <dvcfile(s)>
  1. Remove what's already pushed?
dvc cache clean --already-pushed
dvc cache clean --already-pushed -r <remote_name>
  1. Same scenarios as 1, 2 and 4 but, I'd like to keep something from given SHA/branch_head/tags as well:
dvc cache clean --exclude=<checksum(s)>
dvc cache clean --keep=<checksum(s)> #or, perhaps keep?

Obviously, we could have current --all-branches, --all-tags and --all-commits like functionality
on top of it, but, let's keep it out for first iteration (or, unless someone requests it).

  1. Remove everything that is currently unused:
dvc cache clean --unlinked/--unused/--workspace

Or, only remove via --aggressive option, which is pretty intuitive and easily understood

dvc cache clean --aggressive
  1. Remove unused from commits before:
    Like I said, we should only iterate to certain depths. So, this should be by default supported. But, could be configurable from --depths flag.
dvc cache clean --depths=<number>

Obviously, all of these things are already discussed. But, this is what I liked (and, gathered from different discussions).

There can be some combinations of these scenarios (eg: keeping some + removing others).

P.S. The only thing I don't really like is, remove and clean are synonymous. But, I feel like these behavior should be separated.

Mu suggestion are pretty much the same as before. I still think that we got stuck because we moved too far from the initial issue(s) in this ticket and expanded the scope way too much:

  1. Do first the thing we wanted to do - make gc safer. Two options - do not allow running gc in its current default at all (ask to specify -f or something). _Or_ make --all-commits the default and introduce the --workspace, -w option. No significant changes to any options, symmetry, etc. Has nothing to do with remove discussed above.

  2. Move remove semantics (pretty much all the discussion above - its need overall, speific command names -remove or clean, what are other commands needed to make that workflow usable, etc) to a separate ticket. I don't see any reason to remove or redo gc itself and it feels that providing some additional remove-like functionality is a separate topic.

  3. (note relate directly to this ticket) As a next step for gc/pull/push/show metrics/etc start introducing -n and filters to define scope by names (tag names, commit names, etc, etc). This is a much needed change that we started implementing initially and went into this hole of discussing removes, etc. To my mind this is more important than even providing remove.


@skshetry some additional consideration, most of them should go into a separate ticket(s) to my mind:

Keeping things sound reverse thinking and a weird semantic to use when removing stuffs

To me there is no superior semantics. Both of them are clear and have their place. Remove when it's exactly clear what is supposed to be removed (you put some burden on user's shoulders in terms of somehow getting this knowledge) - e.g. some large dataset that is unused, keep when it's exactly clear what I'm interested in - like some scope of experiments (it correlates very well with what we use for push/pull/show/, etc - users operate in terms of experiments they need).

We should move away from the current gc terminology (it's a very git-centric term). Let's have dvc cache remove/clean.

I don't quite understand/feel statement. Could you elaborate, please? gc is a very precise term and usually intuition is that is counts/collects references in some way and then removes everything that is not referenced. This is exactly what out gc does. I also don't see why is it git-centric. To me it corresponds more with GC in programming languages :)

And, let's move config commands to dvc config or, separately discuss those.

Yes, let's try to stay focused. See above. I think the whole ticket is not about something completely else now.

Separate cleaning up remote and local cache: dvc remote vs dvc cache.

As @dmpetrov pointed we should provide an option to dvc cache to remove stuff in remote along the way (removing both in cache and remote effectively). It is a regular workflow and this is how gc operates now in -c`-rmodes. It's a good question if we need a separate command togc\clean\remove` on the remote alone? I don't know. As I mentioned above, my opinion - it should be a separate ticket. We try to solve to many things at once.

Of course, we prompt and try to be more transparent about removing things.

as we discussed in the #2498 , it looks like the agreement is to not use prompts but fail instead

Yes, instead of walking a step at a time, I proposed wild lot of changes, and that's a failure in my part. But, yes, we can do what you proposed as a first iteration. The discussions or questions, that I raised is just that, discussion.

I don't quite understand/feel statement. Could you elaborate, please? gc is a very precise term and usually intuition is that is counts/collects references in some way and then removes everything that is not referenced. This is exactly what out gc does. I also don't see why is it git-centric. To me it corresponds more with GC in programming languages :)

gc is about removing unused stuffs automatically. ref-counting is an implementation detail, and we also doing the same could be co-incidental (or, not :) ). And, unused in our case is debatable (related to history? workspace? heads?). That's why we are focused on safety issues.

In all of the programming languages and in git, gc is automatic, meaning, user is not notified of the changes nor the user tells the tool to do it (usually). So, what we are really talking about is, cache cleaning/removing.

Also, this stems from how the user is using gc: to clean/remove items from the cache. If they are explicitly telling to clean stuffs, it's good to offer support for removing stuffs.

P.S: I'm not trying to say, gc is not important. But, it feels people are really trying to remove stuffs off of the cache that they think is unused than removing garbage.

And, unused in our case is debatable (related to history? workspace? heads?). That's why we are focused on safety issues.

I believe, it's a very precise definition in our case - everything that is not referenced in DVC-files anymore. And the scope of commits we collect references from corresponds one-to-one with all other commands in DVC - push/pull/metrics show, etc, meaning that is corresponds well with the higher level abstractions we operate - experiments.

gc is about removing unused stuffs automatically

Never felt this way regarding it being automatic. It feels like an implementation details on itself. If it's confusing to a lot of people, I'm fine to rename it to clean or something else. But keep the semantics. If we need remove - let's discuss it separately.

If they are explicitly telling to clean stuffs, it's good to offer support for removing stuffs.

All of this feels like unnecessary complication or at least does not make this workflow superior in any way. With "keep" semantics I don't care about old stuff, I just keep some experiments I care about - the most recent stuff. Also, I don't want to do additional commands to figure our what to remove. It feels too low level - like du -hs + rm. Not saying that it does not have its place, again. It just a separate remove like logic that we should be introducing separately.

Also, this stems from how the user is using gc: to clean/remove items from the cache. If they are explicitly telling to clean stuffs, it's good to offer support for removing stuffs.

that's the good point. And see my other responses - users use it in different modes.

@shcheklein and @skshetry you are going to the rabbit hole again. GC/not-GC is not the most important part here and it is not very important how it corresponds to programming languages. What is the important part - how to clean space?

If we just make GC safe - users won't be able to clean the space. So, we cannot take this step without introducing more aggressive strategies or at least without keeping the current one (probably as an optional). Also, the specific remove is needed dvc cache remove <datafile-or-dir>.

Actions that I'd suggest (the commands' names TBD) sorted by priorities:

  1. In this ticket. Replace the current default scenario to the safe one and keep the current one as an option (probably in the same command).

    1. (optional) Consider renaming dvc gc to dvc cache gc.

  2. Create a separate ticket for the specific remove dvc cache remove <datafile-or-dir>.
  3. Start introducing -n for others commands (see @shcheklein comment).
  4. Think about other gc/delete strategies.

Ok, had another round of discussion with @dmpetrov and @shcheklein . Here are the proposed first steps:

  1. Forbid dvc gc(no arguments) and create -w|--workspace flag to work as current behaviour; Remove confirmation prompt that we already have;
  2. For --all-tags and --all-commits assume -w automatically to match previous behaviour;
  3. Introduce -n(not sure about the best long option name) that will tell how much history commits to keep. So dvc gc -n 10 will keep 10 commits back in your workspace. dvc gc -n 10 -a will keep 10 commits back in all tags and the workspace and so on.

And the rest we can leave for the future discussion. What do you think guys, can we start with these steps?

@efiop thank you for the brief summary. Let's start with this.

PS: you can keep only (1) and (2) in this issue and make (3) as a separate issue if you think the scope is too big. Ups to you...

Ok, had another round of discussion with @dmpetrov and @shcheklein . Here are the proposed first steps:

  1. Forbid dvc gc(no arguments) and create -w|--workspace flag to work as current behaviour; Remove confirmation prompt that we already have;
  2. For --all-tags and --all-commits assume -w automatically to match previous behaviour;
  3. Introduce -n(not sure about the best long option name) that will tell how much history commits to keep. So dvc gc -n 10 will keep 10 commits back in your workspace. dvc gc -n 10 -a will keep 10 commits back in all tags and the workspace and so on.

And the rest we can leave for the future discussion. What do you think guys, can we start with these steps?

This is what you proposed last week as well, right @efiop? I stalled (eh? :smile:) on it a bit as I wanted to discuss more on the usecases (ref: https://github.com/iterative/dvc/issues/2325#issuecomment-586124729) that I felt like were important from reading this issue.

Though, we can start with this than getting stuck. Let's roll with these for now. :slightly_smiling_face:

Introduce -n(not sure about the best long option name)

How about --tail? though that would not match -n, on the other hand -t could be confused with -T

Coming late to this thread.

I expected to have a --dry-run option in dvc gc that would just report what would happen, without actually removing anything. That's similar in spirit to the "require --force to do anything" suggestion, but I think it's a little more straightforward and explicit.

I'd probably suggest requiring either --dry-run or --force to be explicitly specified, just like git clean does.

I also find the wording of the --cloud and --remote options confusing:

  • --cloud: "Collect garbage in remote repository"
  • --remote: "Remote storage to collect garbage in."

To me, that actually sounds like we will put the garbage into the remote repository, i.e. find garbage locally and upload it to the remote. I think if you just changed "in" to "from", that would help; otherwise I think the following would be more clear:

  • --cloud: "Remove garbage from remote repository"
  • --remote: "Remote repository to remove garbage from"

git clean is a good point. I use it all the time but never without arguments - to the extent that I forgot that it raises an error without args. Though you could argue that makes it a little pointless (surely the default should be dry run).

@casperdcl @kenahoo, it's not really possible to provide much information in --dry-run because we only delete what we don't know (though we can probably dig into the older commits and try to find out more information).

I also find the wording of the --cloud and --remote options confusing:

Thanks for pointing that out. It seems to be a help message. I'll try making a PR this week for that.

that the only way to make the cleanup process both safe and flexible is to make it interactive
Interactivity would be one of the next stages

I just want to also vote for considering interactivity when running dvc gc without flags. Or with an -i option not to interfere with shell scripting.

Related discussion: https://github.com/iterative/dvc.org/pull/1023#discussion_r387225516

@skshetry I don't think --dry-run has to be hugely informative, it just mainly has to disable the actual deletion.

@kenahoo If you don't mind --dry-run to be almost as slow as a non-dry one then that is not hard to implement. The thing is we don't really know what we will delete until we list all the cache/all the remote cache, which is the slow part.

I just want to also vote for considering interactivity when running dvc gc without flags

Interactivity is a very good idea but not as a default behavior. It is a bit related to prompt\not-prompt problem (#2498) because it complicates scripting.

From https://github.com/iterative/dvc/issues/2451

Maybe a report of what was collected (and from where) would be more informative. This way the user could know something actually happened, without having to understand and check DVC internals before and after running the command.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jorgeorpinel picture jorgeorpinel  ยท  45Comments

drorata picture drorata  ยท  46Comments

dmpetrov picture dmpetrov  ยท  35Comments

dmpetrov picture dmpetrov  ยท  34Comments

mdekstrand picture mdekstrand  ยท  43Comments