The garbage collection if, by design, very destructive and irreversible. Won't it make sense to add a --dry-run flag which will just list what would happen if gc is to be ran?
Hi @drorata !
Great idea! 馃檪 We'll look into it. Thanks for the feedback!
At least for learning purposes, I'm also looking into it.
As far as I can tell, it is fairly straightforward to print a message to indicate which file in the cache is going to be deleted.
However this would be of limited use because cache files would be printed as an md5 number. Rather, the user is more interested in the actual file/directory that the md5 corresponds to.
Given that dvc gc would likely be called after dvc purge which removes the .dvc files, the .dvc files cannot be used to map a cache file to a user readable output.
Is it possible to recover the user-readable file/directory name even after .dvc files have been removed ?
@tdeboissiere Nope, that is not possible right now. However, we could try to utilize git history of a project in order to retrieve that information, similar to https://github.com/iterative/dvc/issues/1234 .
What is dvc purge?
@drorata @tdeboissiere probably meant dvc remove --purge.
@efiop Exactly.
Using git history is a possible way, but as mentioned by @dmpetrov, this would tie us to a specific SCM, and we would need to have committed the corresponding .dvc file.
In that case, perhaps introducing something like dvc remove --purge_gc, optionally with the --dry flag to carry out purge + corresponding garbage collection would be a more straightforward solution ?
@tdeboissiere sounds good. Maybe even --drop-cache or something, so it is even more straightforward.
@iterative/engineering , is this going to be affected after #2325 ? (thinking about including this one for hacktoberfest, but might not be that clear)
I don't think #2325 affects this, but also it's not exactly clear how the interface for this feature should look like. Can we specify the output of the dvc gc --dry-run? Essentially it can print only a lot of different hashsums, right? Is it valuable enough?
Ideally, I think it should be able to "describe" the removed file. For example removing model.pkl from branch master, revision: 1.0. Thought it surely will not be able to describe file which commit was somehow removed, for example, squashed.
As mentioned by @efiop (https://github.com/iterative/dvc/issues/1511#issuecomment-456751851), by digging the Git history it may be possible to find the relevant information for each cached file (i.e. the name of the corresponding .dvc file, the commit message, the revision id, and maybe branch and tag name).
But this might be very inefficient for a big Git repo (with hundreds or thousands of commits).
Unless somehow this information is indexed and saved in a DB.
The problem is that by definition GC is removing "garbage" - and in a lot of cases it won't be possible to find any references even if we were able to analyze the history. What should we do with it? Especially, if we change the default behavior (analyze history and keep it), it will be removing only files that are being referenced.
Closing as stale. gc --dry-run itself doesn't get much requests these days, and we have https://github.com/iterative/dvc/issues/2325 as an umbrella issue.