There are use-cases where you want to completely purge a file permanently from DVC history. For instance, if given a database of personal data, and the user requests to have their data deleted, it should be deleted in such a manner that it can not be recovered.
I understand that dvc gc is capable of modifying the DVC cache such that the history is also modified. However, it seems that at the moment, there is no way to specify which file/directory you want deleted; so it deletes everything that is not being currently tracked. This might not be preferred because there could be deleted files in your database that you still want to be part of your history.
It would be great if a dvc gc-like feature is added through which you can also specify a file/dir name so that only that will get deleted from the history.
Conversation regarding this in the forum:
https://discuss.dvc.org/t/deleting-a-dvc-tracked-file-from-history/556
There are use-cases where you want to completely purge a file permanently from DVC history. For instance, if given a database of personal data, and the user requests to have their data deleted, it should be deleted in such a manner that it can not be recovered.
For this kind of example, when you say "purge a file permanently from DVC history", do you mean to just remove a particular file from DVC cache + DVC remote?
Or are we talking about also getting into re-writing git history, where the relevant file or directory also may need to be completely removed via git filter-branch plus removing the relevant entries in prior revisions of .dvc files?
I understand that
dvc gcis capable of modifying the DVC cache such that the history is also modified. However, it seems that at the moment, there is no way to specify which file/directory you want deleted; so it deletes everything that is not being currently tracked. This might not be preferred because there could be deleted files in your database that you still want to be part of your history.
dvc gc doesn't modify your repo "history". That is really tracked through git commit history. dvc gc can remove cache and remote objects, but what gc considers as "currently tracked" depends on the arguments used. gc --workspace will remove everything that is not referenced by your current workspace, but other options like --all-commits exist, which will tell gc to keep any files which are referenced in any commit in your entire git history.
So deleting a file from DVC cache and remotes would prevent anyone from being able to access the actual file data, but historical references noting that the file existed (along with some metadata about the file) would still exist in git.
Thank you for the explanation. I guess "DVC history" was the wrong choice of words. I certainly didn't mean doing a re-write of git history. If working on a shared project, that would be a nightmare right?
I meant removing a particular file from DVC cache + DVC remote. Just so that the file cannot be recovered by checking out a previous DVC hash (from a previous git commit).
However, if this gets implemented, I think it should probably wait till this issue gets resolved first. Because at the moment, if a file is removed from the DVC cache and if a commit using that file is checked out, it will throw an error instead of just checking out the subset.
Most helpful comment
Thank you for the explanation. I guess "DVC history" was the wrong choice of words. I certainly didn't mean doing a re-write of git history. If working on a shared project, that would be a nightmare right?
I meant removing a particular file from DVC cache + DVC remote. Just so that the file cannot be recovered by checking out a previous DVC hash (from a previous git commit).
However, if this gets implemented, I think it should probably wait till this issue gets resolved first. Because at the moment, if a file is removed from the DVC cache and if a commit using that file is checked out, it will throw an error instead of just checking out the subset.