It would be useful to be able to clear up disk space on one's local machine by removing the cache files of specific targets.
Hi @shaunirwin !
The problem with applying gc
to specific dvc files is that the same cache files might be used by other dvc files, so it is not trivial to solve in general case. Maybe you could describe your scenario so we could try to see if there are better approaches in your case?
Thanks,
Ruslan
I think I have possible use case for this request.
TL;DR: there should be a way to dvc gc
cache files which are already pushed to remote DVC storage.
Detailed use case
I have two dirs under dvc control: data1
and data2
. Here is the file structure:
├── data1
├── data1.dvc
├── data2
└── data2.dvc
At first I worked with data1
:
$dvc add data1
$git add .
$git commit
$dvc push
$git push
Since I commited and pushed data1
it hasn't changed. Now I'm working with data2
. The problem is that data1
is big and I'm running out of disk space on my local machine. I want to delete data1
both from my workspace and cache. Deleting data1
cache is safe because it is already pushed to remote DVC storage. I can delete data1
from workspace by dvc remove
but I can't clear my cache because dvc gc
considers data1
cache as "current" and does nothing.
P.S.:
I'm not sure that dvc gc
for specific targets is the best option. Probably dvc gc
should provide an option like "clear all cache files which are already synchronized with specific remote storage"
My use case:
I'm working on multiple projects in parallel that all have large datasets tracked by dvc. They are located on a (small) SSD so that I can't have all datasets on disk simultaneously. DVC tracked files are backed up on an SSH-remote. I want to quickly be able to clear the space of files from one project so that i can dvc checkout the files from another.
In terms of implementation, it would look like this:
1) Add a new flag to https://github.com/iterative/dvc/blob/master/dvc/command/gc.py . Not sure how to call it though, maybe anyone has any ideas about it? 🙂 Seems like it should be an additional flag for -c|--cloud
though.
2) Add support for that new flag to https://github.com/iterative/dvc/blob/master/dvc/repo/gc.py . To support it, if the flag is specified, in the last if cloud
block, instead of used
we should supply all local cache. E.g. POC would look something like
from dvc.cache import NamedCache
...
if cloud:
if our_new_flag:
used = [NamedCache.make("local", checksum, None) for checksum in self.cache.local.all()]
_do_gc("remote", self.cloud.get_remote(remote, "gc -c").gc, used)
That last "used =" is needed to transform str checksums returned by all()
to NamedCache, which is expected by gc()
. Might be a nicer way to organize this, but it def gets the job done 🙂 And that is pretty much it in terms of implementation.
@efiop I'd like to take a swing at it :)
@kaiogu Please let us know if you need any help 🙂
Hi, I was on long vacation, but am back now and will start this today.
@kaiogu Sounds good, let us know if you'll have any questions :) Btw, we have a dev-general channel in our discord, please feel free to join :slightly_smiling_face:
@efiop, I was thinking of calling it "safe" because it would only clear the cache if there was a backup safe in a remote. I'll let you know if I get stuck :)
@kaiogu safe
is too generic for my personal taste :slightly_smiling_face: , maybe @iterative/engineering have any thoughts/ideas about it?
local
, pushed
, synced
?
Another option is to include this into push somehow:
dvc push --remove
Or dvc remote gc
as it was suggested during the gc
discussion.
I would expect dvc remote gc
to gc on remote not in local cache.
Most helpful comment
Hi @shaunirwin !
The problem with applying
gc
to specific dvc files is that the same cache files might be used by other dvc files, so it is not trivial to solve in general case. Maybe you could describe your scenario so we could try to see if there are better approaches in your case?Thanks,
Ruslan