Dvc: gc: remove cache files that were already pushed to remote

Created on 22 May 2019  ·  14Comments  ·  Source: iterative/dvc

It would be useful to be able to clear up disk space on one's local machine by removing the cache files of specific targets.

enhancement feature request help wanted p2-medium

Most helpful comment

Hi @shaunirwin !

The problem with applying gc to specific dvc files is that the same cache files might be used by other dvc files, so it is not trivial to solve in general case. Maybe you could describe your scenario so we could try to see if there are better approaches in your case?

Thanks,
Ruslan

All 14 comments

Hi @shaunirwin !

The problem with applying gc to specific dvc files is that the same cache files might be used by other dvc files, so it is not trivial to solve in general case. Maybe you could describe your scenario so we could try to see if there are better approaches in your case?

Thanks,
Ruslan

I think I have possible use case for this request.

TL;DR: there should be a way to dvc gc cache files which are already pushed to remote DVC storage.

Detailed use case

I have two dirs under dvc control: data1 and data2. Here is the file structure:

├── data1
├── data1.dvc
├── data2
└── data2.dvc

At first I worked with data1:

$dvc add data1
$git add .
$git commit
$dvc push
$git push

Since I commited and pushed data1 it hasn't changed. Now I'm working with data2. The problem is that data1 is big and I'm running out of disk space on my local machine. I want to delete data1 both from my workspace and cache. Deleting data1 cache is safe because it is already pushed to remote DVC storage. I can delete data1 from workspace by dvc remove but I can't clear my cache because dvc gc considers data1 cache as "current" and does nothing.

P.S.:

I'm not sure that dvc gc for specific targets is the best option. Probably dvc gc should provide an option like "clear all cache files which are already synchronized with specific remote storage"

My use case:
I'm working on multiple projects in parallel that all have large datasets tracked by dvc. They are located on a (small) SSD so that I can't have all datasets on disk simultaneously. DVC tracked files are backed up on an SSH-remote. I want to quickly be able to clear the space of files from one project so that i can dvc checkout the files from another.

In terms of implementation, it would look like this:
1) Add a new flag to https://github.com/iterative/dvc/blob/master/dvc/command/gc.py . Not sure how to call it though, maybe anyone has any ideas about it? 🙂 Seems like it should be an additional flag for -c|--cloud though.
2) Add support for that new flag to https://github.com/iterative/dvc/blob/master/dvc/repo/gc.py . To support it, if the flag is specified, in the last if cloud block, instead of used we should supply all local cache. E.g. POC would look something like

from dvc.cache import NamedCache
...
if cloud:
    if our_new_flag:
        used = [NamedCache.make("local", checksum, None) for checksum in self.cache.local.all()]
    _do_gc("remote", self.cloud.get_remote(remote, "gc -c").gc, used)

That last "used =" is needed to transform str checksums returned by all() to NamedCache, which is expected by gc(). Might be a nicer way to organize this, but it def gets the job done 🙂 And that is pretty much it in terms of implementation.

@efiop I'd like to take a swing at it :)

@kaiogu Please let us know if you need any help 🙂

Hi, I was on long vacation, but am back now and will start this today.

@kaiogu Sounds good, let us know if you'll have any questions :) Btw, we have a dev-general channel in our discord, please feel free to join :slightly_smiling_face:

@efiop, I was thinking of calling it "safe" because it would only clear the cache if there was a backup safe in a remote. I'll let you know if I get stuck :)

@kaiogu safe is too generic for my personal taste :slightly_smiling_face: , maybe @iterative/engineering have any thoughts/ideas about it?

local, pushed, synced?

Another option is to include this into push somehow:

dvc push --remove

Or dvc remote gc as it was suggested during the gc discussion.

I would expect dvc remote gc to gc on remote not in local cache.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

shcheklein picture shcheklein  ·  3Comments

prihoda picture prihoda  ·  3Comments

ghost picture ghost  ·  3Comments

anotherbugmaster picture anotherbugmaster  ·  3Comments

ghost picture ghost  ·  3Comments