When running a dvc pull
to fetch large files, interruptions at the download step do not clean the temp files in the cache.
I'm pulling files from a Google Storage remote. Multiple interruptions lead to lots of wasted space and eventually fill the disk, especially when iterating on a pipeline with large files.
A dvc gc
does not delete these files.
I started adding file cleaning myself by editing the RemoteGs.download
method to eventually find out that we cannot catch the KeyboardInterrupt
in the futures launched by the ThreadPoolExecutor. At first glance, fixing this does not seem trivial. What do you think ?
Here's an example of leftover files that accumulate (under the 30
directory)
.dvc/cache
โโโ 30
โย ย โโโ 6fef47bb205ecc71e7c5c79113818a.3d12ff9e-c67f-4e17-a771-62582c035353.tmp
โย ย โโโ 6fef47bb205ecc71e7c5c79113818a.4f729150-4961-44df-8327-855b48dfddc5
โย ย โโโ 6fef47bb205ecc71e7c5c79113818a.9c0756e7-7343-48e0-8146-8345e280e50f
โย ย โโโ 6fef47bb205ecc71e7c5c79113818a.c8bfc484-d205-45be-a1ab-48a9420e1a0a.tmp
โย ย โโโ 6fef47bb205ecc71e7c5c79113818a.e11ade77-f69c-4e67-8bb7-6e7e817712fe
โย ย โโโ 6fef47bb205ecc71e7c5c79113818a.f74fe939-1a44-4c2b-bd7c-446b49b08445.tmp
โย ย โโโ 6fef47bb205ecc71e7c5c79113818a.fcced860-64a8-4675-90c2-0a68e9405f31.tmp
โโโ ea
โโโ ffa5834c1819992cf5e94611d8f300
Setup : dvc version 0.29.0 freshly installed via pip (python 3.6)
Hi @jeremcs !
Great catch! The intended purpose of this was to add resume-download support at some point, but we just didn't get to implementing it for gs yet. I see you've been deep into the code already with the intention to submit a fix for this(i suppose), so maybe you would like to look into implementing resumable download support for gs instead of trying to remove these? :slightly_smiling_face:
A
dvc gc
does not delete these files.
This is pretty weird. It should delete them. Here is a reproducer script:
#!/bin/bash
set -e
set -x
rm -rf myrepo
mkdir myrepo
cd myrepo
git init
dvc init
echo foo > foo
dvc add foo
echo bar > .dvc/cache/d3/b07384d113edec49eaa6238ad5ff00.3d12ff9e-c67f-4e17-a771-62582c035353.tmp
tree .dvc/cache
dvc gc -f
tree .dvc/cache
and it works as expected. Are you sure dvc gc
didn't work for you?
:warning: Please be warned that it is a destructive operation and improper use can result in cache loss. :warning:
Thanks,
Ruslan
Hi @efiop,
About the dvc gc
: I just reproduced the problem, tried a dvc gc
and the files are deleted as expected. It's most probably an error of mine. Sorry about that.
I can try implementing resumable downloads when I get the chance to work on it.
@jeremcs, thanks for the follow up! I'll close this issue for now.
And thanks for volunteering with resuming downloads, any effort is really appreciated :smile: !
If you have any doubts about the implementation or code, don't hesitate to contact us through the Discord channel.
@jeremcs No worries, glad it works :)
@mroutis @jeremcs Let's keep this open since while we don't support resumable downloads for most of the remotes, we should indeed remove this garbage right away and not keep it around polluting the cache. It is a matter of simply adding os.unlink(tmp_file)
https://github.com/iterative/dvc/blob/master/dvc/remote/gs.py#L202 . Also worth adding that logic to other remote types. @jeremcs let us know if you'd like to submit a patch for this :)
I'd be glad to submit a patch !
I think resumable downloads are useful but I agree that it would be better to create another issue.
Regarding cleaning, I think just adding os.unlink(tmp_file)
won't be enough. If the source of the download interruption does not come from the futures' thread (e.g. in the case of a KeyboardInterrupt
), we cannot just stop the thread. In the specific case of Google Storage, this is because we call the long-running download_to_filename
method.
I've got two ideas to solve this :
With a queue, we could keep track which temporary files were created in the threads and clean them up in the main thread.
In _get_chunks, we could create additional tmp_infos. I have put an example to illustrate this in my fork of the project : https://github.com/jeremcs/dvc/commit/9f864356f77841a52018d4f2ca4a7e551c19efb0
I know it's not the focus here, but this could also generalize to cleaning remote temporary files. I saw that we added this recently for atomic operations (https://github.com/iterative/dvc/pull/1678)
What do you think ?
@jeremcs You are right about os.unlink
, I've totally missed it. Your patch looks really good! :slightly_smiling_face: One thing to note is that it should actually use tmp_fname from #1678 to generate tmp fnames. Mind submitting a PR so we could proceed with it?
This seems like more than p4 priority to me?
And should we close PR #1720 for now?
@jorgeorpinel Moved to p3.
And should we close PR #1720 for now?
@jeremcs Are you working on that one? ๐
Most helpful comment
@jeremcs You are right about
os.unlink
, I've totally missed it. Your patch looks really good! :slightly_smiling_face: One thing to note is that it should actually use tmp_fname from #1678 to generate tmp fnames. Mind submitting a PR so we could proceed with it?