Dvc: Interrupting dvc pull at the download step does not cleanup temporary files

Created on 1 Mar 2019  ยท  8Comments  ยท  Source: iterative/dvc

When running a dvc pull to fetch large files, interruptions at the download step do not clean the temp files in the cache.
I'm pulling files from a Google Storage remote. Multiple interruptions lead to lots of wasted space and eventually fill the disk, especially when iterating on a pipeline with large files.
A dvc gc does not delete these files.

I started adding file cleaning myself by editing the RemoteGs.download method to eventually find out that we cannot catch the KeyboardInterrupt in the futures launched by the ThreadPoolExecutor. At first glance, fixing this does not seem trivial. What do you think ?

Here's an example of leftover files that accumulate (under the 30 directory)

.dvc/cache
โ”œโ”€โ”€ 30
โ”‚ย ย  โ”œโ”€โ”€ 6fef47bb205ecc71e7c5c79113818a.3d12ff9e-c67f-4e17-a771-62582c035353.tmp
โ”‚ย ย  โ”œโ”€โ”€ 6fef47bb205ecc71e7c5c79113818a.4f729150-4961-44df-8327-855b48dfddc5
โ”‚ย ย  โ”œโ”€โ”€ 6fef47bb205ecc71e7c5c79113818a.9c0756e7-7343-48e0-8146-8345e280e50f
โ”‚ย ย  โ”œโ”€โ”€ 6fef47bb205ecc71e7c5c79113818a.c8bfc484-d205-45be-a1ab-48a9420e1a0a.tmp
โ”‚ย ย  โ”œโ”€โ”€ 6fef47bb205ecc71e7c5c79113818a.e11ade77-f69c-4e67-8bb7-6e7e817712fe
โ”‚ย ย  โ”œโ”€โ”€ 6fef47bb205ecc71e7c5c79113818a.f74fe939-1a44-4c2b-bd7c-446b49b08445.tmp
โ”‚ย ย  โ””โ”€โ”€ 6fef47bb205ecc71e7c5c79113818a.fcced860-64a8-4675-90c2-0a68e9405f31.tmp
โ””โ”€โ”€ ea
    โ””โ”€โ”€ ffa5834c1819992cf5e94611d8f300

Setup : dvc version 0.29.0 freshly installed via pip (python 3.6)

bug good first issue p3-nice-to-have

Most helpful comment

@jeremcs You are right about os.unlink, I've totally missed it. Your patch looks really good! :slightly_smiling_face: One thing to note is that it should actually use tmp_fname from #1678 to generate tmp fnames. Mind submitting a PR so we could proceed with it?

All 8 comments

Hi @jeremcs !

Great catch! The intended purpose of this was to add resume-download support at some point, but we just didn't get to implementing it for gs yet. I see you've been deep into the code already with the intention to submit a fix for this(i suppose), so maybe you would like to look into implementing resumable download support for gs instead of trying to remove these? :slightly_smiling_face:

A dvc gc does not delete these files.

This is pretty weird. It should delete them. Here is a reproducer script:

#!/bin/bash                                                                                     

set -e                                                                                          
set -x                                                                                          

rm -rf myrepo                                                                                   
mkdir myrepo                                                                                    
cd myrepo                                                                                       

git init                                                                                        
dvc init                                                                                        

echo foo > foo                                                                                  
dvc add foo                                                                                     

echo bar > .dvc/cache/d3/b07384d113edec49eaa6238ad5ff00.3d12ff9e-c67f-4e17-a771-62582c035353.tmp

tree .dvc/cache                                                                                 

dvc gc -f                                                                                       

tree .dvc/cache                                                                                 

and it works as expected. Are you sure dvc gc didn't work for you?

:warning: Please be warned that it is a destructive operation and improper use can result in cache loss. :warning:

Thanks,
Ruslan

Hi @efiop,

About the dvc gc : I just reproduced the problem, tried a dvc gc and the files are deleted as expected. It's most probably an error of mine. Sorry about that.

I can try implementing resumable downloads when I get the chance to work on it.

@jeremcs, thanks for the follow up! I'll close this issue for now.

And thanks for volunteering with resuming downloads, any effort is really appreciated :smile: !
If you have any doubts about the implementation or code, don't hesitate to contact us through the Discord channel.

@jeremcs No worries, glad it works :)

@mroutis @jeremcs Let's keep this open since while we don't support resumable downloads for most of the remotes, we should indeed remove this garbage right away and not keep it around polluting the cache. It is a matter of simply adding os.unlink(tmp_file) https://github.com/iterative/dvc/blob/master/dvc/remote/gs.py#L202 . Also worth adding that logic to other remote types. @jeremcs let us know if you'd like to submit a patch for this :)

I'd be glad to submit a patch !
I think resumable downloads are useful but I agree that it would be better to create another issue.

Regarding cleaning, I think just adding os.unlink(tmp_file) won't be enough. If the source of the download interruption does not come from the futures' thread (e.g. in the case of a KeyboardInterrupt), we cannot just stop the thread. In the specific case of Google Storage, this is because we call the long-running download_to_filename method.

I've got two ideas to solve this :

I know it's not the focus here, but this could also generalize to cleaning remote temporary files. I saw that we added this recently for atomic operations (https://github.com/iterative/dvc/pull/1678)

What do you think ?

@jeremcs You are right about os.unlink, I've totally missed it. Your patch looks really good! :slightly_smiling_face: One thing to note is that it should actually use tmp_fname from #1678 to generate tmp fnames. Mind submitting a PR so we could proceed with it?

This seems like more than p4 priority to me?

And should we close PR #1720 for now?

@jorgeorpinel Moved to p3.

And should we close PR #1720 for now?

@jeremcs Are you working on that one? ๐Ÿ™‚

Was this page helpful?
0 / 5 - 0 ratings