Dvc: [UX] `dvc remove` safety

Created on 24 Jan 2019 · 11Comments · Source: iterative/dvc

Backstory: I just dvc removed 2 tarballs which took me several hours to generate 😭.

Suggestions:

The current syntax of dvc remove some.tgz.dvc makes the user think they're removing the dvc file, which would seem to have some sort of "unlinking" functionality. Instead, it leaves the .dvc file and deletes the file which is under dvc control. This is unintuitive and destructive — a nasty combination. If I wanted to delete that file, why wouldn't I just rm some.tgz? What, if anything, did dvc do beyond removing that file? It's not clear to me. I expected dvc remove to do the inverse of dvc add. Why doesn't it? Is there some other way to do the inverse of dvc add?
In the user guide, I feel like it should be more clear that this command is destructive. Due to the misleading syntax issues above, and apparently reading too quickly, I didn't catch until too late that the statement This will remove train.tsv from the working dir did not match the file in the command. Another suggestion might be to put the non-destructive option (dvc unlink) first, or clearly label those headings as [Destructive] and [Non-destructive].

enhancement

Source

colllin

Most helpful comment

Another comment would be that dvc unprotect is unintuitive, too, because I don't remember dvc protecting anything in the first place, so it isn't clear why I would need to undo something I haven't done, especially when it doesn't seem like there's any way to re-protect it afterward, nor do I understand what "protection" I'd be giving up. It feels like dvc unprotect is designed to perform the inverse of dvc add, but the word "remove" is the semantic inverse of "add", and there's a dvc remove which doesn't seem to perform the inverse of dvc add.

I don't know the answers — I'm just trying to clarify the problems I'm feeling right now.

colllin on 31 Jan 2019

👍2

All 11 comments

Hi, @colllin. That sounds bad. First of all, let's try to run dvc checkout some.tgz.dvc. It should get your file back.

shcheklein on 24 Jan 2019

👍1

dvc remove is safe in a sense that it does not remove your data from the cache (dvc add or dvc run puts files into .dvc/cache). It removes it _only_ from your _workspace_. It is very similar to git, if you have a code file that is committed into git you can always get it back via git checkout -- file. The same here, just run dvc checkout file.dvc and it should get it back.

shcheklein on 24 Jan 2019

👍1

Sorry for the confusion though. May be we should put a note after dvc remove, something like: Your date is cached, run dvc pull or dvc checkout <file> if you want to get back. @colllin please, let us know what you think.

shcheklein on 25 Jan 2019

@colllin just want to follow up on this and make sure that you was able to checkout your data back. Please, let us know how things are going.

shcheklein on 26 Jan 2019

Hey @shcheklein, thank you for the follow-up! I think I mislead you into thinking that there was a trivial solution based on my example.

Here's the backstory:

We're working with image data, so it's expensive to download, extract, compute md5, etc, because there's a lot of it (~500GB on this project). I believe that this puts me in a bit of an edge case already.
Because of this, as I discussed with some of you on Discord a couple weeks ago, we had decided that it might be best to keep either the tarball or its extracted contents in the cache, but probably not both. So I dvc add images.tgz then something like dvc run -d images.tgz -O images tar -zxf images.tgz. This basically makes the decision in advance that I'm using dvc more to _tie my data to my code commits_ than to directly _version control my data_. Does that make sense? (Side note: I would even like a way to remove the cached tarball once I extract it. And I don't really want to cache the untarred images due to S3 requests to transfer ~100k images. Maybe there's something smarter like keeping all of the images in a large .h5? Or extracting/reading images directly out of the tgz during training rather than extracting ahead of time?)
I did end up needing to modify my dataset after I had already added it to dvc and extracted it, because I found some corrupt labels. Essentially I just removed about 10% of it.
Then I wanted to re-tar the images and replace the tarball in dvc. Again, it's non-trivially expensive to store all this data in S3, so, in this case, since I know that no one is ever going to care about the previous tarball, I would like to completely remove it from S3 and dvc (I'm fine with manually deleting the cache out of S3 without actually removing the references in my git history, knowing that this will break dvc checkout for those past commits). Is that as easy as looking at the old md5 and deleting the S3 files?

I guess it wasn't clear to me the correct way to go about modifying my data and updating the dvc references to replace the old dataset. I went into this by overwriting the tarball first, as I would in git (modify my file first, add/commit the changes later), but it wasn't clear how to do the same thing with dvc. I thought I might want to "replace the file [in dvc]", which led me to follow the instructions to erase the tarball I had just spent hours creating. That's obviously not what I wanted. Then I found dvc unprotect some.tgz; <modify some.tgz>; dvc add some.tgz further down the page, but I'm still not sure if that's correct, or how that would have played in the scenario <modify some.tgz>; dvc unlink some.tgz; dvc add some.tgz.

Does my confusion and complaint about "safety" make more sense now? (Also, what's the right way to accomplish this?) Have you considered a stage & commit strategy more like git?

As for making dvc remove safer or more intuitive, my #1 recommendation would be to require you to specify the file which will actually be removed, rather than the *.dvc reference to it, i.e. dvc remove train.tsv rather than dvc remove train.tsv.dvc. I would read the docs twice before I typed a command that included remove <some file I don't want to remove>. Consider that my *.dvc is committed to git, which means that if dvc does remove it unexpectedly (when I type dvc remove some.dvc, I'm comfortable with the fact that I can easily check it back out. So for those reasons I felt safe typing dvc remove some.tgz.dvc when I shouldn't have.

colllin on 31 Jan 2019

👍2

I don't know the answers — I'm just trying to clarify the problems I'm feeling right now.

colllin on 31 Jan 2019

👍2

@colllin your scenario makes sense! Thank you for a so detailed follow up! There are a lot of questions in it. I'll try to address all of them (may be not within a single comment and when I have more clarity and hopefully you too).

But before, we go deep into this, could you clarify me a little bit what is expectation and what value do you expect from DVC to get. I'm asking because the way to organize your DVC project might depend on this. For example, it looks like you don't care that much about tracking different versions of the images dataset, you are fine with a single version and you can update it outside of the DVC project. Is it correct?

Is that as easy as looking at the old md5 and deleting the S3 files?

Yes, you can manually remove it by looking for an md5 in the remote cache. Mind though that it has an hierarchical structure internally - something like S3://remote/ac/cdefghtr5736367282 for the accdefghtr5736367282 md5. You can also run dvc gc --cloud, be careful though because it might remove also some models and intermediate results.

Maybe there's something smarter like keeping all of the images in a large .h5

It definitely makes sense to store them as tar at least, it should be easier to read a single file. I don't quite understand why do you use compression? It might be you can switch to tar only and save some CPU and time?

dvc unprotect is unintuitive, too, because I don't remember dvc protecting anything in the first place

Yes, protected mode is opt-in now. Probably we will make it a default soon and there will be no way to edit/modify files w/o running dvc unprotect first.

Regarding dvc remove. I'm not sure I quite understand what happened still. As far as I understood we decided that you take the tarball (images.tgz) under DVC control with dvc add. When you do that it basically means that DVC adds this file into the cache (.dvc\cache\md5\of_images_tgz.md5) and links the source images.tgz to that file (this is done to save the space and avoid actual copy operation). You then avoided caching the extracted files because of -O. Now, what file(s) you lost and how?

shcheklein on 1 Feb 2019

As we said, I don’t care about tracking changes to my dataset. I’m more interested in the linking between datasets and git commits, and the dvc repro framework for running experiments.

That said, sometimes I do need to modify my dataset, for example I discovered some corrupt labels and removed those samples.

Note that at this point I’m talking about modifying the extracted, uncached dataset — image files and mask files.

So then I re-tarred this dataset in the hopes of replacing the original tarball, while keeping the accompanying pipeline which extracts the tarball to a specific location.

After I had recreated the tarball, it was not clear how to add the new tarball to dvc as a replacement for the original.

At this point I tried dvc remove some.tgz.dvc, which then deleted the tarball I had just spent hours creating.

Is my state and goal more clear now? What was the correct way to accomplish it?

colllin on 6 Feb 2019

@colllin I think I have a better sense now. Just to be completely on the same page.

Let's say you have images.tgz and images.tgz.dvc, and images initially. You made some changes (remove files, add files) inside the images directory. And right after that you ran tar zcf images.tgz image, basically overwriting the tar-ball. Is it correct?

To go back, and answer your initial questions. The workflow to update the tarball should have looked like this:

dvc unprotect images.tgz              # or dvc remove images.tgz.dvc - it's not destructive
<add/remove images or metafiles inside images>
tar zcf images.tgz images              # why do we use -z, is it actually beneficial?
dvc add images.tgz

What does dvc unprotect command do? It unlinks the file in your workspace (the one you actually see, the images.tgz in this case) from the cached version (the one inside .dvc/cache) to prevent cache corruption in case you actually overwrite the file in the workspace thus overwriting the cached (previous) version along the way.

What does dvc remove smth command do? It's a syntax sugar for the rm. It's beneficial when you run it with a stage file that have multiple outputs, for example. It also gives value when you use a protected mode by default (when all your files that are under DVC control set to be read only). In this case it's more like chmod + rm.

From what I see, it's definitely confusing to see that remove takes dvc stage files, unprotect just files, move just files. It's a good suggestion to make them all consistent with each other, accept files and provide an option to accept dvc stage files, for example.

Another thing is that dvc remove should warn a user if file is not in cache or remote storage before removing it from the working space.

Any thoughts, @efiop @dmpetrov @colllin ?

shcheklein on 8 Feb 2019

👍1

@shcheklein Thank you for your time and thorough responses. ❤️

I think the only confusing thing about my situation at the beginning is that I started out with the tar command _first_, which, based on my understanding after reading through #1599, means that I had already corrupted my cache. I think we're already addressing that in #1599.

As a new user, this dvc behavior was unlike git, where it's always safe to modify your files, and then you sort out the intended changes later before committing.

Feel free to close this issue if it is not helping you track anything.

colllin on 21 Feb 2019

Sure, thank you @colllin for the valuable feedback!

shcheklein on 23 Feb 2019

Was this page helpful?

0 / 5 - 0 ratings