Backstory: I just dvc remove
d 2 tarballs which took me several hours to generate 😭.
Suggestions:
dvc remove some.tgz.dvc
makes the user think they're removing the dvc file, which would seem to have some sort of "unlinking" functionality. Instead, it leaves the .dvc
file and deletes the file which is under dvc control. This is unintuitive and destructive — a nasty combination. If I wanted to delete that file, why wouldn't I just rm some.tgz
? What, if anything, did dvc do beyond removing that file? It's not clear to me. I expected dvc remove
to do the inverse of dvc add
. Why doesn't it? Is there some other way to do the inverse of dvc add
?This will remove train.tsv from the working dir
did not match the file in the command. Another suggestion might be to put the non-destructive option (dvc unlink
) first, or clearly label those headings as [Destructive] and [Non-destructive].Hi, @colllin. That sounds bad. First of all, let's try to run dvc checkout some.tgz.dvc
. It should get your file back.
dvc remove
is safe in a sense that it does not remove your data from the cache (dvc add
or dvc run
puts files into .dvc/cache
). It removes it _only_ from your _workspace_. It is very similar to git, if you have a code file that is committed into git you can always get it back via git checkout -- file
. The same here, just run dvc checkout file.dvc
and it should get it back.
Sorry for the confusion though. May be we should put a note after dvc remove
, something like: Your date is cached, run dvc pull or dvc checkout <file> if you want to get back
. @colllin please, let us know what you think.
@colllin just want to follow up on this and make sure that you was able to checkout your data back. Please, let us know how things are going.
Hey @shcheklein, thank you for the follow-up! I think I mislead you into thinking that there was a trivial solution based on my example.
Here's the backstory:
dvc add images.tgz
then something like dvc run -d images.tgz -O images tar -zxf images.tgz
. This basically makes the decision in advance that I'm using dvc more to _tie my data to my code commits_ than to directly _version control my data_. Does that make sense? (Side note: I would even like a way to remove the cached tarball once I extract it. And I don't really want to cache the untarred images due to S3 requests to transfer ~100k images. Maybe there's something smarter like keeping all of the images in a large .h5? Or extracting/reading images directly out of the tgz during training rather than extracting ahead of time?)I guess it wasn't clear to me the correct way to go about modifying my data and updating the dvc references to replace the old dataset. I went into this by overwriting the tarball first, as I would in git (modify my file first, add/commit the changes later), but it wasn't clear how to do the same thing with dvc. I thought I might want to "replace the file [in dvc]", which led me to follow the instructions to erase the tarball I had just spent hours creating. That's obviously not what I wanted. Then I found dvc unprotect some.tgz; <modify some.tgz>; dvc add some.tgz
further down the page, but I'm still not sure if that's correct, or how that would have played in the scenario <modify some.tgz>; dvc unlink some.tgz; dvc add some.tgz
.
Does my confusion and complaint about "safety" make more sense now? (Also, what's the right way to accomplish this?) Have you considered a stage & commit strategy more like git?
As for making dvc remove
safer or more intuitive, my #1 recommendation would be to require you to specify the file which will actually be removed, rather than the *.dvc
reference to it, i.e. dvc remove train.tsv
rather than dvc remove train.tsv.dvc
. I would read the docs twice before I typed a command that included remove <some file I don't want to remove>
. Consider that my *.dvc
is committed to git, which means that if dvc does remove it unexpectedly (when I type dvc remove some.dvc
, I'm comfortable with the fact that I can easily check it back out. So for those reasons I felt safe typing dvc remove some.tgz.dvc
when I shouldn't have.
Another comment would be that dvc unprotect
is unintuitive, too, because I don't remember dvc protect
ing anything in the first place, so it isn't clear why I would need to undo something I haven't done, especially when it doesn't seem like there's any way to re-protect it afterward, nor do I understand what "protection" I'd be giving up. It feels like dvc unprotect
is designed to perform the inverse of dvc add
, but the word "remove" is the semantic inverse of "add", and there's a dvc remove
which doesn't seem to perform the inverse of dvc add
.
I don't know the answers — I'm just trying to clarify the problems I'm feeling right now.
@colllin your scenario makes sense! Thank you for a so detailed follow up! There are a lot of questions in it. I'll try to address all of them (may be not within a single comment and when I have more clarity and hopefully you too).
But before, we go deep into this, could you clarify me a little bit what is expectation and what value do you expect from DVC to get. I'm asking because the way to organize your DVC project might depend on this. For example, it looks like you don't care that much about tracking different versions of the images dataset, you are fine with a single version and you can update it outside of the DVC project. Is it correct?
Is that as easy as looking at the old md5 and deleting the S3 files?
Yes, you can manually remove it by looking for an md5 in the remote cache. Mind though that it has an hierarchical structure internally - something like S3://remote/ac/cdefghtr5736367282
for the accdefghtr5736367282
md5. You can also run dvc gc --cloud
, be careful though because it might remove also some models and intermediate results.
Maybe there's something smarter like keeping all of the images in a large .h5
It definitely makes sense to store them as tar at least, it should be easier to read a single file. I don't quite understand why do you use compression? It might be you can switch to tar
only and save some CPU and time?
dvc unprotect is unintuitive, too, because I don't remember dvc protecting anything in the first place
Yes, protected mode is opt-in now. Probably we will make it a default soon and there will be no way to edit/modify files w/o running dvc unprotect
first.
Regarding dvc remove
. I'm not sure I quite understand what happened still. As far as I understood we decided that you take the tarball (images.tgz
) under DVC control with dvc add
. When you do that it basically means that DVC adds this file into the cache (.dvc\cache\md5\of_images_tgz.md5
) and links the source images.tgz
to that file (this is done to save the space and avoid actual copy operation). You then avoided caching the extracted files because of -O
. Now, what file(s) you lost and how?
As we said, I don’t care about tracking changes to my dataset. I’m more interested in the linking between datasets and git commits, and the dvc repro
framework for running experiments.
That said, sometimes I do need to modify my dataset, for example I discovered some corrupt labels and removed those samples.
Note that at this point I’m talking about modifying the extracted, uncached dataset — image files and mask files.
So then I re-tarred this dataset in the hopes of replacing the original tarball, while keeping the accompanying pipeline which extracts the tarball to a specific location.
After I had recreated the tarball, it was not clear how to add the new tarball to dvc as a replacement for the original.
At this point I tried dvc remove some.tgz.dvc
, which then deleted the tarball I had just spent hours creating.
Is my state and goal more clear now? What was the correct way to accomplish it?
@colllin I think I have a better sense now. Just to be completely on the same page.
Let's say you have images.tgz
and images.tgz.dvc
, and images
initially. You made some changes (remove files, add files) inside the images
directory. And right after that you ran tar zcf images.tgz image
, basically overwriting the tar-ball. Is it correct?
To go back, and answer your initial questions. The workflow to update the tarball should have looked like this:
dvc unprotect images.tgz # or dvc remove images.tgz.dvc - it's not destructive
<add/remove images or metafiles inside images>
tar zcf images.tgz images # why do we use -z, is it actually beneficial?
dvc add images.tgz
What does dvc unprotect
command do? It unlinks the file in your workspace (the one you actually see, the images.tgz
in this case) from the cached version (the one inside .dvc/cache
) to prevent cache corruption in case you actually overwrite the file in the workspace thus overwriting the cached (previous) version along the way.
What does dvc remove smth
command do? It's a syntax sugar for the rm
. It's beneficial when you run it with a stage file that have multiple outputs, for example. It also gives value when you use a protected mode by default (when all your files that are under DVC control set to be read only). In this case it's more like chmod
+ rm
.
From what I see, it's definitely confusing to see that remove
takes dvc stage files, unprotect
just files, move
just files. It's a good suggestion to make them all consistent with each other, accept files and provide an option to accept dvc stage files, for example.
Another thing is that dvc remove
should warn a user if file is not in cache or remote storage before removing it from the working space.
Any thoughts, @efiop @dmpetrov @colllin ?
@shcheklein Thank you for your time and thorough responses. ❤️
I think the only confusing thing about my situation at the beginning is that I started out with the tar
command _first_, which, based on my understanding after reading through #1599, means that I had already corrupted my cache. I think we're already addressing that in #1599.
As a new user, this dvc
behavior was unlike git, where it's always safe to modify your files, and then you sort out the intended changes later before committing.
Feel free to close this issue if it is not helping you track anything.
Sure, thank you @colllin for the valuable feedback!
Most helpful comment
Another comment would be that
dvc unprotect
is unintuitive, too, because I don't rememberdvc protect
ing anything in the first place, so it isn't clear why I would need to undo something I haven't done, especially when it doesn't seem like there's any way to re-protect it afterward, nor do I understand what "protection" I'd be giving up. It feels likedvc unprotect
is designed to perform the inverse ofdvc add
, but the word "remove" is the semantic inverse of "add", and there's advc remove
which doesn't seem to perform the inverse ofdvc add
.I don't know the answers — I'm just trying to clarify the problems I'm feeling right now.