dvc add - and how I deleted all my data

Created on 2 Apr 2019  路  5Comments  路  Source: iterative/dvc

I tested the functionality of dvc add and the entire pipeline on my MACOS. I saw the usage and I liked it. After the test, I deleted all .dvc.

After that I run dvc add on Ubuntu, then my disk got full, so I deleted .dvc. Suddenly I realized I deleted almost all my data.

I think it's wrong that dvc add behaves in different ways on different systems and it should be set implicitly, or some warning should be thrown to ensure user won't delete his data like me.

Don't know the solution but it should help a lot. My mistake was awful. Moreover, in doc, one-liner says that it

Take a data file or a directory under DVC control.

which is confusing because it literally moves all your data.

https://dvc.org/doc/commands-reference/add

Most helpful comment

@smolendawid it's really sad to hear that happened :(. It's behavior indeed differs, depending on OS, cache location relative to your project, etc. It's somewhat explained in that ticket, #1599, Ruslan mentioned. Specifically, copy, reflink cache type default suggested in the ticket could prevent this from happening. The downside is that DVC would be physically copying files every time you run dvc add in your multi-disk configuration (unless you opt-in into using an symlinks optimization that is turned on by default now). In your specific case does this tradeoff make sense?

Another option is to do something about documentation. dvc add mentions already that we are replacing files with links (in the description section). It's hard to squeeze all the detail into that short summary, but we could definitely try - how would you improve it (so that you would have fallen into this trap in the first place)?

Sorry again for this happening and let's try to improve it together!

All 5 comments

Hi @smolendawid !

Really sad to hear about that :slightly_frowning_face:

After that I run dvc add on Ubuntu, then my disk got full, so I deleted .dvc. Suddenly I realized I deleted almost all my data.

That is pretty weird. Usually, hardlinks are used on Ubuntu, which are safe in case you remove your .dvc/ directory. Did you see symlinks being used? Is there anything special about your project? E.g. did you use external cache directory feature?

dvc add moves your data to cache and creates links from cache to your workspace to improve performance and cut down the amount of data that your project will use. We currently have a discussion about different link types and which should be used by default here https://github.com/iterative/dvc/issues/1599 , we would love to hear your thoughts about.

Thanks,
Ruslan

I had my data on a big HDD disk and I set cache on a smaller SSD disk. After my SSD got full I deleted .dvc because I wanted to repeat the operation setting cache on HDD. Yes, it looks like symlinks, but I can't check it because I have to recover my disk, so the machine is off.

BTW I am wondering if it will be easier to recover data from HDD or SSD.

Two worst things I see here are:

  1. I tested everything on my MacOS and I could delete all the .dvc and cache without any negative consequences
  2. dvc add is not like git add - makes more than only track the files

@smolendawid it's really sad to hear that happened :(. It's behavior indeed differs, depending on OS, cache location relative to your project, etc. It's somewhat explained in that ticket, #1599, Ruslan mentioned. Specifically, copy, reflink cache type default suggested in the ticket could prevent this from happening. The downside is that DVC would be physically copying files every time you run dvc add in your multi-disk configuration (unless you opt-in into using an symlinks optimization that is turned on by default now). In your specific case does this tradeoff make sense?

Another option is to do something about documentation. dvc add mentions already that we are replacing files with links (in the description section). It's hard to squeeze all the detail into that short summary, but we could definitely try - how would you improve it (so that you would have fallen into this trap in the first place)?

Sorry again for this happening and let's try to improve it together!

@smolendawid We understand how bad the situation is and we've decided to prioritize https://github.com/iterative/dvc/issues/1257 that will prevent your situation from happening ever again without the user explicitly enabling symlinks as an advanced feature (at which point we expect the user to be aware of possible consequences). We will also update our docs to reflect that. Thank you for your feedback!

I understand how and why does it work this way. In my opinion, symlink should be an explicit option for volunteers. I think it should be prioritized to prevent a similar situation, so thanks for that.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

siddygups picture siddygups  路  3Comments

jorgeorpinel picture jorgeorpinel  路  3Comments

gregfriedland picture gregfriedland  路  3Comments

TezRomacH picture TezRomacH  路  3Comments

shcheklein picture shcheklein  路  3Comments