Consider the following situation:
You create a git repo with DVC for data management. You use S3 remote storage. Let's say you then lose the repo (e.g. somebody force pushes into master), consequently, you lose all the .dvc
files. You want to recover the data solely from the S3 backend.
Currently, it is possible to recover the data, but without the filenames.
It would be nice if the filenames and any additional information for such situations were stored somewhere in remote storage.
I think the file is still in the dvc remote cache dir. Maybe you can download the whole remote cache dir and check through all the files. The file size and create time can give you useful informations.
@adamsvystun You could simply git push to your s3 repo, that will back it up. Also, force pushes are usually forbidden for important branches (e.g. on github/gitlab/etc it is really easy to do in the settings and could probably do that in plain git too), so it is worth doing that, since you might not only lose dvc files but also your code. I don't think there is anything dvc could do, except upload a git repo copy, which seems like a hack that went too far. Dvc usually goes along with your git repo, so as long as you take care of it (git repo), you will be fine. And in most of the workflows, code matter as much, if not even more, than data.
@efiop
And in most of the workflows, code matter as much, if not even more, than data.
Emmm, as a machine learning engineer. Data is more important for me, especially those manual labeling data.
@karajan1001 Touche, I went too far on that one :slightly_smiling_face: Thanks for the correction!
Thinking about it some more, theoretically we could store some list of filenames and their hashes in plain form as a kind of tags of sorts. For example, for directories, we store .dir
cache file, which has relpaths and their hashes in plain text form, so you will be able to find and recover a file/directory by path from our regular dvc remote pretty easily. Might consider something similar for standalone files and standalone dirs too. I guess one way to go about it is to kinda dvc add
your project root, that will save .dir
cache file with the structure of your repo. Not sure how useful that would be and how we would orginise the UI for viewing that, but, none the less, it could be done.
Also, we have been thinking about creating some command for listing your raw cache to be able to find lost files by their size/mtime/etc (pretty much what @karajan1001 suggested earlier). Maybe we need something similar here, but for remotes.
Again, these are the thoughts from the top of my head. If you have any suggestions on how it could look like, please feel free to share :slightly_smiling_face:
E.g. dvc add path/to/data
(md5 of data is 12345
) could create empty file .dvc/cache/skeleton/path/to/data/12345
to mark that such path used to have such hash at some point. If data
later changes and has md5 of 54321
, then dvc would create empty file .dvc/cache/skeleton/path/to/data/54321
. If data
ever becomes dir with 33333.dir
hash, then it would create empty file .dvc/cache/skeleton/path/to/data/33333.dir
. This way it would be simple to push/pull and would probably be a good enough solution to find your lost files.
Maybe skeleton is not really needed and file/dirnames are enough, so we could flatten it.
@karajan1001
Maybe you can download the whole remote cache dir and check through all the files. The file size and create time can give you useful informations.
Yes, I can do that, this would allow me to recover the files, but it would be a little nicer if the filenames were also in the remote.
@efiop
force pushes are usually forbidden for important branches
I would know to properly set up the repo in this way for sure. But many people don't do that, and eventually, somebody might lose all the data because of some intern and bad configuration. It's unlikely, but the probability is nonzero. Maybe the probability is too low to consider it necessary to do any changes :) I am not sure.
@efiop Also, more broadly, saving filenames on the remote would give the remote more 'independence' in a sense. Like you would be able to know what is what without the git repo. Not sure what is your internal philosophy as regards to DVC, but I would consider that to be step in a cool direction. Potentially you can have tools that browse the remote files, solely based on the remote, without the need to sync-up with the git repo - this might be a cool feature.
@efiop Also, more broadly, saving filenames on the remote would give the remote more 'independence' in a sense. Like you would be able to know what is what without the git repo. Not sure what is your internal philosophy as regards to DVC, but I would consider that to be step in a cool direction. Potentially you can have tools that browse the remote files, solely based on the remote, without the need to sync-up with the git repo - this might be a cool feature.
@adamsvystun The reason why I am admittedly hesitant is because it feels like duplicating git's functionality in a sense :slightly_smiling_face: But I understand your point and do see the value here.
@adamsvystun @karajan1001 What kind of functionality would you expect from such a feature/command? What if a remote is used by multiple dvc projects? Should it show files as if they are in the same dvc project?
My 2cs on this - we can just dump any information about file names w/o expecting any additional functionality. Just to help to recover information if needed. Any mapping from hash to a name is way better than having nothing. I think that alone can be a good step. At least in making it more secure.
I agree with @shcheklein. Simple, but good enough if you do lose something.
@efiop My opinion, now the data is tracked by dvc, data information( file name?or something else) is tracked by git. There is only one mapping directions between them. It's easy to find data files from dvc files but the opposite operation is much harder. In most cases using git to track the .dvc file is enough. Make it more secure means we had to recover data only with information from a remote repository. As a remote repository now only stores datafile, we had to store more information in a remote repostion. This can be done in three different ways:
In solution 1, this table file is hard to maintain. We had to pull it down and merge it before we push it up to the remotes or we may overwrite information from other ones.
In solution 2, packing and unpacking process are needed, and we can't use links to get better performance (packing file is different from original ones)
In solution 3, directly put .dvc files to remote and rename them?
I don't known what the total benefits and costs are.
For example, with the mapping information, the problem that we couldn't provide useful information of data files which were going to be deleted in a dvc gc process can be solved.
metioned in:
https://github.com/iterative/dvc/issues/1511
( I tried to solve it last weekend and found showing hash name of the data files gives little information)
And finnaly dvc recover
download all the files in the repository and renamed them, maybe a name like {original name}.{create time}.{git tag/git commit msg/ git snap shot} or some other useful informtions
Most helpful comment
@efiop
Emmm, as a machine learning engineer. Data is more important for me, especially those manual labeling data.