Dvc: Preserve file-extension when pushing to remote

Created on 18 Aug 2020  路  4Comments  路  Source: iterative/dvc

Hi,
I set up a NextCloud as a remote for DVC. So far so good! Only weak spot is, that I cannot screen my pushed data as the file-extension is missing after doing a push.
Is it possible to preserver the file-extension during a push, so one can have a look at it's data on a remote server / cloud?

Best regards
Sebastian

awaiting response

All 4 comments

Hi @smoosbau, thanks for creating an issue. This is fundamental to how dvc and dvc remote works.

DVC remote is a backup cache, meant to be used for DVC.

And, because DVC takes care of file duplications, there's no 1-1 mapping to preserve file extension. I know that this does not look good for file-hosting services such as GDrive/NextCloud, etc. but right now, the best approach is to take it as a black box.

Not to add all the metadata are not in DVC remote but in the git repository. Best way to check files are by using dvc commands, or getting a direct link using dvc get --show-url.


That said, we had a similar open issue, can't seem to find it. I'll comment if I do. Thanks again.

Found the issue: https://github.com/iterative/dvc/issues/3621, though it does not solve this.

It's more of a way to recover a repo, i.e. mapping of the file in cache/remote with a new empty file based on prefix, you'd still need to _ignore_ the cache.

Thanks for your quick response @skshetry!
Then I'll have to find a workaround! :)

@smoosbau just to add my 2cs to this :), primarily to help with a workaround.

As @skshetry mentioned, DVC default mode is to use so called "content-addressable" key-value store to keep files. It's done for a few reason - deduplication is one of them, another is an ability to version things in the first place, and other benefits that you can read about here, for example. If you have data.xml in your repo and you update the content of it, which one should we store? In DVC we store both, we just rename them in a way that DVC can find them later. Namely, it creates .dvc or dvc.lock/yaml files to store the information about the initial file name and puts into Git.

This way Git becomes a place where we keep the information about file names, about extensions, etc. Thus, a few questions:

  • how do we easily get that information from Git in some form that would help to solve your initial problem?
  • is it enough to replace that UI you have already for the storage?

There is no one single answer to this, but I can share some things we have done so far, and would really love to hear your opinion on what is missing, what is your use case, etc:

  1. There is the dvc list command.

You do something like dvc list https://github.com/iterative/example-get-started and you get all the files in that repo/including DVC-tracked:

.gitignore
README.md
data
dvc.lock
dvc.yaml
model.pkl
params.yaml
prc.json
scores.json
src

(Here model.pkl is tracked by DVC, it is not visible in the UI here.

Then you could use dvc get, dvc import or dvc.api with the similar interface to access those files.

  1. Another way that DagsHub guys have done - the build an online UI for DVC repos, to view DVC-tracked data in a human readable way:

Screen Shot 2020-08-18 at 12 35 31 PM

  1. Third option is to use external data management it DVC. It's an advanced case. Some motivation and options are described in this post and the last comment to it.

Bottom line, with default DVC way of organizing projects and data you don't access data directly. Git repo becomes an entry point with all the benefits, but with some potential limitations. Would be really great to learn more about your use case and what do you think about the options ^^.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dmpetrov picture dmpetrov  路  3Comments

mdscruggs picture mdscruggs  路  3Comments

siddygups picture siddygups  路  3Comments

ghost picture ghost  路  3Comments

TezRomacH picture TezRomacH  路  3Comments