Dvc: get: allow downloading regular files/dirs tracked by Git

Created on 19 Sep 2019  Β·  14Comments  Β·  Source: iterative/dvc

As we decide how to list data artifacts in external DVC repos in #2509 it became apparent listing regular files along with stage outputs could be especially useful. See https://github.com/iterative/dvc/issues/2509#issuecomment-532314755.

This made us think also then users could want to get some of those individual regular files after seeing them. (See https://github.com/iterative/dvc/issues/2509#issuecomment-532781756)

c13-half-a-week enhancement p1-important

Most helpful comment

@danihodovic

Have the questions above by @jorgeorpinel been resolved?

Import doesn't support regular files IIRC. Cached directories are supported, so it would be great to support non-cached ones too as a part of this ticket πŸ™‚ Recursive option doesn't exist for any of the commands, so it is not part of this ticket (though I'm not even sure what @jorgeorpinel meant by recursive, maybe he could clarify)

Is it supposed to some sort of compatibility with the future dvc list command? @dmpetrov mentions piping dvc list into dvc get with xargs: #2509 (comment)

Well, it is more about dvc list then dvc get. No need to keep dvc list in mind for this issue, unless I'm missing something.

Should it be possible to pipe the file to stdout for inspection with a pager such as less? Perhaps when -o is not provided. That would change the existing functionality of dvc get.

Not required and I don't really see a need for that functionality. Also, how would that even work with directories? πŸ™‚It is not the functionality meant for dvc get.

As for implementation - could it be as simple as checking if the file provided to dvc get is a not a dvc file and then downloading the file locally? I would inject the logic here:
...
Is there a helper to identify dvc files besides just looking at the file suffix?

Not sure what you mean, Dvcfiles are not supplied to dvc get as an argument, output paths are.

All 14 comments

Cc @shcheklein

Why dvc get and not dvc import though? What about directory/recursive support for this feature?

Thanks

Is there an agreement on this one, @efiop ?

@mroutis Most likely, this will be automatically fixed along the way with https://github.com/iterative/dvc/issues/2507

...this will be automatically fixed along the way with #2507

Per https://github.com/iterative/dvc/pull/2566#issuecomment-538271254, that does not seem to be the case after all, guys.

@jorgeorpinel hence the "most likely" part. As you can see, that PR doesn't have "Fixes" message for this issue. Thanks for the heads up, though.

Just keeping track of updates and validity of my issues.

A few questions regarding this issue:

What protocols should dvc get handle? The examples listed below include both http and ssh. Are there other protocols that need to be considered? Git uses 4 - https://git-scm.com/book/en/v2/Git-on-the-Server-The-Protocols.

Why dvc get and not dvc import though? What about directory/recursive support for this feature?

Have the questions above by @jorgeorpinel been resolved?

Is it supposed to some sort of compatibility with the future dvc list command? @dmpetrov mentions piping dvc list into dvc get with xargs: https://github.com/iterative/dvc/issues/2509#issuecomment-537310716

Should it be possible to pipe the file to stdout for inspection with a pager such as less? Perhaps when -o is not provided. That would change the existing functionality of dvc get.

As for implementation - could it be as simple as checking if the file provided to dvc get is a not a dvc file and then downloading the file locally? I would inject the logic here: https://github.com/iterative/dvc/blob/3de46dd25f7640f65f54d61b2c303125fc5a780f/dvc/repo/get.py#L21-L64.

Is there a helper to identify dvc files besides just looking at the file suffix?

@danihodovic

What protocols should dvc get handle? The examples listed below include both http and ssh. Are there other protocols that need to be considered? Git uses 4 - https://git-scm.com/book/en/v2/Git-on-the-Server-The-Protocols.

You mean for dvc gets url argument? Or path argument? Your reference to git makes me think like you are talking about url. If so, we don't really care, as we are passing it to gitpython anyway and whatever it supports - it supports :) But yeah, local, http and ssh are regular ones. But that doesn't really matter for this issue, unless I'm missing something. What would make sense is a path part, and we should only handle local for now, same as for cached arguments. πŸ™‚

@danihodovic

Have the questions above by @jorgeorpinel been resolved?

Import doesn't support regular files IIRC. Cached directories are supported, so it would be great to support non-cached ones too as a part of this ticket πŸ™‚ Recursive option doesn't exist for any of the commands, so it is not part of this ticket (though I'm not even sure what @jorgeorpinel meant by recursive, maybe he could clarify)

Is it supposed to some sort of compatibility with the future dvc list command? @dmpetrov mentions piping dvc list into dvc get with xargs: #2509 (comment)

Well, it is more about dvc list then dvc get. No need to keep dvc list in mind for this issue, unless I'm missing something.

Should it be possible to pipe the file to stdout for inspection with a pager such as less? Perhaps when -o is not provided. That would change the existing functionality of dvc get.

Not required and I don't really see a need for that functionality. Also, how would that even work with directories? πŸ™‚It is not the functionality meant for dvc get.

As for implementation - could it be as simple as checking if the file provided to dvc get is a not a dvc file and then downloading the file locally? I would inject the logic here:
...
Is there a helper to identify dvc files besides just looking at the file suffix?

Not sure what you mean, Dvcfiles are not supplied to dvc get as an argument, output paths are.

Agree with Ruslan. We would basically be using Git to download files in the repo so we will support everything Git does. I didn't get the part about path though, @efiop. The path is always local/internal to the repo's structure, is that what you mean?

So dvc get just has to detect whether this path represents a file in the repo or a file referenced in a DVC-file in the repo (the latter is what's currently supported) – this should help answer your Q about implementation @danihodovic.

Why dvc get and not dvc import though? What about directory/recursive support for this feature?

These Qs haven't really been decided Dani, thanks for checking. I vote to work on dvc get first but this should also apply to dvc import. I also vote to not complicate things by trying to support directories (recursively or not) for now. We can revisit later.

Cached directories are supported.

What do you mean @efiop? Not sure the cache has much to do with this issue. It's really about files hosted in the Git repo vs. files referenced in the DVC-files hosted in the Git repo, right?

Also agree with Ruslan about dvc list. It's related but separate from this, no need to worry about that rn. Agree as well about no need to pipe the file to stdout.

What do you mean @efiop? Not sure the cache has much to do with this issue. It's really about files hosted in the Git repo vs. files referenced in the DVC-files hosted in the Git repo, right?

I was talking about outputs of dvc files that are not cached by dvc, so possibly tracked by git @jorgeorpinel . E.g. dvc run -O dir 'mkdir dir && echo data > dir/data'(notice the big O). The handling sis pretty much the same, see implementation in api.

Agree with Ruslan. We would basically be using Git to download files in the repo so we will support everything Git does. I didn't get the part about path though, @efiop. The path is always local/internal to the repo's structure, is that what you mean?

Yes, that is what I was saying. Note that we might have external outputs (e.g. s3://bucket/data), but those are not supported and shouldn't be for now, so I was just clarifying.

Kudos to @danihodovic for PR #2837 which seems to get most of this resolved, even including support for downloading entire directories from Git. Just pending to check whether dvc import will have the same capabilities after that PR (of if we need a follow-up issue for it).

Was this page helpful?
0 / 5 - 0 ratings

Related issues

GildedHonour picture GildedHonour  Β·  3Comments

anotherbugmaster picture anotherbugmaster  Β·  3Comments

dmpetrov picture dmpetrov  Β·  3Comments

mdscruggs picture mdscruggs  Β·  3Comments

mfrata picture mfrata  Β·  3Comments