As we decide how to list data artifacts in external DVC repos in #2509 it became apparent listing regular files along with stage outputs could be especially useful. See https://github.com/iterative/dvc/issues/2509#issuecomment-532314755.
This made us think also then users could want to get
some of those individual regular files after seeing them. (See https://github.com/iterative/dvc/issues/2509#issuecomment-532781756)
Cc @shcheklein
Why dvc get
and not dvc import
though? What about directory/recursive support for this feature?
Thanks
Related to https://github.com/iterative/dvc/issues/2507
Is there an agreement on this one, @efiop ?
@mroutis Most likely, this will be automatically fixed along the way with https://github.com/iterative/dvc/issues/2507
...this will be automatically fixed along the way with #2507
Per https://github.com/iterative/dvc/pull/2566#issuecomment-538271254, that does not seem to be the case after all, guys.
@jorgeorpinel hence the "most likely" part. As you can see, that PR doesn't have "Fixes" message for this issue. Thanks for the heads up, though.
Just keeping track of updates and validity of my issues.
A few questions regarding this issue:
What protocols should dvc get
handle? The examples listed below include both http and ssh. Are there other protocols that need to be considered? Git uses 4 - https://git-scm.com/book/en/v2/Git-on-the-Server-The-Protocols.
Why dvc get and not dvc import though? What about directory/recursive support for this feature?
Have the questions above by @jorgeorpinel been resolved?
Is it supposed to some sort of compatibility with the future dvc list
command? @dmpetrov mentions piping dvc list
into dvc get
with xargs: https://github.com/iterative/dvc/issues/2509#issuecomment-537310716
Should it be possible to pipe the file to stdout for inspection with a pager such as less
? Perhaps when -o
is not provided. That would change the existing functionality of dvc get.
As for implementation - could it be as simple as checking if the file provided to dvc get
is a not a dvc file and then downloading the file locally? I would inject the logic here: https://github.com/iterative/dvc/blob/3de46dd25f7640f65f54d61b2c303125fc5a780f/dvc/repo/get.py#L21-L64.
Is there a helper to identify dvc files besides just looking at the file suffix?
@danihodovic
What protocols should dvc get handle? The examples listed below include both http and ssh. Are there other protocols that need to be considered? Git uses 4 - https://git-scm.com/book/en/v2/Git-on-the-Server-The-Protocols.
You mean for dvc get
s url
argument? Or path
argument? Your reference to git makes me think like you are talking about url
. If so, we don't really care, as we are passing it to gitpython anyway and whatever it supports - it supports :) But yeah, local, http and ssh are regular ones. But that doesn't really matter for this issue, unless I'm missing something. What would make sense is a path
part, and we should only handle local
for now, same as for cached arguments. π
@danihodovic
Have the questions above by @jorgeorpinel been resolved?
Import doesn't support regular files IIRC. Cached directories are supported, so it would be great to support non-cached ones too as a part of this ticket π Recursive option doesn't exist for any of the commands, so it is not part of this ticket (though I'm not even sure what @jorgeorpinel meant by recursive, maybe he could clarify)
Is it supposed to some sort of compatibility with the future dvc list command? @dmpetrov mentions piping dvc list into dvc get with xargs: #2509 (comment)
Well, it is more about dvc list
then dvc get
. No need to keep dvc list
in mind for this issue, unless I'm missing something.
Should it be possible to pipe the file to stdout for inspection with a pager such as less? Perhaps when -o is not provided. That would change the existing functionality of dvc get.
Not required and I don't really see a need for that functionality. Also, how would that even work with directories? πIt is not the functionality meant for dvc get
.
As for implementation - could it be as simple as checking if the file provided to dvc get is a not a dvc file and then downloading the file locally? I would inject the logic here:
...
Is there a helper to identify dvc files besides just looking at the file suffix?
Not sure what you mean, Dvcfiles are not supplied to dvc get
as an argument, output paths are.
Agree with Ruslan. We would basically be using Git to download files in the repo so we will support everything Git does. I didn't get the part about path
though, @efiop. The path is always local/internal to the repo's structure, is that what you mean?
So dvc get
just has to detect whether this path represents a file in the repo or a file referenced in a DVC-file in the repo (the latter is what's currently supported) β this should help answer your Q about implementation @danihodovic.
Why
dvc get
and notdvc import
though? What about directory/recursive support for this feature?
These Qs haven't really been decided Dani, thanks for checking. I vote to work on dvc get
first but this should also apply to dvc import
. I also vote to not complicate things by trying to support directories (recursively or not) for now. We can revisit later.
Cached directories are supported.
What do you mean @efiop? Not sure the cache has much to do with this issue. It's really about files hosted in the Git repo vs. files referenced in the DVC-files hosted in the Git repo, right?
Also agree with Ruslan about dvc list
. It's related but separate from this, no need to worry about that rn. Agree as well about no need to pipe the file to stdout.
What do you mean @efiop? Not sure the cache has much to do with this issue. It's really about files hosted in the Git repo vs. files referenced in the DVC-files hosted in the Git repo, right?
I was talking about outputs of dvc files that are not cached by dvc, so possibly tracked by git @jorgeorpinel . E.g. dvc run -O dir 'mkdir dir && echo data > dir/data'
(notice the big O). The handling sis pretty much the same, see implementation in api.
Agree with Ruslan. We would basically be using Git to download files in the repo so we will support everything Git does. I didn't get the part about path though, @efiop. The path is always local/internal to the repo's structure, is that what you mean?
Yes, that is what I was saying. Note that we might have external outputs (e.g. s3://bucket/data), but those are not supported and shouldn't be for now, so I was just clarifying.
Kudos to @danihodovic for PR #2837 which seems to get most of this resolved, even including support for downloading entire directories from Git. Just pending to check whether dvc import
will have the same capabilities after that PR (of if we need a follow-up issue for it).
Most helpful comment
@danihodovic
Import doesn't support regular files IIRC. Cached directories are supported, so it would be great to support non-cached ones too as a part of this ticket π Recursive option doesn't exist for any of the commands, so it is not part of this ticket (though I'm not even sure what @jorgeorpinel meant by recursive, maybe he could clarify)
Well, it is more about
dvc list
thendvc get
. No need to keepdvc list
in mind for this issue, unless I'm missing something.Not required and I don't really see a need for that functionality. Also, how would that even work with directories? πIt is not the functionality meant for
dvc get
.Not sure what you mean, Dvcfiles are not supplied to
dvc get
as an argument, output paths are.