Especially useful for "browsing" external DVC projects on Git hosting before using dvc get
or dvc import
. Looking at the Git repo doesn't show the artifacts because they're only referenced in DVC-files (which can be found anywhere), not tracked by Git.
Perhaps dvc list
or dvc artifacts
? (Or/and both dvc get list
and dvc import list
)
As mentioned in https://github.com/iterative/dvc.org/pull/611#discussion_r324998285 and other discussions.
UPDATE: Proposed spec (from https://github.com/iterative/dvc/issues/2509#issuecomment-533019513):
usage: dvc list [-h] [-q | -v] [--recursive [LEVEL]] [--rev REV | --versions]
url [target [target ...]]
positional arguments:
url URL of Git repository with DVC project to download from.
target Paths to DVC-files or directories within the repository to list outputs
for.
UPDATE: Don't forget to update docs AND tab completion scripts when this is implemented.
+1 for dvc list
:slightly_smiling_face:
@efiop @jorgeorpinel another option is to do dvc ls
and it should behave exactly like a regular ls
or aws s3 ls
. Show _all_ the files (including hidden data) by specifying a Git url. This way you can control the scope to show (by not going into all directories by default) - also you can see your data in the context (other files) with an option to filter them out.
On the other hand it can be good to show just a list of all DVC outputs. It can be done with dvc ls --recursive --outputs-only
for example.
What do you think?
In general I'm +100 for dvc list
or something similar :)
Clarification: dvc list
should work on dvc repositories. E.g. dvc list https://github.com/iterative/dvc
should list scripts/innosetup/dvc.ico
, etc.
@efiop can we change the priority to p1 as we discussed because it's part of the get/import epic story?
@shcheklein Sure, forgot to adjust that. Thanks for the heads up!
@shcheklein I'm not sure about the dvc ls
name if all we want is to list data artifacts specifically (not all files as you propose).
...you can see your data in the context (other files) with an option to filter them out.
Will most users need this? The problem I think we need to solve is that it's hard to use dvc get
and dvc import
without a list of available data artifacts. Showing all Git-controlled files may or may not be useful but can be done by existing means.
by specifying a Git url...
What I get from that is: dvc list [url]
where url
can be a file system path or HTTP/SSH URL to a Git repo containing a DVC project (same as the dvc get
and dvc import
argument url
) – and if omitted tries to default to the local repo (containing pwd
).
This way you can control the scope to show (by not going into all directories by default)...
I don't see how just the Git repo URL can control the scope to show. Would need a path
argument for this I think (which could only accept directories) and/or a --depth
option.
dvc ls --recursive --outputs-only
Too complicated for users to remember if the main purpose of this new command is to list artifacts. Maybe --recursive
though... (aws s3 ls
for example isn't recursive by default, but has that option.)
In summary I think just dvc list
or dvc outs
is best.
But I agree we would need to consider the case where there are lots of outputs. Another solution (besides path
arg and/or --depth
, --recursive
options) could be default pagination (and possibly interactive scrolling with ⬆️ ⬇️ – like a stdout pipe to less
).
aws s3 ls
on the other hand takes a simple approach: It has a hard limit of 1000 objects.
Clarification:
dvc list
should work on dvc repositories. E.g.dvc list https://github.com/iterative/dvc
should listscripts/innosetup/dvc.ico/etc
.
@efiop yes, exactly. Extended example:
$ dvc list https://github.com/iterative/dvc
scripts/innosetup/dvc.ico
scripts/innosetup/dvc_left.bmp
scripts/innosetup/dvc_up.bmp
This makes me think, maybe an output that combines the DVC-files (similar to dvc pipeline list
) with their outputs could be most informative (providing some of the context Ivan was looking for). Something like:
$ dvc list https://github.com/iterative/dvc
scripts/innosetup/dvc.ico (from scripts/innosetup/dvc.ico.dvc)
scripts/innosetup/dvc_left.bmp (from scripts/innosetup/dvc_left.bmp.dvc)
scripts/innosetup/dvc_up.bmp (from scripts/innosetup/dvc_up.bmp.dvc)
UPDATE: Thought of yet another 😓 name for the command above: dvc stage list --outs
@jorgeorpinel I think showing the full project in an ls-way is just more natural as opposed to creating our own type of output. There are few benefits:
dvc list
. Instead you just use dvc and see the workspace. And can filter it if it's needed.dvc get
to handle regular Git files. Why not? It can be useful.ls
and intuitively can expect the result.The idea is that by default it's not recursive, it's not limited to outputs only. You go down on your own if you need by clarifying path - the same way you do with ls, aws ls, etc.
ls
, aws ls
, etc - they are all not recursive by default for a reason. In a real case the output can be huge and just won't make sense for you. People tend to go down level by level, or use recursive option when it's exactly clear that it's what they need.
I really don't like making piping and less and complex interfaces part of the tool. You can always use less
if it's needed.
@shcheklein I concur only with benefit #1, so I can agree with showing all files but with an optional flag (which can be developed later, with less priority), not as the default behavior. Thus I wouldn't call the command dvc ls
– since it's not that similar to GNU ls
. I would vote for dvc outs
or dvc list
.
- It looks like it's beneficial for
dvc get
to handle regular Git files. Why not?
Because it can already be done with git archive
(as explained in this SO answer).
3... single place in CLI, no need to go to Github to get the full picture.
The "full picture" could still be achieved from CLI by separately using git ls-tree.
The idea is that by default it's not recursive... You go down on your own if you need by clarifying path...
I can also agree with this: Non-recursive by default sounds easier to implement. So, it would definitely also need an optional dir
(path) argument and a --recursive
option.
I really don't like making piping and less and complex interfaces part of the tool.
Also agree. I just listed it as an alternative.
Anyway, unless anyone else has relevant comments, I suggest Ivan decides the spec for this new command based on all the comments above, so we can get it onto a dev sprint.
p.s. my (initial) updated spec proposal is:
usage: dvc list [-h] [-q | -v] [--recursive] [--rev REV | --versions]
url [target [target ...]]
positional arguments:
url URL of Git repository with DVC project to download from.
target Paths to DVC-files or directories within the repository to list outputs
for.
I think that this command (dvc list
) should be able to get a revision, like --rev tag1
.
Its output should list the checksums of the output files on the given revision, as well as the file path and name. This would also allow an external tool to construct the right URL for downloading from a remote storage a certain revision of a datafile.
This may be related: https://github.com/iterative/dvc/issues/2067
@jorgeorpinel @dashohoxha Few comments:
outs
is a terrible name. output
is a super-confusing term when you want to list your input
datasets :)git-ls-tree
, other tools - that's the point you will need two or more tools to get the full picture, you will have mentally do this exercise of merging lists every time. Also, don't forget that we are dealing with a remote repo - git-ls-tree
won't even solve this problem, unless you got clone the repo first, right?git-archive
. Github does not support it. And why would I want to know about it at all? Why don't we allow to download/import any files?ls
similarity. In mind it's what aws s3 ls
is doing, of hadoop fs ls
and other similar systems. The utilize common and familiar names. And that's why it's important for it to behave the same way as ls. I don't see any reason why it can't be implemented this way.dvc get
to generate an URL for you and part of the dvc.api
.(
dvc list
) should be able to get a revision, like--rev tag1
...
Interesting. Not sure its central to the command because you're not supposed to have knowledge about the repo when you use dvc list
. But yes, it would definitely be a useful advance feature to have. Maybe also a flag to list available revisions for a specified output (e.g. dvc list --versions model.pkl
).
- outs is a terrible name...
You're probably right but I'm just using our own terminology 😋 We would be listing outputs, basically. Anyway, I can agree with dvc list
or even dvc ls
if we go your route and list all files by default.
- the same with git-archive. Github does not support it...
There's special https://raw.githubusercontent.com/user/repository/branch/filename
URLs you could construct and wget
for GitHub. But yeah I guess it wouldn't hurt that dvc get
can download regular files from any Git host. Opened #2515 separately.
Re checksums - we don't want users to construct URLs because this way we expose internal logic... I think it can be part of the
dvc get
to generate an URL for you and part of thedvc.api
.
Do you have in mind something like this?
# return a list of data files, along with the corresponding `.dvc` file
dvc get <url> list
# download a datafile
dvc get <url> <path/to/datafile>
# with option '-h, --show-hashes' display the corresponding hashes as well
dvc get <url> list --show-hashes
# with option '-d, --show-download-url' display the corresponding URL
# for direct download from the default remote storage
# (maybe depends on the type of the remote storage)
dvc get <url> list --show-download-url
# limit listing only to certain DVC-files
dvc get <url> list <file1.dvc> <file2.dvc>
# etc.
This would be useful too for an external tool.
However when I think about a dvc list
command I have in mind something like this:
It should work both for a local workspace and for a remote (GIT) workspace. So, it needs an option like -g, --git-repo <url-of-git-repo>
, where <url-of-git-repo>
may also be a local path, like /path/to/git/repo/
. If this option is not given, default is the current workspace/repository.
It should get some DVC-files as targets, and list the outputs of only these targets. If no targets are given, then all the DVC-files in the current directory are used as targets. If a given target is a directory, then all the DVC-files inside it are used as targets. If the option -R, --recursive
is given as well, then the search for DVC-targets inside the given directory will be recursive.
The output should normally include the name of the DVC-file and the path of the datafile/output. However with the option -h, --show-hashes
it should display the corresponding hash as well.
With the option -d, --show-download-url
should display the URL for direct download from the default remote storage. (Maybe it should check as well whether this file is available on the remote storage, and return NULL
/not-available
, if it is not?). Maybe it should get the option -r, --remote
to use another storage instead of the default one.
With the option --rev
it should show data from a certain revision/tag/branch. If this option is missing, than the current state of the project is used, and this should work with --no-scm
as well.
If you have something else in mind, please feel free to disregard these suggestions.
# ...list of data files, along with the corresponding `.dvc` file dvc get <url> list
This would actually need to be dvc get list [url]
to adhere to DVC syntax but no, we're talking about a new, separate command (dvc list
), not a new subcommand for dvc get
. (It also affects dvc import
, for example.)
Also I think we've established we want the list to be of all the regular files along with "data files" (outputs and their DVC-files), not just the latter.
dvc get <url> list --show-hashes ... dvc get <url> list --show-download-url
Please open a separate issue to decide on adding new options to dvc get
@dashohoxha.
# limit listing only to certain DVC-files dvc get <url> list <file1.dvc> <file2.dvc>
This could be useful actually. Optionally specifying target DVC-files to the dvc list
command could be an alternative for limiting the number of results.
OK I hid some resolved comments and updated the proposed command spec in https://github.com/iterative/dvc/issues/2509#issuecomment-533019513 based on all the good feedback. Does it look good to get started on?
@jorgeorpinel sounds good to me, let's move the spec to the initial message?
Also, would be great to specify how does output look like in different cases. Again, I suggest it to be have the same way as if Git repo is a bucket you are listing files in with aws s3 ls
. May be it would be helpful to come with a few different outputs and compare them.
OK spec moved to initial issue comment.
specify how does output look like in different cases...
I also think copying the aws s3 ls
output is a good starting point.
Using file system paths as url
:
See full example outputs in dvc list proposal: output examples Gist.
$ cd ~
$ dvc list # Default url = .
ERROR: The current working directory doesn't seem to be part of a DVC project.
$ git clone [email protected]:iterative/example-get-started.git
$ cd example-get-started
$ dvc list .
INFO: Listing LOCAL DVC project files, directories, and data at
/home/uname/example-get-started/
17B 2019-09-20 .gitignore
6.0K 2019-09-20 README.md
...
339B 2019-09-20 train.dvc
5.8M └ out: (model.pkl)
...
$ dvc pull featurize.dvc
$ dvc list featurize.dvc # With target DVC-file
...
INFO: Limiting list to data outputs from featurize.dvc stage.
367B 2019-09-20 featurize.dvc
2.7M └ out: data/features/test.pkl
11M └ out: data/features/train.pkl
NOTE: The latter case brings up several questions about how to display outputs located in a different dir as the
target
DVC-file, especially vs. using that location as thetarget
instead. In this case I listed them even without--recursive
.
Note that the .dvc/
dir is omitted from the output. Also, the dates above come from the file system, same as ls
. In the following examples, they come from git history.
Using network Git url
s:
See full example outputs in dvc list proposal: output examples Gist.
$ dvc list [email protected]:iterative/example-get-started.git # SSH URL
17B 2019-09-03 .gitignore
...
339B 2019-09-03 train.dvc
5.8M └ out: model.pkl
$ dvc list https://github.com/iterative/dataset-registry # HTTP URL
1.9K 2019-08-27 README.md
160B 2019-08-27 get-started/
128B 2019-08-27 tutorial/
$ dvc list --recursive https://github.com/iterative/dataset-registry tutorial # Recursive inside target dir
...
INFO: Expanding list recursively.
29B 2019-08-29 tutorial/nlp/.gitignore
178B 2019-08-29 tutorial/nlp/Posts.xml.zip.dvc
10M └ out: tutorial/nlp/Posts.xml.zip
177B 2019-08-29 tutorial/nlp/pipeline.zip.dvc
4.6K └ out: tutorial/nlp/pipeline.zip
...
NOTE: Another question is whether outputs having no date is OK (just associated to the date of their DVC-files) or whether we should also get that from the default remote. Or maybe we don't need dates at all...
Going through these made me realize this command can easily get very complicated so all feedback is welcomed to try and simplify it as much as possible to a point where it still revolves the main problem (listing project outputs in order to know what's available for dvc get
and dvc import
), but doesn't explode in complexity.
@jorgeorpinel this is great! can you make it as a gist or a commit - so that we can leave some comments line by line?
OK, added to https://gist.github.com/jorgeorpinel/61719795628fc0fe64e04e4cc4c0ca1c and updated https://github.com/iterative/dvc/issues/2509#issuecomment-533685061 above.
@jorgeorpinel I can't comment on a specific line on them. I wonder if there is a better tools for this? create a temporary PR with these files to dvc.org?
As a first step, we could simply print lines with all outputs. E.g.
$ dvc list https://github.com/iterative/dvc
scripts/innosetup/dvc.ico
scripts/innosetup/dvc_up.bmp
scripts/innosetup/dvc_left.bmp
and then move on to polishing.
We should plan the polishing in advance a bit so that we won't be breaking the released stuff. Defaults, available options, etc.
Btw, when we just print outputs this way, wouldn't it be the same as dvc pipeline show --outs
🤔
By default, the output should be simple and consumable by tools. It will give us an ability to chain commands like
dvc list https://github.com/iterative/dvc | grep '\.pth$' | xargs dvc get https://github.com/iterative/dvc
(yeah, I'm pretty sure the current dvc get
does not support this yet. EDITED: actually it does through xargs
.)
For the same reason headers like INFO: Expanding list recursively.
are not needed.
Nice looking lists with hierarchies like └ out: tutorial/nlp/pipeline.zip
should be supported under --human-readable
options. And this can be postpone.
@shcheklein good point re dvc pipeline show --outs
. But this command does not support remotes (like dvc pipeline show https://github.com/iterative/dvc --outs
).
Also, based on my understanding,pipeline show
does some additional stuff - sorts outputs by order in the pipeline (the last output is always the last output in the pipeline). Please correct me if it is not correct.
But this command does not support remotes
@dmpetrov yeah, it was just a note that we kinda have a command that outputs DVC-specific info (as opposed to general ls-like output). I have no idea how can it affect any decisions in this ticket. Was sharing in case someone else has something to say about it.
the output should be simple and consumable by tools
if we follow unix tools - a lot of them (probably most of them?) are not made to be scripts consumable by default. You would need to run ls -1
to achieve a simple one item per line format:
Force output to be one entry per line. This is the default when output is not to a terminal.
aws s3 ls
is only somewhat consumable.
find
is not consumable as far as I remember.
Also, I believe it's wider topic (similar to #2498). Ideally we should have some agreed upon approach we are taking, something like this: https://devcenter.heroku.com/articles/cli-style-guide#human-readable-output-vs-machine-readable-output And this approach should be (reasonably) consistent across all commands.
should be supported under --human-readable
See above. I don't remember actually from the top of my head any of the tools that are taking this approach. The only one common convention is -h
but it does not change layout of the output. It changes the details.
And this can be postponed.
agreed that we can postpone adding nice details (file sizes, DVC-file per output, etc). But I think it's very important to decide on the default behavior:
ls
or aws ls
vs DVC-specific interface focusing on all outputs. Again, I don't see a reason to do this - everyone is familiar with ls
or aws s3 ls
behavior. It'll be way easier to explain and easier to alter the behavior with an option.To answer these behavior and UI questions we need to return back to the original motivation:
"browsing" external DVC projects on Git hosting before using
dvc get
ordvc import
. Looking at the Git repo doesn't show the artifacts because they're only referenced in DVC-files (which can be found anywhere), not tracked by Git.
First of all, I understand this pain of not seeing data files in my Git repo.
To my mind, a list of outputs will answer the question. The list of the other Git files (like README.rst
) seems not relevant and even introduces noise to the answer. Do we need to go recursively - hm.. probably yes, since most of the data files are hidden under input
, data
, models
directories. We can avoid the need in multiple dvc list
s. More than that, we have to go recursively (technicaly), because some DVC-file from a subfolder can generate outputs into the root of the repo.
New questions that I just got:
dvc list
a right command name? Why not dvc find
or dvc whereis
? Do we need both dvc list
(show all files) & dvc whereis
(I don't like this idea but still)Some details regarding the output formatting... Yes, in some cases it makes sense to give preference to human-readable commands. Like your Heroku example where a complex/hierarchical output was produced.
But in this case, we are talking about a list of files which is human readable already. I see no benefits of introducing additional complexity (by default). If we decide that type of the file (Git\Remote\metrics) is important to see by default then it can be a reason to use a more complicated format. So, I'm not against this I just see no value.
if we follow unix tools - a lot of them (probably most of them?) are not made to be scripts consumable by default.
I don't agree with this.
I don't remember actually from the top of my head any of the tools that are taking this approach.
aws s3 ls --human-readable
, du -h *
aws s3 ls
is only somewhat consumable.
find
is not consumable as far as I remember.
aws s3 ls
is actually not the best example. It has quite a terrible UX.
find
output is very consumable. As well as ls
(it does the hack internaly) and most of the unix-commands.
if we follow unix tools - a lot of them (probably most of them?) are not made to be scripts consumable by default
Even if it is so, most of the unix tools have an option to produce scriptable output. For example git status
and git status --porcelain
.
Regarding the scriptable output. --porcelain
, --json
, etc - is exactly an approach that I was mentioning. The point is that most of the tools are not made to be scriptable by _default_. That's exactly the point of the link I shared:
Terse, machine-readable output formats can also be useful but shouldn’t get in the way of making beautiful CLI output. When needed, commands should offer a --json and/or a --terse flag when valuable to allow users to easily parse and script the CLI.
Btw, Git porcelain explains it very well why it's better to provide an option for script consumable outputs:
Give the output in an easy-to-parse format for scripts. This is similar to the short output, but will remain stable across Git versions and regardless of user configuration. See below for details.
_stable_ API - It'll be way easier for to maintain and evolve the output later if we use an option.
So, ls
is doing some magic and by default does not sacrifice human UI, heroku,
githave an option,
findis somewhat consumable - but usually you should be providing
print0,
aws` is somewhat consumable - in a Heroku's guide sense.
I wonder if there is a better tools for this? create a temporary PR with these files to dvc.org?
@shcheklein good idea. Here's one: https://github.com/iterative/dvc/pull/2569/files
For the same reason headers like
INFO: Expanding list recursively.
are not needed...
Right, INFO/WARNING
etc logs/messages. could be included in the -v
(verbose) output though.
agreed that we can postpone adding nice details (file sizes, DVC-file per output, etc). But I think it's very important to decide on the default behavior...
We should definitely prioritize the features of this command but I also think a good initial design can save lots of time in the future. Let's not go overkill though because once we're actually implementing I'm sure a lot of new questions will come up.
As for the default behavior, I think it should list all files (Ivan convinced me here: https://github.com/iterative/dvc/issues/2509#issuecomment-532781756) and with level 2 or 3 recursion to catch _most_ buried DVC-files, but warn when there are DVC-files in even deeper levels. It also needs a length limit like 100 files though (especially if we decide full recursion as default) – not sure how to list more if the limit is reached.
DVC-file from a subfolder can generate outputs into the root of the repo.
Yes... This is confusing and hard to figure out how to output this in dvc list
if we show the relationship between DVC-files and output files so perhaps that's one of the less features with less priority.
Should it show Git committed outputs?
Good Q. I think it would be important to mark on any stage output that is Git-tracked in the output of dvc list
.
Discussed briefly with @dmpetrov and I think we agreed on no recursion, all files mode by default. Then, -o
and -r
options to filter the list as you need. @dmpetrov could you chime in, please?
I'd prefer ls-files
to match git
more closely
fyi, for git ls-files
:
.
'*.py'
*.py
(assumes shell expansion, git will exclude untracked file arguments)@casperdcl interesting point.
In what cases do you use git ls-files
vs regular ls
? I think they serve a bit different purposes and to me intuitively dvc list
or dvc ls
(name does not matter that much) serves the role of aws s3 ls
or ls
. There are chances that default behavior for ls-files
is done this way to augment ls
.
git ls-files
sounds like an advanced Git-specific command. And if we go that path, we should probably indeed do recursive list of outputs (files that are under DVC control) - but that will create some disconnect with dvc get
and dvc import
(that now serve the roles of wget, cp, aws s3 cp, etc) that can deal with regular files now. Also, not sure how will it work for files inside outputs (when I want to see some specific images inside a specific output).
.gitignore
) and I only want to ls
the tracked onesgit archive
would include-x
and -X
patterns for systems without grep
-t
flag: also marks file status@casperdcl so the cases you are describing are indeed very different from ls
. They are specific to Git workflow. In case of this new command, my feeling that it's better to compare it with ls
or even aws s3 ls
since we kind-a take the ability to access buckets directly away.
I'd argue that dvc ls-files
listing only dvc-tracked files would be very valuable. I don't see much value in implementing something simpler which is a minor modification to ls
so, that's the point the is no ls
for us and a regular one does not work. It's exactly the same reason why aws s3 ls
exists. Regular ls
won't work for a remote Git repo, right? May be that's where some misunderstanding comes from. The idea is that it should enable some discoverability for DVC projects and you supposed to use it like:
dvc list https://github.com/iterative/example-get-started
and then:
dvc get https://github.com/iterative/example-get-started README.md
or
dvc list https://github.com/iterative/example-get-started data
dvc import https://github.com/iterative/example-get-started data/data.xml
Ah right. Remote listing pre-download is different.
I was talking about local listing local files.
I guess they'll have different implementations but would be nice to have them both called the same command (I'd prefer ls-files
, but fine with ls
or list
too).
I think dvc list .
will work fine in this case. It'll show all files tracked by Git + expansion of DVC-files to show outputs. So it is DVC specific ls-file in this sense. I would still prefer to have default like in ls
- no recursion, show all files - Git-tracked + DVC outs.
dvc list .
is an edge-case for this command. The main usage is the remote one and I would really like to see just a list of files the same way I do with ls
. (+some options to refine and filter if needed).
I'm going to work on the issue
Reopening since there are some details that we need to follow up on.
@efiop but we have #3381 for that. Maybe OK to close this one?
@jorgeorpinel Ok, let's close then :slightly_smiling_face: Thanks for the heads up!
Most helpful comment
I'm going to work on the issue