Dvc: new command to list data artifacts in a DVC project

Created on 17 Sep 2019  ·  45Comments  ·  Source: iterative/dvc

Especially useful for "browsing" external DVC projects on Git hosting before using dvc get or dvc import. Looking at the Git repo doesn't show the artifacts because they're only referenced in DVC-files (which can be found anywhere), not tracked by Git.

Perhaps dvc list or dvc artifacts? (Or/and both dvc get list and dvc import list)

As mentioned in https://github.com/iterative/dvc.org/pull/611#discussion_r324998285 and other discussions.


UPDATE: Proposed spec (from https://github.com/iterative/dvc/issues/2509#issuecomment-533019513):

usage: dvc list [-h] [-q | -v] [--recursive [LEVEL]] [--rev REV | --versions]
                url [target [target ...]]

positional arguments:
  url         URL of Git repository with DVC project to download from.
  target      Paths to DVC-files or directories within the repository to list outputs
              for.

UPDATE: Don't forget to update docs AND tab completion scripts when this is implemented.

c8-full-day feature request p1-important

Most helpful comment

I'm going to work on the issue

All 45 comments

+1 for dvc list :slightly_smiling_face:

@efiop @jorgeorpinel another option is to do dvc ls and it should behave exactly like a regular ls or aws s3 ls. Show _all_ the files (including hidden data) by specifying a Git url. This way you can control the scope to show (by not going into all directories by default) - also you can see your data in the context (other files) with an option to filter them out.

On the other hand it can be good to show just a list of all DVC outputs. It can be done with dvc ls --recursive --outputs-only for example.

What do you think?

In general I'm +100 for dvc list or something similar :)

Clarification: dvc list should work on dvc repositories. E.g. dvc list https://github.com/iterative/dvc should list scripts/innosetup/dvc.ico, etc.

@efiop can we change the priority to p1 as we discussed because it's part of the get/import epic story?

@shcheklein Sure, forgot to adjust that. Thanks for the heads up!

@shcheklein I'm not sure about the dvc ls name if all we want is to list data artifacts specifically (not all files as you propose).

...you can see your data in the context (other files) with an option to filter them out.

Will most users need this? The problem I think we need to solve is that it's hard to use dvc get and dvc import without a list of available data artifacts. Showing all Git-controlled files may or may not be useful but can be done by existing means.

by specifying a Git url...

What I get from that is: dvc list [url] where url can be a file system path or HTTP/SSH URL to a Git repo containing a DVC project (same as the dvc get and dvc import argument url) – and if omitted tries to default to the local repo (containing pwd).

This way you can control the scope to show (by not going into all directories by default)...

I don't see how just the Git repo URL can control the scope to show. Would need a path argument for this I think (which could only accept directories) and/or a --depth option.

dvc ls --recursive --outputs-only

Too complicated for users to remember if the main purpose of this new command is to list artifacts. Maybe --recursive though... (aws s3 ls for example isn't recursive by default, but has that option.)

In summary I think just dvc list or dvc outs is best.
But I agree we would need to consider the case where there are lots of outputs. Another solution (besides path arg and/or --depth, --recursive options) could be default pagination (and possibly interactive scrolling with ⬆️ ⬇️ – like a stdout pipe to less).
aws s3 ls on the other hand takes a simple approach: It has a hard limit of 1000 objects.

Clarification: dvc list should work on dvc repositories. E.g. dvc list https://github.com/iterative/dvc should list scripts/innosetup/dvc.ico/etc.

@efiop yes, exactly. Extended example:

$ dvc list https://github.com/iterative/dvc
scripts/innosetup/dvc.ico
scripts/innosetup/dvc_left.bmp
scripts/innosetup/dvc_up.bmp

This makes me think, maybe an output that combines the DVC-files (similar to dvc pipeline list) with their outputs could be most informative (providing some of the context Ivan was looking for). Something like:

$ dvc list https://github.com/iterative/dvc
scripts/innosetup/dvc.ico   (from scripts/innosetup/dvc.ico.dvc)
scripts/innosetup/dvc_left.bmp  (from scripts/innosetup/dvc_left.bmp.dvc)
scripts/innosetup/dvc_up.bmp    (from scripts/innosetup/dvc_up.bmp.dvc)

UPDATE: Thought of yet another 😓 name for the command above: dvc stage list --outs

@jorgeorpinel I think showing the full project in an ls-way is just more natural as opposed to creating our own type of output. There are few benefits:

  1. You don't have to use two interface to see the full picture - Github + dvc list. Instead you just use dvc and see the workspace. And can filter it if it's needed.
  2. It looks like it's beneficial for dvc get to handle regular Git files. Why not? It can be useful.
  3. Like I mentioned - single place in CLI, no need to go to Github to get the full picture.
  4. Just easier to understand since people are familiar with ls and intuitively can expect the result.

The idea is that by default it's not recursive, it's not limited to outputs only. You go down on your own if you need by clarifying path - the same way you do with ls, aws ls, etc.

ls, aws ls, etc - they are all not recursive by default for a reason. In a real case the output can be huge and just won't make sense for you. People tend to go down level by level, or use recursive option when it's exactly clear that it's what they need.

I really don't like making piping and less and complex interfaces part of the tool. You can always use less if it's needed.

@shcheklein I concur only with benefit #1, so I can agree with showing all files but with an optional flag (which can be developed later, with less priority), not as the default behavior. Thus I wouldn't call the command dvc ls – since it's not that similar to GNU ls. I would vote for dvc outs or dvc list.

  1. It looks like it's beneficial for dvc get to handle regular Git files. Why not?

Because it can already be done with git archive (as explained in this SO answer).

3... single place in CLI, no need to go to Github to get the full picture.

The "full picture" could still be achieved from CLI by separately using git ls-tree.

The idea is that by default it's not recursive... You go down on your own if you need by clarifying path...

I can also agree with this: Non-recursive by default sounds easier to implement. So, it would definitely also need an optional dir (path) argument and a --recursive option.

I really don't like making piping and less and complex interfaces part of the tool.

Also agree. I just listed it as an alternative.


Anyway, unless anyone else has relevant comments, I suggest Ivan decides the spec for this new command based on all the comments above, so we can get it onto a dev sprint.

p.s. my (initial) updated spec proposal is:

usage: dvc list [-h] [-q | -v] [--recursive] [--rev REV | --versions]
                url [target [target ...]]

positional arguments:
  url         URL of Git repository with DVC project to download from.
  target      Paths to DVC-files or directories within the repository to list outputs
              for.

I think that this command (dvc list) should be able to get a revision, like --rev tag1.
Its output should list the checksums of the output files on the given revision, as well as the file path and name. This would also allow an external tool to construct the right URL for downloading from a remote storage a certain revision of a datafile.

@jorgeorpinel @dashohoxha Few comments:

  1. outs is a terrible name. output is a super-confusing term when you want to list your input datasets :)
  2. git-ls-tree, other tools - that's the point you will need two or more tools to get the full picture, you will have mentally do this exercise of merging lists every time. Also, don't forget that we are dealing with a remote repo - git-ls-tree won't even solve this problem, unless you got clone the repo first, right?
  3. the same with git-archive. Github does not support it. And why would I want to know about it at all? Why don't we allow to download/import any files?
  4. Regarding ls similarity. In mind it's what aws s3 ls is doing, of hadoop fs ls and other similar systems. The utilize common and familiar names. And that's why it's important for it to behave the same way as ls. I don't see any reason why it can't be implemented this way.
  5. Re checksums - we don't want users to construct URLs because this way we expose internal logic (the way we store files) to them. Which is not a part of the public protocol (at least now). I totally understand the problem though. I think it can be part of the dvc get to generate an URL for you and part of the dvc.api.

(dvc list) should be able to get a revision, like --rev tag1...

Interesting. Not sure its central to the command because you're not supposed to have knowledge about the repo when you use dvc list. But yes, it would definitely be a useful advance feature to have. Maybe also a flag to list available revisions for a specified output (e.g. dvc list --versions model.pkl).

  1. outs is a terrible name...

You're probably right but I'm just using our own terminology 😋 We would be listing outputs, basically. Anyway, I can agree with dvc list or even dvc ls if we go your route and list all files by default.

  1. the same with git-archive. Github does not support it...

There's special https://raw.githubusercontent.com/user/repository/branch/filename URLs you could construct and wget for GitHub. But yeah I guess it wouldn't hurt that dvc get can download regular files from any Git host. Opened #2515 separately.

Re checksums - we don't want users to construct URLs because this way we expose internal logic... I think it can be part of the dvc get to generate an URL for you and part of the dvc.api.

Do you have in mind something like this?

# return a list of data files, along with the corresponding `.dvc` file
dvc get <url> list

# download a datafile
dvc get <url> <path/to/datafile>

# with option '-h, --show-hashes' display the corresponding hashes as well
dvc get <url> list --show-hashes

# with option '-d, --show-download-url' display the corresponding URL
# for direct download from the default remote storage
# (maybe depends on the type of the remote storage)
dvc get <url> list --show-download-url

# limit listing only to certain DVC-files
dvc get <url> list <file1.dvc> <file2.dvc>

# etc.

This would be useful too for an external tool.

However when I think about a dvc list command I have in mind something like this:

  1. It should work both for a local workspace and for a remote (GIT) workspace. So, it needs an option like -g, --git-repo <url-of-git-repo>, where <url-of-git-repo> may also be a local path, like /path/to/git/repo/. If this option is not given, default is the current workspace/repository.

  2. It should get some DVC-files as targets, and list the outputs of only these targets. If no targets are given, then all the DVC-files in the current directory are used as targets. If a given target is a directory, then all the DVC-files inside it are used as targets. If the option -R, --recursive is given as well, then the search for DVC-targets inside the given directory will be recursive.

  3. The output should normally include the name of the DVC-file and the path of the datafile/output. However with the option -h, --show-hashes it should display the corresponding hash as well.

  4. With the option -d, --show-download-url should display the URL for direct download from the default remote storage. (Maybe it should check as well whether this file is available on the remote storage, and return NULL/not-available, if it is not?). Maybe it should get the option -r, --remote to use another storage instead of the default one.

  5. With the option --rev it should show data from a certain revision/tag/branch. If this option is missing, than the current state of the project is used, and this should work with --no-scm as well.

If you have something else in mind, please feel free to disregard these suggestions.

# ...list of data files, along with the corresponding `.dvc` file
dvc get <url> list

This would actually need to be dvc get list [url] to adhere to DVC syntax but no, we're talking about a new, separate command (dvc list), not a new subcommand for dvc get. (It also affects dvc import, for example.)

Also I think we've established we want the list to be of all the regular files along with "data files" (outputs and their DVC-files), not just the latter.

dvc get <url> list --show-hashes
...
dvc get <url> list --show-download-url

Please open a separate issue to decide on adding new options to dvc get @dashohoxha.

# limit listing only to certain DVC-files
dvc get <url> list <file1.dvc> <file2.dvc>

This could be useful actually. Optionally specifying target DVC-files to the dvc list command could be an alternative for limiting the number of results.

OK I hid some resolved comments and updated the proposed command spec in https://github.com/iterative/dvc/issues/2509#issuecomment-533019513 based on all the good feedback. Does it look good to get started on?

@jorgeorpinel sounds good to me, let's move the spec to the initial message?

Also, would be great to specify how does output look like in different cases. Again, I suggest it to be have the same way as if Git repo is a bucket you are listing files in with aws s3 ls. May be it would be helpful to come with a few different outputs and compare them.

OK spec moved to initial issue comment.

specify how does output look like in different cases...

I also think copying the aws s3 ls output is a good starting point.

Hypothetical example outputs

Using file system paths as url:

See full example outputs in dvc list proposal: output examples Gist.

$ cd ~
$ dvc list  # Default url = .
ERROR: The current working directory doesn't seem to be part of a DVC project.

$ git clone [email protected]:iterative/example-get-started.git
$ cd example-get-started
$ dvc list .
INFO: Listing LOCAL DVC project files, directories, and data at
      /home/uname/example-get-started/

 17B 2019-09-20 .gitignore
6.0K 2019-09-20 README.md
...
339B 2019-09-20 train.dvc
5.8M            └ out: (model.pkl)
...

$ dvc pull featurize.dvc
$ dvc list featurize.dvc  # With target DVC-file
...
INFO: Limiting list to data outputs from featurize.dvc stage.

367B 2019-09-20 featurize.dvc
2.7M            └ out: data/features/test.pkl
 11M            └ out: data/features/train.pkl

NOTE: The latter case brings up several questions about how to display outputs located in a different dir as the target DVC-file, especially vs. using that location as the target instead. In this case I listed them even without --recursive.

Note that the .dvc/ dir is omitted from the output. Also, the dates above come from the file system, same as ls. In the following examples, they come from git history.

Using network Git urls:

See full example outputs in dvc list proposal: output examples Gist.

$ dvc list [email protected]:iterative/example-get-started.git  # SSH URL
 17B 2019-09-03 .gitignore
...
339B 2019-09-03 train.dvc
5.8M            └ out: model.pkl

$ dvc list https://github.com/iterative/dataset-registry  # HTTP URL
1.9K 2019-08-27 README.md
160B 2019-08-27 get-started/
128B 2019-08-27 tutorial/

$ dvc list --recursive https://github.com/iterative/dataset-registry tutorial  # Recursive inside target dir
...
INFO: Expanding list recursively.

 29B 2019-08-29 tutorial/nlp/.gitignore
178B 2019-08-29 tutorial/nlp/Posts.xml.zip.dvc
 10M            └ out: tutorial/nlp/Posts.xml.zip
177B 2019-08-29 tutorial/nlp/pipeline.zip.dvc
4.6K            └ out: tutorial/nlp/pipeline.zip
...

NOTE: Another question is whether outputs having no date is OK (just associated to the date of their DVC-files) or whether we should also get that from the default remote. Or maybe we don't need dates at all...

Going through these made me realize this command can easily get very complicated so all feedback is welcomed to try and simplify it as much as possible to a point where it still revolves the main problem (listing project outputs in order to know what's available for dvc get and dvc import), but doesn't explode in complexity.

@jorgeorpinel this is great! can you make it as a gist or a commit - so that we can leave some comments line by line?

@jorgeorpinel I can't comment on a specific line on them. I wonder if there is a better tools for this? create a temporary PR with these files to dvc.org?

As a first step, we could simply print lines with all outputs. E.g.

$ dvc list https://github.com/iterative/dvc
scripts/innosetup/dvc.ico
scripts/innosetup/dvc_up.bmp
scripts/innosetup/dvc_left.bmp

and then move on to polishing.

We should plan the polishing in advance a bit so that we won't be breaking the released stuff. Defaults, available options, etc.

Btw, when we just print outputs this way, wouldn't it be the same as dvc pipeline show --outs 🤔

By default, the output should be simple and consumable by tools. It will give us an ability to chain commands like

dvc list https://github.com/iterative/dvc | grep '\.pth$' | xargs dvc get https://github.com/iterative/dvc

(yeah, I'm pretty sure the current dvc get does not support this yet. EDITED: actually it does through xargs.)

For the same reason headers like INFO: Expanding list recursively. are not needed.

Nice looking lists with hierarchies like └ out: tutorial/nlp/pipeline.zip should be supported under --human-readable options. And this can be postpone.

@shcheklein good point re dvc pipeline show --outs. But this command does not support remotes (like dvc pipeline show https://github.com/iterative/dvc --outs).

Also, based on my understanding,pipeline show does some additional stuff - sorts outputs by order in the pipeline (the last output is always the last output in the pipeline). Please correct me if it is not correct.

But this command does not support remotes

@dmpetrov yeah, it was just a note that we kinda have a command that outputs DVC-specific info (as opposed to general ls-like output). I have no idea how can it affect any decisions in this ticket. Was sharing in case someone else has something to say about it.

the output should be simple and consumable by tools

if we follow unix tools - a lot of them (probably most of them?) are not made to be scripts consumable by default. You would need to run ls -1 to achieve a simple one item per line format:

Force output to be one entry per line.  This is the default when output is not to a terminal.

aws s3 ls is only somewhat consumable.
find is not consumable as far as I remember.

Also, I believe it's wider topic (similar to #2498). Ideally we should have some agreed upon approach we are taking, something like this: https://devcenter.heroku.com/articles/cli-style-guide#human-readable-output-vs-machine-readable-output And this approach should be (reasonably) consistent across all commands.

should be supported under --human-readable

See above. I don't remember actually from the top of my head any of the tools that are taking this approach. The only one common convention is -h but it does not change layout of the output. It changes the details.

And this can be postponed.

agreed that we can postpone adding nice details (file sizes, DVC-file per output, etc). But I think it's very important to decide on the default behavior:

  1. Does it show _all_ files (and you need to specify a flag to show outputs only) vs only outputs. I've outlined the reasons I think we should be taking ls-like approach and show all files, not DVC-specific only.
  2. Does it go down recursively vs you need to provide a flag -r, similar here - ls or aws ls vs DVC-specific interface focusing on all outputs. Again, I don't see a reason to do this - everyone is familiar with ls or aws s3 ls behavior. It'll be way easier to explain and easier to alter the behavior with an option.

To answer these behavior and UI questions we need to return back to the original motivation:

"browsing" external DVC projects on Git hosting before using dvc get or dvc import. Looking at the Git repo doesn't show the artifacts because they're only referenced in DVC-files (which can be found anywhere), not tracked by Git.

First of all, I understand this pain of not seeing data files in my Git repo.
To my mind, a list of outputs will answer the question. The list of the other Git files (like README.rst) seems not relevant and even introduces noise to the answer. Do we need to go recursively - hm.. probably yes, since most of the data files are hidden under input, data, models directories. We can avoid the need in multiple dvc lists. More than that, we have to go recursively (technicaly), because some DVC-file from a subfolder can generate outputs into the root of the repo.

New questions that I just got:

  1. Do we have any other scenarios/questions in mind or this is the only one? I have a feeling that @shcheklein has some.
  2. Is dvc list a right command name? Why not dvc find or dvc whereis? Do we need both dvc list (show all files) & dvc whereis (I don't like this idea but still)
  3. Should it show Git committed outputs (including metrics files)? Is it important (not nice-to-have, but important) to mention in output (by default) that a certain file is committed to Git? If so, how to show it (a programmatically consumable list of files might be not enough).

Some details regarding the output formatting... Yes, in some cases it makes sense to give preference to human-readable commands. Like your Heroku example where a complex/hierarchical output was produced.
But in this case, we are talking about a list of files which is human readable already. I see no benefits of introducing additional complexity (by default). If we decide that type of the file (Git\Remote\metrics) is important to see by default then it can be a reason to use a more complicated format. So, I'm not against this I just see no value.

if we follow unix tools - a lot of them (probably most of them?) are not made to be scripts consumable by default.

I don't agree with this.

I don't remember actually from the top of my head any of the tools that are taking this approach.

aws s3 ls --human-readable, du -h *

aws s3 ls is only somewhat consumable.
find is not consumable as far as I remember.

aws s3 ls is actually not the best example. It has quite a terrible UX.
find output is very consumable. As well as ls (it does the hack internaly) and most of the unix-commands.

if we follow unix tools - a lot of them (probably most of them?) are not made to be scripts consumable by default

Even if it is so, most of the unix tools have an option to produce scriptable output. For example git status and git status --porcelain.

Regarding the scriptable output. --porcelain, --json, etc - is exactly an approach that I was mentioning. The point is that most of the tools are not made to be scriptable by _default_. That's exactly the point of the link I shared:

Terse, machine-readable output formats can also be useful but shouldn’t get in the way of making beautiful CLI output. When needed, commands should offer a --json and/or a --terse flag when valuable to allow users to easily parse and script the CLI.

Btw, Git porcelain explains it very well why it's better to provide an option for script consumable outputs:

Give the output in an easy-to-parse format for scripts. This is similar to the short output, but will remain stable across Git versions and regardless of user configuration. See below for details.

_stable_ API - It'll be way easier for to maintain and evolve the output later if we use an option.

So, ls is doing some magic and by default does not sacrifice human UI, heroku,githave an option,findis somewhat consumable - but usually you should be providingprint0,aws` is somewhat consumable - in a Heroku's guide sense.

I wonder if there is a better tools for this? create a temporary PR with these files to dvc.org?

@shcheklein good idea. Here's one: https://github.com/iterative/dvc/pull/2569/files

For the same reason headers like INFO: Expanding list recursively. are not needed...

Right, INFO/WARNING etc logs/messages. could be included in the -v (verbose) output though.

agreed that we can postpone adding nice details (file sizes, DVC-file per output, etc). But I think it's very important to decide on the default behavior...

We should definitely prioritize the features of this command but I also think a good initial design can save lots of time in the future. Let's not go overkill though because once we're actually implementing I'm sure a lot of new questions will come up.

As for the default behavior, I think it should list all files (Ivan convinced me here: https://github.com/iterative/dvc/issues/2509#issuecomment-532781756) and with level 2 or 3 recursion to catch _most_ buried DVC-files, but warn when there are DVC-files in even deeper levels. It also needs a length limit like 100 files though (especially if we decide full recursion as default) – not sure how to list more if the limit is reached.

DVC-file from a subfolder can generate outputs into the root of the repo.

Yes... This is confusing and hard to figure out how to output this in dvc list if we show the relationship between DVC-files and output files so perhaps that's one of the less features with less priority.

Should it show Git committed outputs?

Good Q. I think it would be important to mark on any stage output that is Git-tracked in the output of dvc list.

Discussed briefly with @dmpetrov and I think we agreed on no recursion, all files mode by default. Then, -o and -r options to filter the list as you need. @dmpetrov could you chime in, please?

I'd prefer ls-files to match git more closely

fyi, for git ls-files:

  • recursively all files: (no args)
  • current dir: .
  • recursively all python files: '*.py'
  • tracked python files in current dir: *.py (assumes shell expansion, git will exclude untracked file arguments)

@casperdcl interesting point.

In what cases do you use git ls-files vs regular ls? I think they serve a bit different purposes and to me intuitively dvc list or dvc ls (name does not matter that much) serves the role of aws s3 ls or ls. There are chances that default behavior for ls-files is done this way to augment ls.

git ls-files sounds like an advanced Git-specific command. And if we go that path, we should probably indeed do recursive list of outputs (files that are under DVC control) - but that will create some disconnect with dvc get and dvc import (that now serve the roles of wget, cp, aws s3 cp, etc) that can deal with regular files now. Also, not sure how will it work for files inside outputs (when I want to see some specific images inside a specific output).

  • if there are many files (some untracked and some in .gitignore) and I only want to ls the tracked ones
  • immediately understand exactly what git archive would include
  • -x and -X patterns for systems without grep
  • -t flag: also marks file status

@casperdcl so the cases you are describing are indeed very different from ls. They are specific to Git workflow. In case of this new command, my feeling that it's better to compare it with ls or even aws s3 ls since we kind-a take the ability to access buckets directly away.

I'd argue that dvc ls-files listing only dvc-tracked files would be very valuable. I don't see much value in implementing something simpler which is a minor modification to ls

so, that's the point the is no ls for us and a regular one does not work. It's exactly the same reason why aws s3 ls exists. Regular ls won't work for a remote Git repo, right? May be that's where some misunderstanding comes from. The idea is that it should enable some discoverability for DVC projects and you supposed to use it like:

dvc list https://github.com/iterative/example-get-started 

and then:

dvc get https://github.com/iterative/example-get-started README.md

or

dvc list https://github.com/iterative/example-get-started data
dvc import https://github.com/iterative/example-get-started data/data.xml

Ah right. Remote listing pre-download is different.

I was talking about local listing local files.

I guess they'll have different implementations but would be nice to have them both called the same command (I'd prefer ls-files, but fine with ls or list too).

I think dvc list . will work fine in this case. It'll show all files tracked by Git + expansion of DVC-files to show outputs. So it is DVC specific ls-file in this sense. I would still prefer to have default like in ls - no recursion, show all files - Git-tracked + DVC outs.

dvc list . is an edge-case for this command. The main usage is the remote one and I would really like to see just a list of files the same way I do with ls. (+some options to refine and filter if needed).

I'm going to work on the issue

Reopening since there are some details that we need to follow up on.

@efiop but we have #3381 for that. Maybe OK to close this one?

@jorgeorpinel Ok, let's close then :slightly_smiling_face: Thanks for the heads up!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

danfischetti picture danfischetti  ·  41Comments

kskyten picture kskyten  ·  44Comments

luchoPipe87 picture luchoPipe87  ·  69Comments

gcoter picture gcoter  ·  38Comments

dmpetrov picture dmpetrov  ·  34Comments