dvc: provide granularity for commands that could target specific tracked files

Created on 2 Sep 2019  ยท  20Comments  ยท  Source: iterative/dvc

This can be seen as revisiting feature request #1026

UPDATE: Please scroll down to https://github.com/iterative/dvc/issues/2458#issuecomment-566649020 for most recent, summarized requirement.
Here is the original context also (still relevant):


There's different scenarios in which _being able to manipulate files granularly independently of how they were committed/pushed to DVC_ could be useful. The problem with using dvc add -R now is that it can generate lots of .dvc files, but what if a directory could be added without -R (producing a single DVC-file) and yet other commands (lock, update, get, etc) could be applied to individual files inside the added directory tree?

Example (from https://github.com/iterative/dataset-registry/commit/7476a858f6200864b5755863c729bff41d0fb045)

Project 1:

$ tree
.
โ””โ”€โ”€ tutorial
    โ””โ”€โ”€ nlp
        โ”œโ”€โ”€ Posts.xml.zip
        โ””โ”€โ”€ pipeline.zip
$ dvc add tutorial
...
$ dvc push
...

Project 2:

$ dvc import {project-1-url} tutorial/nlp/pipeline.zip
...
$ tree
.
โ”œโ”€โ”€ tutorial
โ”‚ย ย  โ””โ”€โ”€ nlp
โ”‚ย ย      โ””โ”€โ”€ pipeline.zip
โ””โ”€โ”€ tutorial.dvc

Not sure about where the .dvc would have to be placed in this example though.

And also this is how Git works, I believe. Files are tracked individually (in fact it doesn't even recognize empty dirs).

feature request p1-important product

Most helpful comment

Well I think that is a bit dangerous ground:

big directory == a lot of metadata entries

which can cause problem when loading it, and also, big metadata file could be problematic to handle by git.

All 20 comments

@jorgeorpinel so, you propose storing particular files metadata inside stage file, instead of storing metadata for whole directory?

I didn't think on the implementation details but that sounds reasonable... Unless there's thousand of files in there (which is possible): this could make it hard for Git to handle the DVC-file produced.

I would pull a "how does Git do it?" card here โ™ ๏ธ

A dir as a single entity is a desirable thing as I see it.

Well I think that is a bit dangerous ground:

big directory == a lot of metadata entries

which can cause problem when loading it, and also, big metadata file could be problematic to handle by git.

@jorgeorpinel I agree with @Suor on this. In most cases the way ML projects are organized the current default behavior is desirable choice. And we have -R in case someone needs to override it.

It might make sense though to implement -R in a different way that allows for more granular control - may be repurpose the ticket for some research on this?

What kind of granularity (other than the already existing -R) are you guys talking about?

@efiop It reminds me the same discussion we had with some other users and with the team about partial checkouts, for example. Basically, there is a feeling that people are using -R in some cases when dvc add and a single DVC files it creates are not granular enough. But on the other hand -R is cumbersome with all these DVC-files it is creating - it's another painful issue.

OK I repurposed this issue for exploring about providing more granular control of files in added dirs. (No longer related to -R I think)

@efiop

What kind of granularity... are you guys talking about?

Partial syncs like Ivan mentioned. Please see the description of this issue for an example.

I would pull a "how does Git do it?" card here ๏ฟผ

My impression is that the analogy between GIT and DVC is not very good, because they try to do different things (code management vs. data management). So, we should not strive to imitate git on everything.

I did not say we should imitate Git on everything. What is your suggestion?

@dashohoxha what is your suggestion?

I don't have any suggestion. I cannot even understand the problem. But if people are complaining about it, maybe there is some problem.

@shcheklein @jorgeorpinel Thanks for the explanation guys. Yeah, I remember those users that were asking about checking out or pulling only a particular number of files out of a directory. One of the big concerns at that time was that dvc was pretty slow with directories, which was significantly improved since then (still pretty long way to go though), so maybe it is acceptable now. But I can still see cases where someone would only want to pull one file out of the directory, and it seems to me that we could support that for most of the operations, we just need to do a few improvements to our logic.

For example, first step would be to support using paths instead of dvc files as arguments. E.g. dvc pull dir would be the same as dvc pull dir.dvc. After that, we will need to adjust the commands to understand subpaths, so that dvc pull dir/file would only download cache for file and checkout it.

So to summarize, in terms of the current architecture, granular operations are possible with dvc add dir(without -R), but we need to adjust particular dvc commands to understand granular arguments.

Related https://github.com/iterative/dvc/issues/2180

I like the idea, and glad to hear it doesn't seem to difficult to enable this. Using existing <stage>.dvc files as command targets (as is done now) could also be left as an option (for backward compatibility).

@jorgeorpinel sure, we will support both. It is also useful to have stage.dvc support when you have more than 1 output in it.

Related: https://discordapp.com/channels/485586884165107732/485596304961962003/634088447858180108:

    ... 
    I sometime want to have a peak of the data on remote without download the entire dataset.
   ...

From Dmitry (on Slack):

There are two general scenarios:

1) - [x] (Higher priority) Make dvc pull <file_name> work without specifying the DVC-file (where file_name is tracked) โ€“ similar to dvc get <file_name> I guess.
2) - [x] Get files from tracked data directory e.g. dvc pull data-dir/file.txt. (Requires 1.)

  • [x] It should work for any command but these have a higher priority: pull/push/fetch/get/status.

Further rationale/motivation:
There are quite a lot of requests to this problem: a guy from ScribbleData asked personally, a request from Discord a couple months ago, I asked for this recently for dvc get command.
Also, this problem blocks automation scenarios like CD4ML: you cannot push only changed files programmatically to another remote since dvc status returns changed file names that cannot be used as an input of dvc push which requires DVC-files.

Next step: support for get and import.

2 questions!

  1. does dvc list support this granularity? I.e. does it list directory contents? I think so but just double checking
  2. Did we ever update the docs to explicitly explain granularity in all the affected commands? I don't even remember which commands support this already ๐Ÿ™

UPDATE: Yeah we have iterative/dvc.org/issues/886, OK I'll give it some priority

does dvc list support this granularity? I.e. does it list directory contents? I think so but just double checking

Nope :) It will soon though :)

Yeah, that doc is on me, I was planning to tackle it when I have time.

Was this page helpful?
0 / 5 - 0 ratings