dvc: provide granularity for commands that could target specific tracked files

Created on 2 Sep 2019 · 20Comments · Source: iterative/dvc

This can be seen as revisiting feature request #1026

UPDATE: Please scroll down to https://github.com/iterative/dvc/issues/2458#issuecomment-566649020 for most recent, summarized requirement.
Here is the original context also (still relevant):

There's different scenarios in which _being able to manipulate files granularly independently of how they were committed/pushed to DVC_ could be useful. The problem with using dvc add -R now is that it can generate lots of .dvc files, but what if a directory could be added without -R (producing a single DVC-file) and yet other commands (lock, update, get, etc) could be applied to individual files inside the added directory tree?

Example (from https://github.com/iterative/dataset-registry/commit/7476a858f6200864b5755863c729bff41d0fb045)

Project 1:

$ tree
.
└── tutorial
    └── nlp
        ├── Posts.xml.zip
        └── pipeline.zip
$ dvc add tutorial
...
$ dvc push
...

Project 2:

$ dvc import {project-1-url} tutorial/nlp/pipeline.zip
...
$ tree
.
├── tutorial
│   └── nlp
│       └── pipeline.zip
└── tutorial.dvc

Not sure about where the .dvc would have to be placed in this example though.

And also this is how Git works, I believe. Files are tracked individually (in fact it doesn't even recognize empty dirs).

feature request p1-important product

Source

jorgeorpinel

👍2

Most helpful comment

Well I think that is a bit dangerous ground:

big directory == a lot of metadata entries

which can cause problem when loading it, and also, big metadata file could be problematic to handle by git.

pared on 2 Sep 2019

👍2

All 20 comments

@jorgeorpinel so, you propose storing particular files metadata inside stage file, instead of storing metadata for whole directory?

pared on 2 Sep 2019

I didn't think on the implementation details but that sounds reasonable... Unless there's thousand of files in there (which is possible): this could make it hard for Git to handle the DVC-file produced.

I would pull a "how does Git do it?" card here ♠️

jorgeorpinel on 2 Sep 2019

A dir as a single entity is a desirable thing as I see it.

Suor on 2 Sep 2019

👍1

Well I think that is a bit dangerous ground:

big directory == a lot of metadata entries

which can cause problem when loading it, and also, big metadata file could be problematic to handle by git.

pared on 2 Sep 2019

👍2

@jorgeorpinel I agree with @Suor on this. In most cases the way ML projects are organized the current default behavior is desirable choice. And we have -R in case someone needs to override it.

It might make sense though to implement -R in a different way that allows for more granular control - may be repurpose the ticket for some research on this?

shcheklein on 2 Sep 2019

👍1

What kind of granularity (other than the already existing -R) are you guys talking about?

efiop on 2 Sep 2019

@efiop It reminds me the same discussion we had with some other users and with the team about partial checkouts, for example. Basically, there is a feeling that people are using -R in some cases when dvc add and a single DVC files it creates are not granular enough. But on the other hand -R is cumbersome with all these DVC-files it is creating - it's another painful issue.

shcheklein on 3 Sep 2019

OK I repurposed this issue for exploring about providing more granular control of files in added dirs. (No longer related to -R I think)

@efiop

What kind of granularity... are you guys talking about?

Partial syncs like Ivan mentioned. Please see the description of this issue for an example.

jorgeorpinel on 3 Sep 2019

I would pull a "how does Git do it?" card here

My impression is that the analogy between GIT and DVC is not very good, because they try to do different things (code management vs. data management). So, we should not strive to imitate git on everything.

dashohoxha on 3 Sep 2019

👍1

I did not say we should imitate Git on everything. What is your suggestion?

jorgeorpinel on 3 Sep 2019

@dashohoxha what is your suggestion?

I don't have any suggestion. I cannot even understand the problem. But if people are complaining about it, maybe there is some problem.

dashohoxha on 4 Sep 2019

@shcheklein @jorgeorpinel Thanks for the explanation guys. Yeah, I remember those users that were asking about checking out or pulling only a particular number of files out of a directory. One of the big concerns at that time was that dvc was pretty slow with directories, which was significantly improved since then (still pretty long way to go though), so maybe it is acceptable now. But I can still see cases where someone would only want to pull one file out of the directory, and it seems to me that we could support that for most of the operations, we just need to do a few improvements to our logic.

For example, first step would be to support using paths instead of dvc files as arguments. E.g. dvc pull dir would be the same as dvc pull dir.dvc. After that, we will need to adjust the commands to understand subpaths, so that dvc pull dir/file would only download cache for file and checkout it.

So to summarize, in terms of the current architecture, granular operations are possible with dvc add dir(without -R), but we need to adjust particular dvc commands to understand granular arguments.

efiop on 4 Sep 2019

👍1

I like the idea, and glad to hear it doesn't seem to difficult to enable this. Using existing <stage>.dvc files as command targets (as is done now) could also be left as an option (for backward compatibility).

jorgeorpinel on 5 Sep 2019

👍1

@jorgeorpinel sure, we will support both. It is also useful to have stage.dvc support when you have more than 1 output in it.

efiop on 5 Sep 2019

👍1

    ... 
    I sometime want to have a peak of the data on remote without download the entire dataset.
   ...

shcheklein on 16 Oct 2019

👍1

From Dmitry (on Slack):

There are two general scenarios:

1) - [x] (Higher priority) Make dvc pull <file_name> work without specifying the DVC-file (where file_name is tracked) – similar to dvc get <file_name> I guess.
2) - [x] Get files from tracked data directory e.g. dvc pull data-dir/file.txt. (Requires 1.)

[x] It should work for any command but these have a higher priority: pull/push/fetch/get/status.

Further rationale/motivation:
There are quite a lot of requests to this problem: a guy from ScribbleData asked personally, a request from Discord a couple months ago, I asked for this recently for dvc get command.
Also, this problem blocks automation scenarios like CD4ML: you cannot push only changed files programmatically to another remote since dvc status returns changed file names that cannot be used as an input of dvc push which requires DVC-files.

jorgeorpinel on 17 Dec 2019

👍1

Next step: support for get and import.

efiop on 28 Jan 2020

2 questions!

does dvc list support this granularity? I.e. does it list directory contents? I think so but just double checking
Did we ever update the docs to explicitly explain granularity in all the affected commands? I don't even remember which commands support this already 🙁

jorgeorpinel on 15 May 2020

UPDATE: Yeah we have iterative/dvc.org/issues/886, OK I'll give it some priority

jorgeorpinel on 15 May 2020

👍1

does dvc list support this granularity? I.e. does it list directory contents? I think so but just double checking

Nope :) It will soon though :)

Yeah, that doc is on me, I was planning to tackle it when I have time.

efiop on 15 May 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

ask users to opt in/out of analytics, and document the feature

mdscruggs · 3Comments

Improve `-j` help messages

shcheklein · 3Comments

Possibly an incorrect value in a test in diff.DIFF_IDENT

GildedHonour · 3Comments

Aws setup error

mfrata · 3Comments

`dvc config cache.type` should print default values if there is no "cache" section in configuration file

nik123 · 3Comments