This can be seen as revisiting feature request #1026
UPDATE: Please scroll down to https://github.com/iterative/dvc/issues/2458#issuecomment-566649020 for most recent, summarized requirement.
Here is the original context also (still relevant):
There's different scenarios in which _being able to manipulate files granularly independently of how they were committed/pushed to DVC_ could be useful. The problem with using dvc add -R
now is that it can generate lots of .dvc
files, but what if a directory could be added without -R
(producing a single DVC-file) and yet other commands (lock, update, get, etc) could be applied to individual files inside the added directory tree?
Example (from https://github.com/iterative/dataset-registry/commit/7476a858f6200864b5755863c729bff41d0fb045)
Project 1:
$ tree
.
โโโ tutorial
โโโ nlp
โโโ Posts.xml.zip
โโโ pipeline.zip
$ dvc add tutorial
...
$ dvc push
...
Project 2:
$ dvc import {project-1-url} tutorial/nlp/pipeline.zip
...
$ tree
.
โโโ tutorial
โย ย โโโ nlp
โย ย โโโ pipeline.zip
โโโ tutorial.dvc
Not sure about where the
.dvc
would have to be placed in this example though.
And also this is how Git works, I believe. Files are tracked individually (in fact it doesn't even recognize empty dirs).
@jorgeorpinel so, you propose storing particular files metadata inside stage file, instead of storing metadata for whole directory?
I didn't think on the implementation details but that sounds reasonable... Unless there's thousand of files in there (which is possible): this could make it hard for Git to handle the DVC-file produced.
I would pull a "how does Git do it?" card here โ ๏ธ
A dir as a single entity is a desirable thing as I see it.
Well I think that is a bit dangerous ground:
big directory == a lot of metadata entries
which can cause problem when loading it, and also, big metadata file could be problematic to handle by git.
@jorgeorpinel I agree with @Suor on this. In most cases the way ML projects are organized the current default behavior is desirable choice. And we have -R in case someone needs to override it.
It might make sense though to implement -R in a different way that allows for more granular control - may be repurpose the ticket for some research on this?
What kind of granularity (other than the already existing -R) are you guys talking about?
@efiop It reminds me the same discussion we had with some other users and with the team about partial checkouts, for example. Basically, there is a feeling that people are using -R in some cases when dvc add
and a single DVC files it creates are not granular enough. But on the other hand -R is cumbersome with all these DVC-files it is creating - it's another painful issue.
OK I repurposed this issue for exploring about providing more granular control of files in added dirs. (No longer related to -R
I think)
@efiop
What kind of granularity... are you guys talking about?
Partial syncs like Ivan mentioned. Please see the description of this issue for an example.
I would pull a "how does Git do it?" card here ๏ฟผ
My impression is that the analogy between GIT and DVC is not very good, because they try to do different things (code management vs. data management). So, we should not strive to imitate git on everything.
I did not say we should imitate Git on everything. What is your suggestion?
@dashohoxha what is your suggestion?
I don't have any suggestion. I cannot even understand the problem. But if people are complaining about it, maybe there is some problem.
@shcheklein @jorgeorpinel Thanks for the explanation guys. Yeah, I remember those users that were asking about checking out or pulling only a particular number of files out of a directory. One of the big concerns at that time was that dvc was pretty slow with directories, which was significantly improved since then (still pretty long way to go though), so maybe it is acceptable now. But I can still see cases where someone would only want to pull one file out of the directory, and it seems to me that we could support that for most of the operations, we just need to do a few improvements to our logic.
For example, first step would be to support using paths instead of dvc files as arguments. E.g. dvc pull dir
would be the same as dvc pull dir.dvc
. After that, we will need to adjust the commands to understand subpaths, so that dvc pull dir/file
would only download cache for file
and checkout it.
So to summarize, in terms of the current architecture, granular operations are possible with dvc add dir
(without -R), but we need to adjust particular dvc commands to understand granular arguments.
I like the idea, and glad to hear it doesn't seem to difficult to enable this. Using existing <stage>.dvc
files as command targets (as is done now) could also be left as an option (for backward compatibility).
@jorgeorpinel sure, we will support both. It is also useful to have stage.dvc
support when you have more than 1 output in it.
Related: https://discordapp.com/channels/485586884165107732/485596304961962003/634088447858180108:
...
I sometime want to have a peak of the data on remote without download the entire dataset.
...
From Dmitry (on Slack):
There are two general scenarios:
1) - [x] (Higher priority) Make dvc pull <file_name>
work without specifying the DVC-file (where file_name
is tracked) โ similar to dvc get <file_name>
I guess.
2) - [x] Get files from tracked data directory e.g. dvc pull data-dir/file.txt
. (Requires 1.)
Further rationale/motivation:
There are quite a lot of requests to this problem: a guy from ScribbleData asked personally, a request from Discord a couple months ago, I asked for this recently fordvc get
command.
Also, this problem blocks automation scenarios like CD4ML: you cannot push only changed files programmatically to another remote sincedvc status
returns changed file names that cannot be used as an input ofdvc push
which requires DVC-files.
Next step: support for get and import.
2 questions!
UPDATE: Yeah we have iterative/dvc.org/issues/886, OK I'll give it some priority
does dvc list support this granularity? I.e. does it list directory contents? I think so but just double checking
Nope :) It will soon though :)
Yeah, that doc is on me, I was planning to tackle it when I have time.
Most helpful comment
Well I think that is a bit dangerous ground:
big directory == a lot of metadata entries
which can cause problem when loading it, and also, big metadata file could be problematic to handle by git.