Sometimes only a subset of files is needed when the user runs import or pull data from a data directory. It is convenient to define a file pattern for an import.
From https://discuss.dvc.org/t/working-with-a-small-subset-of-remote-data/541
Related: https://github.com/iterative/dvc/issues/4705, https://github.com/iterative/dvc/issues/4815
Patterns to implement:
dvc pull cats-dogs/data/train/dogs/*.imgdvc pull cats-dogs/data/train/{dogs,cats}/???.imgdvc pull cats-dogs/data/train/**/*.imgdvc pull cats-dogs/data/train/dogs/%C.img?counter=1:100dvc pull users/%Y/%m/%d/users.csv?startdata=2020-09-01,enddate=now,ignoremissingThe first three patterns should use a regular Unix file syntax. While the last two require a special language to define the pattern - we need to find a good examples.
Based on my experience I'd assign the priorities like this:
***path/file-%Y-%m-%d.txt%C?, ., {}But we need to agree on the common pattern format (how to reflect the pattern in dvc-files) before implementing even the first step.
Regarding the first step
simple wildcard dvc pull cats-dogs/data/train/dogs/*.img
support for dir entries will simply require treating existing filter_info in https://github.com/iterative/dvc/blob/6a9ab9cdfbf8ddd5ccb647b072cc36955a69a0e1/dvc/output/base.py#L403 appropriately. Right now we only check if filter equals or contains other files.
Regular glob patterns are clearer than the proposed date/counter selectors, those need some research on existing solutions. So this is a multilayer ticket that has a lot of special cases.
Related #4419.
I will be taking a stab at implementing the first step for this issue.
- [ ] simple wildcard dvc pull cats-dogs/data/train/dogs/*.img
Sound slike at least this check box could be marked, per #4864?
@jorgeorpinel No, only dvc add supports it right now.
Sound slike at least this check box could be marked, per #4864?
@jorgeorpinel #4864 is only about dvc add. pull/push/import are missing for checking the first checkbox.
I can continue adding this functionality for all commands, if that's alright.
@ju0gri Thanks for looking into it! :pray:
In the case of dvc import, what is the desired behaviour for example when importing something like dir/subdir/foo* - should dir/subdir contain one individual entry for each file matching the pattern?
Also, when importing only files such as foo* in a passed output folder foos_imported - should this be a folder containing the individual files e.g. foos_imported/foo.dvc, foos_imported/foo123.dvc or should there be an entry for each foo file prefixed with the output value: e.g. foos_imported_foo.dvc, foos_imported_foo123.dvc?
@ju0gri Good question! We could start simple: dvc import and its signature only supports one target, so it would be safe to just error-out if after globbing you get more than one target.
@ju0gri Good question! We could start simple:
dvc importand its signature only supports one target, so it would be safe to just error-out if after globbing you get more than one target.
Ok, so I was going down the complicated route with the solution for this. Does it still make sense to add the functionality to import in this case? The only benefits i see with this is to simplify typing a long complex filename e.g. foo234783478432hjhfjdfd, and maybe as a building block for future work where import might return a list of stages similar to add.
@ju0gri Yep, still useful.
Question:
We've introduced the --glob option to a few commands to implement some of these patterns above (the ones covered by glob i.e. 1,2, and 5 from https://github.com/iterative/dvc/issues/4816#issuecomment-719996406)
Is the option temporary, expecting to make this default the behavior at some point? Otherwise I think we may need a better term as discussed in https://github.com/iterative/dvc/pull/4976#issuecomment-736701953, and even more now that I see patterns 3 (iterator) and 4 (date) which I think aren't covered by glob.
Thanks
Most helpful comment
I will be taking a stab at implementing the first step for this issue.