Dvc: add support for wildcard/patterns

Created on 31 Oct 2020  路  15Comments  路  Source: iterative/dvc

Sometimes only a subset of files is needed when the user runs import or pull data from a data directory. It is convenient to define a file pattern for an import.

From https://discuss.dvc.org/t/working-with-a-small-subset-of-remote-data/541
Related: https://github.com/iterative/dvc/issues/4705, https://github.com/iterative/dvc/issues/4815

Patterns to implement:

  • [ ] simple wildcard dvc pull cats-dogs/data/train/dogs/*.img
  • [ ] whole wildcard dvc pull cats-dogs/data/train/{dogs,cats}/???.img
  • [ ] globstar/ricursive dvc pull cats-dogs/data/train/**/*.img
  • [ ] iterator dvc pull cats-dogs/data/train/dogs/%C.img?counter=1:100
  • [ ] date dvc pull users/%Y/%m/%d/users.csv?startdata=2020-09-01,enddate=now,ignoremissing

The first three patterns should use a regular Unix file syntax. While the last two require a special language to define the pattern - we need to find a good examples.

feature p2-medium

Most helpful comment

I will be taking a stab at implementing the first step for this issue.

All 15 comments

Based on my experience I'd assign the priorities like this:

  1. simple wildcard *
  2. globstar/ricursive **
  3. data path/file-%Y-%m-%d.txt
  4. iterator/count %C
  5. whole wildcard - ?, ., {}

But we need to agree on the common pattern format (how to reflect the pattern in dvc-files) before implementing even the first step.

Regarding the first step

simple wildcard dvc pull cats-dogs/data/train/dogs/*.img

support for dir entries will simply require treating existing filter_info in https://github.com/iterative/dvc/blob/6a9ab9cdfbf8ddd5ccb647b072cc36955a69a0e1/dvc/output/base.py#L403 appropriately. Right now we only check if filter equals or contains other files.

Regular glob patterns are clearer than the proposed date/counter selectors, those need some research on existing solutions. So this is a multilayer ticket that has a lot of special cases.

Related #4419.

I will be taking a stab at implementing the first step for this issue.

  • [ ] simple wildcard dvc pull cats-dogs/data/train/dogs/*.img

Sound slike at least this check box could be marked, per #4864?

@jorgeorpinel No, only dvc add supports it right now.

Sound slike at least this check box could be marked, per #4864?

@jorgeorpinel #4864 is only about dvc add. pull/push/import are missing for checking the first checkbox.

I can continue adding this functionality for all commands, if that's alright.

@ju0gri Thanks for looking into it! :pray:

In the case of dvc import, what is the desired behaviour for example when importing something like dir/subdir/foo* - should dir/subdir contain one individual entry for each file matching the pattern?
Also, when importing only files such as foo* in a passed output folder foos_imported - should this be a folder containing the individual files e.g. foos_imported/foo.dvc, foos_imported/foo123.dvc or should there be an entry for each foo file prefixed with the output value: e.g. foos_imported_foo.dvc, foos_imported_foo123.dvc?

@ju0gri Good question! We could start simple: dvc import and its signature only supports one target, so it would be safe to just error-out if after globbing you get more than one target.

@ju0gri Good question! We could start simple: dvc import and its signature only supports one target, so it would be safe to just error-out if after globbing you get more than one target.

Ok, so I was going down the complicated route with the solution for this. Does it still make sense to add the functionality to import in this case? The only benefits i see with this is to simplify typing a long complex filename e.g. foo234783478432hjhfjdfd, and maybe as a building block for future work where import might return a list of stages similar to add.

@ju0gri Yep, still useful.

Question:

We've introduced the --glob option to a few commands to implement some of these patterns above (the ones covered by glob i.e. 1,2, and 5 from https://github.com/iterative/dvc/issues/4816#issuecomment-719996406)

Is the option temporary, expecting to make this default the behavior at some point? Otherwise I think we may need a better term as discussed in https://github.com/iterative/dvc/pull/4976#issuecomment-736701953, and even more now that I see patterns 3 (iterator) and 4 (date) which I think aren't covered by glob.

Thanks

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Casyfill picture Casyfill  路  56Comments

gvyshnya picture gvyshnya  路  36Comments

dmpetrov picture dmpetrov  路  35Comments

gcoter picture gcoter  路  38Comments

danfischetti picture danfischetti  路  41Comments