Dvc: Validation of DVC files and pipelines without downloading actual data

Created on 21 Nov 2020  路  5Comments  路  Source: iterative/dvc

While using DVC I often have two following questions:

  1. Can I pull data from the remote? I.e. do all *.dvc files have corresponding data?

    Currently I solve this using dvc status -c | grep "missing:" (like described in #4436), but I think that's suboptimal, because cache is also checked and it's a bit of a workaround.

    I can't just use an exit-code, because I don't want to download all the data first, so all the files which are present in remote will be marked as deleted and all the files which aren't as missing, which means always non-zero exit-code.

  2. If I pull the data, will pipeline's outputs change?

    I don't know any easy ways to do it, because data which are not downloaded considered deleted and not modified, even though hash values in *.dvc file and in dvc.lock don't match.

I propose implementing --pull flag for dvc status which would make it behave as if dvc pull was ran just before that. Particularly that means no deleted entries in output and no actual file hash computations before making comparison.

This solves the first case and the second case: we can just use exit-code now.

I think it also doesn't complicate things much from an interface perspective.

enhancement feature request p2-medium

Most helpful comment

I think the main issue with current dvc status behavior is that it assumes that user can easily pull all the data on a local machine, but in my experience that's not always the case. One usually works only on one pipeline/dataset at a time and it's ok if some files aren't present locally

All 5 comments

Similar logic can also be applied to dvc repro --pull command: we can skip pulling dependencies for outputs whose hash hasn't changed and just download these outputs (making an assumption that all pipelines are deterministic).

Question 1 Related #4657. But here it is about files not dirs?

Question 2, Pull equals to fetch ( download cache from a remote repo to local cache ) + checkout ( move files in local cache to the workspace according to hash values in *.dvc or dvc.lock. So if files' md5 in your workspace matches the value in *.dvc
and dvc.lock it would not change, if not match then they would be replaced.

I don't think we should actually perform fetch and checkout, just pretend that we did and ignore the fact that some files are actually missing in the workspace. Frankly I'm not sure how to name the flag so that its behavior would be obvious. --ignore-missing would be another option, but then we have missing in the output of dvc status -c. --ignore-deleted maybe?

Well, in a way #4657 is connected, yes, one could run dvc pull --dir-only and check exit-code, but AFAIK this would overwrite the *.dvc files in workspace and I wouldn't do that

I think the main issue with current dvc status behavior is that it assumes that user can easily pull all the data on a local machine, but in my experience that's not always the case. One usually works only on one pipeline/dataset at a time and it's ok if some files aren't present locally

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ghost picture ghost  路  3Comments

ghost picture ghost  路  3Comments

robguinness picture robguinness  路  3Comments

dnabanita7 picture dnabanita7  路  3Comments

dmpetrov picture dmpetrov  路  3Comments