Dvc.org: update: how to update specific rev import? Clarify docs.

Created on 22 Oct 2019  ยท  21Comments  ยท  Source: iterative/dvc.org

UPDATE: See https://github.com/iterative/dvc.org/issues/735#issuecomment-545320373 and under for concepts that should be better explained in docs.


$ dvc version
DVC version: 0.63.4
Python version: 3.7.4
Platform: Darwin-19.0.0-x86_64-i386-64bit
Binary: False
Cache: reflink - True, hardlink - True, symlink - True
Filesystem type (cache directory): ('apfs', '/dev/disk1s5')
Filesystem type (workspace): ('apfs', '/dev/disk1s5')

$ mkdir test && cd test
$ dvc init --no-scm
...
$ dvc import --rev cats-dogs-v1 \
             [email protected]:iterative/dataset-registry.git \
             use-cases/cats-dogs
Importing 'use-cases/cats-dogs ([email protected]:iterative/dataset-registry.git)' -> 'cats-dogs'
Output 'cats-dogs' didn't change. Skipping saving.                                                                                                           
Saving information to 'cats-dogs.dvc'.
$ ls
total 8
drwxr-xr-x  3 usr  staff    96B Oct 22 14:05 cats-dogs/
-rw-r--r--  1 usr  staff   340B Oct 22 14:05 cats-dogs.dvc
$ tree cats-dogs --filelimit 3      
cats-dogs
โ””โ”€โ”€ data
    โ”œโ”€โ”€ train
    โ”‚ย ย  โ”œโ”€โ”€ cats [500 entries exceeds filelimit, not opening dir]
    โ”‚ย ย  โ””โ”€โ”€ dogs [500 entries exceeds filelimit, not opening dir]
...
$ dvc update cats-dogs.dvc
Saving information to 'cats-dogs.dvc'.
$ tree cats-dogs --filelimit 3
cats-dogs
โ””โ”€โ”€ data
    โ”œโ”€โ”€ train
    โ”‚ย ย  โ”œโ”€โ”€ cats [500 entries exceeds filelimit, not opening dir]
    โ”‚ย ย  โ””โ”€โ”€ dogs [500 entries exceeds filelimit, not opening dir]
...

Notice how the last command still counts 500 images each in cats-dogs/data/train cats/ and dogs/, however the latest version of cats-dogs.dvc in https://github.com/iterative/dataset-registry/tree/master/use-cases should track 1000 files in each.

doc-content

Most helpful comment

Even if you specify --rev <branch> I think what Ruslan said implies those imports will also never update.

@jorgeorpinel To understand the behavior of dvc update with respect to dvc import --rev <tag> and dvc import --rev <branch>, it might be useful to think about the difference between a <tag> and a <branch> in Git. A <tag> is a label that always points to the same commit (revision), unless explicitly moved. A <branch> is a floating label that always points at the top commit of a certain branch.
If you also think of a dvc update as a dvc get followed by a dvc commit, this should make clear its behavior.

I would say that if possible, updating should detect the branch where the imported rev exists. If there's more than one, use the one with the latest commit. What do you guys think?

If you understand properly how dvc update works, I think that this would be not only unnecessary, but also wrong and unexpected. When I import a <tag> or a <commit>, I would expect it not to change, unless I do a re-import with a different <tag> or <commit>.

For the same reason I think that adding a --rev option to dvc update is not a good idea. An explicit re-import would be better in my opinion.

All 21 comments

To confirm that all the files are on the remote, I go to my local copy of the dataset-registry repo and run:

$ git remote -v             
origin  [email protected]:iterative/dataset-registry.git (fetch)
origin  [email protected]:iterative/dataset-registry.git (push)
$ git rev-parse --short HEAD
99d1cdb
$ tree use-cases/cats-dogs --filelimit 3
use-cases/cats-dogs
โ””โ”€โ”€ data
    โ”œโ”€โ”€ train
    โ”‚ย ย  โ”œโ”€โ”€ cats [1000 entries exceeds filelimit, not opening dir]
    โ”‚ย ย  โ””โ”€โ”€ dogs [1000 entries exceeds filelimit, not opening dir]
...
$ dvc push                              
Everything is up to date.                                                                                                                                    

Note that the HEAD Git commit is https://github.com/iterative/dataset-registry/commit/99d1cdb9829ffb66323f862265b36c2b3ffc5dd4, same as the tip of master in that repo rn:

image

https://github.com/iterative/dataset-registry/commits/master

Can reproduce. Investigating.

Bug caused by https://github.com/iterative/dvc/pull/2583

EDIT: might be caused by that. Still investigating.

Ah, alright, I've missed this: cats-dogs-v1 is a tag that doesn't go anywhere. So update for that tag is not possible, as there is no newer version for it, it is not a branch. So this is not a bug.

@jorgeorpinel to update to cats-dogs-v2, you need to re-import it. There is no option for dvc update to tell it specific rev that you want to update to yet.

@jorgeorpinel We can introduce that option though :slightly_smiling_face: Something like dvc update cats-dogs.dvc --rev cats-dogs-v2. What do you think?

@efiop what would happen if I don't specify initially a --rev option when I do import and then I run dvc update. Will it update to the latest master?

Btw, what are the differences between running dvc update and dvc import again?

@efiop what would happen if I don't specify initially a --rev option when I do import and then I run dvc update. Will it update to the latest master?

@shcheklein Yes, it will import from default branch and if it moves then dvc update will update it to the new state.

Btw, what are the differences between running dvc update and dvc import again?

Update moves within the specified rev (the one specified during import) and so if it is a specific sha then it won't ever update it. If it is a specific tag that didn't move it also won't update it. Basically there is no difference between dvc update and dvc import --rev cats-dogs-v1 here, because first import before dvc update had --rev cats-dogs-v1, so dvc update can't update here.

Ah, alright, I've missed this: cats-dogs-v1 is a tag that doesn't go anywhere. So update for that tag is not possible...

Ah, thanks for the info! Interesting indeed. It makes sense...

to update to cats-dogs-v2, you need to re-import it...

Re-import it how? Using the commit hash instead of the tag name? That will detect the commit's branch when updating? (I want to import an older version and then update it to the latest.)

We can introduce that option though :slightly_smiling_face: Something like dvc update cats-dogs.dvc --rev cats-dogs-v2.

Yes, I was going to suggest that also.

Update moves within the specified rev (the one specified during import) and so if it is a specific sha then it won't ever update it.

Hm. Sounds like I can't do what I wanted then (import an older version and then update it to the latest), at least not with dvc update. I have to run dvc import again, knowing which rev contains the updates.

What happens if I (re-)import with dvc import cats-dogs.dvc --rev master, will that become a regular import that can be updated later with dvc update?

Besides or instead of adding --rev to dvc update, I would say that if possible, updating should detect the branch where the imported rev exists. If there's more than one, use the one with the latest commit. What do you guys think?

Hm. Sounds like I can't do what I wanted then (import an older version and then update it to the latest), at least not with dvc update

You can (and probably should). As an illustration, it won't be runnable. I think it will a regular workflow in certain case to import just a branch and then use dvc update from time to time to move to the latest commit in it.

As an illustration, it won't be runnable.

What do you mean?

in certain case to import just a branch and then use dvc update...

Even if you specify --rev <branch> I think what Ruslan said implies those imports will also never update. Let's wait for his feedback though.

What do you mean?

I mean that user won't be able to to run dcv import without specifying --rev to get the initial version. But, we should use dvc import without --rev to illustrate the regular workflow. In the real world your project evolves along with the data registry and you just use the most recent version w/o specifying --rev.

Ruslan said implies those imports will also never update. Let's wait for his feedback though.

no, I hope that when you use dvc import artifact (w/o rev) dvc update moves it ti the most recent version in that default branch.

Even if you specify --rev <branch> I think what Ruslan said implies those imports will also never update.

@jorgeorpinel To understand the behavior of dvc update with respect to dvc import --rev <tag> and dvc import --rev <branch>, it might be useful to think about the difference between a <tag> and a <branch> in Git. A <tag> is a label that always points to the same commit (revision), unless explicitly moved. A <branch> is a floating label that always points at the top commit of a certain branch.
If you also think of a dvc update as a dvc get followed by a dvc commit, this should make clear its behavior.

I would say that if possible, updating should detect the branch where the imported rev exists. If there's more than one, use the one with the latest commit. What do you guys think?

If you understand properly how dvc update works, I think that this would be not only unnecessary, but also wrong and unexpected. When I import a <tag> or a <commit>, I would expect it not to change, unless I do a re-import with a different <tag> or <commit>.

For the same reason I think that adding a --rev option to dvc update is not a good idea. An explicit re-import would be better in my opinion.

Makes sense about moving labelsreferences. I knew that, but was not understanding Ruslan's comments correctly.

But I don't agree that it would be wrong or unexpected that update detects the branch when imported with specific SHA or maybe even unmoved tag. When you run dvc update, and you know the repo has changes, you probably expect to get the changes back from update.

However I don't have a strong opinion either way. I don't think either approach is "wrong"; It's just a question of product design.

Of course its easier to not change/complicate it if there's no compelling reason. We could just should probably improve the docs to explain this behavior clearly.

@jorgeorpinel agreed on updating docs! (as usual, haha :))

I would clarify though that dvc update _updates_ the DVC-file and data when branch name is specified as a revision.

It does not update only when you explicitly told it to use an exact commit. It should be the same as regular package managers behave if you specify them a link to an exact commit on Github. Their update logic will keep them as is. And will be updating only if you use a branch name as a rev.

Agreed. I like the package manager analogy. Maybe we should've called these features dvc pkg install/update after all haha. ๐Ÿ˜‹

Adding details on all this to the docs in https://github.com/iterative/dvc.org/pull/679/commits/131af1ea370592c9237f8e55a2e0202c7fe4246d!

Should we also open a ticket to give special output from the command itself when it detects a fixed rev field in the DVC-file, and this rev hasn't changed? (An alert suggesting it may be necessary to re-import instead of updating if the rev is a tag or SHA commit)

It can be just a status message (not WARNING or something like that - since there is nothing wrong) that this file is up to date since exact commit is specified and there is nothing to update.

Opened iterative/dvc/issues/2696.

@jorgeorpinel ๐Ÿ™

Was this page helpful?
0 / 5 - 0 ratings