dvc checkout will leave empty folders

Created on 14 Aug 2019  Â·  8Comments  Â·  Source: iterative/dvc

dvc checkout doesn't restore the exact snapshot as it was in a previous state.
For example if you create a new folder and add files, and want to go back to a previous commit, where this folder and the files didn't exist, it will still show the folder, but empty.

Below you will find a reproducer of the problem.

#!/bin/bash                            

set -e                                 
set -x                                 

rm -rf myrepo                          
mkdir myrepo                           
cd myrepo                              

git init                               
dvc init                               

mkdir dir                              
dvc run -o dir/foo 'echo foo > dir/foo'

rm foo.dvc                             

dvc checkout                           

tree
bug p2-medium

All 8 comments

We are most likely using utils.remove method, so we probably need to make it(or some wrapper or whatever) remove empty parent directories up to repo root.

Ok, as we've discovered with @iterative/engineering , this

mkdir dir                              
dvc run -o dir/foo 'echo foo > dir/foo'

is actually wrong and dvc repro will break on it, because the command itself should be able to create all the needed subdirectories.

But even if we include mkdir sbudir, when you clone this repo elsewhere and you dvc checkout, dvc will automatically create that directory, so it should also remove it when it checks out elsewhere.

@efiop you explained that the first case is not correct. Could you please clarify with the second one - does DVC works properly and should we close this issue?

@dmpetrov No longer able to reproduce. Everything works as expected. Closing. Thanks for the heads up!

I am still able to reproduce this issue:

➜ dvc --version
1.8.2

➜ dvc init --no-scm
...

➜ mkdir foo

➜ dvc add foo
WARNING: 'foo' is empty.

➜ ls
foo/  foo.dvc

➜ cat foo.dvc
outs:
- md5: d751713988987e9331980363e24189ce.dir
  path: foo

➜ mkdir foo/bar

➜ dvc checkout --verbose
2020-10-12 14:27:46,792 DEBUG: Check for update is enabled.
2020-10-12 14:27:46,797 DEBUG: fetched: [(3,)]
2020-10-12 14:27:46,809 DEBUG: checking if 'foo'('HashInfo(name='md5', value='d751713988987e9331980363e24189ce.dir', dir_info=None)') has changed.
2020-10-12 14:27:46,809 DEBUG: Assuming '/Users/pokey/src/dvc-empty-test/.dvc/cache/d7/51713988987e9331980363e24189ce.dir' is unchanged since it is read-only
2020-10-12 14:27:46,811 DEBUG: Path '/Users/pokey/src/dvc-empty-test/foo' inode '87720699'
2020-10-12 14:27:46,811 DEBUG: fetched: [('99914b932bd37a50b983c5e7c90ae93b', '0', 'd751713988987e9331980363e24189ce.dir', '1602500101943105024')]
2020-10-12 14:27:46,812 DEBUG: 'foo' hasn't changed.
2020-10-12 14:27:46,813 DEBUG: Data 'foo' didn't change.
2020-10-12 14:27:46,814 DEBUG: fetched: [(4,)]
2020-10-12 14:27:46,817 DEBUG: Analytics is enabled.
2020-10-12 14:27:46,891 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/var/folders/yb/k48gsy1s3qg1gdd72v9y6rdr0000gn/T/tmplekybbra']'
2020-10-12 14:27:46,893 DEBUG: Spawned '['daemon', '-q', 'analytics', '/var/folders/yb/k48gsy1s3qg1gdd72v9y6rdr0000gn/T/tmplekybbra']'

➜ ls foo
bar/

At the end, I'd expect foo/bar to have been removed

@pokey is right, its still an issue:

def test_checkout_empty_dir(tmp_dir, dvc):
    tmp_dir.dvc_gen({"empty_dir":{}})
    tmp_dir.gen({"empty_dir": {"subdir": {}}})

    dvc.checkout()

    assert not os.path.exists(tmp_dir / "empty_dir" / "subdir")

This test fails. Seems like our check falsely reports that the directory is unchanged. This is probably due to the way we calculate cache for dir: introducing empty directories does not influence dir hash.

Obvious fix: introduce directories (even empty) information to dir hash is probably a no-go - that will make all dir hashes calculated prior to change invalid.

This is probably due to the way we calculate cache for dir: introducing empty directories does not influence dir hash.

Obvious fix: introduce directories (even empty) information to dir hash is probably a no-go - that will make all dir hashes calculated prior to change invalid.

Great points! Just for the record: that behaviour might be revisited in https://github.com/iterative/dvc/issues/829#issuecomment-711137723

But yeah, for now we could potentially just add some logic to clean up such cases.

Was this page helpful?
0 / 5 - 0 ratings