Details:
$ dvc version
DVC version: 0.62.1
Python version: 3.6.5
Platform: Linux-3.10.0-693.11.1.el7.x86_64-x86_64-with-debian-stretch-sid
Binary: False
Filesystem type (workspace): ('xfs', '/dev/mapper/docker-259:1-2147484675-2dfd687cd3e711f6528c675a70265f8bef1522f96104edf9ae18beec60932fc5')
Context: https://opendatascience.slack.com/archives/CGGLZJ119/p1571500778020600
dvc run was used to unzip 1M (~11GB) images archive into a directory, cache: false:md5: 1b59aa1f99b185cddbce8eccc10c86bb
cmd: unzip artifacts/input/resized_images_256.zip -d artifacts/input/resized_images_256
wdir: ../..
deps:
- md5: 7e1268ff3d039bd51ab49c3781e5819a
path: artifacts/input/resized_images_256.zip
outs:
- md5: 635645a153f7531c2bd38faee65b1992.dir
path: artifacts/input/resized_images_256
cache: false
metric: false
persist: false
dvc remove.dvc gc took a while to execute after that (why?)dvc status after that:dvc status -v
DEBUG: Trying to spawn '['/opt/conda/bin/python', '/opt/conda/bin/dvc', 'daemon', '-q', 'updater']'
DEBUG: Spawned '['/opt/conda/bin/python', '/opt/conda/bin/dvc', 'daemon', '-q', 'updater']'
DEBUG: PRAGMA user_version;
DEBUG: fetched: [(3,)]
DEBUG: CREATE TABLE IF NOT EXISTS state (inode INTEGER PRIMARY KEY, mtime TEXT NOT NULL, size TEXT NOT NULL, md5 TEXT NOT NULL, timestamp TEXT NOT NULL)
DEBUG: CREATE TABLE IF NOT EXISTS state_info (count INTEGER)
DEBUG: CREATE TABLE IF NOT EXISTS link_state (path TEXT PRIMARY KEY, inode INTEGER NOT NULL, mtime TEXT NOT NULL)
DEBUG: INSERT OR IGNORE INTO state_info (count) SELECT 0 WHERE NOT EXISTS (SELECT * FROM state_info)
DEBUG: PRAGMA user_version = 3;
Hanged for more than an hour and was killed.
Caused by #2492
@efiop can you give a bit more details? in this case directory is not under DVC control, so how can it be related to checkout?
Have tried to reproduce using imagenet (~0.5M images):
dvc add images.tar
dvc run -d images.tar -O images tar -xf images.tar
dvc remove images.tar.dvc
dvc status <--- takes a few minutes
dvc gc <--- fast
dvc status <-- takes a few minutes, looks like takes longer than the previous one
output of the last dvc status:
Computing md5 for a large number of files. This is only done once.
images.dvc:
changed deps:
deleted: images.tar
images.tar.dvc:
changed outs:
deleted: images.tar
dvc version:
DVC version: 0.63.3+1e01ce
Python version: 3.7.3
Platform: Darwin-18.2.0-x86_64-i386-64bit
Binary: False
Cache: reflink - True, hardlink - True, symlink - True
no luck reproducing this so fat
or something changed between 0.62.1 and the latest master - will check it next run
@shcheklein it's not clear whether dir is modified or not. If that is modified, say a few files missing, then it will be slow. That is an older issue - https://github.com/iterative/dvc/issues/1970
@Suor :
(.env) [ivan@ivan ~/Projects/test-imagenet]$ time dvc status
Computing md5 for a large number of files. This is only done once.
images.dvc:
changed deps:
deleted: images.tar
changed outs:
modified: images
images.tar.dvc:
changed outs:
deleted: images.tar
real 4m57.531s
user 3m59.674s
sys 3m8.965s
this is after modifying the directory (removing one file).
4m << 1h+
@shcheklein Indeed, that ticket doesn't seem related. Need to take a closer look.
@efiop have we resolved, ruled out the bug?
@shcheklein No, I don't think so. Didn't look into this at all yet.
The issue has disappeared for the user. Closing for now, until there is more info and/or a way to reproduce this.