Dvc: status: extremely slow on a million image uncached directory

Created on 19 Oct 2019  路  10Comments  路  Source: iterative/dvc

Details:

$ dvc version
DVC version: 0.62.1
Python version: 3.6.5
Platform: Linux-3.10.0-693.11.1.el7.x86_64-x86_64-with-debian-stretch-sid
Binary: False
Filesystem type (workspace): ('xfs', '/dev/mapper/docker-259:1-2147484675-2dfd687cd3e711f6528c675a70265f8bef1522f96104edf9ae18beec60932fc5')

Context: https://opendatascience.slack.com/archives/CGGLZJ119/p1571500778020600

  • dvc run was used to unzip 1M (~11GB) images archive into a directory, cache: false:
md5: 1b59aa1f99b185cddbce8eccc10c86bb
cmd: unzip artifacts/input/resized_images_256.zip -d artifacts/input/resized_images_256
wdir: ../..
deps:
- md5: 7e1268ff3d039bd51ab49c3781e5819a
  path: artifacts/input/resized_images_256.zip
outs:
- md5: 635645a153f7531c2bd38faee65b1992.dir
  path: artifacts/input/resized_images_256
  cache: false
  metric: false
  persist: false
  • a few images were missing in the old archive, and ZIP was removed with dvc remove.
  • dvc gc took a while to execute after that (why?)
  • dvc status after that:
dvc status -v
DEBUG: Trying to spawn '['/opt/conda/bin/python', '/opt/conda/bin/dvc', 'daemon', '-q', 'updater']'
DEBUG: Spawned '['/opt/conda/bin/python', '/opt/conda/bin/dvc', 'daemon', '-q', 'updater']'
DEBUG: PRAGMA user_version;
DEBUG: fetched: [(3,)]
DEBUG: CREATE TABLE IF NOT EXISTS state (inode INTEGER PRIMARY KEY, mtime TEXT NOT NULL, size TEXT NOT NULL, md5 TEXT NOT NULL, timestamp TEXT NOT NULL)
DEBUG: CREATE TABLE IF NOT EXISTS state_info (count INTEGER)
DEBUG: CREATE TABLE IF NOT EXISTS link_state (path TEXT PRIMARY KEY, inode INTEGER NOT NULL, mtime TEXT NOT NULL)
DEBUG: INSERT OR IGNORE INTO state_info (count) SELECT 0 WHERE NOT EXISTS (SELECT * FROM state_info)
DEBUG: PRAGMA user_version = 3;

Hanged for more than an hour and was killed.

bug p0-critical performance research

All 10 comments

Caused by #2492

@efiop can you give a bit more details? in this case directory is not under DVC control, so how can it be related to checkout?

Have tried to reproduce using imagenet (~0.5M images):

dvc add images.tar
dvc run -d images.tar -O images tar -xf images.tar
dvc remove images.tar.dvc
dvc status <--- takes a few minutes
dvc gc <--- fast
dvc status <-- takes a few minutes, looks like takes longer than the previous one

output of the last dvc status:

Computing md5 for a large number of files. This is only done once.
images.dvc:
    changed deps:
        deleted:            images.tar
images.tar.dvc:
    changed outs:
        deleted:            images.tar

dvc version:

DVC version: 0.63.3+1e01ce
Python version: 3.7.3
Platform: Darwin-18.2.0-x86_64-i386-64bit
Binary: False
Cache: reflink - True, hardlink - True, symlink - True

no luck reproducing this so fat

or something changed between 0.62.1 and the latest master - will check it next run

@shcheklein it's not clear whether dir is modified or not. If that is modified, say a few files missing, then it will be slow. That is an older issue - https://github.com/iterative/dvc/issues/1970

@Suor :

(.env) [ivan@ivan ~/Projects/test-imagenet]$ time dvc status
Computing md5 for a large number of files. This is only done once.
images.dvc:
    changed deps:
        deleted:            images.tar
    changed outs:
        modified:           images
images.tar.dvc:
    changed outs:
        deleted:            images.tar

real    4m57.531s
user    3m59.674s
sys 3m8.965s

this is after modifying the directory (removing one file).

4m << 1h+

@shcheklein Indeed, that ticket doesn't seem related. Need to take a closer look.

@efiop have we resolved, ruled out the bug?

@shcheklein No, I don't think so. Didn't look into this at all yet.

The issue has disappeared for the user. Closing for now, until there is more info and/or a way to reproduce this.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

gregfriedland picture gregfriedland  路  3Comments

mdscruggs picture mdscruggs  路  3Comments

jorgeorpinel picture jorgeorpinel  路  3Comments

anotherbugmaster picture anotherbugmaster  路  3Comments

prihoda picture prihoda  路  3Comments