Ubuntu: 18.04
dvc: 0.22.0
I added dvc later to an already existing project like so:
dvc add data
dvc run -d data -d train.py -o model -f train.dvc --no-exec train.py ...
(where data and model are directories with lots of image-files resp. big ML-models)
I've added a remote S3, commited dvc-files into git and when calling:
dvc push dvc hangs at "20% Collecting information"
Calling dvc push -v provides lots of
Debug: SELECT * from state WHERE inode=1576004
Debug: fetched: [(1576004, '1544624726428000000', '449ec1c0d6beeadb55b6b4808ce41ec7', '1544630970012086784')]
Debug: Inode '1576004', mtime '1544624726428000000', actual mtime '1544624726428000000'.
Debug: UPDATE state SET timestamp = "1544630972341281280" WHERE inode = 1576004
... and the last log-output is:
Debug: File '/home/blabla/.dvc/cache/44/9ec1c0d6beeadb55b6b4808ce41ec7', md5 '449ec1c0d6beeadb55b6b4808ce41ec7', actual '449ec1c0d6beeadb55b6b4808ce41ec7'
Debug: Path /home/blabla/.dvc/cache/07/270488711c768df5ad02185cb81493 inode 1576007
Debug: SELECT * from state WHERE inode=1576007
Debug: fetched: [(1576007, '1544624727832000000', '07270488711c768df5ad02185cb81493', '1544630970013530368')]
Debug: Inode '1576007', mtime '1544624727832000000', actual mtime '1544624727832000000'.
Debug: UPDATE state SET timestamp = "1544630972343799552" WHERE inode = 1576007
Debug: File '/home/blabla/.dvc/cache/07/270488711c768df5ad02185cb81493', md5 '07270488711c768df5ad02185cb81493', actual '07270488711c768df5ad02185cb81493'
Debug: File '/home/blabla/.dvc/cache/ea/ad01785397a0a43a25cf21fb816b8e', md5 'eaad01785397a0a43a25cf21fb816b8e', actual 'None'
[###### ] 20% Collecting information
What's going wrong here?
Hi @stvogel !
How long has it been stuck? Looks like it is collecting information about the existing cache. Approximately, how many directories with how many files do you have? If we are talking about directories with millions of files, it might take a while. It is a known(e.g. https://github.com/iterative/dvc/issues/1429) performance limitation caused by the current way we store and process such directories. We are actively working on optimizing that.
Thanks,
Ruslan
Ok ... maybe this was a false alarm ... after ... well nearly an hour dvc went on pushing the data to remote s3.
And yes the folder data and models contains > 5 GB of data, so I assume this is worth waiting ;-)
@stvogel Glad it worked eventually :) Let's keep this open until we at least improve information messages(e.g. might add an additional progress bar or just modify the current one to be more informative).
I had the same problem: paused for 2h+.
One workaround for now (thanks to Ruslan on Discord): specify no-cache, and then compress to a single file (which dvc is then happy with), push, and decompress.
dvc run -d small_stuff -O image_dir python make_images_dir.py
dvc run -d image_dir -o images.tar tar -cf images.tar image_dir
tar still takes a while, of course, but it eventually works.
Most helpful comment
I had the same problem: paused for 2h+.
One workaround for now (thanks to Ruslan on Discord): specify no-cache, and then compress to a single file (which dvc is then happy with), push, and decompress.
tar still takes a while, of course, but it eventually works.