Pushing 1.5Gb dvc repository with 8K small data files takes more than 6 hours inside AWS.
There were 83K very small files, not 8K.
So, this issue priority is not too high.
Data directories are big part of this issue since DVC mirrors the directory structure and creates a corresponded file for each file from the dir.
@efiop is there a way to create a single file to store the data directory structure?
I have a few thoughts on the alternative ways to store dirs. I'll take a closer look soon.
Actually, the main problem here is that we don't reuse aws connections and instead establish a new one every time we push/pull a file. This is a known issue. I am already working on it.
Ok, found the issue. Cached dirs are currently processed in 1 thread in cloud logic code. Already working on fixing that.
Managed to speed up x7 times for a few hundred small files with https://github.com/dataversioncontrol/dvc/pull/504 . Still, dvc is not really optimized for an extreme number of small files, so you might want to consider packing those into tar or something. Closing for now.
x7 is awesome! Thank you.
Yeah, we need to invent some file packing tricks like Git does.
Note that the group between users has to be a primary group and not a secondary. This can be changed by running : usermod -g groupName $USER
@mazzma12 Looks like you've mistaken dvc and dvc.org repos :) You've meant to link https://github.com/iterative/dvc.org/issues/497 . I'll copy your comment there. Thanks!
Most helpful comment
@mazzma12 Looks like you've mistaken dvc and dvc.org repos :) You've meant to link https://github.com/iterative/dvc.org/issues/497 . I'll copy your comment there. Thanks!