I am dealing with a large hierarchical data set. One where artifacts are pulled from various directories to generate contiguous data sets that are then fed to ML processes downstream. I don't want to use dvc to reproduce the pipeline, at least not yet. My needs are rather to be able to version the overall image dataset hierarchy, for the purpose of manual inspection of the whole hierarchy and moving images into groups or removing them altogether when necessary.
This enables folks with less ML expertise control the data set they want to build by grouping the content together that they want to pick up when generating the data set. The data set is not a list of images, rather it is a list of lower dimensional feature vectors extracted from those images.
I'm finding dvc taking a potentially unreasonable amount of time to just add and commit. Perhaps I don't understand what I'm doing or haven't set my expectations correctly.
I wanted to keep these operations small in order to ensure things were working well. I have done the following. I have approximately 300K in total in this set right now.
I am just looking for some guidance in managing a dataset of this nature using dvc in a way that will not eat up so much time, disk, compute, etc. If I'm doing something suboptimal, then I want to shine some light on that.
Here dvc is taking the greater amount of 99% of system memory (13 GB) and appears to be causing disk thrashing. It's been running nearly for a day so far.
sounds like a bug to me.
you actually don't even need to run it if you didn't use dvc add --no-commit
or something similar.
large hierarchical data set
so, why keep it under a single dvc add
if you can use dvc add data/feature1
, dvc add data/feature2
, etc?
I'm finding dvc taking a potentially unreasonable amount of time to just add and commit.
commit
- see above, we'll try to reproduce and see what's happening here
add
- adding more files requires analyzing the directory again for now unfortunately, though we cache certain things. Let's see anyway what's happening here as well - could you please share dvc version
first?
Thanks @shcheklein, I am thinking that adding each individual directory is better as well, though I wasn't sure if it would take a very long time to do that as well. I was also trying to keep the command to sync down the dataset simple, so that only the root directory would need to be synchronized.
My thinking here is that perhaps I didn't understand that simply running add more than once on a root directory containing large numbers of files could be causing dvc to run through all of the files each time.
For instance, if I add and commit each subdirectory, should each add only be concerned with that directory? If so, then yes I will do that, but I will need to terminate the existing commit command that has been running so long.
Also, how would this impact when a contributor needs to sync down the full dataset? Will they need to dvc checkout every single dvc file to get every single subdirectory local?
dvc --version shows 0.63.2
@JoeyCarson thanks! could you run specifically, dvc version
please - it can show a bit more information. Also, could you please update to the latest version to avoid hitting some already solved issues.
For instance, if I add and commit each subdirectory, should each add only be concerned with that directory?
Yes. It will more granular and it'll be analyzing only one subdirectory. So, if the nature of your datasets is hierarchical by design it's better to manage them separately.
If so, then yes I will do that, but I will need to terminate the existing commit command that has been running so long.
trying to reproduce this ... though, like I mentioned you probably don't even need dvc commit
.
to dvc checkout every single dvc file to get every single subdirectory local?
no, by default dvc checkout
does checkout all the files. But you can specify it a target to do a granular checkout.
Also, in your workflow I would take a look at import/get - there were specifically made to manage feature and data registries. Let me know if you need more details here, we can discuss.
dvc version shows the following output
DVC Version: 0.63.2
Python version: 3.6.8
Platform: Windows-10-10.0.18362.-SP0.
Binary: False
I figured that using the individual directories would be faster and easier. I will go ahead and do the following, hopefully to not lose the existing cache.
Additionally I will have a look into the import workflow and come back with any questions I have. Thanks for the support!
@JoeyCarson sounds good! Please, also take a look here - https://dvc.org/doc/user-guide/running-dvc-on-windows - those are caveats that we found so far running DVC on Windows. Some of them can significantly improve performance.
@JoeyCarson also, how large is you data? if you don't edit manually specific files I would also consider turning on this optimization - https://dvc.org/doc/user-guide/large-dataset-optimization
otherwise DVC is copying files on Windows into local cache by default.
@JoeyCarson one more quick question. When you do dvc commit
, does it print anything? Something like "Creating unpacked dir ..." ?
Yes! It does and shows that it only makes it to 77% of whatever this means. In fact during the last commit, it showed that twice, both times only getting to 77%.
@JoeyCarson thanks! looks like one of our optimization does not like Windows that much. We will test it here.
Happy to point that out. I am also finding that when running dvc add for one of those smaller subdirectories (only 500 files and one empty child directory inside), it is still hanging a while without doing anything. I also created a dvc repo (ran dvc init separately) and several minutes after I've added the subdirectory, I still see no .dvc/.cache directory. This is the first add, so I expect it to show up, though it doesn't.
@JoeyCarson no messages at all? Could you check CPU load level during this? It's probably calculating md5s for the files. Btw, how large is that set - number of files, data size?
I have run it with -v for verbose logging. Shows that it spawned the daemon process, outputs a few lines that look like SQLite CREATE TABLE queries, and finally PRAGMA user_version = 3. It did sit there a while and is now writing outputs on the command line. I signed out of windows and back in before I did this, so maybe something was corrupt before.
@JoeyCarson how did you install DVC - pip or conda? and how large is your data - number of files and size per file?
I installed using pip, from an anaconda virtual environment.
conda activate myenv
python -m pip install dvc
Images are mostly jpeg and some png in source dimensions, e.g. not resized. When adding each subset, there are about 500 on average. There are about 30 subsets in this particular category.
@JoeyCarson and to completely replicate your environment, what is the Anaconda version and what shell do you use - Git Bash or Anaconda's one?
Also, why don't you install it with conda
?
I have been able to reproduce it on Windows with ImageNet dataset (~50GB).
dvc add
is 8 times slower (comparing to Mac).
Running dvc commit
after that - process killed due to low memory (16GB machine).
I haven't applied any additional Windows or DVC configuration.
I'll create a separate ticket to investigate and deal with this.
@efiop I suggest we create a separate ticket for the dvc commit
consuming too much time and memory and bump priority of the #2360 .
@JoeyCarson and to completely replicate your environment, what is the Anaconda version and what shell do you use - Git Bash or Anaconda's one?
I always use Anaconda Prompt with a conda virtual environment. Machine is rather busy at the moment running dvc add, so I will update this comment with version info ASAP.
Also, why don't you install it with conda?
Generally, I use conda for the ease of virtual environments. When it comes to deployment to a pure python PaaS, it's usually easier if everything is installed with pip, as the conda usually writes non-standard module declarations that can confuse pure python environments. I've observed non-standard declarations when running pip freeze after installing using conda. So I generally activate my virtual environment and then run python -m pip install package_name. That enforces using the virtual env's pip installation and that it installs packages from PyPI, I believe.
For the record, OOM fix is https://github.com/iterative/dvc/pull/2735 .
@JoeyCarson OOM fix was released in a new version of dvc. Please upgrade and give it a try :) Thanks for the feedback!
5x slower (was 8x slower) - after adding python and project directory to Windows security whitelist as described here.
Copy semantics is enabled by default. Next step is to enable hardlinks.
$ python -mcProfile -s cumtime -m dvc add images
ncalls tottime percall cumtime percall filename:lineno(function)
1326850/1 151.750 0.000 4919.023 4919.023 {built-in method builtins.exec}
1 0.000 0.000 4919.022 4919.022 runpy.py:195(run_module)
1 0.000 0.000 4918.861 4918.861 runpy.py:62(_run_code)
1 0.000 0.000 4918.861 4918.861 __main__.py:1(<module>)
1 0.000 0.000 4918.752 4918.752 main.py:21(main)
1 0.000 0.000 4918.015 4918.015 add.py:16(run)
1 0.000 0.000 4918.015 4918.015 __init__.py:32(wrapper)
1 0.000 0.000 4917.952 4917.952 scm_context.py:2(run)
1 0.000 0.000 4917.952 4917.952 add.py:20(add)
1 0.000 0.000 3356.946 3356.946 stage.py:748(commit)
1 0.000 0.000 3356.946 3356.946 base.py:255(commit)
1 0.000 0.000 3356.945 3356.945 base.py:477(save)
1 0.000 0.000 3344.306 3344.306 base.py:491(_save)
1 3.390 3.390 3344.305 3344.305 base.py:441(_save_dir)
440796 4.425 0.000 3199.029 0.007 base.py:421(_save_file)
440796 0.620 0.000 1675.638 0.004 base.py:378(link)
440796 3.038 0.000 1675.018 0.004 base.py:381(_link)
440801/440800 1.046 0.000 1580.231 0.004 base.py:295(get_checksum)
440796 5.254 0.000 1560.799 0.004 slow_link_detection.py:29(wrapper)
440796 1.243 0.000 1548.212 0.004 base.py:388(_try_links)
1 0.000 0.000 1548.144 1548.144 stage.py:711(save)
1 0.000 0.000 1548.143 1548.143 base.py:217(save)
2 0.000 0.000 1548.141 774.071 base.py:324(save_info)
440797 3.288 0.000 1546.690 0.004 base.py:404(_do_link)
1 0.000 0.000 1514.319 1514.319 base.py:202(changed)
1 0.000 0.000 1514.319 1514.319 base.py:187(status)
1 0.000 0.000 1514.319 1514.319 base.py:173(changed_checksum)
1 0.893 0.893 1428.897 1428.897 base.py:230(get_dir_checksum)
1 0.517 0.517 1357.461 1357.461 base.py:193(_collect_dir)
440796 0.870 0.000 1352.709 0.003 local.py:174(copy)
440796 3.527 0.000 1351.839 0.003 system.py:17(copy)
440796 67.413 0.000 1346.624 0.003 shutil.py:96(copyfile)
1 1.569 1.569 1022.543 1022.543 base.py:180(_calculate_checksums)
440796 7.258 0.000 1010.477 0.002 shutil.py:76(copyfileobj)
440797 0.381 0.000 1006.315 0.002 std.py:1051(__iter__)
440797 2.805 0.000 1005.934 0.002 _base.py:591(result_iterator)
440796 4.518 0.000 1002.828 0.002 _base.py:408(result)
362592 2.116 0.000 996.654 0.003 threading.py:264(wait)
725208 991.924 0.001 991.924 0.001 {method 'acquire' of '_thread.lock
So, it is 1346.624
to copy files, probably 1580.231
to calculate checksums, what else? @efiop @Suor could you help to interpret this?
It's better to sort it by tottime or provide longer output, there is not enough rows here
Also, are you adding 440k images at once?
And you are waiting a lot on thread lock.
https://github.com/iterative/dvc/blob/master/dvc/remote/slow_link_detection.py#L14
@casperdcl Btw, does tqdm walk away if the lock is taken or does it wait on it to update the progress bar?
@Suor yep, I saw thread lock, not sure where does it come from and how does it affect the total execution time. In general, I would I don't know how profile is calculated when method is called from multiple threads. Should we only take in account only time of the enclosing method?
Also, are you adding 440k images at once?
yes, it's imagenet - one directory, split into a one or two hundred dirs with images in them
@Suor @efiop :
Full dumps to analyze:
Windows, Hardlinks, 0.66.8, conda, Defender Off
Mac, Hardlinks, 0.66.9dev, pip
@efiop
Btw, does tqdm walk away if the lock is taken or does it wait on it to update the progress bar?
The lock is acquired purely when trying to write console output. It isn't dropped at all. Maybe we should change it to drop and skip outputting after maybe 0.1
seconds?
@casperdcl In the old progress implementation, we used to just walk if away if it is locked on regular update [1] and actually wait for the lock if the task is finished [2] . 0.1 sec is still quite a lot, maybe we could adopt our old approach in tqdm too, so it doesn't slow down multiple workers working in parallel(just to be clear, I'm not 100% sure it is tqdm that is at fault here)? What do you think?
[1] https://github.com/iterative/dvc/blob/0.35.6/dvc/progress.py#L51
[2] https://github.com/iterative/dvc/blob/0.35.6/dvc/progress.py#L70
@casperdcl Let me clarify, so when, say, you are using a tqdm progress bar for writing a file, then for each chunk, that same thread is going to wait tqdm lock to be available, right? So tqdm would be slowing down workers?
@efiop yes, tqdm
uses the blocking approach [2] both for intermediate updates (so not [1) and final output ([2]). I agree that (esp for multithread) that "walk away" should be an option.
opened upstream issue: https://github.com/tqdm/tqdm/issues/838
@casperdcl Or another option would be to make tqdm create a separate thread that is dealing with all progress bar updates and make workers just submit to a queue, so it is hopefully faster. Not sure which approach would be better for tqdm in general :slightly_smiling_face: Though the queue
one could be implemented on top of tqdm, of course.
I do actually have a singleton monitor thread managing other bars but it probably needs a redesign. tqdm
was originally designed for efficient single-threaded use. Fixing this issue ASAP is a bigger priority for me that upgrading the monitor thread. Will happen soon.
For the record. Tried running dvc add on a dir with 25K files of 1M size generated from urandom. Ran on windows and linux on the same machine and on the same ntfs drive. Windows scored at 686sec and linux at 776sec. So looks like the issue reported by @shcheklein is caused by SSD on ec2 windows instance being slower than his pcie SSD on his macbook (you can see from cprofile that io takes much longer on windows).
@casperdcl do we still plan to use deque-based lockless mechanism? Also, as I see it, now you use non-blocking RLock.acquire()
, what's the point? Does it prevent output from mixing up across threads?
yes, as with the previous (before tqdm
) approach it prevents multiple threads writing progress output on top of each other. Since it's now non-blocking it should also have negligible overhead in multi-threaded cases. I don't see the point of deque atm.
Also @Suor the main issue here seems to be
- dvc commit. Here dvc is taking the greater amount of 99% of system memory (13 GB) and appears to be causing disk thrashing. It's been running nearly for a day so far.
which doesn't look UI-related.
all file operations are freaking faster on Mac, like 5x faster at least
So cutting up on those dup exists/isdir calls might help. Not sure though, nt.stat
takes more time than posix.stat
in mac bench, however, exists don't have that much cumtime.
One thing that bothers me is that you have 440796 calls to shutil.copyfile()
and corresponding copyfileobj()
on windows, but not on mac. Doesn't look like mac really copying anything.
@casperdcl but if it doesn't block, how does it prevent two threads from printing at the same time? So first process acquires a lock and prints to stdout, in the meantime the second process fails to acquire the lock, but still continues and prints to stdout. They mix outputs.
@casperdcl that issue you refer to was caused by memory leak and was a show stopper for sure. However, that one was fixed, but Windows is still slower than unix considerably, which bothers us.
@Suor I feel this discussion should be had on #2767 <- #2766 but essentially failure to acquire a non-blocking lock means that output is skipped (rather than interleaved). In practice screen refresh could reduce from ~10/s to ~1/s per bar but the task will finish quicker :)
@casperdcl oh, I caused a confusion, I don't mean bar refreshes I mean actual output, which we do through logger, which calls Tqdm.write()
.
That ignores lock_args
and hence blocks on the lock, which means a few things:
logger.*()
calls won't mix.1. Two `logger.*()` calls won't mix.
True
2. If we have lots of concurrent logger output this will be slow.
Yes but I assume logging output is never that much. If there is a lot of logging output then we have more problems (why is there so much logging output? excessive verbosity is hard to use for debugging. non-blocking logging would mean interleaving and/or dropped messages, neither of which are good)
When we run -v
flag then there are lots of messages logged. And queues should work, no blocking, no interleaving.
I don't know whether that is a real or only a potential issue though.
Yes would need profiling. Based on https://github.com/iterative/dvc/pull/2767#issuecomment-552067441 there's no issue with dvc add
but I don't know about dvc commit
...
@casperdcl That commit leak was fixed in https://github.com/iterative/dvc/pull/2735 . Sorry, forgot to link it here.
a quick update, btw. I tried to run it on the same instance type + storage type as I use for Windows, but on XFS Linux and the same dvc add
operation is 8-9x slower now (after thread locking fix). Still researching in background what's causing this and if there are workarounds when we deal with large numbers of files on Windows.
Most helpful comment
@Suor @efiop :
Full dumps to analyze:
Windows, Hardlinks, 0.66.8, conda, Defender Off
Mac, Hardlinks, 0.66.9dev, pip