Dvc: Using dvc only for dataset management (e.g. no dvc run pipeline).

Created on 30 Oct 2019 · 53Comments · Source: iterative/dvc

I am dealing with a large hierarchical data set. One where artifacts are pulled from various directories to generate contiguous data sets that are then fed to ML processes downstream. I don't want to use dvc to reproduce the pipeline, at least not yet. My needs are rather to be able to version the overall image dataset hierarchy, for the purpose of manual inspection of the whole hierarchy and moving images into groups or removing them altogether when necessary.

This enables folks with less ML expertise control the data set they want to build by grouping the content together that they want to pick up when generating the data set. The data set is not a list of images, rather it is a list of lower dimensional feature vectors extracted from those images.

I'm finding dvc taking a potentially unreasonable amount of time to just add and commit. Perhaps I don't understand what I'm doing or haven't set my expectations correctly.

I wanted to keep these operations small in order to ensure things were working well. I have done the following. I have approximately 300K in total in this set right now.

store 60K images on local file system, under the data/ directory.
dvc add data/
dvc push -r remote. I forgot to commit here since things took so long and I wanted to see if pushing worked.
store 120K additional images to another sub directory under the data/ directory.
dvc add data/ -> goes through all of the files in data/ regardless. I ran -v here and showed the previous files.
dvc push -r remote.
dvc commit. Here dvc is taking the greater amount of 99% of system memory (13 GB) and appears to be causing disk thrashing. It's been running nearly for a day so far.

I am just looking for some guidance in managing a dataset of this nature using dvc in a way that will not eat up so much time, disk, compute, etc. If I'm doing something suboptimal, then I want to shine some light on that.

performance question research

Source

JoeyCarson

👍2

Most helpful comment

@Suor @efiop :

Full dumps to analyze:

Windows, Hardlinks, 0.66.8, conda, Defender Off
Mac, Hardlinks, 0.66.9dev, pip

Windows is 5x slower
thread acquire lock takes substantial portion of time in both cases
all file operations are freaking faster on Mac, like 5x faster at least

shcheklein on 6 Nov 2019

👍2

All 53 comments

Here dvc is taking the greater amount of 99% of system memory (13 GB) and appears to be causing disk thrashing. It's been running nearly for a day so far.

sounds like a bug to me.

you actually don't even need to run it if you didn't use dvc add --no-commit or something similar.

large hierarchical data set

so, why keep it under a single dvc add if you can use dvc add data/feature1, dvc add data/feature2, etc?

I'm finding dvc taking a potentially unreasonable amount of time to just add and commit.

commit - see above, we'll try to reproduce and see what's happening here
add - adding more files requires analyzing the directory again for now unfortunately, though we cache certain things. Let's see anyway what's happening here as well - could you please share dvc version first?

shcheklein on 30 Oct 2019

Thanks @shcheklein, I am thinking that adding each individual directory is better as well, though I wasn't sure if it would take a very long time to do that as well. I was also trying to keep the command to sync down the dataset simple, so that only the root directory would need to be synchronized.

My thinking here is that perhaps I didn't understand that simply running add more than once on a root directory containing large numbers of files could be causing dvc to run through all of the files each time.

For instance, if I add and commit each subdirectory, should each add only be concerned with that directory? If so, then yes I will do that, but I will need to terminate the existing commit command that has been running so long.

Also, how would this impact when a contributor needs to sync down the full dataset? Will they need to dvc checkout every single dvc file to get every single subdirectory local?

JoeyCarson on 30 Oct 2019

dvc --version shows 0.63.2

JoeyCarson on 30 Oct 2019

@JoeyCarson thanks! could you run specifically, dvc version please - it can show a bit more information. Also, could you please update to the latest version to avoid hitting some already solved issues.

shcheklein on 30 Oct 2019

For instance, if I add and commit each subdirectory, should each add only be concerned with that directory?

Yes. It will more granular and it'll be analyzing only one subdirectory. So, if the nature of your datasets is hierarchical by design it's better to manage them separately.

If so, then yes I will do that, but I will need to terminate the existing commit command that has been running so long.

trying to reproduce this ... though, like I mentioned you probably don't even need dvc commit.

to dvc checkout every single dvc file to get every single subdirectory local?

no, by default dvc checkout does checkout all the files. But you can specify it a target to do a granular checkout.

Also, in your workflow I would take a look at import/get - there were specifically made to manage feature and data registries. Let me know if you need more details here, we can discuss.

shcheklein on 30 Oct 2019

dvc version shows the following output
DVC Version: 0.63.2
Python version: 3.6.8
Platform: Windows-10-10.0.18362.-SP0.
Binary: False

I figured that using the individual directories would be faster and easier. I will go ahead and do the following, hopefully to not lose the existing cache.

rename the directory to use in the remote blob store, so that caches don't overwrite.
rename the .dvc directory to .dvc.bak.
dvc init
Add the smaller directories first. (But how do I handle artifacts that are in the parent directory of those smaller directories without causing it dvc to index those child directories too? I can add each of the individual files in that case I guess.)
Skip dvc commit, because it's just not necessary.
dvc push -r remoteName

Additionally I will have a look into the import workflow and come back with any questions I have. Thanks for the support!

JoeyCarson on 30 Oct 2019

@JoeyCarson sounds good! Please, also take a look here - https://dvc.org/doc/user-guide/running-dvc-on-windows - those are caveats that we found so far running DVC on Windows. Some of them can significantly improve performance.

shcheklein on 30 Oct 2019

@JoeyCarson also, how large is you data? if you don't edit manually specific files I would also consider turning on this optimization - https://dvc.org/doc/user-guide/large-dataset-optimization

otherwise DVC is copying files on Windows into local cache by default.

shcheklein on 30 Oct 2019

@JoeyCarson one more quick question. When you do dvc commit, does it print anything? Something like "Creating unpacked dir ..." ?

shcheklein on 30 Oct 2019

Yes! It does and shows that it only makes it to 77% of whatever this means. In fact during the last commit, it showed that twice, both times only getting to 77%.

JoeyCarson on 30 Oct 2019

@JoeyCarson thanks! looks like one of our optimization does not like Windows that much. We will test it here.

shcheklein on 30 Oct 2019

Happy to point that out. I am also finding that when running dvc add for one of those smaller subdirectories (only 500 files and one empty child directory inside), it is still hanging a while without doing anything. I also created a dvc repo (ran dvc init separately) and several minutes after I've added the subdirectory, I still see no .dvc/.cache directory. This is the first add, so I expect it to show up, though it doesn't.

JoeyCarson on 30 Oct 2019

@JoeyCarson no messages at all? Could you check CPU load level during this? It's probably calculating md5s for the files. Btw, how large is that set - number of files, data size?

shcheklein on 30 Oct 2019

I have run it with -v for verbose logging. Shows that it spawned the daemon process, outputs a few lines that look like SQLite CREATE TABLE queries, and finally PRAGMA user_version = 3. It did sit there a while and is now writing outputs on the command line. I signed out of windows and back in before I did this, so maybe something was corrupt before.

JoeyCarson on 30 Oct 2019

@JoeyCarson how did you install DVC - pip or conda? and how large is your data - number of files and size per file?

shcheklein on 31 Oct 2019

I installed using pip, from an anaconda virtual environment.
conda activate myenv
python -m pip install dvc

Images are mostly jpeg and some png in source dimensions, e.g. not resized. When adding each subset, there are about 500 on average. There are about 30 subsets in this particular category.

JoeyCarson on 31 Oct 2019

👍1

@JoeyCarson and to completely replicate your environment, what is the Anaconda version and what shell do you use - Git Bash or Anaconda's one?

Also, why don't you install it with conda?

shcheklein on 31 Oct 2019

I have been able to reproduce it on Windows with ImageNet dataset (~50GB).

dvc add is 8 times slower (comparing to Mac).

Running dvc commit after that - process killed due to low memory (16GB machine).

I haven't applied any additional Windows or DVC configuration.

I'll create a separate ticket to investigate and deal with this.

shcheklein on 1 Nov 2019

@efiop I suggest we create a separate ticket for the dvc commit consuming too much time and memory and bump priority of the #2360 .

shcheklein on 1 Nov 2019

@JoeyCarson and to completely replicate your environment, what is the Anaconda version and what shell do you use - Git Bash or Anaconda's one?

I always use Anaconda Prompt with a conda virtual environment. Machine is rather busy at the moment running dvc add, so I will update this comment with version info ASAP.

Also, why don't you install it with conda?

Generally, I use conda for the ease of virtual environments. When it comes to deployment to a pure python PaaS, it's usually easier if everything is installed with pip, as the conda usually writes non-standard module declarations that can confuse pure python environments. I've observed non-standard declarations when running pip freeze after installing using conda. So I generally activate my virtual environment and then run python -m pip install package_name. That enforces using the virtual env's pip installation and that it installs packages from PyPI, I believe.

JoeyCarson on 1 Nov 2019

For the record, OOM fix is https://github.com/iterative/dvc/pull/2735 .

efiop on 4 Nov 2019

🎉1

@JoeyCarson OOM fix was released in a new version of dvc. Please upgrade and give it a try :) Thanks for the feedback!

efiop on 6 Nov 2019

5x slower (was 8x slower) - after adding python and project directory to Windows security whitelist as described here.

Copy semantics is enabled by default. Next step is to enable hardlinks.

$ python -mcProfile -s cumtime -m dvc add images

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1326850/1  151.750    0.000 4919.023 4919.023 {built-in method builtins.exec}
        1    0.000    0.000 4919.022 4919.022 runpy.py:195(run_module)
        1    0.000    0.000 4918.861 4918.861 runpy.py:62(_run_code)
        1    0.000    0.000 4918.861 4918.861 __main__.py:1(<module>)
        1    0.000    0.000 4918.752 4918.752 main.py:21(main)
        1    0.000    0.000 4918.015 4918.015 add.py:16(run)
        1    0.000    0.000 4918.015 4918.015 __init__.py:32(wrapper)
        1    0.000    0.000 4917.952 4917.952 scm_context.py:2(run)
        1    0.000    0.000 4917.952 4917.952 add.py:20(add)
        1    0.000    0.000 3356.946 3356.946 stage.py:748(commit)
        1    0.000    0.000 3356.946 3356.946 base.py:255(commit)
        1    0.000    0.000 3356.945 3356.945 base.py:477(save)
        1    0.000    0.000 3344.306 3344.306 base.py:491(_save)
        1    3.390    3.390 3344.305 3344.305 base.py:441(_save_dir)
   440796    4.425    0.000 3199.029    0.007 base.py:421(_save_file)
   440796    0.620    0.000 1675.638    0.004 base.py:378(link)
   440796    3.038    0.000 1675.018    0.004 base.py:381(_link)
440801/440800    1.046    0.000 1580.231    0.004 base.py:295(get_checksum)
   440796    5.254    0.000 1560.799    0.004 slow_link_detection.py:29(wrapper)
   440796    1.243    0.000 1548.212    0.004 base.py:388(_try_links)
        1    0.000    0.000 1548.144 1548.144 stage.py:711(save)
        1    0.000    0.000 1548.143 1548.143 base.py:217(save)
        2    0.000    0.000 1548.141  774.071 base.py:324(save_info)
   440797    3.288    0.000 1546.690    0.004 base.py:404(_do_link)
        1    0.000    0.000 1514.319 1514.319 base.py:202(changed)
        1    0.000    0.000 1514.319 1514.319 base.py:187(status)
        1    0.000    0.000 1514.319 1514.319 base.py:173(changed_checksum)
        1    0.893    0.893 1428.897 1428.897 base.py:230(get_dir_checksum)
        1    0.517    0.517 1357.461 1357.461 base.py:193(_collect_dir)
   440796    0.870    0.000 1352.709    0.003 local.py:174(copy)
   440796    3.527    0.000 1351.839    0.003 system.py:17(copy)
   440796   67.413    0.000 1346.624    0.003 shutil.py:96(copyfile)
        1    1.569    1.569 1022.543 1022.543 base.py:180(_calculate_checksums)
   440796    7.258    0.000 1010.477    0.002 shutil.py:76(copyfileobj)
   440797    0.381    0.000 1006.315    0.002 std.py:1051(__iter__)
   440797    2.805    0.000 1005.934    0.002 _base.py:591(result_iterator)
   440796    4.518    0.000 1002.828    0.002 _base.py:408(result)
   362592    2.116    0.000  996.654    0.003 threading.py:264(wait)
   725208  991.924    0.001  991.924    0.001 {method 'acquire' of '_thread.lock

shcheklein on 6 Nov 2019

👀1

So, it is 1346.624 to copy files, probably 1580.231 to calculate checksums, what else? @efiop @Suor could you help to interpret this?

shcheklein on 6 Nov 2019

It's better to sort it by tottime or provide longer output, there is not enough rows here

Suor on 6 Nov 2019

Also, are you adding 440k images at once?

Suor on 6 Nov 2019

And you are waiting a lot on thread lock.

Suor on 6 Nov 2019

👍1

https://github.com/iterative/dvc/blob/master/dvc/remote/slow_link_detection.py#L14

@casperdcl Btw, does tqdm walk away if the lock is taken or does it wait on it to update the progress bar?

efiop on 6 Nov 2019

@Suor yep, I saw thread lock, not sure where does it come from and how does it affect the total execution time. In general, I would I don't know how profile is calculated when method is called from multiple threads. Should we only take in account only time of the enclosing method?

shcheklein on 6 Nov 2019

Also, are you adding 440k images at once?

yes, it's imagenet - one directory, split into a one or two hundred dirs with images in them

shcheklein on 6 Nov 2019

@Suor @efiop :

Full dumps to analyze:

Windows, Hardlinks, 0.66.8, conda, Defender Off
Mac, Hardlinks, 0.66.9dev, pip

Windows is 5x slower
thread acquire lock takes substantial portion of time in both cases
all file operations are freaking faster on Mac, like 5x faster at least

shcheklein on 6 Nov 2019

👍2

@efiop

Btw, does tqdm walk away if the lock is taken or does it wait on it to update the progress bar?

The lock is acquired purely when trying to write console output. It isn't dropped at all. Maybe we should change it to drop and skip outputting after maybe 0.1 seconds?

casperdcl on 7 Nov 2019

@casperdcl In the old progress implementation, we used to just walk if away if it is locked on regular update [1] and actually wait for the lock if the task is finished [2] . 0.1 sec is still quite a lot, maybe we could adopt our old approach in tqdm too, so it doesn't slow down multiple workers working in parallel(just to be clear, I'm not 100% sure it is tqdm that is at fault here)? What do you think?

[1] https://github.com/iterative/dvc/blob/0.35.6/dvc/progress.py#L51
[2] https://github.com/iterative/dvc/blob/0.35.6/dvc/progress.py#L70

efiop on 7 Nov 2019

@casperdcl Let me clarify, so when, say, you are using a tqdm progress bar for writing a file, then for each chunk, that same thread is going to wait tqdm lock to be available, right? So tqdm would be slowing down workers?

efiop on 7 Nov 2019

@efiop yes, tqdm uses the blocking approach [2] both for intermediate updates (so not [1) and final output ([2]). I agree that (esp for multithread) that "walk away" should be an option.

casperdcl on 7 Nov 2019

opened upstream issue: https://github.com/tqdm/tqdm/issues/838

casperdcl on 7 Nov 2019

👍1

@casperdcl Or another option would be to make tqdm create a separate thread that is dealing with all progress bar updates and make workers just submit to a queue, so it is hopefully faster. Not sure which approach would be better for tqdm in general :slightly_smiling_face: Though the queue one could be implemented on top of tqdm, of course.

efiop on 7 Nov 2019

I do actually have a singleton monitor thread managing other bars but it probably needs a redesign. tqdm was originally designed for efficient single-threaded use. Fixing this issue ASAP is a bigger priority for me that upgrading the monitor thread. Will happen soon.

casperdcl on 7 Nov 2019

👍1

For the record. Tried running dvc add on a dir with 25K files of 1M size generated from urandom. Ran on windows and linux on the same machine and on the same ntfs drive. Windows scored at 686sec and linux at 776sec. So looks like the issue reported by @shcheklein is caused by SSD on ec2 windows instance being slower than his pcie SSD on his macbook (you can see from cprofile that io takes much longer on windows).

efiop on 7 Nov 2019

@casperdcl do we still plan to use deque-based lockless mechanism? Also, as I see it, now you use non-blocking RLock.acquire(), what's the point? Does it prevent output from mixing up across threads?

Suor on 13 Nov 2019

yes, as with the previous (before tqdm) approach it prevents multiple threads writing progress output on top of each other. Since it's now non-blocking it should also have negligible overhead in multi-threaded cases. I don't see the point of deque atm.

casperdcl on 13 Nov 2019

Also @Suor the main issue here seems to be

dvc commit. Here dvc is taking the greater amount of 99% of system memory (13 GB) and appears to be causing disk thrashing. It's been running nearly for a day so far.

which doesn't look UI-related.

casperdcl on 13 Nov 2019

👍1

all file operations are freaking faster on Mac, like 5x faster at least

So cutting up on those dup exists/isdir calls might help. Not sure though, nt.stat takes more time than posix.stat in mac bench, however, exists don't have that much cumtime.

One thing that bothers me is that you have 440796 calls to shutil.copyfile() and corresponding copyfileobj() on windows, but not on mac. Doesn't look like mac really copying anything.

Suor on 13 Nov 2019

@casperdcl but if it doesn't block, how does it prevent two threads from printing at the same time? So first process acquires a lock and prints to stdout, in the meantime the second process fails to acquire the lock, but still continues and prints to stdout. They mix outputs.

Suor on 13 Nov 2019

@casperdcl that issue you refer to was caused by memory leak and was a show stopper for sure. However, that one was fixed, but Windows is still slower than unix considerably, which bothers us.

Suor on 13 Nov 2019

@Suor I feel this discussion should be had on #2767 <- #2766 but essentially failure to acquire a non-blocking lock means that output is skipped (rather than interleaved). In practice screen refresh could reduce from ~10/s to ~1/s per bar but the task will finish quicker :)

casperdcl on 13 Nov 2019

@casperdcl oh, I caused a confusion, I don't mean bar refreshes I mean actual output, which we do through logger, which calls Tqdm.write().

That ignores lock_args and hence blocks on the lock, which means a few things:

Two logger.*() calls won't mix.
If we have lots of concurrent logger output this will be slow.

Suor on 13 Nov 2019

1. Two `logger.*()` calls won't mix.

True

2. If we have lots of concurrent logger output this will be slow.

Yes but I assume logging output is never that much. If there is a lot of logging output then we have more problems (why is there so much logging output? excessive verbosity is hard to use for debugging. non-blocking logging would mean interleaving and/or dropped messages, neither of which are good)

casperdcl on 13 Nov 2019

When we run -v flag then there are lots of messages logged. And queues should work, no blocking, no interleaving.

I don't know whether that is a real or only a potential issue though.

Suor on 13 Nov 2019

Yes would need profiling. Based on https://github.com/iterative/dvc/pull/2767#issuecomment-552067441 there's no issue with dvc add but I don't know about dvc commit...

casperdcl on 13 Nov 2019

@casperdcl That commit leak was fixed in https://github.com/iterative/dvc/pull/2735 . Sorry, forgot to link it here.

efiop on 13 Nov 2019

a quick update, btw. I tried to run it on the same instance type + storage type as I use for Windows, but on XFS Linux and the same dvc add operation is 8-9x slower now (after thread locking fix). Still researching in background what's causing this and if there are workarounds when we deal with large numbers of files on Windows.

shcheklein on 13 Nov 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings