Dvc: On Windows, dvc repro spends an awful lot of time in rwlock

Created on 19 Apr 2020 · 20Comments · Source: iterative/dvc

Windows 7 / 10, dvc 0.93.0+f5d89f (after merging #3645)

Running cProfile on dvc repro -P in a repository with around 200 steps all of which are up to date. On my system this takes 65s, of which 42s is spent in _edit_rwlock.

I wonder, is it necessary to lock each individual file here? Couldn't the whole repository be locked instead?

Maybe a broader question: what are the respective purposes of .dvc/tmp/rwlock, .dvc/lockand .dvc/updater.lock?

awaiting response p1-important performance research

Source

charlesbaynham

👀1

Most helpful comment

BTW, we already discussed this fsync() here :)

Suor on 16 Jun 2020

👍3

All 20 comments

Running cProfile on dvc repro -P in a repository with around 200 steps all of which are up to date. On my system this takes 65s, of which 42s is spent in _edit_rwlock.

Thanks for the investigation! That is indeed pretty strange, we need to take a closer look. Could you show your dvc version output btw?

I wonder, is it necessary to lock each individual file here? Couldn't the whole repository be locked instead?

Actually, that's what we used to do, but rwlock was introduced to support granular locking, so that you could run dvc in parallel.

Maybe a broader question: what are the respective purposes of .dvc/tmp/rwlock, .dvc/lockand .dvc/updater.lock?

Useful link: https://dvc.org/doc/user-guide/dvc-files-and-directories :slightly_smiling_face: rwlock is for granular locking of specific deps and outs, lock is for locking the whole repo (when you need to push/pull/etc or need to create/modify/delete dvc-file). updater.lock is for our updater worker, that checks for updates once a week.

efiop on 21 Apr 2020

👍1

@charlesbaynham Btw, have you tried running that on Linux? Wonder if this is windows-specific or if we have troubles everywhere.

efiop on 21 Apr 2020

Thanks for the run down @efiop, that's helpful. The version that ran the above was 0.93.0+f5d89f, which is v0.93.0 + the changes in PR #3645. That's necessary because otherwise the effect is swamped by all the time spent reproducing duplicate steps.

I'll give it a try on linux: annoyingly the repository I tested on has some case sensitivity problems which I've glossed over so far on Windows but which prevent it from working on linux, but I should fix that anyway.

The cProfile results for my repo on Windows:
repro_nowork.zip

charlesbaynham on 21 Apr 2020

👍1

Here's the cProfile results for dvc repro -P in the same repo, run on Linux and Windows.

repro_profile.zip

It looks like the same fraction of time is spend writing to rwlock in both (just via posix.fsync instead of nt.fsync)

charlesbaynham on 22 Apr 2020

Here's a screenshot of the profile for Windows:

Zooming in, you can see that the actual calls to Stage.changed() don't take much time at all, though that's where I'd have expected most of the load to be.

charlesbaynham on 22 Apr 2020

👍1

@charlesbaynham Sorry for the delay. Admitedly didn't have much time to look deeply into it, but could you give my patch a try, please:

pip uninstall -y dvc
# make sure there is no old dvc version installed
pip install git+https://github.com/efiop/dvc@3653

efiop on 28 Apr 2020

Yes that makes a lot of difference. Here are two profiles in the same repo on Window 7, one at v0.93.0+f5d89f and the other at your patch:

data_processing.zip

Your patch cuts the time spent in _reproduce_stages from 62s:

to 23s

charlesbaynham on 29 Apr 2020

👀2

But doesn't that just effectively disable the lock?

charlesbaynham on 29 Apr 2020

I also notice that we're making 1.4 million calls to normpath which seems like a lot for a repo with ~200 files in it!

charlesbaynham on 29 Apr 2020

👍1

Took a quick look at the profiler results to see if the excessive normpath calls were pathinfo related, and they are not, but we do make 220k calls to relpath in dvc.utils (with string paths)

pmrowla on 29 Apr 2020

❤1 👍1

@charlesbaynham Sorry for the delay. It doesn't really disable it, just doesn't try to be paranoid about sync-ing to disk. We'll need to take a closer look. We don't have the capacity to do it right now, but we'll try to do that in the nearest future.

efiop on 10 May 2020

No need for any apology, the work you guys do is amazing.

Maybe a compromise could be to only flush the lock if dvc determines that actual work needs to be done to reproduce a stage. In that scenario it'd be a small fraction of the time taken.

charlesbaynham on 10 May 2020

❤1

Maybe a compromise could be to only flush the lock if dvc determines that actual work needs to be done to reproduce a stage. In that scenario it'd be a small fraction of the time taken.

The thing is that it modifies rwlock _before_ checking that something has changed, so that no other stage is in the middle of modifying something that we will be checking. :slightly_frowning_face: It might be as simple as my patch that removes paranoid fsync-ing, but need to double check that. If not, I'm sure we'll find another way to optimize this.

You and @pmrowla have also noticed something that we know is problematic: relpath calls. The solution there is not stop trying to abspath every path we use and just use whatever paths we get. That would also fix the issues with our abspaths getting longer than PATH_MAX limit. dvc doesn't really chdir at any point in its operation, so that should be very safe and efficient. The reason why we didn't do that from day 0 is that we weren't really sure which direction dvc will go, so just wanted to be safe with using deterministic abspaths, but we can surely do the switch now.

efiop on 11 May 2020

👍1

@efiop, I think we can get rid of the fsync(), as the rwlock is a temporary and it's totally fine for it to either get corrupted and prefer performance over durability. Also, at one time, there can only be a single instance of dvc that can write to rwlock.

In my synthetic benchmark, I see 12X improvement (146s vs 12s) just by removing fsync().
Regarding relpath calls, it's mostly coming from Stage::__eq__ or Stage::__hash__ used by networkx.

We can assume Stage to be immutable and cache relpath and path_in_repo calls (used by __eq__ and __hash__ to further reduce the time to 7.5s (~20X) improvement. Again, take this with a grain of salt as it is a very silly/synthetic benchmark.

Also, it's interesting to see that the fsync calls on Windows is twice as fast as on Linux.

skshetry on 16 Jun 2020

🚀1 👍1

@skshetry Sounds good. Let's get rid of it for now.

efiop on 16 Jun 2020

BTW, we already discussed this fsync() here :)

Suor on 16 Jun 2020

👍3

You guys are great, thank you!

charlesbaynham on 16 Jun 2020

👍1

@charlesbaynham, can you give dvc==1.0.0b5 a try? Would love to see the difference. :slightly_smiling_face:

skshetry on 18 Jun 2020

Btw, for the record: we'll add a benchmark soon for this (or related) https://github.com/iterative/dvc-bench/issues/73 .

efiop on 18 Jun 2020

Sorry for the absolute age it has taken me to respond here. You might be interested to see the performance on the same repository using a new version of the code: v1.6.0: