Windows 7 / 10, dvc 0.93.0+f5d89f (after merging #3645)
Running cProfile on dvc repro -P
in a repository with around 200 steps all of which are up to date. On my system this takes 65s, of which 42s is spent in _edit_rwlock
.
I wonder, is it necessary to lock each individual file here? Couldn't the whole repository be locked instead?
Maybe a broader question: what are the respective purposes of .dvc/tmp/rwlock
, .dvc/lock
and .dvc/updater.lock
?
Running cProfile on dvc repro -P in a repository with around 200 steps all of which are up to date. On my system this takes 65s, of which 42s is spent in _edit_rwlock.
Thanks for the investigation! That is indeed pretty strange, we need to take a closer look. Could you show your dvc version
output btw?
I wonder, is it necessary to lock each individual file here? Couldn't the whole repository be locked instead?
Actually, that's what we used to do, but rwlock was introduced to support granular locking, so that you could run dvc in parallel.
Maybe a broader question: what are the respective purposes of .dvc/tmp/rwlock, .dvc/lockand .dvc/updater.lock?
Useful link: https://dvc.org/doc/user-guide/dvc-files-and-directories :slightly_smiling_face: rwlock
is for granular locking of specific deps and outs, lock
is for locking the whole repo (when you need to push/pull/etc or need to create/modify/delete dvc-file). updater.lock
is for our updater worker, that checks for updates once a week.
@charlesbaynham Btw, have you tried running that on Linux? Wonder if this is windows-specific or if we have troubles everywhere.
Thanks for the run down @efiop, that's helpful. The version that ran the above was 0.93.0+f5d89f
, which is v0.93.0 + the changes in PR #3645. That's necessary because otherwise the effect is swamped by all the time spent reproducing duplicate steps.
I'll give it a try on linux: annoyingly the repository I tested on has some case sensitivity problems which I've glossed over so far on Windows but which prevent it from working on linux, but I should fix that anyway.
The cProfile results for my repo on Windows:
repro_nowork.zip
Here's the cProfile results for dvc repro -P
in the same repo, run on Linux and Windows.
It looks like the same fraction of time is spend writing to rwlock
in both (just via posix.fsync
instead of nt.fsync
)
Here's a screenshot of the profile for Windows:
Zooming in, you can see that the actual calls to Stage.changed()
don't take much time at all, though that's where I'd have expected most of the load to be.
@charlesbaynham Sorry for the delay. Admitedly didn't have much time to look deeply into it, but could you give my patch a try, please:
pip uninstall -y dvc
# make sure there is no old dvc version installed
pip install git+https://github.com/efiop/dvc@3653
?
Yes that makes a lot of difference. Here are two profiles in the same repo on Window 7, one at v0.93.0+f5d89f and the other at your patch:
Your patch cuts the time spent in _reproduce_stages
from 62s:
to 23s
But doesn't that just effectively disable the lock?
I also notice that we're making 1.4 million calls to normpath
which seems like a lot for a repo with ~200 files in it!
Took a quick look at the profiler results to see if the excessive normpath
calls were pathinfo related, and they are not, but we do make 220k calls to relpath
in dvc.utils
(with string paths)
@charlesbaynham Sorry for the delay. It doesn't really disable it, just doesn't try to be paranoid about sync-ing to disk. We'll need to take a closer look. We don't have the capacity to do it right now, but we'll try to do that in the nearest future.
No need for any apology, the work you guys do is amazing.
Maybe a compromise could be to only flush the lock if dvc determines that actual work needs to be done to reproduce a stage. In that scenario it'd be a small fraction of the time taken.
Maybe a compromise could be to only flush the lock if dvc determines that actual work needs to be done to reproduce a stage. In that scenario it'd be a small fraction of the time taken.
The thing is that it modifies rwlock _before_ checking that something has changed, so that no other stage is in the middle of modifying something that we will be checking. :slightly_frowning_face: It might be as simple as my patch that removes paranoid fsync-ing, but need to double check that. If not, I'm sure we'll find another way to optimize this.
You and @pmrowla have also noticed something that we know is problematic: relpath calls. The solution there is not stop trying to abspath
every path we use and just use whatever paths we get. That would also fix the issues with our abspaths getting longer than PATH_MAX limit. dvc doesn't really chdir at any point in its operation, so that should be very safe and efficient. The reason why we didn't do that from day 0 is that we weren't really sure which direction dvc will go, so just wanted to be safe with using deterministic abspaths, but we can surely do the switch now.
@efiop, I think we can get rid of the fsync()
, as the rwlock
is a temporary and it's totally fine for it to either get corrupted and prefer performance over durability. Also, at one time, there can only be a single instance of dvc
that can write to rwlock
.
In my synthetic benchmark, I see 12X improvement (146s vs 12s) just by removing fsync()
.
Regarding relpath
calls, it's mostly coming from Stage::__eq__
or Stage::__hash__
used by networkx
.
We can assume Stage
to be immutable and cache relpath
and path_in_repo
calls (used by __eq__
and __hash__
to further reduce the time to 7.5s (~20X) improvement. Again, take this with a grain of salt as it is a very silly/synthetic benchmark.
Also, it's interesting to see that the fsync
calls on Windows is twice as fast as on Linux.
@skshetry Sounds good. Let's get rid of it for now.
BTW, we already discussed this fsync()
here :)
You guys are great, thank you!
@charlesbaynham, can you give dvc==1.0.0b5
a try? Would love to see the difference. :slightly_smiling_face:
Btw, for the record: we'll add a benchmark soon for this (or related) https://github.com/iterative/dvc-bench/issues/73 .
Sorry for the absolute age it has taken me to respond here. You might be interested to see the performance on the same repository using a new version of the code: v1.6.0:
So that's 13s instead of the initial 65s, a 5x improvement. No obvious bottlenecks remaining either:
Most helpful comment
BTW, we already discussed this
fsync()
here :)