Distribution Name | CentOS
Distribution Version | 7.8
Linux Kernel | 3.10.0-1127.18.2.el7
Architecture | x86_64
ZFS Version | 0.8.4-1
SPL Version | 0.8.4-1
With the prefetch enabled (default) we see abysmal read performance on concurrent IO, much of the hit seems to be locking,
# PREFETCH OFF
my_hostname:0# cat /sys/module/zfs/parameters/zfs_prefetch_disable
1
my_hostname:0# perf stat -r 5 bash -c "rg --files /my/repo/path | wc -l"
104142
104142
104142
104142
104142
Performance counter stats for 'bash -c rg --files /my/repo/path | wc -l' (5 runs):
2,339.59 msec task-clock # 4.357 CPUs utilized ( +- 2.83% )
15,833 context-switches # 0.007 M/sec ( +- 1.95% )
30 cpu-migrations # 0.013 K/sec ( +- 8.23% )
15,520 page-faults # 0.007 M/sec ( +- 1.06% )
6,772,466,857 cycles # 2.895 GHz ( +- 3.37% )
6,230,759,287 instructions # 0.92 insn per cycle ( +- 4.10% )
1,244,566,351 branches # 531.959 M/sec ( +- 4.75% )
19,297,028 branch-misses # 1.55% of all branches ( +- 0.62% )
0.5370 +- 0.0127 seconds time elapsed ( +- 2.37% )
my_hostname:0# perf stat -r 5 bash -c "rg --files /my/repo/path | wc -l"
104142
104142
104142
104142
104142
Performance counter stats for 'bash -c rg --files /my/repo/path | wc -l' (5 runs):
2,478.84 msec task-clock # 4.023 CPUs utilized ( +- 2.69% )
15,572 context-switches # 0.006 M/sec ( +- 2.45% )
23 cpu-migrations # 0.009 K/sec ( +- 6.94% )
15,664 page-faults # 0.006 M/sec ( +- 0.49% )
6,911,258,667 cycles # 2.788 GHz ( +- 1.15% )
6,556,781,111 instructions # 0.95 insn per cycle ( +- 6.12% )
1,321,973,475 branches # 533.304 M/sec ( +- 6.98% )
19,817,544 branch-misses # 1.50% of all branches ( +- 1.01% )
0.6162 +- 0.0252 seconds time elapsed ( +- 4.09% )
# PREFETCH ON
my_hostname:0# cat /sys/module/zfs/parameters/zfs_prefetch_disable
0
my_hostname:0# perf stat -r 5 bash -c "rg --files /my/repo/path | wc -l"
104142
104142
104142
104142
104142
Performance counter stats for 'bash -c rg --files /my/repo/path | wc -l' (5 runs):
3,013.93 msec task-clock # 1.806 CPUs utilized ( +- 1.32% )
50,054 context-switches # 0.017 M/sec ( +- 0.49% )
27 cpu-migrations # 0.009 K/sec ( +- 17.09% )
13,157 page-faults # 0.004 M/sec ( +- 0.39% )
7,425,112,564 cycles # 2.464 GHz ( +- 1.50% )
6,537,269,286 instructions # 0.88 insn per cycle ( +- 3.56% )
1,309,810,878 branches # 434.586 M/sec ( +- 4.11% )
25,634,267 branch-misses # 1.96% of all branches ( +- 0.90% )
1.6691 +- 0.0392 seconds time elapsed ( +- 2.35% )
Hopefully this is enough proof, I've done this test a hundred times, with cleared caches, hot caches, it doesn't matter. I've inspected it with perf and the time spent is ~ all in locking (futexes, mutex_lock). I used rg here because it is a simple multi-threaded program with open source code. I ran the prefetch-off example first to demonstrate that even with the benefit of caching, the performance is degraded =~ 3x.
Which begs the question, why isn't the default prefetch off? I understand there is a read performance benefit on large sequential reads with prefetch on, but I think the cons outweigh the pros as it currently stands.
Ultimately, I'd like to see the locking improved such that we don't end up ~serializing multi-threaded IO.
This is pretty funny. I fixed this in FreeBSD ~12 years ago until the change was eventually reverted by an update. The old OpenZFS folks didn't appear to understand the locking profiles. At least with that incarnation of prefetch it was pretty easy to convert the locks to trylocks.
I don't think that trylock is a safe answer, since it may miss some of reads accounting, making sequential reads look like random.
zfetch own locking is per-file. I am very curios how exactly rg is reading files (in what blocks and how many threads) to cause congestion there. Or it can be somewhere else that zfetch calls, but in that case we'd need more detailed CPU profile. I personally never seen lock congestion in zfetch, barely some memory accesses.
It might be useful to see how rg does the parallelism https://github.com/BurntSushi/ripgrep/blob/master/crates/core/main.rs#L224
@amotin points out: In ZoL zf->zf_lock is a mutex, while in FreeBSD zf->zf_rwlock is RW, taken for read in most cases.
And in Illumos it is also RW, only ZoL seems missed it.
I've just noticed that in ZoL zf->zf_lock is a mutex now, while in Illumos and old FreeBSD zf->zf_rwlock is RW, taken for read in most cases. It was reverted by @ahrens year ago. I don't know what is so wrong with rwlocks on Linux, but ask him. On FreeBSD ZFS mutexes are implemented with sxlocks, so reader/writer semantics is available for free there.
There's some explanation of the change to a mutex in the commit message (for PR https://github.com/openzfs/zfs/pull/9062):
The rwlock implementation on linux does not perform as well as mutexes.
We can realize a performance benefit by replacing the zf_rwlock with a
mutex.
I don't recall the implementation details of linux rwlocks that explain why this is the case, but it's what I measured. If there's new data available, we can make a change based on that.
@ahrens It seems, based on the PR, that the benefits of moving to mutexes weren't as significant as had been originally thought, and I've not seen any concrete examples demonstrating that mutexes over RW locks improved the performance. Are the details of the referenced microbenchmarks available anywhere?
I don't mean to sound adversarial here, but it seems like the burden of proof is being placed on me, and I believe I've demonstrated the issue, so I am wondering what type of new data would you be looking for? I am not seeing any original data which would be a point of comparison for new data; the RW lock performance issues seem a bit nebulous to me based on the PR.
My understanding is that mutexes are more restrictive and simpler, so perhaps that simplicity buys us some performance wins, but the ability to allow multiple readers is the very heart of the issue here, and is explicitly what RW locks buy us over mutexes. Based on that understanding, it doesn't make sense to me that we trade one for the other in order to gain performance, because the performance cost for multi-threaded reads appear to overshadow any potential gains.
I don't mean to sound adversarial here, but it seems like the burden of proof is being placed on me, and I believe I've demonstrated the issue, so I am wondering what type of new data would you be looking for
Dont worry to much about it, how i'm reading it @ahrens wasn't refering to your issue specifically but mostly about the general design choices behind moving to mutexes, as discussed by @amotin and @mattmacy .
@gleventhal You've definitely demonstrated the performance overhead of the zfetch code in this workload. It's a legitimate issue.
I'm just saying that if we want to change the implementation back to rwlocks, someone would need to measure the performance to show that it's an improvement over the current implementation, and then open a PR for that change. My educated guess is that simply changing the mutex to a rwlock will not substantially improve performance. We would need to make a more significant change to the zfetch code to address the issue.
Apologies, I misinterpreted what I'd read a bit. That said, I am happy to do some testing if someone wants to supply me with some preliminary changes to test.
What would help the most is CPU flame graph. I am still not sure how the zfetch should affect rg performance, unless it is written somehow weird to run small parallel requests to the same file from different threads. If there are indeed such requests, but they are not sequential (from different zfetch stream), then possibly the change in locking may be important. Otherwise, it may be just an unrelated unexpected observation.
@amotin One theory I had is that, with prefetch enabled, that it's serializing getdents calls on a mutex. Is there any zfetch code that would do that, or perhaps some side-effect of enabling zfetch?
I suspect the reader threads are waiting on directory traversal, not necessarily contending for the same file. This is just a theory, though I did observe threads in nanosleep while a single thread was doing readdir, so I wondered if the problem was actually that recursing from the root directory was serialized on a mutex.
I timed the longest futex in a getdents stack (from an rg run) with and without the prefetch enabled, and found that when enabled that it was ~890ms vs ~300ms with it off.
I'll get you some graphs.
I've got the flamegraphs, and while I am fairly sure that there is no proprietary data in them, I need to get company sign-off before I can send them. At first glance, it seems that perhaps zap_lockdir -> dmu_buf_hold might be involved? With prefetch it represents 10.12% of the time versus 5.45% without prefetch.
The flamegraphs should be attached, one run with prefetch, one without (labeled in file name)
I'm sorry, but only after looking on the profiles I've noticed that you are running 0.8.4 version of ZFS. So obviously Matt's locking change is not the cause of your problems, but may actually be a solution, since looking on the profile Linux indeed does some complicated things there. In your profile I see most of time is spent around directory traversal and file open, not the actual file reads. Somehow I don't even see actual I/O done in prefetch case, while I do some in non-prefetch. I guess it all goes from caches. In such scenario prefetcher by design can not give you any benefit, but can only waste time time on its bookkeeping, still trying to detect sequential accesses in case something appear to not be in cache. So if running rg on precached data is your primary workload, then disabling prefetch may be right for you. Otherwise it would be interesting to see what would be the numbers on OpenZFS 2.0.
@amotin Ah, you're right, I guess I am on RW locks after all. Still, as you can see in the perf examples in my original post, when the prefetch is enabled, we see only 25% of CPU time (when compared to no prefetch) and ~60% more context switches. This leads me to believe it's an off-cpu issue, such as locking, and not prefetch overhead/book-keeping which would likely be an on-cpu activity.
Do you agree? I've tried profiling this after dropping caches completely (echo 3 > /proc/sys/vm/drop_caches), and see the same issue, so I don't think IO vs caching factors in much here. I'll try to take a look at profiling with openzfs 2, but until then, I am going to continue to troubleshoot this (and leave prefetch disabled in production).
I took a look at the flame graphs. They indicate a locking issue that is consuming CPU, and may also have off-cpu issues. The relevant part of the workload is when the getdents64() syscall is accessing the first block of the directory ZAP object, and it holds the dnode. We are calling the zfetch code on the dnode object (which holds the dnodes). I'm not familiar with the rg test program, but it would seem to be readdir-heavy. Can you describe the workload, including the directory structure that it's accessing? (e.g. how many files per directory)
I suspect that attempting to do predictive prefetch on the dnode object is not useful. We already get 32 dnodes with each read, plus when iterating over a directory we prefetch the dnodes of the directory entries. So it might make sense for dnode_hold_impl() to pass DB_RF_NOPREFETCH to dbuf_read().
Also, upgrading to master / 2.0rc may help a bit, e.g. by using a mutex instead of a rwlock, and also #8946 may help.
Possibly newly related: https://github.com/openzfs/zfs/pull/10900
It is not so much related, other then also in speculative prefetcher.
Is there any tool I can use to show contention for the dnodes as I believe @ahrens was indicating may be the underlying issue? Like perhaps something that gives a mapping of PID to dnode held (and preferably a mapping from that dnode to the underlying file/directory) ?
I can see that rg is explicitly yielding the CPU a lot (sched_yield(2)) when prefetch is on, so I suspect that's the reason the CPU seconds are lower with prefetch (and it takes longer). It seems to be using the Rust crossbeam-channel concurrency framework which performs a try_recv and then often yields, presumably because a held dnode makes the directory access unavailable and the consumer thread then sleeps.
@ahrens The directory structure is a monorepo. There are ~ 1000 directories at the root level, with an average of ~6.7 directories inside of each of those top-level dirs. Ultimately, it's around 37280 directories and 185000 files.
The workload here is a multithreaded program starting at the root of the repo, and descending into the repo, reading any hg/git ignore files, and then reporting on any files that aren't configured to be ignored by version control. There are many readdir_r calls, you are correct.
To answer my own question about the tooling, it seems like dbufstat comes the closest to doing what I wanted so far.
With the prefetch on, I see tons of this stack appearing:
rwsem_down_read_failed <- call_rwsem_down_read_failed <- down_read <- dmu_zfetch <- dbuf_read
It seems like dbuf_read is bumping up against the RW semaphore and the rg thread is yielding/sleeping when it cannot do work
With prefetch disabled I see none of those stacks.
I also see about twice the DMU_OT_DIRECTORY_CONTENTS entries in dbufstat when prefetch is enabled, but I haven't fully deciphered the significance of that (or if there is any) yet.
The prefetch path is lock heavy. My past, never finished, attempt at improving it showed high spread of latency on the metadata heavy workloads (see https://github.com/openzfs/zfs/pull/6568#issuecomment-325763049)
IIRC, just bailing out on repeated <dnode, blkid, nblks> calls, or a sign of locking trouble (!mutex_tryenter()), already showed benefit for this scenario.
I suspect that attempting to do predictive prefetch on the dnode object is not useful. We already get 32 dnodes with each read, plus when iterating over a directory we prefetch the dnodes of the directory entries. So it might make sense for dnode_hold_impl() to pass DB_RF_NOPREFETCH to dbuf_read().
You could make this patch and see if it improves your workload. I might have time for that tomorrow.
I suspect that attempting to do predictive prefetch on the dnode object is not useful. We already get 32 dnodes with each read, plus when iterating over a directory we prefetch the dnodes of the directory entries. So it might make sense for dnode_hold_impl() to pass DB_RF_NOPREFETCH to dbuf_read().
You could make this patch and see if it improves your workload. I might have time for that tomorrow.
I've done a quickie patch for this @ https://github.com/adamdmoss/zfs/tree/dnode-no-prefetch (https://github.com/adamdmoss/zfs/commit/0715e3bfa0110ae63004d85664de1b269f892773)
I added DB_RF_NOPREFETCH to dnode_next_offset_level() too because that also seemed not-useful, though I could always be wrong. :grimacing:
@adamdmoss that change looks good to me.
I'm testing this patch and will report back, thanks
I've applied the patch and unfortunately, I am not seeing notable improvement from what I can tell. Can someone please explain precisely what the patch is doing so that I can use this info to better understand the problem? @adamdmoss
@gleventhal can you post new flame graphs with the patch applied?
@ahrens I've included 2 graphs, one with prefetch enabled, one without, both with the patch, both with clear caches (echo 3 > drop_caches)
The new flamegraph shows that dnode_hold_impl -> dbuf_read is still calling dmu_zfetch(). Which shouldn't be possible. Can you double check that you're running with the patch?
cripes, I patched the wrong thing, I should have patched the kmod-zfs package not the zfs one, or maybe both? Redeploying and retesting with the change in the kmod, sorry!
It works! The test is now identical with prefetch on or off! Thanks and sorry about my confusion on the last test.
myhostname:130# cat /sys/module/zfs/parameters/zfs_prefetch_disable
0
myhostname:0# perf stat -r 5 bash -c "rg --files /usr/local/home/gleventhal/workspaces/my_repo/|wc -l"
187191
187191
187191
187191
187191
Performance counter stats for 'bash -c rg --files /usr/local/home/gleventhal/workspaces/my_repo/|wc -l' (5 runs):
4,005.63 msec task-clock # 5.802 CPUs utilized ( +- 5.47% )
26,927 context-switches # 0.007 M/sec ( +- 1.70% )
31 cpu-migrations # 0.008 K/sec ( +- 4.56% )
22,491 page-faults # 0.006 M/sec ( +- 0.66% )
13,916,106,098 cycles # 3.474 GHz ( +- 6.41% )
10,066,874,247 instructions # 0.72 insn per cycle ( +- 3.77% )
1,941,982,736 branches # 484.814 M/sec ( +- 4.11% )
37,167,069 branch-misses # 1.91% of all branches ( +- 2.48% )
0.6904 +- 0.0169 seconds time elapsed ( +- 2.45% )
myhostname:0# perf stat -r 5 bash -c "rg --files /usr/local/home/gleventhal/workspaces/my_repo/|wc -l"
187191
187191
187191
187191
187191
Performance counter stats for 'bash -c rg --files /usr/local/home/gleventhal/workspaces/my_repo/|wc -l' (5 runs):
3,737.21 msec task-clock # 5.693 CPUs utilized ( +- 1.11% )
28,153 context-switches # 0.008 M/sec ( +- 3.42% )
29 cpu-migrations # 0.008 K/sec ( +- 10.96% )
22,566 page-faults # 0.006 M/sec ( +- 0.57% )
13,023,968,038 cycles # 3.485 GHz ( +- 0.82% )
9,613,969,469 instructions # 0.74 insn per cycle ( +- 1.95% )
1,850,275,334 branches # 495.096 M/sec ( +- 2.39% )
36,157,031 branch-misses # 1.95% of all branches ( +- 0.25% )
0.65651 +- 0.00422 seconds time elapsed ( +- 0.64% )
myhostname:0# cat /sys/module/zfs/parameters/zfs_prefetch_disable
0
myhostname:0# echo 1 > /sys/module/zfs/parameters/zfs_prefetch_disable
myhostname:0# perf stat -r 5 bash -c "rg --files /usr/local/home/gleventhal/workspaces/my_repo/|wc -l"
187191
187191
187191
187191
187191
Performance counter stats for 'bash -c rg --files /usr/local/home/gleventhal/workspaces/my_repo/|wc -l' (5 runs):
3,761.64 msec task-clock # 5.650 CPUs utilized ( +- 1.51% )
29,151 context-switches # 0.008 M/sec ( +- 1.71% )
27 cpu-migrations # 0.007 K/sec ( +- 7.39% )
22,266 page-faults # 0.006 M/sec ( +- 0.92% )
13,092,983,882 cycles # 3.481 GHz ( +- 1.32% )
9,810,140,476 instructions # 0.75 insn per cycle ( +- 3.89% )
1,895,407,612 branches # 503.878 M/sec ( +- 4.66% )
36,177,389 branch-misses # 1.91% of all branches ( +- 0.33% )
0.66578 +- 0.00766 seconds time elapsed ( +- 1.15% )
Nice work @adamdmoss Thanks @ahrens and everyone else!
In what upcoming release can we expect this patch to be included?
@adamdmoss it'd be great if you could open a PR with the change so we can get it reviewed.
@adamdmoss it'd be great if you could open a PR with the change so we can get it reviewed.
Will do - may take a couple of days due to the day job.
Most helpful comment
@gleventhal You've definitely demonstrated the performance overhead of the zfetch code in this workload. It's a legitimate issue.
I'm just saying that if we want to change the implementation back to rwlocks, someone would need to measure the performance to show that it's an improvement over the current implementation, and then open a PR for that change. My educated guess is that simply changing the mutex to a rwlock will not substantially improve performance. We would need to make a more significant change to the zfetch code to address the issue.