Zfs: zfetch locking causes prohibitively bad read performance on concurrent I/O workloads

Created on 3 Sep 2020 · 36Comments · Source: openzfs/zfs

System information

With the prefetch enabled (default) we see abysmal read performance on concurrent IO, much of the hit seems to be locking,

Describe how to reproduce the problem

# PREFETCH OFF
my_hostname:0# cat /sys/module/zfs/parameters/zfs_prefetch_disable 
1
my_hostname:0# perf stat -r 5  bash -c "rg --files /my/repo/path | wc -l"
104142
104142
104142
104142
104142

 Performance counter stats for 'bash -c rg --files /my/repo/path | wc -l' (5 runs):

          2,339.59 msec task-clock                #    4.357 CPUs utilized            ( +-  2.83% )
            15,833      context-switches          #    0.007 M/sec                    ( +-  1.95% )
                30      cpu-migrations            #    0.013 K/sec                    ( +-  8.23% )
            15,520      page-faults               #    0.007 M/sec                    ( +-  1.06% )
     6,772,466,857      cycles                    #    2.895 GHz                      ( +-  3.37% )
     6,230,759,287      instructions              #    0.92  insn per cycle           ( +-  4.10% )
     1,244,566,351      branches                  #  531.959 M/sec                    ( +-  4.75% )
        19,297,028      branch-misses             #    1.55% of all branches          ( +-  0.62% )

            0.5370 +- 0.0127 seconds time elapsed  ( +-  2.37% )

my_hostname:0# perf stat -r 5  bash -c "rg --files /my/repo/path | wc -l"
104142
104142
104142
104142
104142

 Performance counter stats for 'bash -c rg --files /my/repo/path | wc -l' (5 runs):

          2,478.84 msec task-clock                #    4.023 CPUs utilized            ( +-  2.69% )
            15,572      context-switches          #    0.006 M/sec                    ( +-  2.45% )
                23      cpu-migrations            #    0.009 K/sec                    ( +-  6.94% )
            15,664      page-faults               #    0.006 M/sec                    ( +-  0.49% )
     6,911,258,667      cycles                    #    2.788 GHz                      ( +-  1.15% )
     6,556,781,111      instructions              #    0.95  insn per cycle           ( +-  6.12% )
     1,321,973,475      branches                  #  533.304 M/sec                    ( +-  6.98% )
        19,817,544      branch-misses             #    1.50% of all branches          ( +-  1.01% )

            0.6162 +- 0.0252 seconds time elapsed  ( +-  4.09% )

# PREFETCH ON
my_hostname:0# cat /sys/module/zfs/parameters/zfs_prefetch_disable 
0

my_hostname:0# perf stat -r 5  bash -c "rg --files /my/repo/path | wc -l"
104142
104142
104142
104142
104142

 Performance counter stats for 'bash -c rg --files /my/repo/path | wc -l' (5 runs):

          3,013.93 msec task-clock                #    1.806 CPUs utilized            ( +-  1.32% )
            50,054      context-switches          #    0.017 M/sec                    ( +-  0.49% )
                27      cpu-migrations            #    0.009 K/sec                    ( +- 17.09% )
            13,157      page-faults               #    0.004 M/sec                    ( +-  0.39% )
     7,425,112,564      cycles                    #    2.464 GHz                      ( +-  1.50% )
     6,537,269,286      instructions              #    0.88  insn per cycle           ( +-  3.56% )
     1,309,810,878      branches                  #  434.586 M/sec                    ( +-  4.11% )
        25,634,267      branch-misses             #    1.96% of all branches          ( +-  0.90% )

            1.6691 +- 0.0392 seconds time elapsed  ( +-  2.35% )

Hopefully this is enough proof, I've done this test a hundred times, with cleared caches, hot caches, it doesn't matter. I've inspected it with perf and the time spent is ~ all in locking (futexes, mutex_lock). I used rg here because it is a simple multi-threaded program with open source code. I ran the prefetch-off example first to demonstrate that even with the benefit of caching, the performance is degraded =~ 3x.

Performance

Source

gleventhal

Most helpful comment

@gleventhal You've definitely demonstrated the performance overhead of the zfetch code in this workload. It's a legitimate issue.

I'm just saying that if we want to change the implementation back to rwlocks, someone would need to measure the performance to show that it's an improvement over the current implementation, and then open a PR for that change. My educated guess is that simply changing the mutex to a rwlock will not substantially improve performance. We would need to make a more significant change to the zfetch code to address the issue.

ahrens on 5 Sep 2020

👍3

All 36 comments

Which begs the question, why isn't the default prefetch off? I understand there is a read performance benefit on large sequential reads with prefetch on, but I think the cons outweigh the pros as it currently stands.

Ultimately, I'd like to see the locking improved such that we don't end up ~serializing multi-threaded IO.

gleventhal on 3 Sep 2020

This is pretty funny. I fixed this in FreeBSD ~12 years ago until the change was eventually reverted by an update. The old OpenZFS folks didn't appear to understand the locking profiles. At least with that incarnation of prefetch it was pretty easy to convert the locks to trylocks.

mattmacy on 3 Sep 2020

👍1

I don't think that trylock is a safe answer, since it may miss some of reads accounting, making sequential reads look like random.

zfetch own locking is per-file. I am very curios how exactly rg is reading files (in what blocks and how many threads) to cause congestion there. Or it can be somewhere else that zfetch calls, but in that case we'd need more detailed CPU profile. I personally never seen lock congestion in zfetch, barely some memory accesses.

amotin on 3 Sep 2020

It might be useful to see how rg does the parallelism https://github.com/BurntSushi/ripgrep/blob/master/crates/core/main.rs#L224

gleventhal on 3 Sep 2020

@amotin points out: In ZoL zf->zf_lock is a mutex, while in FreeBSD zf->zf_rwlock is RW, taken for read in most cases.
And in Illumos it is also RW, only ZoL seems missed it.

mattmacy on 3 Sep 2020

I've just noticed that in ZoL zf->zf_lock is a mutex now, while in Illumos and old FreeBSD zf->zf_rwlock is RW, taken for read in most cases. It was reverted by @ahrens year ago. I don't know what is so wrong with rwlocks on Linux, but ask him. On FreeBSD ZFS mutexes are implemented with sxlocks, so reader/writer semantics is available for free there.

amotin on 3 Sep 2020

There's some explanation of the change to a mutex in the commit message (for PR https://github.com/openzfs/zfs/pull/9062):

The rwlock implementation on linux does not perform as well as mutexes.
We can realize a performance benefit by replacing the zf_rwlock with a
mutex.

I don't recall the implementation details of linux rwlocks that explain why this is the case, but it's what I measured. If there's new data available, we can make a change based on that.

ahrens on 4 Sep 2020

@ahrens It seems, based on the PR, that the benefits of moving to mutexes weren't as significant as had been originally thought, and I've not seen any concrete examples demonstrating that mutexes over RW locks improved the performance. Are the details of the referenced microbenchmarks available anywhere?

I don't mean to sound adversarial here, but it seems like the burden of proof is being placed on me, and I believe I've demonstrated the issue, so I am wondering what type of new data would you be looking for? I am not seeing any original data which would be a point of comparison for new data; the RW lock performance issues seem a bit nebulous to me based on the PR.

My understanding is that mutexes are more restrictive and simpler, so perhaps that simplicity buys us some performance wins, but the ability to allow multiple readers is the very heart of the issue here, and is explicitly what RW locks buy us over mutexes. Based on that understanding, it doesn't make sense to me that we trade one for the other in order to gain performance, because the performance cost for multi-threaded reads appear to overshadow any potential gains.

gleventhal on 5 Sep 2020

I don't mean to sound adversarial here, but it seems like the burden of proof is being placed on me, and I believe I've demonstrated the issue, so I am wondering what type of new data would you be looking for

Dont worry to much about it, how i'm reading it @ahrens wasn't refering to your issue specifically but mostly about the general design choices behind moving to mutexes, as discussed by @amotin and @mattmacy .

Ornias1993 on 5 Sep 2020

👍1

@gleventhal You've definitely demonstrated the performance overhead of the zfetch code in this workload. It's a legitimate issue.

ahrens on 5 Sep 2020

👍3

Apologies, I misinterpreted what I'd read a bit. That said, I am happy to do some testing if someone wants to supply me with some preliminary changes to test.

gleventhal on 6 Sep 2020

What would help the most is CPU flame graph. I am still not sure how the zfetch should affect rg performance, unless it is written somehow weird to run small parallel requests to the same file from different threads. If there are indeed such requests, but they are not sequential (from different zfetch stream), then possibly the change in locking may be important. Otherwise, it may be just an unrelated unexpected observation.

amotin on 6 Sep 2020

@amotin One theory I had is that, with prefetch enabled, that it's serializing getdents calls on a mutex. Is there any zfetch code that would do that, or perhaps some side-effect of enabling zfetch?

I suspect the reader threads are waiting on directory traversal, not necessarily contending for the same file. This is just a theory, though I did observe threads in nanosleep while a single thread was doing readdir, so I wondered if the problem was actually that recursing from the root directory was serialized on a mutex.

I timed the longest futex in a getdents stack (from an rg run) with and without the prefetch enabled, and found that when enabled that it was ~890ms vs ~300ms with it off.

I'll get you some graphs.

gleventhal on 6 Sep 2020

I've got the flamegraphs, and while I am fairly sure that there is no proprietary data in them, I need to get company sign-off before I can send them. At first glance, it seems that perhaps zap_lockdir -> dmu_buf_hold might be involved? With prefetch it represents 10.12% of the time versus 5.45% without prefetch.

gleventhal on 6 Sep 2020

The flamegraphs should be attached, one run with prefetch, one without (labeled in file name)

flamegraphs.zip

gleventhal on 8 Sep 2020

I'm sorry, but only after looking on the profiles I've noticed that you are running 0.8.4 version of ZFS. So obviously Matt's locking change is not the cause of your problems, but may actually be a solution, since looking on the profile Linux indeed does some complicated things there. In your profile I see most of time is spent around directory traversal and file open, not the actual file reads. Somehow I don't even see actual I/O done in prefetch case, while I do some in non-prefetch. I guess it all goes from caches. In such scenario prefetcher by design can not give you any benefit, but can only waste time time on its bookkeeping, still trying to detect sequential accesses in case something appear to not be in cache. So if running rg on precached data is your primary workload, then disabling prefetch may be right for you. Otherwise it would be interesting to see what would be the numbers on OpenZFS 2.0.

amotin on 8 Sep 2020

@amotin Ah, you're right, I guess I am on RW locks after all. Still, as you can see in the perf examples in my original post, when the prefetch is enabled, we see only 25% of CPU time (when compared to no prefetch) and ~60% more context switches. This leads me to believe it's an off-cpu issue, such as locking, and not prefetch overhead/book-keeping which would likely be an on-cpu activity.

Do you agree? I've tried profiling this after dropping caches completely (echo 3 > /proc/sys/vm/drop_caches), and see the same issue, so I don't think IO vs caching factors in much here. I'll try to take a look at profiling with openzfs 2, but until then, I am going to continue to troubleshoot this (and leave prefetch disabled in production).

gleventhal on 8 Sep 2020

I took a look at the flame graphs. They indicate a locking issue that is consuming CPU, and may also have off-cpu issues. The relevant part of the workload is when the getdents64() syscall is accessing the first block of the directory ZAP object, and it holds the dnode. We are calling the zfetch code on the dnode object (which holds the dnodes). I'm not familiar with the rg test program, but it would seem to be readdir-heavy. Can you describe the workload, including the directory structure that it's accessing? (e.g. how many files per directory)

I suspect that attempting to do predictive prefetch on the dnode object is not useful. We already get 32 dnodes with each read, plus when iterating over a directory we prefetch the dnodes of the directory entries. So it might make sense for dnode_hold_impl() to pass DB_RF_NOPREFETCH to dbuf_read().

Also, upgrading to master / 2.0rc may help a bit, e.g. by using a mutex instead of a rwlock, and also #8946 may help.

ahrens on 8 Sep 2020

Possibly newly related: https://github.com/openzfs/zfs/pull/10900

gdevenyi on 8 Sep 2020

It is not so much related, other then also in speculative prefetcher.

amotin on 8 Sep 2020

Is there any tool I can use to show contention for the dnodes as I believe @ahrens was indicating may be the underlying issue? Like perhaps something that gives a mapping of PID to dnode held (and preferably a mapping from that dnode to the underlying file/directory) ?

I can see that rg is explicitly yielding the CPU a lot (sched_yield(2)) when prefetch is on, so I suspect that's the reason the CPU seconds are lower with prefetch (and it takes longer). It seems to be using the Rust crossbeam-channel concurrency framework which performs a try_recv and then often yields, presumably because a held dnode makes the directory access unavailable and the consumer thread then sleeps.

gleventhal on 10 Sep 2020

@ahrens The directory structure is a monorepo. There are ~ 1000 directories at the root level, with an average of ~6.7 directories inside of each of those top-level dirs. Ultimately, it's around 37280 directories and 185000 files.

The workload here is a multithreaded program starting at the root of the repo, and descending into the repo, reading any hg/git ignore files, and then reporting on any files that aren't configured to be ignored by version control. There are many readdir_r calls, you are correct.

To answer my own question about the tooling, it seems like dbufstat comes the closest to doing what I wanted so far.

With the prefetch on, I see tons of this stack appearing:

rwsem_down_read_failed <- call_rwsem_down_read_failed <- down_read <- dmu_zfetch <- dbuf_read

It seems like dbuf_read is bumping up against the RW semaphore and the rg thread is yielding/sleeping when it cannot do work

With prefetch disabled I see none of those stacks.

I also see about twice the DMU_OT_DIRECTORY_CONTENTS entries in dbufstat when prefetch is enabled, but I haven't fully deciphered the significance of that (or if there is any) yet.

gleventhal on 13 Sep 2020

The prefetch path is lock heavy. My past, never finished, attempt at improving it showed high spread of latency on the metadata heavy workloads (see https://github.com/openzfs/zfs/pull/6568#issuecomment-325763049)
IIRC, just bailing out on repeated <dnode, blkid, nblks> calls, or a sign of locking trouble (!mutex_tryenter()), already showed benefit for this scenario.

ironMann on 15 Sep 2020

👍1

I suspect that attempting to do predictive prefetch on the dnode object is not useful. We already get 32 dnodes with each read, plus when iterating over a directory we prefetch the dnodes of the directory entries. So it might make sense for dnode_hold_impl() to pass DB_RF_NOPREFETCH to dbuf_read().

You could make this patch and see if it improves your workload. I might have time for that tomorrow.

ahrens on 15 Sep 2020

👍1

I suspect that attempting to do predictive prefetch on the dnode object is not useful. We already get 32 dnodes with each read, plus when iterating over a directory we prefetch the dnodes of the directory entries. So it might make sense for dnode_hold_impl() to pass DB_RF_NOPREFETCH to dbuf_read().

You could make this patch and see if it improves your workload. I might have time for that tomorrow.

I've done a quickie patch for this @ https://github.com/adamdmoss/zfs/tree/dnode-no-prefetch (https://github.com/adamdmoss/zfs/commit/0715e3bfa0110ae63004d85664de1b269f892773)

I added DB_RF_NOPREFETCH to dnode_next_offset_level() too because that also seemed not-useful, though I could always be wrong. :grimacing:

adamdmoss on 16 Sep 2020

@adamdmoss that change looks good to me.

ahrens on 16 Sep 2020

I'm testing this patch and will report back, thanks

gleventhal on 16 Sep 2020

I've applied the patch and unfortunately, I am not seeing notable improvement from what I can tell. Can someone please explain precisely what the patch is doing so that I can use this info to better understand the problem? @adamdmoss

gleventhal on 17 Sep 2020

@gleventhal can you post new flame graphs with the patch applied?

ahrens on 17 Sep 2020

@ahrens I've included 2 graphs, one with prefetch enabled, one without, both with the patch, both with clear caches (echo 3 > drop_caches)

flamegraphs_with_patch.zip

gleventhal on 17 Sep 2020

The new flamegraph shows that dnode_hold_impl -> dbuf_read is still calling dmu_zfetch(). Which shouldn't be possible. Can you double check that you're running with the patch?

ahrens on 17 Sep 2020

cripes, I patched the wrong thing, I should have patched the kmod-zfs package not the zfs one, or maybe both? Redeploying and retesting with the change in the kmod, sorry!

gleventhal on 17 Sep 2020

It works! The test is now identical with prefetch on or off! Thanks and sorry about my confusion on the last test.

myhostname:130# cat /sys/module/zfs/parameters/zfs_prefetch_disable
0
myhostname:0# perf stat -r 5 bash -c "rg --files /usr/local/home/gleventhal/workspaces/my_repo/|wc -l"
187191
187191
187191
187191
187191

 Performance counter stats for 'bash -c rg --files /usr/local/home/gleventhal/workspaces/my_repo/|wc -l' (5 runs): 

          4,005.63 msec task-clock                #    5.802 CPUs utilized            ( +-  5.47% )
            26,927      context-switches          #    0.007 M/sec                    ( +-  1.70% )
                31      cpu-migrations            #    0.008 K/sec                    ( +-  4.56% )
            22,491      page-faults               #    0.006 M/sec                    ( +-  0.66% )
    13,916,106,098      cycles                    #    3.474 GHz                      ( +-  6.41% )
    10,066,874,247      instructions              #    0.72  insn per cycle           ( +-  3.77% )
     1,941,982,736      branches                  #  484.814 M/sec                    ( +-  4.11% )
        37,167,069      branch-misses             #    1.91% of all branches          ( +-  2.48% )

            0.6904 +- 0.0169 seconds time elapsed  ( +-  2.45% )

myhostname:0# perf stat -r 5 bash -c "rg --files /usr/local/home/gleventhal/workspaces/my_repo/|wc -l"
187191
187191
187191
187191
187191

 Performance counter stats for 'bash -c rg --files /usr/local/home/gleventhal/workspaces/my_repo/|wc -l' (5 runs): 

          3,737.21 msec task-clock                #    5.693 CPUs utilized            ( +-  1.11% )
            28,153      context-switches          #    0.008 M/sec                    ( +-  3.42% )
                29      cpu-migrations            #    0.008 K/sec                    ( +- 10.96% )
            22,566      page-faults               #    0.006 M/sec                    ( +-  0.57% )
    13,023,968,038      cycles                    #    3.485 GHz                      ( +-  0.82% )
     9,613,969,469      instructions              #    0.74  insn per cycle           ( +-  1.95% )
     1,850,275,334      branches                  #  495.096 M/sec                    ( +-  2.39% )
        36,157,031      branch-misses             #    1.95% of all branches          ( +-  0.25% )

           0.65651 +- 0.00422 seconds time elapsed  ( +-  0.64% )

myhostname:0# cat /sys/module/zfs/parameters/zfs_prefetch_disable 
0
myhostname:0# echo 1 > /sys/module/zfs/parameters/zfs_prefetch_disable
myhostname:0# perf stat -r 5 bash -c "rg --files /usr/local/home/gleventhal/workspaces/my_repo/|wc -l"
187191
187191
187191
187191
187191

 Performance counter stats for 'bash -c rg --files /usr/local/home/gleventhal/workspaces/my_repo/|wc -l' (5 runs): 

          3,761.64 msec task-clock                #    5.650 CPUs utilized            ( +-  1.51% )
            29,151      context-switches          #    0.008 M/sec                    ( +-  1.71% )
                27      cpu-migrations            #    0.007 K/sec                    ( +-  7.39% )
            22,266      page-faults               #    0.006 M/sec                    ( +-  0.92% )
    13,092,983,882      cycles                    #    3.481 GHz                      ( +-  1.32% )
     9,810,140,476      instructions              #    0.75  insn per cycle           ( +-  3.89% )
     1,895,407,612      branches                  #  503.878 M/sec                    ( +-  4.66% )
        36,177,389      branch-misses             #    1.91% of all branches          ( +-  0.33% )

           0.66578 +- 0.00766 seconds time elapsed  ( +-  1.15% )

Nice work @adamdmoss Thanks @ahrens and everyone else!

gleventhal on 17 Sep 2020

👍1

In what upcoming release can we expect this patch to be included?

gleventhal on 18 Sep 2020

@adamdmoss it'd be great if you could open a PR with the change so we can get it reviewed.