Zfs: arc_prune: high load and soft lockups

Created on 13 Jun 2017 · 29Comments · Source: openzfs/zfs

System information

Type | Version/Name
--- | ---
Distribution Name | Debian
Distribution Version | 8.8
Linux Kernel | 4.9.31
Architecture | x64
ZFS Version | 0.7.0+ (zfsonlinux/zfs@8264410)
SPL Version | 0.7.0+ (zfsonlinux/spl@ac48361)

Describe the problem you're observing

We're experiencing high CPU load (spiking to 60-70, occasionally 90+) and soft lockups multiple times a day involving arc_prune, e.g.:

 kernel:[2584566.976329] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [arc_prune:10762]
 kernel:[2584567.126329] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 21s! [arc_prune:30403]
 kernel:[2584568.636340] NMI watchdog: BUG: soft lockup - CPU#11 stuck for 23s! [arc_prune:2218]
 kernel:[2584568.786343] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [arc_prune:20348]
 kernel:[2584568.936343] NMI watchdog: BUG: soft lockup - CPU#13 stuck for 22s! [arc_prune:18709]
 kernel:[2584569.086344] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [arc_prune:31076]
 kernel:[2584571.276359] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [arc_prune:30756]
 kernel:[2584571.426359] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 21s! [arc_prune:29186]
 kernel:[2584571.586364] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [arc_prune:29185]
 kernel:[2584571.736363] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [arc_prune:30272]
 kernel:[2584571.886363] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [arc_prune:312]
 kernel:[2584572.036363] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [arc_prune:15481]
 kernel:[2584572.336366] NMI watchdog: BUG: soft lockup - CPU#9 stuck for 22s! [migration/9:65]
 kernel:[2584573.236374] NMI watchdog: BUG: soft lockup - CPU#15 stuck for 24s! [arc_prune:30402]

The box has 192G RAM, Xeon E5620 x 2 (16 cores) and, possibly crucially, 2K+ mounted zfs filesystems (base filesystems + snapshots).

Non-default settings:

zfs_arc_max:120259084288
zfs_arc_meta_adjust_restarts:4
zfs_arc_meta_limit:120259084288
zfs_arc_meta_min:6442450944
zfs_arc_meta_prune:100000
zfs_arc_min:12884901888
zfs_dirty_data_max:4294967296
zfs_vdev_async_write_min_active:3

Some of these settings were put in place to see if they might help the issue e.g. zfs_arc_meta_prune and zfs_arc_meta_adjust_restarts (https://github.com/zfsonlinux/zfs/issues/3303#issuecomment-93598260).

Using perf during times of trouble, I see the major call chains being:

  + ret_from_fork
    + kthread
      + taskq_thread
        + arc_prune_task
          + zpl_prune_sb
            + zfs_prune
              + igrab
              | + _raw_spin_lock
              |   + lock_aquire
              |
              + d_prune_aliases
                + _raw_spin_lock
                  + lock_aquire

kthreadd
  + ret_from_fork
    + kthread
      + taskq_thread
        + arc_prune_task
          + zpl_prune_sb
            + zfs_prune
              + igrab
              | + _raw_spin_lock
              |   + lock_aquire
              |
              + d_prune_aliases
                + _raw_spin_lock
                  + lock_aquire

The problem seems to lie in zfs_prune().

With this kernel, we have:

$ egrep -h 'HAVE_SHRINK|HAVE_SPLIT_SHRINKER_CALLBACK|SHRINK_CONTROL_HAS_NID|SHRINKER_NUMA_AWARE|HAVE_D_PRUNE_ALIASES' ../{spl/spl,zfs/zfs}_config.h ~/git/linux/include/linux/shrinker.h
#define HAVE_SHRINK_CONTROL_STRUCT 1
#define HAVE_SPLIT_SHRINKER_CALLBACK 1
#define HAVE_D_PRUNE_ALIASES 1
#define SHRINK_CONTROL_HAS_NID 1
#define SHRINKER_NUMA_AWARE     (1 << 0)

...which means zfs_prune() becomes:

/*
 * The ARC has requested that the filesystem drop entries from the dentry
 * and inode caches.  This can occur when the ARC needs to free meta data
 * blocks but can't because they are all pinned by entries in these caches.
 */
int
zfs_prune(struct super_block *sb, unsigned long nr_to_scan, int *objects)
{
        zfsvfs_t *zfsvfs = sb->s_fs_info;
        int error = 0;
        struct shrinker *shrinker = &sb->s_shrink;
        struct shrink_control sc = {
                .nr_to_scan = nr_to_scan,
                .gfp_mask = GFP_KERNEL,
        };
        ZFS_ENTER(zfsvfs);
        if (sb->s_shrink.flags & SHRINKER_NUMA_AWARE) {
                *objects = 0;
                for_each_online_node(sc.nid) {
                        *objects += (*shrinker->scan_objects)(shrinker, &sc);
                }
        } else {
                        *objects = (*shrinker->scan_objects)(shrinker, &sc);
        }
        /*
         * Fall back to zfs_prune_aliases if the kernel's per-superblock
         * shrinker couldn't free anything, possibly due to the inodes being
         * allocated in a different memcg.
         */
        if (*objects == 0)
                *objects = zfs_prune_aliases(zfsvfs, nr_to_scan);

        ZFS_EXIT(zfsvfs);

        dprintf_ds(zfsvfs->z_os->os_dsl_dataset,
            "pruning, nr_to_scan=%lu objects=%d error=%d\n",
            nr_to_scan, *objects, error);

        return (error);
}

(The *objects = 0 appears redundant as this function is only called from zpl_prune_sb() which sets *objects to zero first.)

Note that zfs_prune() is called for each zfs superblock (i.e. zfs mount) on the system, and zfs_prune() in turn calls, "for_each_online_node", (*shrinker->scan_objects)(shrinker, &sc).

Through the magic of SPL_SHRINKER_CALLBACK_WRAPPER and friends, the call to shrinker->scan_objects() ends up as a call to __arc_shrinker_func(NULL, &sc).

However for the life of me I can't see where __arc_shrinker_func() is either node-specific or superblock-specific. I.e. it seems for our 16 cores and 2K+ mounted file systems, we're calling __arc_shrinker_func() 32000+ times (16 * 2K) when once would have produced the same result?!

Looking at the issue from another direction, I used a systemtap script to log the calls to zfs_prune(), and during those calls, when something was actually able to be pruned (i.e. *objects > 0). Over a 30 second period during "moderate" load average (~20-30) I observed 60886 calls to zfs_prune() (which amplifies to 970K+ calls to __arc_shrinker_func()), and only 1081 of those zfs_prune() calls returned with *objects set to non-zero, i.e. it was actually able to prune something. So that supports the code analysis that the vast majority of the time, zfs_prune() is simply spinning the wheels.

What am I missing, or is pruning really this badly in need of some TLC?

Memory Management Defect

Source

chrisrd

Most helpful comment

I don't know what I'm doing but after making arc_meta_limit the same size as arc_meta_max, arc_meta_used has grown a little but the system load has dropped and arc_prune is gone

mailinglists35 on 25 Jun 2017

🎉2 👍2

All 29 comments

@chrisrd thanks for taking the time to provide a detailed analysis. Your configuration with 1000's of mount points is definitely an area which could benefit from additional scaling work. Let me try and summarize what the code is trying to accomplish here and we can see about optimizing or tuning it.

The key thing to keep in mind about the per-superblock zfs_prune() callback is that its purpose is to provide a mechanism to request that a superblock drop cached dentries and inodes. This is important because these each of these cached objects can pin data buffers in the ARC preventing them from being released. When the ARC determines that it can't free up enough memory by releasing unpinned buffers it invokes the prune callback to request the the update layers drop their references to unpin these buffers. With a large number of filesystems mounted it's possible to pin a large fraction of the ARC which can cause the kind of issue you're seeing.

The good news is there are some additional tunings which may help. The first thing you're going to want to determine is exactly which limit is the ARC exceeding which is causing it to determine it needs to prune dentries. You should be able to determine this from /proc/spl/kstat/zfsarcstats. Pruning can happen when, dnode_size > arc_dnode_limit or arc_meta_used > arc_meta_limit.

Since you've already increased the arc_meta_limit my guess is you're bumping up against the dnode limit with a metadata heavy workload. You should be able to see this in the arcstats.

With more recent versions of the code you can adjust both of these thresholds with the zfs_arc_dnode_limit_percent (default 10%) and zfs_arc_meta_limit_percent (default 75%) module options. They behave exactly as you'd expect and you should increase them, up to 100%, until they're large enough that with your workload you're no longer triggering constant pruning.

zfs_dirty_data_max:4294967296

By the way this is now the default so you shouldn't need to set it.

behlendorf on 13 Jun 2017

@behlendorf Thanks for the pointer towards arc_dnode_limit - I was definitely hitting that limit:

# grep dnode /proc/spl/kstat/zfs/arcstats
dnode_size                      4    9060926400
arc_dnode_limit                 4    9019431321

The man page for zfs-module-parameters is actually misleading there, describing zfs_arc_dnode_limit as a "floor" i.e. minimum, when it's actually a maximum. I've submitted a pull request (#6228) to change the man page from floor to ceiling.

There's another trap for young and old there: per arc_tuning_update(), the only time zfs_arc_dnode_limit_percent is used is if you actually change zfs_arc_max. (And whilst we're here, if you change zfs_arc_meta_min then arc_dnode_limit is automatically set back to 10% of arc_meta_limit, regardless of zfs_arc_dnode_limit or zfs_arc_dnode_limit_percent.)

With my zfs_arc_dnode_limit_percent set to 75 the dnode_size/limit looks much healthier:

# grep dnode /proc/spl/kstat/zfs/arcstats
dnode_size                      4    38120264000
arc_dnode_limit                 4    67645734912

And I've yet to see another arc_prune storm. Yippee!

Whilst it seems my immediate issue is resolved, I think zfs_prune() really needs some attention...

I suspect the SHRINKER_NUMA_AWARE stuff in there is completely off base (or I am!).

Our superblock currently always has SHRINKER_NUMA_AWARE set (and so that branch is always taken in zfs_prune()) because it's set by default in the proforma superblock created by linux/fs/super.c:alloc_super(). We have the opportunity to override that default in zfs_domount() if we wanted.

It looks to me that SHRINKER_NUMA_AWARE is supposed to be a flag for us to let linux know "yes, our shrinker uses NUMA" (or not) rather than a direction to our shrinker "you must do something about NUMA", e.g. see comments on linux/mm/vmscan.c:shrink_slab().

Given our shrinker->scan_objects() (i.e. __arc_shrinker_func()) doesn't really know anything about NUMA (e.g. the node id nid isn't used anywhere that I can see), it looks like we should be unsetting SHRINKER_NUMA_AWARE (and SHRINKER_MEMCG_AWARE whilst we're at it!) in zfs_domount() to tell linux our shrinker doesn't deal with those things. And remove the (misdirected) special casing in zfs_prune() for those things.

Overall I'm pretty confused by zfs_prune() and the mix between freeing fs-level stuff (znodes and the like) via zfs_prune_aliases() and freeing page-level stuff via shrinker->scan_objects() (i.e. __arc_shrinker_func()).

Why isn't zfs_prune() just calling zfs_prune_aliases() to free up fs stuff, and then we rely on linux/mm/vmscan.c:do_shrink_slab() to call our shrinker->scan_objects() to free up page stuff?

chrisrd on 14 Jun 2017

And I've yet to see another arc_prune storm. Yippee!

Great, I'm glad you be able to tune thing to prevent the issue. And thanks for the PR to fix the documentation, I'll get it merged.

This is definitely an area which could benefit from some code cleanup and simplification. The reason for a lot of the complexity here is historical. These interfaces have changes significantly over various kernel versions and the right thing to do for say 2.6.32 can be significantly different from 4.10.

For example, the older kernels don't even have the notion of per-superblock shrinkers. The newer kernels have definitely improved on the available infrastructure and we should take advantage of that. We've been slowly moving to ingrate the ARC more tightly with the kernels VM code as it becomes possible.

Why isn't zfs_prune() just calling zfs_prune_aliases() to free up fs stuff, and then we rely on linux/mm/vmscan.c:do_shrink_slab() to call our shrinker->scan_objects() to free up page stuff?

This is pretty close to how things are working today but I agree it's overly convoluted. As a next step I think it would be worth exploring how best to take advantage of the per-superblock shrinkers. One idea is to use these as the entry points to free additional dentries and inodes.

Normally this is left up to the prune_icache_sb() and prune_dcache_sb(), unfortunately we can't call these functions directly and we can't rely on the VM to call them when the ARC needs it. But we could register these as the functions which are run as the ARC prune callbacks. We'd need to keep some compatibly code around for kernel without this support but that shouldn't be too bad.

With my zfs_arc_dnode_limit_percent set to 75 the dnode_size/limit looks much healthier:

We may want to consider increasing the default value of 10% if this too low for some common workloads.

behlendorf on 14 Jun 2017

The good news is I'm still arc_prune storm free after a full night and morning load cycle.

But it's odd, to my eyes, how the arc has responded overall to the arc_dnode_limit change (done at 14:15 yesterday) once the evening load started to ramp up from around 18:00. Witness:

zfs_stats_yazol_arc_size_breakdown-19
zfs_stats_yazol_arc_size-19

I don't know why the Data Size (arcstats:data_size) decreased in conjunction with the increasing Dnode Size (arcstats:dnode_size) between 18:00 and 00:00, let alone why the collapse just before 12:30 with no identifiable changes on the system around that time.

Perhaps because, in the absence of pruning the dnodes (now that the limit is raised), the system is pruning the data bufs instead? And/or because "Target Cache Size" (arcstats:c) is telling it reduce the arc size? But there's plenty of memory free so why is c so low?:

# function show-arc-free
{ 
  local -i is_free=$(awk '$1 ~ /^(nr_free_pages|nr_inactive_file|nr_inactive_anon|nr_slab_reclaimable)$/ { tot += $2 } END { print tot * 4096; }' /proc/vmstat)
  local -i want_free=$(awk '$1 == "arc_sys_free" { print $3 }' /proc/spl/kstat/zfs/arcstats)
  local -i need_freed=$((want_free - is_free))
  printf '%s is_free=%dG want_free=%dG need_freed=%dG\n' "$(date +'%F %T')" $((is_free / 1024**3)) $((want_free / 1024**3)) $((need_freed / 1024**3))
}
# show-arc-free
2017-06-15 14:44:27 is_free=61G want_free=2G need_freed=-59G

Looks like #4869? Sigh.

Unfortunately my change to man5/zfs-module-parameters.5 (#6228, 627791f3c) wasn't quite right. It turns out zfs_arc_dnode_limit is both a ceiling (dnode pruning occurs above this limit, but only in response to ...something) and a floor (dnode pruning stops when dnode cache memory gets this low) .

I guess this all just means I really don't understand the labyrinthian ZoL memory management at all. I wish I could take a sabbatical and work out what's really going on, fix anything that needs fixing, improve anything that needs improving, and write a thesis on it all!

chrisrd on 15 Jun 2017

I am getting a similar behaviour when doing ncdu /path/to/toplevelpool/with/2000_filesytems_here except I only see high load and high number of arc_prune processes, no dmesg complain.

````

grep arc_meta /proc/spl/kstat/zfs/arcstats

arc_meta_used 4 1813359592
arc_meta_limit 4 1485040128
arc_meta_max 4 1834266256
arc_meta_min 4 16777216

grep dnode /proc/spl/kstat/zfs/arcstats

dnode_size 4 118869440
arc_dnode_limit 4 148504012
````

mailinglists35 on 25 Jun 2017

perf top -ag

````

86.49% 0.07% [kernel] [k] taskq_thread a
- 65.43% taskq_thread a
  - 63.36% arc_prune_task a
    - 66.84% zpl_prune_sb a
    - 78.46% zfs_prune a
      
      37.34% _raw_spin_lock a
      - 15.49% iput a
        
        13.33% _atomic_dec_and_lock a
        
        10.10% igrab a
        
        4.37% d_prune_aliases a
        
        1.03% _atomic_dec_and_lock a
        
        0.96% d_prune_aliases a
        
        0.54% igrab a
  - 1.07% zio_execute a
  - 0.72% zvol_read a
81.84% 12.67% [kernel] [k] zfs_prune a
- 65.79% zfs_prune a
  
  37.34% _raw_spin_lock a
  - 15.49% iput a
    
    13.33% _atomic_dec_and_lock a
    
    10.10% igrab a
    
    4.37% d_prune_aliases a
    
    1.03% _atomic_dec_and_lock a
- 12.67% ret_from_fork a
  
  kthread a
  
  taskq_thread a
  
  arc_prune_task a
80.36% 0.03% [kernel] [k] arc_prune_task a
- 63.34% arc_prune_task a
  - 66.26% zpl_prune_sb a
    - 75.62% zfs_prune a
      
      35.77% _raw_spin_lock a
    - 14.82% iput a
      
      12.75% _atomic_dec_and_lock a
      
      9.64% igrab a
      
      4.19% d_prune_aliases a
      
      0.99% _atomic_dec_and_lock a
      
      0.92% d_prune_aliases a
67.04% 0.00% [kernel] [k] ret_from_fork a
66.97% 0.00% [kernel] [k] kthread a
63.32% 0.00% [kernel] [k] zpl_prune_sb a
37.71% 37.60% [kernel] [k] _raw_spin_lock a
15.98% 2.64% [kernel] [k] iput a
14.35% 14.31% [kernel] [k] _atomic_dec_and_lock a
10.63% 10.61% [kernel] [k] igrab a
5.34% 5.31% [kernel] [k] d_prune_aliases a
````

mailinglists35 on 25 Jun 2017

so rereading Brian's comment above, in my case it's arc_meta_used > arc_meta_limit

so I've adjusted /sys/module/zfs/parameters/zfs_arc_meta_limit_percent from 75 to 100, the ncdu command is long finished, the system is doing NOTHING, but arc_prune processes keeps on using all cpu

````

cat /proc/meminfo

MemTotal: 3867292 kB
MemFree: 783140 kB
MemAvailable: 877000 kB
Buffers: 68496 kB
Cached: 488640 kB
SwapCached: 2072 kB
Active: 281692 kB
Inactive: 317456 kB
Active(anon): 157256 kB
Inactive(anon): 171376 kB
Active(file): 124436 kB
Inactive(file): 146080 kB
Unevictable: 96196 kB
Mlocked: 96196 kB
SwapTotal: 4194300 kB
SwapFree: 4139556 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 136672 kB
Mapped: 51560 kB
Shmem: 282376 kB
Slab: 769868 kB
SReclaimable: 48764 kB
SUnreclaim: 721104 kB
KernelStack: 37712 kB
PageTables: 6828 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 6127944 kB
Committed_AS: 1165712 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 40960 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
CmaTotal: 0 kB
CmaFree: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 92044 kB
DirectMap2M: 4067328 kB
````

````

cat /proc/spl/kstat/zfs/arcstats

12 1 0x01 92 4416 10499467133 82634254073046
name type data
hits 4 9455160
misses 4 740164278
demand_data_hits 4 1750243
demand_data_misses 4 340930
demand_metadata_hits 4 5275639
demand_metadata_misses 4 737643474
prefetch_data_hits 4 56
prefetch_data_misses 4 1750227
prefetch_metadata_hits 4 2429222
prefetch_metadata_misses 4 429647
mru_hits 4 2547060
mru_ghost_hits 4 75623
mfu_hits 4 5477864
mfu_ghost_hits 4 173621
deleted 4 4766434
mutex_miss 4 22251
evict_skip 4 10774970
evict_not_enough 4 680212
evict_l2_cached 4 3266544128
evict_l2_eligible 4 97994958336
evict_l2_ineligible 4 86747079168
evict_l2_skip 4 0
hash_elements 4 72462
hash_elements_max 4 276211
hash_collisions 4 18086624
hash_chains 4 4075
hash_chain_max 4 7
p 4 1562194258
c 4 1907107392
c_min 4 33554432
c_max 4 1980053504
size 4 1843323928
compressed_size 4 152079872
uncompressed_size 4 1414979072
overhead_size 4 1408504832
hdr_size 4 11725000
data_size 4 0
metadata_size 4 1560584704
dbuf_size 4 63612432
dnode_size 4 161811712
bonus_size 4 40733440
anon_size 4 67584
anon_evictable_data 4 0
anon_evictable_metadata 4 0
mru_size 4 1463779840
mru_evictable_data 4 0
mru_evictable_metadata 4 0
mru_ghost_size 4 0
mru_ghost_evictable_data 4 0
mru_ghost_evictable_metadata 4 0
mfu_size 4 96737280
mfu_evictable_data 4 0
mfu_evictable_metadata 4 0
mfu_ghost_size 4 0
mfu_ghost_evictable_data 4 0
mfu_ghost_evictable_metadata 4 0
l2_hits 4 65876
l2_misses 4 718477129
l2_feeds 4 77780
l2_rw_clash 4 0
l2_read_bytes 4 290393600
l2_write_bytes 4 763876352
l2_writes_sent 4 10190
l2_writes_done 4 10190
l2_writes_error 4 0
l2_writes_lock_retry 4 1
l2_evict_lock_retry 4 0
l2_evict_reading 4 0
l2_evict_l1cached 4 25
l2_free_on_write 4 1
l2_abort_lowmem 4 0
l2_cksum_bad 4 0
l2_io_error 4 0
l2_size 4 860622848
l2_asize 4 209475072
l2_hdr_size 4 4856640
memory_throttle_count 4 0
memory_direct_count 4 768
memory_indirect_count 4 3198
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 16005833
arc_meta_used 4 1843323928
arc_meta_limit 4 1485040128
arc_dnode_limit 4 148504012
arc_meta_max 4 1930656256
arc_meta_min 4 16777216
sync_wait_for_async 4 7100
demand_hit_predictive_prefetch 4 51793
arc_need_free 4 0
arc_sys_free 4 61876672
````

mailinglists35 on 25 Jun 2017

I don't know what I'm doing but after making arc_meta_limit the same size as arc_meta_max, arc_meta_used has grown a little but the system load has dropped and arc_prune is gone

mailinglists35 on 25 Jun 2017

🎉2 👍2

so apparently zfs_arc_meta_limit_percent had no effect

mailinglists35 on 25 Jun 2017

I don't know how it's handled exactly in the code but it sounds like it's mutually exclusive, so you either pick meta_limit_percent or meta_limit

happy that you are seeing some improvement :)

kernelOfTruth on 25 Jun 2017

@mailinglists35 you may not have noticed buried in my previous ravings:

the only time zfs_arc_dnode_limit_percent is used is if you actually change zfs_arc_max

I.e. there's a bug in the current code that prevents a change to zfs_arc_dnode_limit_percent taking effect until you've changed zfs_arc_max.

See also #6269

chrisrd on 26 Jun 2017

👍1

Are there some known workarounds or ZFS versions unaffected by this bug?

Harvie on 1 Mar 2019

The merged means it was supposed to be fixed on August 28, 2017.

drescherjm on 1 Mar 2019

@drescherjm So is there reason, why is this issue left open? It's confusing...
I still have lockups under load and don't know why...

Harvie on 1 Mar 2019

In which version of ZFS did this issue get fixed? I am running ZFS 0.7.12-1~bpo9+1 on Debian 9 and am currently seeing quite a few of these:

 kernel:[965000.084000] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [arc_prune:4521]

with a high CPU load of 8 during an rm of many small files on a ZFS partition.

If that's of any help, here is the kernel stack I see in the kernel log:

[Tue Jul  2 21:09:45 2019] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [arc_prune:4521]
[Tue Jul  2 21:09:45 2019] Modules linked in: rpcsec_gss_krb5 bonding binfmt_misc zfs(PO) zunicode(PO) zavl(PO) icp(PO) iTCO_wdt iTCO_vendor_support ppdev evdev coretemp kvm irqbypass pcspkr zcommon(PO) znvpair(PO) amdkfd lpc_ich mfd_core spl(O) radeon ttm drm_kms_helper i5000_edac drm edac_core i5k_amb serio_raw i2c_algo_bit rng_core shpchp parport_pc parport ipmi_si ipmi_msghandler button sg nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 ext4 crc16 jbd2 crc32c_generic fscrypto ecb glue_helper lrw gf128mul ablk_helper cryptd aes_x86_64 mbcache ses enclosure uas usb_storage sr_mod cdrom sd_mod ata_generic ata_piix uhci_hcd ehci_pci i2c_i801 i2c_smbus ehci_hcd ahci usbcore libahci usb_common smartpqi libata scsi_transport_sas scsi_mod e1000e ptp pps_core e1000 floppy [last unloaded: msr]
[Tue Jul  2 21:09:45 2019] CPU: 1 PID: 4521 Comm: arc_prune Tainted: P           O L  4.9.0-9-amd64 #1 Debian 4.9.168-1+deb9u3
[Tue Jul  2 21:09:45 2019] Hardware name: Supermicro X7DB8/X7DB8, BIOS 2.1c 07/04/2011
[Tue Jul  2 21:09:45 2019] task: ffff8e004483eec0 task.stack: ffff9e310cd70000
[Tue Jul  2 21:09:45 2019] RIP: 0010:[<ffffffff82ac5b2d>]  [<ffffffff82ac5b2d>] native_queued_spin_lock_slowpath+0x5d/0x1a0
[Tue Jul  2 21:09:45 2019] RSP: 0018:ffff9e310cd73c10  EFLAGS: 00000202
[Tue Jul  2 21:09:45 2019] RAX: 0000000000000101 RBX: ffff8e005c8b9dd8 RCX: ffff8e00efc9267c
[Tue Jul  2 21:09:45 2019] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8e005c8b9dd8
[Tue Jul  2 21:09:45 2019] RBP: ffff8e0059355380 R08: ffff8e0059355390 R09: ffff9e310cd73d20
[Tue Jul  2 21:09:45 2019] R10: ffff8e0013dfd4f0 R11: ffff8dffd067b6b0 R12: ffff9e310cd73c68
[Tue Jul  2 21:09:45 2019] R13: ffff8e005c8b9dd8 R14: ffff8e0059355300 R15: ffff8e005c8b9dd8
[Tue Jul  2 21:09:45 2019] FS:  0000000000000000(0000) GS:ffff8e00efc80000(0000) knlGS:0000000000000000
[Tue Jul  2 21:09:45 2019] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Tue Jul  2 21:09:45 2019] CR2: 00007f22b8b39160 CR3: 00000000c9608000 CR4: 0000000000040670
[Tue Jul  2 21:09:45 2019] Stack:
[Tue Jul  2 21:09:45 2019]  ffffffff8301a87d ffffffff82d443ba ffff8e005c8b9d80 ffffffff82c247c4
[Tue Jul  2 21:09:45 2019]  000000000009ad8f 00000000000a78af ffff8e00e79e44c0 ffff8e00e79e4000
[Tue Jul  2 21:09:45 2019]  0000000000000000 0000000000000003 ffffffff82c261b2 ffff8e005258d440
[Tue Jul  2 21:09:45 2019] Call Trace:
[Tue Jul  2 21:09:45 2019]  [<ffffffff8301a87d>] ? _raw_spin_lock+0x1d/0x20
[Tue Jul  2 21:09:45 2019]  [<ffffffff82d443ba>] ? lockref_put_or_lock+0x5a/0x80
[Tue Jul  2 21:09:45 2019]  [<ffffffff82c247c4>] ? shrink_dentry_list+0x154/0x300
[Tue Jul  2 21:09:45 2019]  [<ffffffff82c261b2>] ? prune_dcache_sb+0x52/0x70
[Tue Jul  2 21:09:45 2019]  [<ffffffff82c0f7cc>] ? super_cache_scan+0x10c/0x190
[Tue Jul  2 21:09:45 2019]  [<ffffffffc108a3f7>] ? zfs_prune+0x137/0x350 [zfs]
[Tue Jul  2 21:09:45 2019]  [<ffffffffc10ad774>] ? zpl_prune_sb+0x34/0x50 [zfs]
[Tue Jul  2 21:09:45 2019]  [<ffffffffc0fe0d7e>] ? arc_prune_task+0x1e/0x30 [zfs]
[Tue Jul  2 21:09:45 2019]  [<ffffffffc07b52cb>] ? taskq_thread+0x28b/0x460 [spl]
[Tue Jul  2 21:09:45 2019]  [<ffffffff82aa5e00>] ? wake_up_q+0x70/0x70
[Tue Jul  2 21:09:45 2019]  [<ffffffffc07b5040>] ? task_done+0x90/0x90 [spl]
[Tue Jul  2 21:09:45 2019]  [<ffffffff82a9a969>] ? kthread+0xd9/0xf0
[Tue Jul  2 21:09:45 2019]  [<ffffffff8301aa64>] ? __switch_to_asm+0x34/0x70
[Tue Jul  2 21:09:45 2019]  [<ffffffff82a9a890>] ? kthread_park+0x60/0x60
[Tue Jul  2 21:09:45 2019]  [<ffffffff8301aaf7>] ? ret_from_fork+0x57/0x70
[Tue Jul  2 21:09:45 2019] Code: 85 f6 75 42 f0 0f ba 2f 08 b8 00 01 00 00 0f 42 f0 8b 07 30 e4 09 c6 f7 c6 00 ff ff ff 75 1b 85 f6 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 81 e6 00 ff 00 00

This is on a pool made out of 4 mirrors with 2 disks (ST4000NM0033) in each mirror.

hostingnuggets on 2 Jul 2019

There have been improvements, but the soft lockup issue isn't "fixed" as such. Removing a lot of small files means you may be running into arc_dnode_limit, e.g. Brian's comment above and my followup comment.

The easiest thing is to check, when the problem is occurring, if your dnode_size is near or over arc_dnode_limit, indicating arc_prune is busy trying to keep dnode_size below arc_dnode_limit:

$ grep dnode /proc/spl/kstat/zfs/arcstats

If so, increasing zfs_arc_dnode_limit_percent will likely help. I have it set to 70 (up from the default of 10) for my workload which does a lot of metadata work, e.g.:

$ echo 70 | sudo tee /sys/module/zfs/parameters/zfs_arc_meta_limit_percent

chrisrd on 3 Jul 2019

@chrisrd thank you very much for your answer. You must be right about the dnode size over its limit because right now I can see the following:

dnode_size                      4    175699136
arc_dnode_limit                 4    155516467

So as you recommend I have now increased the zfs_arc_dnode_limit_percent from the 10% default to 50% and will observe. By the way in your comment above you mention zfs_arc_meta_limit_percent where you actually meant zfs_arc_dnode_limit_percent ;-)

I must mention here that this server has 4 GB of RAM and hence 2 GB of them are dedicated to the ARC. So part of the problem in my case might also be that I don't have enough RAM.

hostingnuggets on 3 Jul 2019

When this is resolved, please recommend changes to the "when to change" sections of https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#zfs_arc_dnode_limit_percent

richardelling on 3 Jul 2019

👍2

@chrisrd just wanted to get back to you that adapting the dnode percent limit from 1%0 to 50% worked fine for me so far. I did not observe any soft lockups anymore.

hostingnuggets on 7 Jul 2019

I've been struggling with nfsd hung (soft lockups) issue for quite some time.

My system has 384GB memory, I set arc max to 300G, dbuf max bytes to 32G. Lots of nfs clients (more than 2500 Linux hosts) talks to the server which runs zfs 0.7.13, at some point it's guaranteed to have the followng nfsd lockup:

[Mon Jul  1 07:01:47 2019] INFO: task nfsd:4073 blocked for more than 120 seconds.
[Mon Jul  1 07:01:47 2019]       Tainted: G           OE     5.0.0-20-generic #21mw
[Mon Jul  1 07:01:47 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mon Jul  1 07:01:47 2019] nfsd            D    0  4073      2 0x80000000
[Mon Jul  1 07:01:47 2019] Call Trace:
[Mon Jul  1 07:01:47 2019]  __schedule+0x2d0/0x840
[Mon Jul  1 07:01:47 2019]  schedule+0x2c/0x70
[Mon Jul  1 07:01:47 2019]  rwsem_down_write_failed+0x160/0x340
[Mon Jul  1 07:01:47 2019]  ? security_prepare_creds+0x53/0xf0
[Mon Jul  1 07:01:47 2019]  call_rwsem_down_write_failed+0x17/0x30
[Mon Jul  1 07:01:47 2019]  down_write+0x2d/0x40
[Mon Jul  1 07:01:47 2019]  nfsd_lookup_dentry+0xa0/0x380 [nfsd]
[Mon Jul  1 07:01:47 2019]  ? cache_check+0x26c/0x2f0 [sunrpc]
[Mon Jul  1 07:01:47 2019]  nfsd_lookup+0x82/0x130 [nfsd]
[Mon Jul  1 07:01:47 2019]  nfsd3_proc_lookup+0xb5/0x110 [nfsd]
[Mon Jul  1 07:01:47 2019]  nfsd_dispatch+0xa6/0x220 [nfsd]
[Mon Jul  1 07:01:47 2019]  svc_process_common+0x360/0x750 [sunrpc]
[Mon Jul  1 07:01:47 2019]  svc_process+0xbc/0x100 [sunrpc]
[Mon Jul  1 07:01:47 2019]  nfsd+0xe9/0x150 [nfsd]
[Mon Jul  1 07:01:47 2019]  kthread+0x120/0x140
[Mon Jul  1 07:01:47 2019]  ? nfsd_destroy+0x60/0x60 [nfsd]
[Mon Jul  1 07:01:47 2019]  ? __kthread_parkme+0x70/0x70
[Mon Jul  1 07:01:47 2019]  ret_from_fork+0x1f/0x40

Once I set zfs_arc_dnode_limit_percent to 50, no more nfsd lockup.

Don't understand why this arc dmode limit was implemented. If it's there for a good reason, fine; but it seems it brings more troubles than benefits.

youzhongyang on 8 Jul 2019

@richardelling

When this is resolved, please recommend changes to the "when to change" sections of https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#zfs_arc_dnode_limit_percent

No idea how to update the wiki (github help suggests Edit button, but I don't see one), so...

commit 2e6aae68f72ed30b8e1ef2c1227ae045b276c3f6
Author: Chris Dunlop <[email protected]>
Date:   Fri Aug 2 07:14:20 2019 +0000

    update when to change suggestion for zfs_arc_dnode_limit_percent, zfs_arc_dnode_limit

diff --git a/ZFS-on-Linux-Module-Parameters.md b/ZFS-on-Linux-Module-Parameters.md
index 77dbf8aadc9f..668991fe87f2 100644
--- a/ZFS-on-Linux-Module-Parameters.md
+++ b/ZFS-on-Linux-Module-Parameters.md
@@ -1131,7 +1131,7 @@ The value calculated for `zfs_arc_dnode_limit_percent` can be overridden by
 | zfs_arc_dnode_limit_percent | Notes
 |---|---
 | Tags | [ARC](#arc)
-| When to change | Testing dnode cache efficiency
+| When to change | Consider increasing if arc_prune is using excessive system time and /proc/spl/kstat/zfs/arcstats shows arc_dnode_size is near or over arc_dnode_limit
 | Data Type | int
 | Units | percent of arc_meta_limit
 | Range | 0 to 100
@@ -1153,7 +1153,7 @@ purpose for metadata.
 | zfs_arc_dnode_limit | Notes
 |---|---
 | Tags | [ARC](#arc)
-| When to change | Testing dnode cache efficiency
+| When to change | Consider increasing if arc_prune is using excessive system time and /proc/spl/kstat/zfs/arcstats shows arc_dnode_size is near or over arc_dnode_limit
 | Data Type | uint64
 | Units | bytes
 | Range | 0 to MAX_UINT64

chrisrd on 2 Aug 2019

@chrisrd thanks. I've updated the wiki accordingly.

behlendorf on 21 Aug 2019

@behlendorf

With more recent versions of the code you can adjust both of these thresholds with the zfs_arc_dnode_limit_percent (default 10%) and zfs_arc_meta_limit_percent (default 75%) module options. They behave exactly as you'd expect and you should increase them, up to 100%, until they're large enough that with your workload you're no longer triggering constant pruning.

Why such values are used by default? Any drawback if increase?

edo1 on 10 Feb 2020

wiki wording is slightly off

_# cat /proc/spl/kstat/zfs/arcstats|grep dnode
dnode_size 4 2333270832
arc_dnode_limit 4 2505743155_

i get a the CPU stuck condition for umount every month when i reboot after updates. i believe i have this issue after a unknown amount of uptime. current uptime is 2 days.

_# ps -e|grep arc_prune|wc -l
255_

erpadmin on 12 Feb 2020

This is potentially related to https://github.com/openzfs/zfs/issues/7559

alek-p on 14 May 2020

Also ran into this problem; can the default be changed?

eqvinox on 29 May 2020

When this is resolved, please recommend changes to the "when to change" sections of https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#zfs_arc_dnode_limit_percent

where is the current link for that url?

clicking the link takes me to this:

then clicking again to the "This page was moved" takes me to this:

mailinglists35 on 29 May 2020

@gmelikov wouldn't it be a best practice that during a transition (or even after is completed), let the old content be available, and instead of deleting it, simply add a note that "the content is no longer updated, please see updated links at..."?

mailinglists35 on 29 May 2020

I've updated this link, page was renamed early, we should update links in wiki in this case too.

@mailinglists35 we've already migrated, I'm against several "old/actual" variants, absence of normal redirect is a github wiki's problem, unfortunately.

Feel free to post docs issues here https://github.com/openzfs/openzfs-docs/issues .

I'll work on problems with 404 page soon.