Type | Version/Name
--- | ---
Distribution Name | Debian
Distribution Version | stretch
Linux Kernel | 4.16.0-0.bpo.2-amd64
Architecture | amd64
ZFS Version | 0.7.9-3~bpo9+1
SPL Version | 0.7.9-3~bpo9+1
System deadlocked while forcing a page out to a swap zvol. Unfortunately I do not have the rest of the backtraces.
rsync a filesystem with O(1M) files.

seraph:~# zfs get all rpool/swap
NAME PROPERTY VALUE SOURCE
rpool/swap type volume -
rpool/swap creation Tue Jun 26 17:52 2018 -
rpool/swap used 8.50G -
rpool/swap available 8.84T -
rpool/swap referenced 128K -
rpool/swap compressratio 1.09x -
rpool/swap reservation none default
rpool/swap volsize 8G local
rpool/swap volblocksize 4K -
rpool/swap checksum on default
rpool/swap compression zle local
rpool/swap readonly off default
rpool/swap createtxg 156102 -
rpool/swap copies 1 default
rpool/swap refreservation 8.50G local
rpool/swap guid 11434631848701244988 -
rpool/swap primarycache metadata local
rpool/swap secondarycache all default
rpool/swap usedbysnapshots 0B -
rpool/swap usedbydataset 128K -
rpool/swap usedbychildren 0B -
rpool/swap usedbyrefreservation 8.50G -
rpool/swap logbias throughput local
rpool/swap dedup off default
rpool/swap mlslabel none default
rpool/swap sync always local
rpool/swap refcompressratio 1.09x -
rpool/swap written 128K -
rpool/swap logicalused 46K -
rpool/swap logicalreferenced 46K -
rpool/swap volmode default default
rpool/swap snapshot_limit none default
rpool/swap snapshot_count none default
rpool/swap snapdev hidden default
rpool/swap context none default
rpool/swap fscontext none default
rpool/swap defcontext none default
rpool/swap rootcontext none default
rpool/swap redundant_metadata all default
rpool/swap com.sun:auto-snapshot false local
@jgallag88 and I have been observing this as well. Here's our bug report with a simple reproducer:
Type | Version/Name
--- | ---
Distribution Name | Ubuntu
Distribution Version | 18.04
Linux Kernel | 4.15.0-33-generic
Architecture | amd64
ZFS Version | Master
SPL Version | Master
With zvol swap devices, high memory operations hang indefinitely despite there being a lot of swap space available. It appears that the hang occurs in ZIL code while ZFS is performing an allocation for a zio as part of its work to swap pages.
We also tested this situation with a ext4 configured swap device and did not see any hangs.
This is the memory usage on one our of typical VMs:
total used free shared buff/cache available
Mem: 7.3G 3.6G 3.1G 32M 603M 3.4G
Swap: 4.0G 0B 4.0G
````
### Reproducing the problem
#### Configure a `zvol` as a swap device
zfs create -V 4G -b "$(getconf PAGESIZE)" -o logbias=throughput -o sync=always -o primarycache=metadata rpool/swap
mkswap -f /dev/zvol/rpool/swap
swapon /dev/zvol/rpool/swap
#### Run a high memory operation
Running something that will use up all your memory, for example `stress`:
`stress --vm 10 --vm-keep --vm-bytes 512M`
Check that all memory is being used with
`watch free -h`
Eventually the system will start to swap but quickly hang and not recover.
### Examples
<!--
*IMPORTANT* - Please mark logs and text output from terminal commands
or else Github will not display them correctly.
An example is provided below.
-->
#### Example of the thread trying to swap out page:
PID: 1 TASK: ffff8cadf6208000 CPU: 1 COMMAND: "systemd"
#0 [ffffb10840c671f0] __schedule at ffffffffb0594571
#1 [ffffb10840c67288] schedule at ffffffffb0594bac
#2 [ffffb10840c67298] schedule_preempt_disabled at ffffffffb0594e8e
#3 [ffffb10840c672a8] __mutex_lock at ffffffffb059654c
#4 [ffffb10840c67338] __mutex_lock_slowpath at ffffffffb05968a3
#5 [ffffb10840c67350] mutex_lock at ffffffffb05968df
#6 [ffffb10840c67368] zil_commit_writer at ffffffffc0723d21 [zfs]
#7 [ffffb10840c673a0] zil_commit_impl at ffffffffc0723e66 [zfs]
#8 [ffffb10840c673c0] zil_commit at ffffffffc0723ef0 [zfs]
#9 [ffffb10840c673e8] zvol_write at ffffffffc0746e20 [zfs]
RIP: 00007fcfe704d69a RSP: 00007ffe9178ead0 RFLAGS: 00010202
RAX: 0000000000000001 RBX: 000055ca1d2bd8c0 RCX: 00007fcfe753abb7
RDX: 0000000000000056 RSI: 00007ffe9178ead0 RDI: 0000000000000000
RBP: 00007ffe9178efe0 R8: 0000000000000000 R9: 61742e6369736162
R10: 00000000ffffffff R11: 0000000000000000 R12: 0000000000000001
R13: ffffffffffffffff R14: 00007ffe9178ead0 R15: 0000000000000001
ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b
#### Two places where zios were stuck
PID: 570 TASK: ffff91c660c32d80 CPU: 0 COMMAND: "z_wr_int"
#0 [fffffe0000008cc0] machine_kexec at ffffffff896631c3
#1 [fffffe0000008d20] __crash_kexec at ffffffff8972a479
#2 [fffffe0000008de8] panic at ffffffff8968c7b0
#3 [fffffe0000008e70] nmi_panic at ffffffff8968c329
#4 [fffffe0000008e80] unknown_nmi_error at ffffffff896318e7
#5 [fffffe0000008ea0] default_do_nmi at ffffffff89631a8e
#6 [fffffe0000008ec8] do_nmi at ffffffff89631bc9
#7 [fffffe0000008ef0] end_repeat_nmi at ffffffff8a0019eb
[exception RIP: putback_inactive_pages+501]
RIP: ffffffff897ea245 RSP: ffff9fc501e9f658 RFLAGS: 00000003
RAX: ffff91c677012800 RBX: ffffeb4ac7e52a80 RCX: 0000000000000001
RDX: ffffeb4ac7380720 RSI: 0000000000000000 RDI: ffff91c677012800
RBP: ffff9fc501e9f6d8 R8: 0000000000027f20 R9: 0000000000036164
R10: ffff91c67ffd2000 R11: 00000000000000e5 R12: 0000000000000000
R13: ffffeb4ac7e52aa0 R14: ffff91c677012800 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
---
#8 [ffff9fc501e9f658] putback_inactive_pages at ffffffff897ea245
#9 [ffff9fc501e9f6e0] shrink_inactive_list at ffffffff897ea74d
And the next place:
PID: 567 TASK: ffff97afe0e05b00 CPU: 0 COMMAND: "z_wr_iss"
#0 [fffffe0000008cc0] machine_kexec at ffffffffa64631c3
#1 [fffffe0000008d20] __crash_kexec at ffffffffa652a479
#2 [fffffe0000008de8] panic at ffffffffa648c7b0
#3 [fffffe0000008e70] nmi_panic at ffffffffa648c329
#4 [fffffe0000008e80] unknown_nmi_error at ffffffffa64318e7
#5 [fffffe0000008ea0] default_do_nmi at ffffffffa6431a8e
#6 [fffffe0000008ec8] do_nmi at ffffffffa6431bc9
#7 [fffffe0000008ef0] end_repeat_nmi at ffffffffa6e019eb
[exception RIP: page_referenced_one]
RIP: ffffffffa661e5f0 RSP: ffffa65681e5f3a8 RFLAGS: 00000246
RAX: ffffffffa661e5f0 RBX: 00000000ffc49000 RCX: ffffa65681e5f418
RDX: 00000000ffc49000 RSI: ffff97afc231cea0 RDI: ffffdde407197dc0
RBP: ffffa65681e5f3f8 R8: 0000000000080000 R9: 0000000000000000
R10: 0000000000000001 R11: 0000000000000004 R12: ffffdde407197dc0
R13: ffff97afc231cea0 R14: ffffa65681e5f430 R15: ffff97afc104dd80
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
---
#8 [ffffa65681e5f3a8] page_referenced_one at ffffffffa661e5f0
#9 [ffffa65681e5f3a8] rmap_walk_anon at ffffffffa661e47a
```
Can reproduce this as well on:
Type | Version/Name
--- | ---
Distribution Name | Proxmox VE (based on Debian)
Distribution Version | 5.2 / stretch
Linux Kernel | 4.15.18-4-pve (based on ubuntu-bionic)
Architecture | amd64
ZFS Version | 0.7.9-pve3~bpo9
SPL Version | 0.7.9-1
swap zvol was created as above with " -o compression=zle-o secondarycache=none" as additional parameters.
The system is a small qemu-machine, using a qemu-disk w/o zfs as swap does not lockup the machine.
I could reproduce the issue with plain debian stretch: kernel 4.9.110-3+deb9u4 and zfs 0.7.9 (from stretch-backports)
The Problem does not occur on plain debian stretch: kernel 4.9.110-3+deb9u4 and zfs 0.6.5.9
Please let me know if/how I can help in hunting this down/fixing this.
console output w/ hung tasks on 4.15.18-4 (ZFS 0.7.9):
[ 846.823830] INFO: task systemd:1 blocked for more than 120 seconds.
[ 846.824978] Tainted: P O 4.15.18-4-pve #1
[ 846.825888] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 846.826774] systemd D 0 1 0 0x00000000
[ 846.827409] Call Trace:
[ 846.827722] __schedule+0x3e0/0x870
[ 846.828145] schedule+0x36/0x80
[ 846.828541] cv_wait_common+0x11e/0x140 [spl]
[ 846.829088] ? wait_woken+0x80/0x80
[ 846.829477] __cv_wait+0x15/0x20 [spl]
[ 846.829968] txg_wait_open+0xb0/0x100 [zfs]
[ 846.830455] dmu_tx_wait+0x389/0x3a0 [zfs]
[ 846.830992] dmu_tx_assign+0x8b/0x4a0 [zfs]
[ 846.831485] zvol_write+0x175/0x620 [zfs]
[ 846.831994] zvol_request+0x24a/0x300 [zfs]
[ 846.832513] ? SyS_madvise+0xa20/0xa20
[ 846.833124] generic_make_request+0x123/0x2f0
[ 846.833908] submit_bio+0x73/0x140
[ 846.834491] ? submit_bio+0x73/0x140
[ 846.835098] ? get_swap_bio+0xcf/0x100
[ 846.835780] __swap_writepage+0x345/0x3b0
[ 846.836463] ? __frontswap_store+0x73/0x100
[ 846.837190] swap_writepage+0x34/0x90
[ 846.838503] pageout.isra.53+0x1e5/0x330
[ 846.839874] shrink_page_list+0x955/0xb70
[ 846.841326] shrink_inactive_list+0x256/0x5e0
[ 846.842768] ? next_arg+0x80/0x110
[ 846.843851] shrink_node_memcg+0x365/0x780
[ 846.845052] shrink_node+0xe1/0x310
[ 846.846113] ? shrink_node+0xe1/0x310
[ 846.847153] do_try_to_free_pages+0xef/0x360
[ 846.848258] try_to_free_pages+0xf2/0x1b0
[ 846.849392] __alloc_pages_slowpath+0x401/0xf10
[ 846.850513] ? __page_cache_alloc+0x86/0x90
[ 846.851612] __alloc_pages_nodemask+0x25b/0x280
[ 846.852784] alloc_pages_current+0x6a/0xe0
[ 846.853903] __page_cache_alloc+0x86/0x90
[ 846.855011] filemap_fault+0x369/0x740
[ 846.856057] ? page_add_file_rmap+0xf7/0x150
[ 846.857482] ? filemap_map_pages+0x369/0x380
[ 846.858618] ext4_filemap_fault+0x31/0x44
[ 846.859708] __do_fault+0x24/0xe3
[ 846.860751] __handle_mm_fault+0xcd7/0x11e0
[ 846.861842] ? ep_read_events_proc+0xd0/0xd0
[ 846.863129] handle_mm_fault+0xce/0x1b0
[ 846.864175] __do_page_fault+0x25e/0x500
[ 846.865251] ? wake_up_q+0x80/0x80
[ 846.866175] do_page_fault+0x2e/0xe0
[ 846.867108] ? async_page_fault+0x2f/0x50
[ 846.868080] do_async_page_fault+0x1a/0x80
[ 846.869078] async_page_fault+0x45/0x50
[ 846.869985] RIP: 0033:0x7f551d2d30a3
[ 846.870874] RSP: 002b:00007ffdbd747e28 EFLAGS: 00010246
[ 846.872075] RAX: 0000000000000001 RBX: 0000558ebcbf3120 RCX: 00007f551d2d30a3
[ 846.873473] RDX: 0000000000000041 RSI: 00007ffdbd747e30 RDI: 0000000000000004
[ 846.874811] RBP: 00007ffdbd748240 R08: 431bde82d7b634db R09: 0000000000000100
[ 846.876098] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007ffdbd747e30
[ 846.877594] R13: 0000000000000001 R14: ffffffffffffffff R15: 0000000000000003
[ 846.878882] INFO: task kswapd0:34 blocked for more than 120 seconds.
[ 846.880099] Tainted: P O 4.15.18-4-pve #1
[ 846.881379] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 846.882873] kswapd0 D 0 34 2 0x80000000
[ 846.883989] Call Trace:
[ 846.884834] __schedule+0x3e0/0x870
[ 846.885771] schedule+0x36/0x80
[ 846.886722] cv_wait_common+0x11e/0x140 [spl]
[ 846.887853] ? wait_woken+0x80/0x80
[ 846.888889] __cv_wait+0x15/0x20 [spl]
[ 846.889927] txg_wait_open+0xb0/0x100 [zfs]
[ 846.891027] dmu_tx_wait+0x389/0x3a0 [zfs]
[ 846.892132] dmu_tx_assign+0x8b/0x4a0 [zfs]
[ 846.893185] zvol_write+0x175/0x620 [zfs]
[ 846.894136] ? avl_add+0x74/0xa0 [zavl]
[ 846.895112] zvol_request+0x24a/0x300 [zfs]
[ 846.896124] ? SyS_madvise+0xa20/0xa20
[ 846.897118] generic_make_request+0x123/0x2f0
[ 846.898189] submit_bio+0x73/0x140
[ 846.899171] ? submit_bio+0x73/0x140
[ 846.900115] ? get_swap_bio+0xcf/0x100
[ 846.901183] __swap_writepage+0x345/0x3b0
[ 846.902212] ? __frontswap_store+0x73/0x100
[ 846.903221] swap_writepage+0x34/0x90
[ 846.904185] pageout.isra.53+0x1e5/0x330
[ 846.905207] shrink_page_list+0x955/0xb70
[ 846.906238] shrink_inactive_list+0x256/0x5e0
[ 846.907257] ? __wake_up+0x13/0x20
[ 846.908211] ? __cv_signal+0x2d/0x40 [spl]
[ 846.909286] ? next_arg+0x80/0x110
[ 846.910237] shrink_node_memcg+0x365/0x780
[ 846.911275] shrink_node+0xe1/0x310
[ 846.912253] ? shrink_node+0xe1/0x310
[ 846.913273] kswapd+0x386/0x770
[ 846.914213] kthread+0x105/0x140
[ 846.915168] ? mem_cgroup_shrink_node+0x180/0x180
[ 846.916326] ? kthread_create_worker_on_cpu+0x70/0x70
[ 846.917519] ret_from_fork+0x35/0x40
[ 846.918631] INFO: task jbd2/sda1-8:196 blocked for more than 120 seconds.
[ 846.920121] Tainted: P O 4.15.18-4-pve #1
[ 846.921514] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 846.923138] jbd2/sda1-8 D 0 196 2 0x80000000
[ 846.924465] Call Trace:
[ 846.925372] __schedule+0x3e0/0x870
[ 846.926425] schedule+0x36/0x80
[ 846.927439] cv_wait_common+0x11e/0x140 [spl]
[ 846.928567] ? wait_woken+0x80/0x80
[ 846.929635] __cv_wait+0x15/0x20 [spl]
[ 846.930714] txg_wait_open+0xb0/0x100 [zfs]
[ 846.931826] dmu_tx_wait+0x389/0x3a0 [zfs]
[ 846.932937] dmu_tx_assign+0x8b/0x4a0 [zfs]
[ 846.934063] zvol_write+0x175/0x620 [zfs]
[ 846.935090] ? avl_find+0x5f/0xa0 [zavl]
[ 846.936147] zvol_request+0x24a/0x300 [zfs]
[ 846.937292] ? SyS_madvise+0xa20/0xa20
[ 846.938274] generic_make_request+0x123/0x2f0
[ 846.939445] submit_bio+0x73/0x140
[ 846.940395] ? submit_bio+0x73/0x140
[ 846.941384] ? get_swap_bio+0xcf/0x100
[ 846.942427] __swap_writepage+0x345/0x3b0
[ 846.943393] ? __frontswap_store+0x73/0x100
[ 846.944395] swap_writepage+0x34/0x90
[ 846.945419] pageout.isra.53+0x1e5/0x330
[ 846.946424] shrink_page_list+0x955/0xb70
[ 846.947423] shrink_inactive_list+0x256/0x5e0
[ 846.948465] ? next_arg+0x80/0x110
[ 846.949443] shrink_node_memcg+0x365/0x780
[ 846.950518] ? _cond_resched+0x1a/0x50
[ 846.951533] shrink_node+0xe1/0x310
[ 846.952469] ? shrink_node+0xe1/0x310
[ 846.953446] do_try_to_free_pages+0xef/0x360
[ 846.954449] try_to_free_pages+0xf2/0x1b0
[ 846.955670] __alloc_pages_slowpath+0x401/0xf10
[ 846.956738] ? ext4_map_blocks+0x436/0x5d0
[ 846.957745] ? __switch_to_asm+0x34/0x70
[ 846.958753] __alloc_pages_nodemask+0x25b/0x280
[ 846.959807] alloc_pages_current+0x6a/0xe0
[ 846.960860] __page_cache_alloc+0x86/0x90
[ 846.961867] pagecache_get_page+0xab/0x2b0
[ 846.962968] __getblk_gfp+0x109/0x300
[ 846.964009] jbd2_journal_get_descriptor_buffer+0x5e/0xe0
[ 846.965220] journal_submit_commit_record+0x84/0x200
[ 846.966361] ? wait_woken+0x80/0x80
[ 846.967350] jbd2_journal_commit_transaction+0x1299/0x1720
[ 846.968574] ? __switch_to_asm+0x34/0x70
[ 846.969665] ? __switch_to_asm+0x40/0x70
[ 846.970672] ? finish_task_switch+0x74/0x200
[ 846.971787] kjournald2+0xc8/0x260
[ 846.972772] ? kjournald2+0xc8/0x260
[ 846.973770] ? wait_woken+0x80/0x80
[ 846.974696] kthread+0x105/0x140
[ 846.975568] ? commit_timeout+0x20/0x20
[ 846.976555] ? kthread_create_worker_on_cpu+0x70/0x70
[ 846.977751] ? do_syscall_64+0x73/0x130
[ 846.978746] ? SyS_exit_group+0x14/0x20
[ 846.979736] ret_from_fork+0x35/0x40
[ 846.980747] INFO: task systemd-journal:236 blocked for more than 120 seconds.
[ 846.982049] Tainted: P O 4.15.18-4-pve #1
[ 846.983182] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 846.984707] systemd-journal D 0 236 1 0x00000120
[ 846.986238] Call Trace:
[ 846.987172] __schedule+0x3e0/0x870
[ 846.988109] ? spl_kmem_zalloc+0xa4/0x190 [spl]
[ 846.989188] schedule+0x36/0x80
[ 846.990073] cv_wait_common+0x11e/0x140 [spl]
[ 846.991083] ? wait_woken+0x80/0x80
[ 846.992044] __cv_wait+0x15/0x20 [spl]
[ 846.993046] txg_wait_open+0xb0/0x100 [zfs]
[ 846.994018] dmu_tx_wait+0x389/0x3a0 [zfs]
[ 846.994985] dmu_tx_assign+0x8b/0x4a0 [zfs]
[ 846.996196] zvol_write+0x175/0x620 [zfs]
[ 846.997469] zvol_request+0x24a/0x300 [zfs]
[ 846.998562] ? SyS_madvise+0xa20/0xa20
[ 846.999542] generic_make_request+0x123/0x2f0
[ 847.000637] submit_bio+0x73/0x140
[ 847.001711] ? submit_bio+0x73/0x140
[ 847.002674] ? get_swap_bio+0xcf/0x100
[ 847.003655] __swap_writepage+0x345/0x3b0
[ 847.004788] ? __frontswap_store+0x73/0x100
[ 847.005816] swap_writepage+0x34/0x90
[ 847.006733] pageout.isra.53+0x1e5/0x330
[ 847.007692] shrink_page_list+0x955/0xb70
[ 847.008845] shrink_inactive_list+0x256/0x5e0
[ 847.010100] ? next_arg+0x80/0x110
[ 847.011214] shrink_node_memcg+0x365/0x780
[ 847.012477] shrink_node+0xe1/0x310
[ 847.013530] ? shrink_node+0xe1/0x310
[ 847.014567] do_try_to_free_pages+0xef/0x360
[ 847.015686] try_to_free_pages+0xf2/0x1b0
[ 847.016894] __alloc_pages_slowpath+0x401/0xf10
[ 847.017958] ? __page_cache_alloc+0x86/0x90
[ 847.018954] __alloc_pages_nodemask+0x25b/0x280
[ 847.020080] alloc_pages_current+0x6a/0xe0
[ 847.021240] __page_cache_alloc+0x86/0x90
[ 847.022334] filemap_fault+0x369/0x740
[ 847.023299] ? page_add_file_rmap+0xf7/0x150
[ 847.024324] ? filemap_map_pages+0x369/0x380
[ 847.025381] ext4_filemap_fault+0x31/0x44
[ 847.026388] __do_fault+0x24/0xe3
[ 847.027337] __handle_mm_fault+0xcd7/0x11e0
[ 847.028382] ? ep_read_events_proc+0xd0/0xd0
[ 847.029472] handle_mm_fault+0xce/0x1b0
[ 847.030504] __do_page_fault+0x25e/0x500
[ 847.031518] ? wake_up_q+0x80/0x80
[ 847.032505] do_page_fault+0x2e/0xe0
[ 847.033544] ? async_page_fault+0x2f/0x50
[ 847.034608] do_async_page_fault+0x1a/0x80
[ 847.035740] async_page_fault+0x45/0x50
[ 847.036859] RIP: 0033:0x7f5e14a430c3
[ 847.037888] RSP: 002b:00007ffde5fb7fd0 EFLAGS: 00010293
[ 847.039163] RAX: 0000000000000001 RBX: 000056164a08f1e0 RCX: 00007f5e14a430c3
[ 847.040736] RDX: 000000000000001c RSI: 00007ffde5fb7fe0 RDI: 0000000000000008
[ 847.042378] RBP: 00007ffde5fb8230 R08: 431bde82d7b634db R09: 000000f966b2e318
[ 847.043991] R10: 00000000ffffffff R11: 0000000000000293 R12: 00007ffde5fb7fe0
[ 847.045640] R13: 0000000000000001 R14: ffffffffffffffff R15: 0005750c02d60440
[ 847.047266] INFO: task systemd-udevd:255 blocked for more than 120 seconds.
[ 847.048893] Tainted: P O 4.15.18-4-pve #1
[ 847.050532] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 847.052367] systemd-udevd D 0 255 1 0x00000120
[ 847.053924] Call Trace:
[ 847.054897] __schedule+0x3e0/0x870
[ 847.056031] schedule+0x36/0x80
[ 847.057129] cv_wait_common+0x11e/0x140 [spl]
[ 847.058391] ? wait_woken+0x80/0x80
[ 847.059493] __cv_wait+0x15/0x20 [spl]
[ 847.060683] txg_wait_open+0xb0/0x100 [zfs]
[ 847.061979] dmu_tx_wait+0x389/0x3a0 [zfs]
[ 847.063203] dmu_tx_assign+0x8b/0x4a0 [zfs]
[ 847.064447] ? zvol_setup_zv+0x1b0/0x1b0 [zfs]
[ 847.065775] dmu_sync_late_arrival+0x53/0x150 [zfs]
[ 847.067099] ? dmu_write_policy+0xca/0x350 [zfs]
[ 847.068394] dmu_sync+0x3b4/0x490 [zfs]
[ 847.069756] ? dbuf_hold+0x33/0x60 [zfs]
[ 847.070998] ? zvol_setup_zv+0x1b0/0x1b0 [zfs]
[ 847.072266] zvol_get_data+0x164/0x180 [zfs]
[ 847.073532] zil_commit.part.14+0x451/0x8b0 [zfs]
[ 847.074855] zil_commit+0x17/0x20 [zfs]
[ 847.076040] zvol_write+0x5a2/0x620 [zfs]
[ 847.077285] zvol_request+0x24a/0x300 [zfs]
[ 847.078528] ? SyS_madvise+0xa20/0xa20
[ 847.079677] generic_make_request+0x123/0x2f0
[ 847.080963] submit_bio+0x73/0x140
[ 847.082058] ? submit_bio+0x73/0x140
[ 847.083093] ? get_swap_bio+0xcf/0x100
[ 847.084135] __swap_writepage+0x345/0x3b0
[ 847.085284] ? __frontswap_store+0x73/0x100
[ 847.086424] swap_writepage+0x34/0x90
[ 847.087471] pageout.isra.53+0x1e5/0x330
[ 847.088565] shrink_page_list+0x955/0xb70
[ 847.089688] shrink_inactive_list+0x256/0x5e0
[ 847.090715] ? next_arg+0x80/0x110
[ 847.091670] shrink_node_memcg+0x365/0x780
[ 847.092808] shrink_node+0xe1/0x310
[ 847.093789] ? shrink_node+0xe1/0x310
[ 847.094739] do_try_to_free_pages+0xef/0x360
[ 847.095766] try_to_free_pages+0xf2/0x1b0
[ 847.096834] __alloc_pages_slowpath+0x401/0xf10
[ 847.097863] ? security_file_open+0x90/0xa0
[ 847.098852] ? terminate_walk+0x91/0xf0
[ 847.099919] __alloc_pages_nodemask+0x25b/0x280
[ 847.101087] alloc_pages_current+0x6a/0xe0
[ 847.102101] __get_free_pages+0xe/0x30
[ 847.103044] pgd_alloc+0x1e/0x170
[ 847.103941] mm_init+0x197/0x280
[ 847.104935] copy_process.part.35+0xa50/0x1ab0
[ 847.106029] ? __seccomp_filter+0x49/0x540
[ 847.107024] ? _raw_spin_unlock_bh+0x1e/0x20
[ 847.108080] _do_fork+0xdf/0x3f0
[ 847.109080] ? __secure_computing+0x3f/0x100
[ 847.110242] ? syscall_trace_enter+0xca/0x2e0
[ 847.111348] SyS_clone+0x19/0x20
[ 847.112435] do_syscall_64+0x73/0x130
[ 847.113606] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 847.114968] RIP: 0033:0x7f7ae276a38b
[ 847.116137] RSP: 002b:00007ffea1b4b950 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
[ 847.117728] RAX: ffffffffffffffda RBX: 00007ffea1b4b950 RCX: 00007f7ae276a38b
[ 847.119429] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
[ 847.121162] RBP: 00007ffea1b4b9a0 R08: 00007f7ae39148c0 R09: 0000000000000210
[ 847.122879] R10: 00007f7ae3914b90 R11: 0000000000000246 R12: 0000000000000000
[ 847.124640] R13: 0000000000000020 R14: 0000000000000000 R15: 0000000000000000
[ 847.126237] INFO: task systemd-timesyn:2623 blocked for more than 120 seconds.
[ 847.127856] Tainted: P O 4.15.18-4-pve #1
[ 847.129293] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 847.130993] systemd-timesyn D 0 2623 1 0x00000120
[ 847.132390] Call Trace:
[ 847.133404] __schedule+0x3e0/0x870
[ 847.134513] ? spl_kmem_zalloc+0xa4/0x190 [spl]
[ 847.135783] schedule+0x36/0x80
[ 847.136876] cv_wait_common+0x11e/0x140 [spl]
[ 847.138128] ? wait_woken+0x80/0x80
[ 847.139215] __cv_wait+0x15/0x20 [spl]
[ 847.140503] txg_wait_open+0xb0/0x100 [zfs]
[ 847.141783] dmu_tx_wait+0x389/0x3a0 [zfs]
[ 847.142980] dmu_tx_assign+0x8b/0x4a0 [zfs]
[ 847.144255] zvol_write+0x175/0x620 [zfs]
[ 847.145531] ? avl_find+0x5f/0xa0 [zavl]
[ 847.146818] zvol_request+0x24a/0x300 [zfs]
[ 847.166365] ? SyS_madvise+0xa20/0xa20
[ 847.167706] generic_make_request+0x123/0x2f0
[ 847.169118] submit_bio+0x73/0x140
[ 847.170263] ? submit_bio+0x73/0x140
[ 847.171385] ? get_swap_bio+0xcf/0x100
[ 847.172468] __swap_writepage+0x345/0x3b0
[ 847.173624] ? __frontswap_store+0x73/0x100
[ 847.174690] swap_writepage+0x34/0x90
[ 847.175697] pageout.isra.53+0x1e5/0x330
[ 847.176840] shrink_page_list+0x955/0xb70
[ 847.177948] shrink_inactive_list+0x256/0x5e0
[ 847.179101] ? next_arg+0x80/0x110
[ 847.180086] shrink_node_memcg+0x365/0x780
[ 847.181168] shrink_node+0xe1/0x310
[ 847.182126] ? shrink_node+0xe1/0x310
[ 847.183114] do_try_to_free_pages+0xef/0x360
[ 847.184216] try_to_free_pages+0xf2/0x1b0
[ 847.185362] __alloc_pages_slowpath+0x401/0xf10
[ 847.186524] ? dev_queue_xmit+0x10/0x20
[ 847.187576] ? br_dev_queue_push_xmit+0x7a/0x140
[ 847.188691] __alloc_pages_nodemask+0x25b/0x280
[ 847.189891] alloc_pages_vma+0x88/0x1c0
[ 847.190975] __read_swap_cache_async+0x147/0x200
[ 847.192166] read_swap_cache_async+0x2b/0x60
[ 847.193337] swapin_readahead+0x22f/0x2b0
[ 847.194369] ? radix_tree_lookup_slot+0x22/0x50
[ 847.195408] ? find_get_entry+0x1e/0x100
[ 847.196396] ? pagecache_get_page+0x2c/0x2b0
[ 847.197482] do_swap_page+0x52d/0x9b0
[ 847.198517] ? do_swap_page+0x52d/0x9b0
[ 847.199527] ? crypto_shash_update+0x47/0x130
[ 847.200539] __handle_mm_fault+0x88d/0x11e0
[ 847.201564] ? jbd2_journal_dirty_metadata+0x22d/0x290
[ 847.202735] handle_mm_fault+0xce/0x1b0
[ 847.203761] __do_page_fault+0x25e/0x500
[ 847.204766] ? __switch_to_asm+0x34/0x70
[ 847.205728] do_page_fault+0x2e/0xe0
[ 847.206648] do_async_page_fault+0x1a/0x80
[ 847.207715] async_page_fault+0x25/0x50
[ 847.208725] RIP: 0010:ep_send_events_proc+0x120/0x1a0
[ 847.209764] RSP: 0000:ffffad16810c3d80 EFLAGS: 00010246
[ 847.210821] RAX: 0000000000000001 RBX: ffffad16810c3df8 RCX: 00007fffc09eaec0
[ 847.212099] RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000246
[ 847.213504] RBP: ffffad16810c3dd8 R08: 00000000000002c6 R09: ffff9d25fb9f5f98
[ 847.214822] R10: ffffad16810c3cc0 R11: 0000000000000038 R12: 0000000000000000
[ 847.216193] R13: ffffad16810c3e78 R14: ffff9d25fb9b89c0 R15: ffff9d25fb9f5f98
[ 847.217572] ? ep_read_events_proc+0xd0/0xd0
[ 847.218601] ep_scan_ready_list.constprop.18+0x9e/0x210
[ 847.219743] ep_poll+0x1f8/0x3b0
[ 847.220735] ? wake_up_q+0x80/0x80
[ 847.221659] SyS_epoll_wait+0xce/0xf0
[ 847.222683] do_syscall_64+0x73/0x130
[ 847.223681] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 847.224859] RIP: 0033:0x7fe0d40920c3
[ 847.225852] RSP: 002b:00007fffc09eaeb0 EFLAGS: 00000293 ORIG_RAX: 00000000000000e8
[ 847.227252] RAX: ffffffffffffffda RBX: 000056492a0138f0 RCX: 00007fe0d40920c3
[ 847.228575] RDX: 0000000000000006 RSI: 00007fffc09eaec0 RDI: 0000000000000004
[ 847.229952] RBP: 00007fffc09eb010 R08: 431bde82d7b634db R09: 00000000ffffffff
[ 847.231397] R10: 00000000ffffffff R11: 0000000000000293 R12: 00007fffc09eaec0
[ 847.232823] R13: 0000000000000001 R14: ffffffffffffffff R15: 0000000000000000
[ 847.234199] INFO: task rpcbind:2624 blocked for more than 120 seconds.
[ 847.235669] Tainted: P O 4.15.18-4-pve #1
[ 847.237042] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 847.242089] rpcbind D 0 2624 1 0x00000000
[ 847.244049] Call Trace:
[ 847.245324] __schedule+0x3e0/0x870
[ 847.246883] schedule+0x36/0x80
[ 847.247957] cv_wait_common+0x11e/0x140 [spl]
[ 847.249063] ? wait_woken+0x80/0x80
[ 847.250169] __cv_wait+0x15/0x20 [spl]
[ 847.251423] txg_wait_open+0xb0/0x100 [zfs]
[ 847.252706] dmu_tx_wait+0x389/0x3a0 [zfs]
[ 847.254264] dmu_tx_assign+0x8b/0x4a0 [zfs]
[ 847.255518] zvol_write+0x175/0x620 [zfs]
[ 847.256706] zvol_request+0x24a/0x300 [zfs]
[ 847.258092] ? SyS_madvise+0xa20/0xa20
[ 847.259682] generic_make_request+0x123/0x2f0
[ 847.261515] submit_bio+0x73/0x140
[ 847.263119] ? submit_bio+0x73/0x140
[ 847.264676] ? get_swap_bio+0xcf/0x100
[ 847.266047] __swap_writepage+0x345/0x3b0
[ 847.267622] ? __frontswap_store+0x73/0x100
[ 847.269369] swap_writepage+0x34/0x90
[ 847.270991] pageout.isra.53+0x1e5/0x330
[ 847.272710] shrink_page_list+0x955/0xb70
[ 847.274195] shrink_inactive_list+0x256/0x5e0
[ 847.275618] ? next_arg+0x80/0x110
[ 847.276909] shrink_node_memcg+0x365/0x780
[ 847.278354] shrink_node+0xe1/0x310
[ 847.279636] ? shrink_node+0xe1/0x310
[ 847.280971] do_try_to_free_pages+0xef/0x360
[ 847.282448] try_to_free_pages+0xf2/0x1b0
[ 847.283825] __alloc_pages_slowpath+0x401/0xf10
[ 847.285329] ? __hrtimer_init+0xa0/0xa0
[ 847.286574] ? poll_freewait+0x4a/0xb0
[ 847.287577] __alloc_pages_nodemask+0x25b/0x280
[ 847.288654] alloc_pages_current+0x6a/0xe0
[ 847.289718] __page_cache_alloc+0x86/0x90
[ 847.290707] __do_page_cache_readahead+0x10e/0x2d0
[ 847.291834] ? radix_tree_lookup_slot+0x22/0x50
[ 847.293072] ? __intel_pmu_enable_all.constprop.19+0x4d/0x80
[ 847.294357] ? find_get_entry+0x1e/0x100
[ 847.295449] filemap_fault+0x571/0x740
[ 847.296492] ? filemap_fault+0x571/0x740
[ 847.297612] ? filemap_map_pages+0x180/0x380
[ 847.298682] ext4_filemap_fault+0x31/0x44
[ 847.299784] __do_fault+0x24/0xe3
[ 847.300829] __handle_mm_fault+0xcd7/0x11e0
[ 847.301910] handle_mm_fault+0xce/0x1b0
[ 847.302926] __do_page_fault+0x25e/0x500
[ 847.303946] ? ktime_get_ts64+0x51/0xf0
[ 847.305059] do_page_fault+0x2e/0xe0
[ 847.306133] ? async_page_fault+0x2f/0x50
[ 847.307158] do_async_page_fault+0x1a/0x80
[ 847.308292] async_page_fault+0x45/0x50
[ 847.309415] RIP: 0033:0x7fc96419d660
[ 847.310441] RSP: 002b:00007fff1212c788 EFLAGS: 00010246
[ 847.311917] RAX: 0000000000000000 RBX: 0000000000000007 RCX: 00007fc96419d660
[ 847.313598] RDX: 0000000000007530 RSI: 0000000000000007 RDI: 00007fff1212c9d0
[ 847.315185] RBP: 0000000000000007 R08: 00000000000000c3 R09: 000000000000000b
[ 847.316666] R10: 00007fff1212c700 R11: 0000000000000246 R12: 000000000000000c
[ 847.318208] R13: 000000000000000c R14: 0000558af8c4d480 R15: 00007fff1212c810
[ 847.319523] INFO: task systemd-logind:2759 blocked for more than 120 seconds.
[ 847.320812] Tainted: P O 4.15.18-4-pve #1
[ 847.322042] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 847.323441] systemd-logind D 0 2759 1 0x00000120
[ 847.324575] Call Trace:
[ 847.325493] __schedule+0x3e0/0x870
[ 847.326482] schedule+0x36/0x80
[ 847.327482] cv_wait_common+0x11e/0x140 [spl]
[ 847.328683] ? wait_woken+0x80/0x80
[ 847.329778] __cv_wait+0x15/0x20 [spl]
[ 847.330970] txg_wait_open+0xb0/0x100 [zfs]
[ 847.332121] dmu_tx_wait+0x389/0x3a0 [zfs]
[ 847.333402] dmu_tx_assign+0x8b/0x4a0 [zfs]
[ 847.334552] zvol_write+0x175/0x620 [zfs]
[ 847.335630] zvol_request+0x24a/0x300 [zfs]
[ 847.336841] ? SyS_madvise+0xa20/0xa20
[ 847.337971] generic_make_request+0x123/0x2f0
[ 847.339185] submit_bio+0x73/0x140
[ 847.340308] ? submit_bio+0x73/0x140
[ 847.341486] ? get_swap_bio+0xcf/0x100
[ 847.342568] __swap_writepage+0x345/0x3b0
[ 847.343733] ? __frontswap_store+0x73/0x100
[ 847.344923] swap_writepage+0x34/0x90
[ 847.345982] pageout.isra.53+0x1e5/0x330
[ 847.347115] shrink_page_list+0x955/0xb70
[ 847.348290] shrink_inactive_list+0x256/0x5e0
[ 847.349570] ? virtqueue_add_sgs+0x3b0/0x490
[ 847.350704] ? next_arg+0x80/0x110
[ 847.351733] shrink_node_memcg+0x365/0x780
[ 847.352902] shrink_node+0xe1/0x310
[ 847.353928] ? shrink_node+0xe1/0x310
[ 847.355200] do_try_to_free_pages+0xef/0x360
[ 847.356538] try_to_free_pages+0xf2/0x1b0
[ 847.357638] __alloc_pages_slowpath+0x401/0xf10
[ 847.358760] ? __update_load_avg_se.isra.38+0x1bc/0x1d0
[ 847.360134] ? __switch_to_asm+0x40/0x70
[ 847.361353] ? __switch_to_asm+0x34/0x70
[ 847.362495] ? __switch_to_asm+0x34/0x70
[ 847.363611] ? __switch_to_asm+0x40/0x70
[ 847.364668] __alloc_pages_nodemask+0x25b/0x280
[ 847.365826] alloc_pages_vma+0x88/0x1c0
[ 847.366881] __read_swap_cache_async+0x147/0x200
[ 847.368331] read_swap_cache_async+0x2b/0x60
[ 847.369577] swapin_readahead+0x22f/0x2b0
[ 847.370702] ? radix_tree_lookup_slot+0x22/0x50
[ 847.371913] ? find_get_entry+0x1e/0x100
[ 847.373118] ? pagecache_get_page+0x2c/0x2b0
[ 847.374463] do_swap_page+0x52d/0x9b0
[ 847.375610] ? do_swap_page+0x52d/0x9b0
[ 847.376802] ? __wake_up_sync_key+0x1e/0x30
[ 847.378066] __handle_mm_fault+0x88d/0x11e0
[ 847.379291] handle_mm_fault+0xce/0x1b0
[ 847.380417] __do_page_fault+0x25e/0x500
[ 847.381577] ? __switch_to_asm+0x34/0x70
[ 847.382513] do_page_fault+0x2e/0xe0
[ 847.383370] do_async_page_fault+0x1a/0x80
[ 847.384302] async_page_fault+0x25/0x50
[ 847.385225] RIP: 0010:ep_send_events_proc+0x120/0x1a0
[ 847.386258] RSP: 0000:ffffad168108bd80 EFLAGS: 00010246
[ 847.387335] RAX: 0000000000000001 RBX: ffffad168108bdf8 RCX: 00007fffea7798b0
[ 847.388657] RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000246
[ 847.390106] RBP: ffffad168108bdd8 R08: 000000000000030a R09: ffff9d25f5dfc218
[ 847.391442] R10: ffffad168108bcc0 R11: 0000000000000000 R12: 0000000000000000
[ 847.392743] R13: ffffad168108be78 R14: ffff9d25faea20c0 R15: ffff9d25f5dfc218
[ 847.394189] ? ep_read_events_proc+0xd0/0xd0
[ 847.395244] ep_scan_ready_list.constprop.18+0x9e/0x210
[ 847.396415] ep_poll+0x1f8/0x3b0
[ 847.397387] ? wake_up_q+0x80/0x80
[ 847.398331] SyS_epoll_wait+0xce/0xf0
[ 847.399313] do_syscall_64+0x73/0x130
[ 847.400232] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 847.401329] RIP: 0033:0x7fc20e6080a3
[ 847.402292] RSP: 002b:00007fffea7798a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e8
[ 847.403817] RAX: ffffffffffffffda RBX: 0000560a68ae1260 RCX: 00007fc20e6080a3
[ 847.405311] RDX: 000000000000000d RSI: 00007fffea7798b0 RDI: 0000000000000004
[ 847.406765] RBP: 00007fffea779a50 R08: 0000560a68aed6e0 R09: 0000000000000000
[ 847.408354] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007fffea7798b0
[ 847.410017] R13: 0000000000000001 R14: ffffffffffffffff R15: 00007fffea779b50
[ 847.411383] INFO: task qemu-ga:2842 blocked for more than 120 seconds.
[ 847.413009] Tainted: P O 4.15.18-4-pve #1
[ 847.414513] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 847.416280] qemu-ga D 0 2842 1 0x00000000
[ 847.417779] Call Trace:
[ 847.418769] __schedule+0x3e0/0x870
[ 847.419916] schedule+0x36/0x80
[ 847.421044] cv_wait_common+0x11e/0x140 [spl]
[ 847.422126] ? wait_woken+0x80/0x80
[ 847.423130] __cv_wait+0x15/0x20 [spl]
[ 847.424295] zil_commit.part.14+0x86/0x8b0 [zfs]
[ 847.425726] ? spl_kmem_free+0x33/0x40 [spl]
[ 847.427098] ? zfs_range_unlock+0x1b3/0x2e0 [zfs]
[ 847.428477] zil_commit+0x17/0x20 [zfs]
[ 847.429738] zvol_write+0x5a2/0x620 [zfs]
[ 847.430951] ? avl_find+0x5f/0xa0 [zavl]
[ 847.432198] zvol_request+0x24a/0x300 [zfs]
[ 847.433506] ? SyS_madvise+0xa20/0xa20
[ 847.434707] generic_make_request+0x123/0x2f0
[ 847.435984] submit_bio+0x73/0x140
[ 847.437152] ? submit_bio+0x73/0x140
[ 847.438330] ? get_swap_bio+0xcf/0x100
[ 847.439487] __swap_writepage+0x345/0x3b0
[ 847.440700] ? __frontswap_store+0x73/0x100
[ 847.441983] swap_writepage+0x34/0x90
[ 847.443175] pageout.isra.53+0x1e5/0x330
[ 847.444443] shrink_page_list+0x955/0xb70
[ 847.445809] shrink_inactive_list+0x256/0x5e0
[ 847.447055] ? next_arg+0x80/0x110
[ 847.448177] shrink_node_memcg+0x365/0x780
[ 847.449443] shrink_node+0xe1/0x310
[ 847.450563] ? shrink_node+0xe1/0x310
[ 847.451733] do_try_to_free_pages+0xef/0x360
[ 847.452995] try_to_free_pages+0xf2/0x1b0
[ 847.454159] __alloc_pages_slowpath+0x401/0xf10
[ 847.455441] ? __page_cache_alloc+0x86/0x90
[ 847.456679] __alloc_pages_nodemask+0x25b/0x280
[ 847.457968] alloc_pages_current+0x6a/0xe0
[ 847.459194] __page_cache_alloc+0x86/0x90
[ 847.460374] filemap_fault+0x369/0x740
[ 847.461542] ? __switch_to_asm+0x40/0x70
[ 847.462696] ? __switch_to_asm+0x34/0x70
[ 847.463843] ? filemap_map_pages+0x180/0x380
[ 847.465061] ? __switch_to_asm+0x34/0x70
[ 847.466211] ext4_filemap_fault+0x31/0x44
[ 847.467346] __do_fault+0x24/0xe3
[ 847.468398] __handle_mm_fault+0xcd7/0x11e0
[ 847.469631] handle_mm_fault+0xce/0x1b0
[ 847.470750] __do_page_fault+0x25e/0x500
[ 847.471897] do_page_fault+0x2e/0xe0
[ 847.472997] ? async_page_fault+0x2f/0x50
[ 847.474089] do_async_page_fault+0x1a/0x80
[ 847.475205] async_page_fault+0x45/0x50
[ 847.476285] RIP: 0033:0x55c573934927
[ 847.477358] RSP: 002b:00007ffee4db6630 EFLAGS: 00010206
[ 847.478611] RAX: 0000000000000000 RBX: 000055c575723e20 RCX: 00007ff976bbc270
[ 847.480176] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007ffee4db6610
[ 847.481798] RBP: 00007ffee4db6640 R08: 000055c57572a0c0 R09: 0000000000000000
[ 847.483337] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c57572a9a0
[ 847.484893] R13: 000055c57572a2a0 R14: 000055c57572ad60 R15: 00007ff97766b610
[ 847.486492] INFO: task rrdcached:3267 blocked for more than 120 seconds.
[ 847.487976] Tainted: P O 4.15.18-4-pve #1
[ 847.489368] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 847.491033] rrdcached D 0 3267 1 0x00000000
[ 847.492429] Call Trace:
[ 847.493452] __schedule+0x3e0/0x870
[ 847.494556] schedule+0x36/0x80
[ 847.495609] cv_wait_common+0x11e/0x140 [spl]
[ 847.496841] ? wait_woken+0x80/0x80
[ 847.497976] __cv_wait+0x15/0x20 [spl]
[ 847.499154] txg_wait_open+0xb0/0x100 [zfs]
[ 847.500355] dmu_tx_wait+0x389/0x3a0 [zfs]
[ 847.501566] dmu_tx_assign+0x8b/0x4a0 [zfs]
[ 847.502758] zvol_write+0x175/0x620 [zfs]
[ 847.503976] zvol_request+0x24a/0x300 [zfs]
[ 847.505245] ? SyS_madvise+0xa20/0xa20
[ 847.506371] generic_make_request+0x123/0x2f0
[ 847.507549] submit_bio+0x73/0x140
[ 847.508608] ? submit_bio+0x73/0x140
[ 847.509801] ? get_swap_bio+0xcf/0x100
[ 847.510921] __swap_writepage+0x345/0x3b0
[ 847.512140] ? __frontswap_store+0x73/0x100
[ 847.513384] swap_writepage+0x34/0x90
[ 847.514525] pageout.isra.53+0x1e5/0x330
[ 847.515632] shrink_page_list+0x955/0xb70
[ 847.516772] shrink_inactive_list+0x256/0x5e0
[ 847.518015] ? syscall_return_via_sysret+0x1a/0x7a
[ 847.519363] ? next_arg+0x80/0x110
[ 847.520513] shrink_node_memcg+0x365/0x780
[ 847.521717] shrink_node+0xe1/0x310
[ 847.522846] ? shrink_node+0xe1/0x310
[ 847.523949] do_try_to_free_pages+0xef/0x360
[ 847.525182] try_to_free_pages+0xf2/0x1b0
[ 847.526436] __alloc_pages_slowpath+0x401/0xf10
[ 847.527735] ? __page_cache_alloc+0x86/0x90
[ 847.528986] __alloc_pages_nodemask+0x25b/0x280
[ 847.530283] alloc_pages_current+0x6a/0xe0
[ 847.531307] __page_cache_alloc+0x86/0x90
[ 847.532306] filemap_fault+0x369/0x740
[ 847.533308] ? filemap_map_pages+0x27a/0x380
[ 847.534323] ext4_filemap_fault+0x31/0x44
[ 847.535333] __do_fault+0x24/0xe3
[ 847.536304] __handle_mm_fault+0xcd7/0x11e0
[ 847.537400] ? update_cfs_group+0xc4/0xe0
[ 847.538423] handle_mm_fault+0xce/0x1b0
[ 847.539484] __do_page_fault+0x25e/0x500
[ 847.540584] ? ktime_get_ts64+0x51/0xf0
[ 847.541690] do_page_fault+0x2e/0xe0
[ 847.542714] ? async_page_fault+0x2f/0x50
[ 847.543785] do_async_page_fault+0x1a/0x80
[ 847.544885] async_page_fault+0x45/0x50
[ 847.545894] RIP: 0033:0x7f40f2d6667d
[ 847.546870] RSP: 002b:00007fffcec77460 EFLAGS: 00010293
[ 847.548075] RAX: 0000000000000000 RBX: 0000556934dc0c10 RCX: 00007f40f2d6667d
[ 847.549599] RDX: 00000000000003e8 RSI: 0000000000000001 RDI: 0000556934dae6e0
[ 847.551060] RBP: 0000000000000001 R08: 00007f40e9f24700 R09: 00007f40e9f24700
[ 847.552554] R10: 00000000000008ba R11: 0000000000000293 R12: 0000556934c28360
[ 847.554038] R13: 0000556934dae6e0 R14: 0000000000000000 R15: 0000000000000001
If possible could you dump the stack from the txg_sync process when this happens.
Managed to get the following by running sleep 60; echo t > /proc/sysrq-trigger, before starting stress:
[10981.102235] txg_sync R running task 0 606 2 0x80000000
[10981.103613] Call Trace:
[10981.104615] <IRQ>
[10981.105989] sched_show_task+0xfe/0x130
[10981.107636] show_state_filter+0x62/0xd0
[10981.108730] sysrq_handle_showstate+0x10/0x20
[10981.109813] __handle_sysrq+0x106/0x170
[10981.110859] sysrq_filter+0x9b/0x390
[10981.111841] input_to_handler+0x61/0x100
[10981.112901] input_pass_values.part.6+0x117/0x130
[10981.114128] input_handle_event+0x137/0x510
[10981.115171] input_event+0x54/0x80
[10981.116220] atkbd_interrupt+0x4b6/0x7a0
[10981.117176] serio_interrupt+0x4c/0x90
[10981.118211] i8042_interrupt+0x1f4/0x3b0
[10981.119180] __handle_irq_event_percpu+0x84/0x1a0
[10981.120324] handle_irq_event_percpu+0x32/0x80
[10981.121878] handle_irq_event+0x3b/0x60
[10981.123547] handle_edge_irq+0x78/0x1a0
[10981.124862] handle_irq+0x20/0x30
[10981.125899] do_IRQ+0x4e/0xd0
[10981.126730] common_interrupt+0x84/0x84
[10981.127697] </IRQ>
[10981.128422] RIP: 0010:page_vma_mapped_walk+0x49e/0x770
[10981.129488] RSP: 0000:ffffc3310106f188 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffd6
[10981.130916] RAX: ffff9e7c795f6a80 RBX: ffffc3310106f1d8 RCX: ffff9e7c40000a80
[10981.132363] RDX: 0000000000000000 RSI: 00003ffffffff000 RDI: 0000000000000000
[10981.133810] RBP: ffffc3310106f1b8 R08: 0000000000000067 R09: 0000000000000000
[10981.135262] R10: 0000000000000000 R11: 0000000000000091 R12: ffffe7bd40949440
[10981.136754] R13: ffff9e7c79468cc0 R14: 0000000000000988 R15: ffffc3310106f1d8
[10981.138596] ? page_vma_mapped_walk+0x204/0x770
[10981.140413] page_referenced_one+0x91/0x190
[10981.141780] rmap_walk_anon+0x113/0x270
[10981.142918] rmap_walk+0x48/0x60
[10981.143998] page_referenced+0x10d/0x170
[10981.145189] ? rmap_walk_anon+0x270/0x270
[10981.146331] ? page_get_anon_vma+0x80/0x80
[10981.147523] shrink_page_list+0x791/0xb70
[10981.148737] shrink_inactive_list+0x256/0x5e0
[10981.150007] shrink_node_memcg+0x365/0x780
[10981.151191] ? _cond_resched+0x1a/0x50
[10981.152316] shrink_node+0xe1/0x310
[10981.153384] ? shrink_node+0xe1/0x310
[10981.154739] do_try_to_free_pages+0xef/0x360
[10981.155776] try_to_free_pages+0xf2/0x1b0
[10981.156747] __alloc_pages_slowpath+0x401/0xf10
[10981.157910] __alloc_pages_nodemask+0x25b/0x280
[10981.159112] alloc_pages_current+0x6a/0xe0
[10981.160480] new_slab+0x317/0x690
[10981.161439] ? __blk_mq_get_tag+0x21/0x90
[10981.162485] ___slab_alloc+0x3c1/0x4e0
[10981.163458] ? spl_kmem_zalloc+0xa4/0x190 [spl]
[10981.164520] ? update_load_avg+0x665/0x700
[10981.165545] ? update_load_avg+0x665/0x700
[10981.166545] __slab_alloc+0x20/0x40
[10981.167472] ? __enqueue_entity+0x5c/0x60
[10981.168405] ? __slab_alloc+0x20/0x40
[10981.169297] __kmalloc_node+0xe4/0x2b0
[10981.170362] ? spl_kmem_zalloc+0xa4/0x190 [spl]
[10981.171331] spl_kmem_zalloc+0xa4/0x190 [spl]
[10981.172380] dbuf_dirty+0x20c/0x880 [zfs]
[10981.173440] ? spl_kmem_zalloc+0xa4/0x190 [spl]
[10981.174606] dbuf_dirty+0x74e/0x880 [zfs]
[10981.175719] dnode_setdirty+0xbe/0x100 [zfs]
[10981.176841] dbuf_dirty+0x46c/0x880 [zfs]
[10981.177809] dmu_buf_will_dirty+0x11c/0x130 [zfs]
[10981.178814] dsl_dataset_sync+0x26/0x230 [zfs]
[10981.179801] dsl_pool_sync+0x9f/0x430 [zfs]
[10981.180747] spa_sync+0x42d/0xd50 [zfs]
[10981.181653] txg_sync_thread+0x2d4/0x4a0 [zfs]
[10981.182655] ? finish_task_switch+0x74/0x200
[10981.183603] ? txg_quiesce_thread+0x3f0/0x3f0 [zfs]
[10981.184578] thread_generic_wrapper+0x74/0x90 [spl]
[10981.185537] kthread+0x105/0x140
[10981.186386] ? __thread_exit+0x20/0x20 [spl]
[10981.187333] ? kthread_create_worker_on_cpu+0x70/0x70
[10981.188470] ? do_syscall_64+0x73/0x130
[10981.189420] ? kthread_create_worker_on_cpu+0x70/0x70
[10981.190450] ret_from_fork+0x35/0x40
I decided to give this a try and, as advertised, the deadlock is quite easy to reproduce. First, in a 4.14 kernel, it's interesting to note that a bunch of the "page allocation stalls for ..." messages are generated. That particular message, however, was removed in later kernels because it was determined that merely trying to print the message could, itself, cause deadlocks.
In a 4.16 kernel, which does not have the message, the processes are in quite a few different states. Among other things, an interesting observation is that the "zvol" tasks are generally in a running (R) state. Most of the regular user processes are in various states of dormancy (D). This sshd process is typical ("?" lines elided for brevity):
[ 6212.296211] sshd D 0 3018 2976 0x00000000
[ 6212.296211] Call Trace:
[ 6212.296211] schedule+0x32/0x80
[ 6212.296211] schedule_timeout+0x159/0x330
[ 6212.296211] io_schedule_timeout+0x19/0x40
[ 6212.296211] mempool_alloc+0x121/0x150
[ 6212.296211] bvec_alloc+0x86/0xe0
[ 6212.296211] bio_alloc_bioset+0x153/0x200
[ 6212.296211] get_swap_bio+0x40/0xd0
[ 6212.296211] __swap_writepage+0x292/0x360
[ 6212.296211] pageout.isra.51+0x1bb/0x300
[ 6212.296211] shrink_page_list+0x959/0xbd0
[ 6212.296211] shrink_inactive_list+0x2a8/0x630
[ 6212.296211] shrink_node_memcg+0x37f/0x7d0
[ 6212.296211] shrink_node+0xcc/0x300
[ 6212.296211] do_try_to_free_pages+0xdb/0x330
[ 6212.296211] try_to_free_pages+0xd5/0x1a0
[ 6212.296211] __alloc_pages_slowpath+0x37c/0xd80
[ 6212.296211] __alloc_pages_nodemask+0x227/0x260
[ 6212.296211] alloc_pages_vma+0x7c/0x1e0
[ 6212.296211] __read_swap_cache_async+0x149/0x1c0
[ 6212.296211] do_swap_page_readahead+0xc9/0x1a0
[ 6212.296211] do_swap_page+0x5a5/0x820
[ 6212.296211] __handle_mm_fault+0x91a/0x1040
[ 6212.296211] handle_mm_fault+0xe4/0x1f0
[ 6212.296211] __do_page_fault+0x242/0x4d0
[ 6212.296211] do_page_fault+0x2c/0x110
[ 6212.296211] page_fault+0x25/0x50
Other user dormant user processes such as agetty, etc. all have pretty much identical stacks.
As far as ZFS is concerned, in the current deadlocked system I've got running, the only blocked process is "txg_quiesce" with a rather uninteresting stack of:
[ 5891.836341] txg_quiesce D 0 2457 2 0x80000000
[ 5891.836341] Call Trace:
[ 5891.836341] schedule+0x32/0x80
[ 5891.836341] cv_wait_common+0x129/0x260 [spl]
[ 5891.836341] txg_quiesce_thread+0x2fc/0x500 [zfs]
[ 5891.836341] thread_generic_wrapper+0x76/0xb0 [spl]
[ 5891.836341] kthread+0xf5/0x130
[ 5891.836341] ret_from_fork+0x35/0x40
which is in the cv_wait(&tc->tc_cv[g], &tc->tc_lock); at the bottom of txg_quiesce().
The most interesting zfs-related thing is that all the zvol threads are spinning (in "R" state). They all appear to be in their while (ui.ui_resid > 0 ... loop calling dmu_write_uio_dnode() and show this typical stack:
[ 5891.836341] zvol R running task 0 3034 2 0x80000000
[ 5891.836341] Call Trace:
[ 5891.836341] ? page_referenced+0xb9/0x160
[ 5891.836341] ? shrink_page_list+0x467/0xbd0
[ 5891.836341] ? shrink_inactive_list+0x2b5/0x630
[ 5891.836341] ? shrink_node_memcg+0x37f/0x7d0
[ 5891.836341] ? __list_lru_init+0x170/0x200
[ 5891.836341] ? shrink_node+0xcc/0x300
[ 5891.836341] ? shrink_node+0xcc/0x300
[ 5891.836341] ? do_try_to_free_pages+0xdb/0x330
[ 5891.836341] ? try_to_free_pages+0xd5/0x1a0
[ 5891.836341] ? __alloc_pages_slowpath+0x37c/0xd80
[ 5891.836341] ? get_page_from_freelist+0x145/0x1190
[ 5891.836341] ? __alloc_pages_nodemask+0x227/0x260
[ 5891.836341] ? abd_alloc+0x1e4/0x4d0 [zfs]
[ 5891.836341] ? arc_hdr_alloc_abd+0x5b/0x230 [zfs]
[ 5891.836341] ? arc_hdr_alloc+0x161/0x390 [zfs]
[ 5891.836341] ? arc_alloc_buf+0x38/0x110 [zfs]
[ 5891.836341] ? dmu_buf_will_fill+0x122/0x3b0 [zfs]
[ 5891.836341] ? dmu_write_uio_dnode+0xe1/0x1a0 [zfs]
[ 5891.836341] ? zvol_write+0x191/0x6c0 [zfs]
I'm going to try a couple of things: First, set zvol_request_sync=1 to see what difference it might make. Second, add some instrumentation to the zvol path to see whether it might actually be making some forward progress, albeit very slowly.
naively i don't see how swap can be expected to work as things are, i think we should document/label this as such
@behlendorf given the amount of code complexity and potentially allocations required is supporting swap even realistic?
@cwedgwood It's definitely been a challenging area to get working and is not heavily stress tested on a wide range of kernels. It was working reliably with older zfs releases and older kernels, but as you say given the code complexity and surface area we'd need to test it's not something I'd recommend. I tend to agree we should make this clearer in the documentation.
I am using this (swap on zfs) on several production systems (home and work) with recent kernels. Thankfully it has not been an issue for me yet.
Same issue on Gentoo with ZFS 0.8.0-rc1 and native encryption.
@inpos Out of curiosity, how did you get 0.8.0-rc1 on Gentoo? I have not finished my review of it, so I haven't pushed it to the main tree yet. I might just push it unkeyworded if people would otherwise be resort to building outside the package manager.
@behlendorf This was never 100% sane, although the return of the zvol threads certainly did not help. I started (but did not finish) writing a patch that could help with this if finished:
https://paste.pound-python.org/show/KWlvrHdBU2mA9ev2odXL/
At this point, it is fairly clear to me that my offline obligations prevent me from speculating on if/when I will finish that, but other developers should be able to see the idea there. It ought to help if/when finished. The current version will NOT compile, so I would prefer it if users did not attempt it. Trying to compile it would be a waste of their time.
Alternatively, there is a nuclear option for resolving this. If we treat swap like Illumos treats dump devices by disabling CoW and checksums, things will definitely work. That would be at the expense of protection against bitrot on the swap. This requires modifications to the code because the current codebase is not able to support a zvol in that configuration. Quite honestly, if we were going for the nuclear option, I'd prefer to create a new "dataset" type and implement extent based allocation.
@ahrens Not that I am seriously considering this, but it would be nice to hear your thoughts on the nuclear option for dealing with swap.
On second thought, given that swap devices are simultaneously dump devices on Linux, it might make sense to implement support for creating/writing to Illumos dump devices. That would probably be a good middle ground here.
I'm not an expert in this area of the code, but I think that swap on ZVOL is inherently unreliable due to writes to the swap ZVOL having to go through the normal TXG sync and ZIO write paths, which can require lots of memory allocations by design (and these memory allocations can stall due to a low memory situation). I believe this to be true for swap on ZVOL for illumos, as well as Linux, and presumably FreeBSD too (although I have no experience using it on FreeBSD, so I could be wrong).
I think the "proper" way to address this is to mimic the write path for a ZVOL dump device on illumos, for the ZVOL swap device. i.e. preallocate the ZVOL's blocks on disk, then do a "direct write" to the preallocated LBA such that it doesn't go through the normal TXG sync and ZIO write code paths. This would significantly reduce the amount of work (e.g. memory allocations) that's required to write a swap page to disk, thereby increasing the reliability of that write.
The drawback to this approach would be that we wont get the data consistency guarantees that we normally get with ZFS, e.g. data checksums. I think this is a reasonable tradeoff (i.e. swap zvol + no hangs + no checksums vs. swap zvol + hangs + checksums), given that other Linux filesystems are no better (right?).
I spent a couple hours reading the ZVOL dump device write codepaths during the OpenZFS dev summit last week, and I think this approach is viable. It'll require porting the code over to work on Linux, where we'll need to rework the code to work with the Linux block device layer, since (IIRC) that's what's used to issue the writes to disk (see zvol_dumpio_vdev() in illumos). Additionally, we'll need to determine how/where to do the preallocation of the ZVOL's blocks (see zvol_prealloc() in illumos); for illumos dump devices, this occurs via the DKIOCDUMPINIT ioctl (when zvol_dumpify() is called).
With all that said, I've only spent an hour or two looking into this and I don't have any prior experience with swap, so I may be over looking something or flat out wrong in my analysis so far. Also, I won't have any time in the near future to look into this in detail, but I could potentially help answer questions if somebody else wants to try to prototype the changes I'm proposing.
we'll need to determine how/where to do the preallocation of the ZVOL's blocks (see zvol_prealloc() in illumos);
I took a look at mkswap.c and aside from writing a unique signature to the device it doesn't do anything which would let ZFS easily identify it as a swap device. One solution would be to add a udev helper which detects ZFS volumes with the swap signature and calls zvol_prealloc() via an ioctl(). This _should_ allow things to "just work" on most Linux distributions, though obviously it's Linux specific. We should probably do something a bit more generic and reliable and add a new create-time-only property for volumes. This wouldn't even necessarily prevent us from adding the udev hook in the future.
@behlendorf that's unfortunate; given there's no good entry point to do the preallocation, your two suggestions seem reasonable at first glance.
Using a new ioctl (or create-time property) to preallocate and "convert" a ZVOL to a "direct write" ZVOL might be a neat feature, as it could enable this use case of swap, but also other use cases where the user would like to forego the COW semantics (e.g. a zpool layered on top of ZVOLs).
Maybe this wiki page
https://github.com/zfsonlinux/pkg-zfs/wiki/HOWTO-use-a-zvol-as-a-swap-device
should be edited to warn about this issue?
I took the liberty of amending the wiki (I've been affected by this as well). It's quite reproducible in v0.6.5.9-5~bpo8+1 running on 3.16.0-7-amd64 (Debian 3.16.59-1), so I look forward to an eventual fix. I've found no sustainable workaround (other than to stop using zvols for swap).
Thanks @MobyGamer, I'm running ZFS 0.7.12-1\~bpo9+1 on Debian Stretch (kernel 4.19.12-1\~bpo9+1) on a system with 2GB of RAM and I experienced this issue. I had followed the Debian Stretch Root on ZFS and was lead to believe I should use swap on ZFS. Searching for the cause, there was really nothing in the wiki's about potential down-sides to doing it, and the only "against" I managed to find was this issue. It's good that there is at least some documentation regarding it now 😄.
@mafredri i updated the wiki. Indeed, we didn't find problems with Swap on ZFS 0.7 branch for a while.
FYI, this issue is what made us switch from a swap zvol to a dedicated partition at the last minute after user's feedback and our own experience in our installer on ubuntu 19.10. (https://bugs.launchpad.net/bugs/1847628)
This is quite an annoying issue as for ext4, we are using for quite a long time a swapfile to prevent too much partionning on user's machine and would like our default ZFS experience to be similar.
With multi_vdev_crash_dump it should be possible to swap into a preallocated zvol or file, the writes bypassing zfs completely, right? But it is not implemented yet.
@didrocks I gave some information on a possible approach to address this long term, in my comment above. At this point, I think it's accepted that using a ZVOL for swap will result in instabilities, but we haven't had anybody with the time or motivation to step up and attempt to fix this.
I think at least partially, the lack of motivation stems from the fact that (I assume) _most_ folks using ZFS on Linux are not using it for the root filesystem (I use it personally, and so does my employer, so I don't mean this as nobody uses root on ZFS). Now that Ubuntu is officially supporting a root filesystem on ZFS configuration, perhaps that will change. I mean that, both in terms of more users of a root filesystem on ZFS, but also in terms of more developers with sufficient motivation to fix this issue.
Alternatively, there is a nuclear option for resolving this. If we treat swap like Illumos treats dump devices by disabling CoW and checksums, things will definitely work. That would be at the expense of protection against bitrot on the swap.
and:
The drawback to this approach would be that we wont get the data consistency guarantees that we normally get with ZFS, e.g. data checksums. I think this is a reasonable tradeoff (i.e. swap zvol + no hangs + no checksums vs. swap zvol + hangs + checksums), given that other Linux filesystems are no better (right?).
Not that my 2¢ means much as a non-contributor, but I'd personally _really_ prefer to see a design where at least checksums can be maintained functional so the possibility of end-to-end integrity can remain. Can anyone speak to the amount of complexity/effort that that single feature would add to this work?
I know above Illumos dump devices were mentioned. I am father unfamiliar with Illumos, but I thought dump and swap devices were not the same thing, and that Illumos actually did use zvols for swap. Am I incorrect, or were the above comparisons not quite valid?
I'd personally really prefer to see a design where at least checksums can be maintained functional
I think we'd all prefer this, but I don't think it's currently possible if we were to write directly to pre-allocated blocks; due to the checksums being stored in the indirect blocks, and the indirect blocks being written during the pre-allocation rather than when we write out the swap data.
and that Illumos actually did use zvols for swap.
Yes, we used ZVOLs for swap on our illumos based appliance, and we often saw (likely the same) deadlocks for workloads that tried to actually use the swap space.
FYI, this issue is what made us switch from a swap zvol to a dedicated partition at the last minute after user's feedback and our own experience in our installer on ubuntu 19.10. (https://bugs.launchpad.net/bugs/1847628)
This is quite an annoying issue as for ext4, we are using for quite a long time a swapfile to prevent too much partionning on user's machine and would like our default ZFS experience to be similar.
The page out code is racing to get data out of memory by handing it to the block device, but ZFS is designed to write data asynchronously whenever possible. A problem occurs here because the attempts to do pageout counterintuitively increase memory pressure in the short term (between IO submission and txg commit) because ZFS must do book keeping on the data handed to it before it is actually sent to disk as part of the txg commit.
Creating your swap zvols like this should fix the problem (of pageout to a swap zvol being counterproductive) by using the ZIL to bypass the txg commit and make the ARC buffers used to store the data evictable before the txg commit is done:
zfs create -o sync=always -o primarycache=metadata -o secondarycache=none -o logbias=throughput -b 4K -V 8G rpool/swap
That is suboptimal because single queue depth sync writeout performance with 4K IOs is fairly terrible. I recall seeing 10MB/sec performance on an Intel SSD like 5 or 6 years ago. I have not been keeping track of the status of it, but I expect ZIL pipelining to make the write out performance of a swap zvol setup this way better.
Someone probably should check to see if we take advantage of the ability to do immediate eviction when writes are done to a zvol setup this way. There might still be a lag (although it is a much better situation than going with the default settings).
I'd personally really prefer to see a design where at least checksums can be maintained functional
I think we'd all prefer this, but I don't think it's currently possible if we were to write directly to pre-allocated blocks; due to the checksums being stored in the indirect blocks, and the indirect blocks being written during the pre-allocation rather than when we write out the swap data
The best sketch of a solution that can preserve checksums that I have involves making a special swap property on zvols that alters several behaviors:
A few bonus improvements would be:
Some additional ideas that ignore suspend to disk (which is racy on a zvol, but that is for another issue):
This is a mix of my recollection of my past ideas on how to solve this with some new ones that occurred to me when (or shortly after) writing it. I had originally hoped that there might be a BIO flag that would avoid the need for an explicit property, but I am not aware of one. The ideas here probably need to be developed a bit more, but I believe that they would solve the problem.
That being said, without any code changes, the best that can be done to my knowledge is to force everything through ZIL like this:
zfs create -o sync=always -o primarycache=metadata -o secondarycache=none -o logbias=throughput -b 4K -V 8G rpool/swap
This should do most of the things on the first list.
@ryao In what you're proposing, where we lean heavily on the ZIL to do the writes, we'll still need to do on-disk allocations "on demand" as we write out data to the swap ZVOL, right? Those on-disk allocations may require additional memory allocations (e.g. loading of metaslabs on heavily fragmented systems), which may be a cause for concern (e.g. hangs) when we're in a low (free) memory state. Forcing the use of the ZIL would likely improve the situation, but I'm not sure it would theoretically eliminate the problem.
Using rsync locally to move data from on dataset to another (send|recv was not an option in this specific case), the machine froze completely after a while (Linux 5.3, zfs/master @ g177c79df). Limiting zfs_arc_max to prevent swap at all helped.
That being said, without any code changes, the best that can be done to my knowledge is to force everything through ZIL like this:
zfs create -o sync=always -o primarycache=metadata -o secondarycache=none -o logbias=throughput -b 4K -V 8G rpool/swap
[Update] Correction: Does not work here. See Post below
However, what about compression on the swap zvol as stated on the zvol swap wiki page?
Just to make sure, I disabled compression on the swap zvol and re-created the swap fs.
Question: How would compression on the swap zvol influence the deadlock behavior?
Using rsync locally to move data from on dataset to another (send|recv was not an option in this specific case), the machine froze completely after a while (Linux 5.3, zfs/master @ g177c79df). Limiting zfs_arc_max to prevent swap at all helped.
That being said, without any code changes, the best that can be done to my knowledge is to force everything through ZIL like this:
zfs create -o sync=always -o primarycache=metadata -o secondarycache=none -o logbias=throughput -b 4K -V 8G rpool/swapSetting the swap zvol to parameters shown above works for me.
However, what about compression on the swap zvol as stated on the zvol swap wiki page?
Just to make sure, I disabled compression on the swap zvol and re-created the swap fs.
Question: How would compression on the swap zvol influence the deadlock behavior?
ZLE compression just does zero elimination. It is a clever idea, although it likely has no effect in practice because applications generally do not keep significant quantities of zeroed pages in their address spaces. Offhand, I cannot think of a way that it could possibly make things worse. Explicitly setting it could be an improvement by keeping the swap zvol from inheriting a compression setting from the root dataset that involves doing memory allocations.
@ryao In what you're proposing, where we lean heavily on the ZIL to do the writes, we'll still need to do on-disk allocations "on demand" as we write out data to the swap ZVOL, right? Those on-disk allocations may require additional memory allocations (e.g. loading of metaslabs on heavily fragmented systems), which may be a cause for concern (e.g. hangs) when we're in a low (free) memory state. Forcing the use of the ZIL would likely improve the situation, but I'm not sure it would theoretically eliminate the problem.
@prakashsurya My general impression of the kernel’s existing facilities is that they are not intended to give theoretically perfect guarantees under low memory situations. You can see that in past tweaks that mainline has made to various behaviors (such as the OOM killer) to try to make behavior under memory pressure more desirable. Given that the kernel design is more of a best effort one than one that provides guarantees, I do not think we could be reasonably expected to do more than that.
With that in mind, we can take measures to incrementally improve swap zvols to the point where we have more confidence that memory allocations required to allow pageout under memory pressure practically always succeed. The kernel has emergency memory reserves meant to be leveraged in such situations that ought to help there.
That being said, the issue appears to be effectively resolved by forcing things through ZIL in practice. At least, that was my conclusion several years ago. I have not seen any empirical evidence to show otherwise in the time since then. This is likely because of a mix of factors:
Forcing pageout to go through ZIL in a way that avoids an amortized net increase in memory pressure until the txg commit finishes, at the very least, allows us to interact much more nicely with the kernel’s existing infrastructure for handling low memory conditions.
Ultimately, these are all best efforts measures, but I don’t see a way to do better than that and I suspect that this issue would never have been filed had such measures been made easily accessible via a zvol property.
I think that swap to a preallocated zvol is a viable solution, and much simpler than making the general write code paths work well when under extreme memory pressure. Checksum verification can be preserved by saving the checksums in memory only (not writing them to indirect blocks).
That said, we could also improve on the current situation by writing the to the swap zvol with dmu_sync() and then evicting the data from the ARC. This is basically the proposed "directio" write code path, which I prototyped (https://github.com/ahrens/zfs/tree/directio) and I believe @behlendorf is working on integrating.
@ahrens Writing to preallocated space and keeping checksums in memory would work for most systems, but it causes memory requirements to scale with the size of stored data on the zvol. If the zvol space doing this is large relative to system memory, it is possible for lockups to occur that would not be apparent until a system is in production.
Preallocated space would be seen by system administrators as a performance improvement, so it is quite likely that such a facility would be used for things other than swap, despite warnings. If we do this for swap, we would likely want to have swap mounted by ZFS at boot-time like non-legacy mountpoints and to restrict the facility that keeps checksums in memory to swap when it is mounted in that way. It would be giving people too much rope to use to hang themselves otherwise.
@ryao I disagree. Administrators should not be prevented from shooting themselves in the foot, so long as the warnings are clear. Nothing stops administrators from adding a single disk vdev to a pool for example. Nothing stops them from disabling sync on a dataset holding a database. Neither should they be prevented from setting a property that uses preallocated checksum space, even if it should only be used for swap.
Allowing people to do stupid things also allows them to do clever things.
@clinta If we implemented this functionality and restricted it, you would likely still have an unsupported way of using it outside of its intended usage. For example, we restrict mounting non-legacy mountpoints to using the ZFS utilities, although you can get around it via the zfsutil flag or telling the mount utility not to use the ZFS mount helper.
This would not be the first time we restricted the capabilities of the code for safety purposes. We do not allow certain ashift parameters despite the code supporting 9 through 17 because they can cause issues, with the most obvious being pool corruption at ashift=17 following a crash when hardware is less than perfect. Disabling flushes is possible through an unsupported kernel module option, but it is not exposed via the typical property interface because of the severe risk of pool corruption. In the case of sync=disabled, it is intended for troubleshooting, but the amount of damage that it can do is far inferior disabling flushes, which is a feature that we restrict.
We also have restrictions on features that could pose runtime safety issues. One would be larger than 1MB records where you need to set a kernel module parameter to allow it because it is not well tested. Another would be arbitrary nesting of vdev types because it is a stack overflow risk. It also likely does not interact well with the current zpool utility.
If we implement Matt’s idea, not making it a first class feature outside of its intended use case would probably be for the best. There is just too much risk of random hangs in production from misuse of this by people who do not know better. ZFS is being used in critical / near-critical systems where hangs can affect people’s well being. Knowing that, I really don’t feel confortable with the idea of making it too easy for people to configure ZFS in a way that would randomly hang in production.
Furthermore, ZFS traditionally has avoided placing a hard limits on the amount of storage that can be used with a given amount of memory. This would be a departure from that. Even though some would say data deduplication is such a thing, the system just becomes slower with too much deduplicated unique data relative to system memory rather than locking up, although being too slow might be considered the same as a lockup, especially following a reboot during an async destroy.
We have had plenty of negative feedback from the loss of performance when unique deduplicated data reaches a certain difficult to predict level in relation to memory. I cannot see a general purpose performance feature that can cause seemingly random hangs being received very well among the people bitten by it in the context of how well the much less severe issue of lower performance when too much data is deduplicated has been received.
2 additional Notes on this:
Providing a safe swap is absolute critical for a filesystem as soon as it's on the way to become more and more popular as root fs (which is a great thing).
Would it be necessary or could it be optional to keep checksums in memory only when going the pre-allocated way?
That being said, without any code changes, the best that can be done to my knowledge is to force everything through ZIL like this:
zfs create -o sync=always -o primarycache=metadata -o secondarycache=none -o logbias=throughput -b 4K -V 8G rpool/swapSetting the swap zvol to parameters shown above works for me.
Correction: The solution above does not work here. Systems freeze soon after swap occours. Tested on 2 machines with different zfs / kernel versions (26b60474 @ Linux 5.2 / 177c79df @ Linux 5.3) and different pool configurations (mirror / raidz).
Reproduce:
machine 32G RAM, 8G swap zvol, freshly booted, 31G RAM available, no additional load:
open 3 consoles
console 1 monitor:
top
console 2 fill up RAM:
stress --vm 3 --vm-keep --vm-bytes 10G
console 3 produce fs i/o:
stress --hdd 1
as soon as fs i/o is produced, top may be monitored for a couple of seconds until the system freezes.
Providing a safe swap is absolute critical for a filesystem as soon as it's on the way to become more and more popular as root fs (which is a great thing).
no, it really isn't. most sensible people simply partition a device and give it to the kernel.
I also used to go this way until reading the zfs wiki entry on swap/zvol (of course before the now shown warning). However, not needing to partition disks (except for maybe a compatible boot pool) is one of the greatest features on zfs. But that might be a matter of philosophy.
Correction: The solution above does not work here.
this is correct, there is currently no method that allows swap to work without deadlock.
I understand. But wouldn't all zvol swaps be impossible then, no matter of the zfs version? Is it possible that such problems are just so rare because systems get more and more RAM these days and they simply don't get discovered / reported that much?
If that is the case, I sincerely ask to modify the zvsol swap wiki page to not guide a HOWTO but instead clearly state that zvol swaps are not possible at the moment and will deadlock systems.
Side Note: swapon(8) in the current version shows a clear warning about CoW filesystems, but examples older versions of Linux / BTRFS only. Also, BTRFS seems to have fixed this issue by implementing the no-cow feature.
5.4 got some fixes to zram race conditions, maybe those issues are resolved now
in which way is zram related to swap on zvol ?
both zram and swap on zvol mean that swap is located in space governed by additional module. That might indicate some problems in cooperation/locking between swap and these modules. both zram and zfs swap deadlocks seemed very similar in behaviour. but i am just guessing.
ok, both modules should have in common that for freeing memory, additional buffers are needed when memory already is low.
but - first off, you refer to a fix in zram - which does not fix anything in zfsonlinux
second, zram looks very very simple in comparison to zfsonlinux
As a hackathon project, I've spent some time implementing Matt's idea to re-use the zvol dump device logic from Illumos. There's still a bunch of work to do but it seems promising. You can see a summary of the investigations here: https://docs.google.com/document/d/1y_zKQZJ8_RGYWwvx4k_QmR-PrFhrJUGAhWV-GT0lZRg
I haven't had this issue, but then again I haven't had extreme memory pressure during swapping. However, I don't use sync=always on my swap zvol. I keep it on the default sync=standard.
This led me to do an experiment. I disabled swap (swapoff -av) so I could hammer on the swap zvol. Yeah, I could've simply made a new zvol, but this seemed more expedient. Anyway, doing a simple "dd if=/dev/urandom of=/dev/zvol/pool/swap bs=4096 status=progress", it took about 2 minutes to fill the 4G zvol. Of course, it did buffer much of it, but the approx. 120 seconds was the total time to completion.
Next, I set sync=always and retried the test. This time, it ran up to 1.4G transferred relatively quickly, after which it slowed to a crawl once the memory buffer was filled. The first time I tried the test the "dd" process "froze", and quit updating the status for several minutes. In a separate terminal window, I set sync=standard on the swap zvol, and in short order the dd process took off again, eventually completing after about a minute and a half or so.
Once again, I set sync=always and re-ran the "dd". Once again, it rapidly ran up to 1.4G, and then nearly stalled out. This time, the dd process kept updating the status, but the rate of transfer slowed to a crawl. Watching the bytes transferred, which updated approx once per second, I was able to determine that writing had slowed to ~500KiB/sec. At that rate, the dd command would have required about 90 minutes to complete the transfer of 4G of random data.
Synchronous writes are, of course, slower than async buffered writes, but the difference in that basic test was astonishing. This makes me wonder if the reason I have never had my system lock up during swap activity is because I have always had sync=standard? I also use sync=standard on my VM zvols. Furthermore, my slog and l2arc are on zvols on an nvme drive, and they are also set for sync=standard (although I wonder if this may be unsafe for the slog?).
Any thoughts?
Yes I think we should not be recommending sync=always for swap zvols ( https://github.com/zfsonlinux/pkg-zfs/issues/229 )
I agree. I see no reason for using sync=always. The cache will be flushed in a few seconds anyway, and swap is non-persistent so there are no worries about corruption or inconsistency following power loss.
For the same performance reasons, I also use sync=standard on my VM zvols. Again, I have had no issues whatsoever, but I also haven't had a power loss during a VM write operation. In that case, the kernel running in the VM would believe the information has been flushed to disk even though it is still in RAM on the host. Exactly what would happen if power was lost at this moment, I do not know. For the most part, the way I set up and use my VM's any user information is stored in a shared folder on the host filesystem, so it would be no different than writing user data from an application on the host. The only time the guest filesystem is really changed is if I install or remove software or make other system changes to the guest. In order to avoid catastrophe, I simply snapshot the VM zvol before doing any of those things. That way, if power is lost, or I screw something up, I can simply do a rollback and be on my merry way. To date the systems I have set up this way have, with one exception, been trouble-free. The problem system was caused by a bad hard drive. It was a single-drive system running copies=2, and despite continually failing and relocated sectors it kept running, with ZFS repairing the damaged data with the other copy located elsewhere on the drive (almost certainly on another platter). The HD got replaced with a new SSD as quickly as possible, and the system has been completely trouble free since.
Has anyone replicated the swap deadlock when the zfs volume for swap is created with sync=standard? The examples I see above which replicated the deadlock have sync=always.
@johnalotoski I did reproduce it with sync=disabled, but have not tried standard.
Just tried with sync=standard and experienced no hang when the swap started to get used and when the hdd got stressed. If this would solve the whole issue that would be incredible.
Unfortunately for me, using sync=standard is still a problem.
| Type | Version/Name |
|----------|:-------------:|
| Distribution | NixOS |
| Distribution Version | 20.09 |
| Linux Kernel | 5.8.16 x86_64 |
| ZFS Version | zfs-0.8.4-1 |
Testing on a ZFS system with plenty of SSD capacity and 64 GB physical RAM, a ZFS swap vol was created with:
$ sudo zfs create -V 64G -b $(getconf PAGESIZE) -o compression=zle \
-o logbias=throughput -o sync=standard \
-o primarycache=metadata -o secondarycache=none \
-o com.sun:auto-snapshot=false tank/swap
$ sudo mkswap -f /dev/zvol/tank/swap
$ sudo swapon /dev/zvol/tank/swap
Then a memory stressor from the stress-ng package was run so that approximately 50% of the swap space was used during a memory stress test (64 GB physical RAM + 32 GB swap = 96 GB target usage):
$ nix-shell -p stress-ng
$ sudo stress-ng --vm 1 --vm-bytes 96G --vm-method all --verify -t 10m -v
In another console window, I had a watch 'free -h' to see usage of the swap vol in real-time during the stress test. Shortly after the stress test started and physical RAM was consumed and the system started using the ZFS swap, the system froze and did not recover (even after the 10 minute stressor should have stopped with the -t 10m argument above). The swap vol was observed to use ~1 GB of the ZFS swap volume at the time the system froze.
After a system reboot, the 64 GB ZFS swap vol was destroyed, and a 64 GB standard swap from a separate partition was enabled. The same stress command above was re-run with the standard swap enabled and it completed successfully, without freezing. The standard swap used during the stress test was ~41 GB as seen by the watch 'free -h' command.
Happy to try again with different ZFS swap volume creation parameters if someone wants to suggest something they think might help.
I didn't use com.sun:auto-snapshot=false in my tests. When I did, it was an immediate hang. Can you try again without that?
User properties shouldn't have any affect on zfs behavior.
On Mon, Nov 16, 2020 at 19:33 pepa65 notifications@github.com wrote:
I didn't use com.sun:auto-snapshot=false in my tests. When I did, it was
an immediate hang. Can you try again without that?—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/openzfs/zfs/issues/7734#issuecomment-728519182, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AAAZIDN25TCIY5AHRMR42UTSQHAFNANCNFSM4FLGXRVA
.
I will try again, but I have 2 swap zvols, with that property being the only difference, and one keeps running under swap load, and the other immediately deadlocks. I never thought to try earlier because it should have zero influence.
Yeah, I tried again without that snapshot parameter and received the same deadlock result. At least it's consistently reproducible for me so far :).
Most helpful comment
As a hackathon project, I've spent some time implementing Matt's idea to re-use the zvol dump device logic from Illumos. There's still a bunch of work to do but it seems promising. You can see a summary of the investigations here: https://docs.google.com/document/d/1y_zKQZJ8_RGYWwvx4k_QmR-PrFhrJUGAhWV-GT0lZRg