Type | Version/Name
--- | ---
Distribution Name | CentOS 7.5 x86_64
Distribution Version | CentOS 7.5 x86_64
Linux Kernel | 3.10.0-862.9.1.el7.x86_64
Architecture | x86_64
ZFS Version | 0.7.9-1
SPL Version | 0.7.9-1
Incrementally receiving dataset with big sparse file is very slow due to "txg_sync storm" on the receiving side, which cause a big synchronized write load on the receiving disk. This also led to very noticeable write amplification on the receiving side.
The problem manifest itself when the sending side has some big sparse file which are partially allocated due to random write load. For a practical example, think to a virtualization host with thin-provisioned VM disk images: the backing file will be mostly sparse, but with significant allocation due to pseudo-random writes coming from the guest.
I hope for the following reproducer to clarify things a bit. Please note that sometime I need multiple run to trigger the issue.
SENDING SIDE:
[root@localhost ~]# zpool create tank /dev/sdb
[root@localhost ~]# zfs create tank/test
[root@localhost ~]# truncate --size=64G /tank/test/test.img
[root@localhost ~]# zfs list -t all
NAME USED AVAIL REFER MOUNTPOINT
tank 117K 7.69G 25.5K /tank
tank/test 25.5K 7.69G 25.5K /tank/test
[root@localhost ~]# fio --name test --filename /tank/test/test.img --io_size=4M --refill_buffers --rw randwrite
[root@localhost ~]# du -hs /tank/test/test.img
129M /tank/test/test.img
[root@localhost ~]# zfs snapshot tank/test@snap1
[root@localhost ~]# time zfs send tank/test@snap1 | ssh [email protected] zfs recv tank/test
real 0m6.214s
user 0m1.102s
sys 0m1.117s
[root@localhost ~]# fio --name test --filename /tank/test/test.img --io_size=4M --refill_buffers --rw randwrite
[root@localhost ~]# du -hs /tank/test/test.img
129M /tank/test/test.img
[root@localhost ~]# time zfs send -i tank/test@snap1 tank/test@snap2 | ssh [email protected] zfs recv tank/test
real 0m37.843s
user 0m0.760s
sys 0m2.792s
[root@localhost ~]# zfs list -t all
NAME USED AVAIL REFER MOUNTPOINT
tank 387M 7.31G 25.5K /tank
tank/test 386M 7.31G 129M /tank/test
tank/test@snap1 129M - 129M -
tank/test@snap2 129M - 129M -
Note how much time the incremental send/recv need to complete. As written above, I something need multiple fio + snapshot + send/recv passes to trigger the issue. During the very slow receive, the following happens on the receiving side...
RECEIVING SIDE:
[root@localhost ~]# dstat -d -f -n
--dsk/sda-----dsk/sdb-- -net/enp0s3
read writ: read writ| recv send
0 0 : 0 2245k| 21M 48k
0 0 : 0 7876k|3510k 7830B
0 0 : 0 9643k|4319k 10k
0 0 : 0 14M|3650k 7728B
0 0 : 0 8382k|4050k 10k
0 0 : 0 10M|3780k 7758B
0 0 : 0 7807k|3651k 8754B
0 0 : 0 8877k|4182k 9288B
0 0 : 0 8431k|3916k 7698B
0 0 : 0 13M|2970k 6960B
0 0 : 0 8725k|2700k 5832B
0 0 : 0 11M|3511k 8130B
0 0 : 0 14M|2567k 6726B
0 0 : 0 9815k|3243k 7890B
0 0 : 0 9342k|2563k 6756B
0 0 : 0 7917k|3930k 7224B
0 0 : 0 7245k|4171k 7722B
0 0 : 0 7325k|3779k 7962B
0 0 : 0 14M|3511k 8094B
0 26k: 0 7982k|3513k 8058B
0 0 : 0 7833k|3782k 8754B
0 0 : 0 10M|4053k 8454B
0 0 : 0 9437k|4050k 9714B
0 0 : 0 11M|4188k 7632B
0 0 : 0 16M|4322k 9246B
0 0 : 0 9638k|3239k 7122B
0 0 : 0 16M|4055k 8520B
0 0 : 0 11M|4186k 8378B
0 0 : 0 9707k|3915k 8826B
0 8192B: 0 9940k|4865k 11k
0 0 : 0 15M|3241k 8562B
0 0 : 0 11M|4053k 9552B
0 0 : 0 14M| 964k 3492B
[root@localhost ~]# iostat -x -k 1
avg-cpu: %user %nice %system %iowait %steal %idle
4.12 0.00 23.71 72.16 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 979.38 0.00 8203.61 16.75 1.06 1.07 0.00 1.07 0.95 92.89
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle
3.19 0.00 28.72 68.09 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 1254.26 0.00 14247.34 22.72 1.11 0.87 0.00 0.87 0.75 94.04
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle
4.17 0.00 27.08 68.75 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 962.50 0.00 6493.75 13.49 1.07 1.15 0.00 1.15 0.95 91.77
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
[root@localhost ~]# cat /proc/spl/kstat/zfs/dbgmsg
1531863431 dnode_sync.c:120:free_blocks(): ds= (null) obj=3e num=1
1531863431 txg.c:542:txg_sync_thread(): txg=1790 quiesce_txg=1792 sync_txg=1581
1531863431 txg.c:509:txg_sync_thread(): waiting; tx_synced=1790 waiting=1581 dp=ffff8eccba5c9000
1531863431 txg.c:592:txg_quiesce_thread(): txg=1791 quiesce_txg=1792 sync_txg=1581
1531863431 txg.c:600:txg_quiesce_thread(): quiesce done, handing off txg 1791
1531863431 txg.c:680:txg_wait_open(): txg=1793 quiesce_txg=1793 sync_txg=1581
1531863431 txg.c:542:txg_sync_thread(): txg=1791 quiesce_txg=1793 sync_txg=1581
1531863431 txg.c:509:txg_sync_thread(): waiting; tx_synced=1791 waiting=1581 dp=ffff8eccba5c9000
1531863431 txg.c:592:txg_quiesce_thread(): txg=1792 quiesce_txg=1793 sync_txg=1581
1531863431 txg.c:600:txg_quiesce_thread(): quiesce done, handing off txg 1792
1531863431 txg.c:542:txg_sync_thread(): txg=1792 quiesce_txg=1793 sync_txg=1581
1531863431 dnode_sync.c:120:free_blocks(): ds=ffff8eccba5fc000 obj=2 num=891
1531863431 dnode_sync.c:120:free_blocks(): ds=ffff8eccba5fc000 obj=2 num=81
1531863431 dnode_sync.c:120:free_blocks(): ds=ffff8eccba5fc000 obj=2 num=77
1531863431 txg.c:680:txg_wait_open(): txg=1794 quiesce_txg=1794 sync_txg=1581
1531863431 txg.c:592:txg_quiesce_thread(): txg=1793 quiesce_txg=1794 sync_txg=1581
1531863431 txg.c:600:txg_quiesce_thread(): quiesce done, handing off txg 1793
1531863431 txg.c:680:txg_wait_open(): txg=1795 quiesce_txg=1795 sync_txg=1581
1531863431 dnode_sync.c:120:free_blocks(): ds= (null) obj=3e num=1
1531863431 txg.c:542:txg_sync_thread(): txg=1793 quiesce_txg=1795 sync_txg=1581
1531863431 txg.c:592:txg_quiesce_thread(): txg=1794 quiesce_txg=1795 sync_txg=1581
1531863431 txg.c:600:txg_quiesce_thread(): quiesce done, handing off txg 1794
1531863431 txg.c:680:txg_wait_open(): txg=1796 quiesce_txg=1796 sync_txg=1581
1531863431 dnode_sync.c:120:free_blocks(): ds=ffff8eccba5fc000 obj=2 num=625
1531863431 dnode_sync.c:120:free_blocks(): ds=ffff8eccba5fc000 obj=2 num=24
1531863431 dnode_sync.c:120:free_blocks(): ds= (null) obj=3e num=1
1531863431 txg.c:542:txg_sync_thread(): txg=1794 quiesce_txg=1796 sync_txg=1581
1531863431 txg.c:509:txg_sync_thread(): waiting; tx_synced=1794 waiting=1581 dp=ffff8eccba5c9000
1531863431 txg.c:592:txg_quiesce_thread(): txg=1795 quiesce_txg=1796 sync_txg=1581
1531863431 txg.c:600:txg_quiesce_thread(): quiesce done, handing off txg 1795
1531863431 txg.c:680:txg_wait_open(): txg=1797 quiesce_txg=1797 sync_txg=1581
1531863431 txg.c:542:txg_sync_thread(): txg=1795 quiesce_txg=1797 sync_txg=1581
1531863431 txg.c:509:txg_sync_thread(): waiting; tx_synced=1795 waiting=1581 dp=ffff8eccba5c9000
1531863431 txg.c:592:txg_quiesce_thread(): txg=1796 quiesce_txg=1797 sync_txg=1581
1531863431 txg.c:600:txg_quiesce_thread(): quiesce done, handing off txg 1796
1531863431 txg.c:542:txg_sync_thread(): txg=1796 quiesce_txg=1797 sync_txg=1581
1531863431 dnode_sync.c:120:free_blocks(): ds=ffff8eccba5fc000 obj=2 num=291
1531863431 dnode_sync.c:120:free_blocks(): ds=ffff8eccba5fc000 obj=2 num=23
1531863431 txg.c:680:txg_wait_open(): txg=1798 quiesce_txg=1798 sync_txg=1581
1531863431 txg.c:592:txg_quiesce_thread(): txg=1797 quiesce_txg=1798 sync_txg=1581
1531863431 txg.c:600:txg_quiesce_thread(): quiesce done, handing off txg 1797
1531863431 dnode_sync.c:120:free_blocks(): ds= (null) obj=3e num=1
1531863431 txg.c:680:txg_wait_open(): txg=1799 quiesce_txg=1799 sync_txg=1581
[root@localhost ~]# while true; do tail -1 /proc/spl/kstat/zfs/tank/txgs ; sleep 1; done
1581 2923512183496 O 0 0 0 0 0 0 0 0 0
1592 2924869136301 O 0 0 0 0 0 0 0 0 0
1637 2925857136846 O 0 0 0 0 0 0 0 0 0
1677 2926805983996 O 0 0 0 0 0 0 0 0 0
1716 2927870517362 O 0 0 0 0 0 0 0 0 0
1762 2928892130959 O 0 0 0 0 0 0 0 0 0
1790 2929819190105 O 0 0 0 0 0 0 0 0 0
1822 2930925821686 O 0 0 0 0 0 0 0 0 0
1851 2931902851534 O 0 0 0 0 0 0 0 0 0
1887 2932932608960 O 0 0 0 0 0 0 0 0 0
1920 2933922961357 O 0 0 0 0 0 0 0 0 0
1945 2934919538181 O 0 0 0 0 0 0 0 0 0
1988 2935935767377 O 0 0 0 0 0 0 0 0 0
2021 2936943034285 O 0 0 0 0 0 0 0 0 0
2064 2937954241755 O 0 0 0 0 0 0 0 0 0
2102 2938961198837 O 0 0 0 0 0 0 0 0 0
2141 2939959538399 O 0 0 0 0 0 0 0 0 0
2180 2940931324571 O 0 0 0 0 0 0 0 0 0
2223 2942017723225 O 0 0 0 0 0 0 0 0 0
2269 2943023817670 O 0 0 0 0 0 0 0 0 0
2300 2944011508163 O 0 0 0 0 0 0 0 0 0
2334 2945034108817 O 0 0 0 0 0 0 0 0 0
2357 2945682134602 O 0 0 0 0 0 0 0 0 0
2382 2947040590210 O 0 0 0 0 0 0 0 0 0
2412 2948068644023 O 0 0 0 0 0 0 0 0 0
2453 2949081098557 O 0 0 0 0 0 0 0 0 0
2499 2950088899578 O 0 0 0 0 0 0 0 0 0
[root@localhost ~]# zfs list -t all
NAME USED AVAIL REFER MOUNTPOINT
tank 388M 7.31G 25.5K /tank
tank/test 386M 7.31G 129M /tank/test
tank/test@snap1 129M - 129M -
tank/test@snap2 129M - 129M -
Focusing on the receiving side, please note how:
dstat shows a very long writeout phase;iostat shows very small writes are pushed to disk;dbgmsg shows frenetic txg activity;txgs confirm how hundred of transaction are being synchronized to disk.All this cause bad performance and very significant write amplification on the target host.
NOTE 1: I am reproducing the issue on two virtual machines, but I have multiple physical hosts which show the very same problem.
NOTE 2: I tried to add a SSD acting as SLOG device and nothing changes (the SLOG is absolutely not touched by the long data writeout phase).
Speculating here: I wonder if we are generating a lot of FREE records (because hole birth time is disabled by default), and when we process them on the receiving side they quickly exceed zfs_per_txg_dirty_frees_percent, causing us to push out a txg. You could try setting zfs_per_txg_dirty_frees_percent to 0 to disable it, and see if that helps.
I just tried it on the VMs and it seems to work! I suspected it was related to some sort of threshold, but I can't figure which one (I tried with the various metaslab/frag threshold, with no avail). Tomorrow (well, today...) I'll try on a physical machine having the issue and I'll report back.
EDIT: I finally tried it on a physical, production machine and the high IO load is gone! sar stats shows iowait decreased from ~10% to <1%, and receives are much faster now. Thank you so much.
Some questions:
zfs_per_txg_dirty_frees_percent to 0 safe, or it can cause issues?send_holes_without_birth_time to 0 but nothing changed. Why?Thanks.
@shodanshok
You can read about why zfs_per_txg_dirty_frees_percent is used (nonzero) in the PR that introduced it: https://github.com/openzfs/openzfs/pull/61
send_holes_without_birth_time=0 doesn't do anything for incremental sends if you have (or ever had) >~20 files in the filesystem, because object numbers may have been reused, which inhibits our ability to tell if the other side already has holes with birth time of 0. (note that send_holes_without_birth_time=0 does help with sparse zvols.)
is it now safe to use the hole_birth optimization, or does it remain bugged
Hole birth (the zpool feature) is safe. I'm not aware of any bugs with send_holes_without_birth_time=0, but I'm also not sure how much testing that has received recently.
@alek-p Maybe the real fix would be to make dp_long_free_dirty_pertxg[] only include filled-in parts of the file that are being freed, not some huge range of already-free stuff.
@ahrens thanks for your insights. As you said, it would be great if the delete throttling logic would account for filled-in parts only rather the for already sparse regions.
Regarding send_holes_without_birth_time, is the ~20 files limit an hard one? I can't find any reference about that. If so, I agree that it basically is a zvol-only feature. Anyway on the testing pool (whose output is shown in the first post) I have a single file only, so it should make a difference, right? But nothing changed by enabling it. Can it be due to the holes being generated at file creation time (ie: via truncate)?
Thanks.
Hi, can confirm we had same issue and setting zfs_per_txg_dirty_frees_percent=0 fixed it. Primarily replicating files with LUKS/ext4 filesystems in them (for unfortunate legacy reasons) and replication performance was less than expected, writes very small, disk utilization at an almost constant 25%. After tuning to 0, disk utilization dropped to almost 0, write sizes increased, transfer speed went up drastically. I'm very happy.
(Ubuntu 16.04.x, ZoL 0.7.5)
As the problem is solved, I am going to close the issue. Special thanks to @ahrens for helping.
Most helpful comment
@alek-p Maybe the real fix would be to make
dp_long_free_dirty_pertxg[]only include filled-in parts of the file that are being freed, not some huge range of already-free stuff.