Zfs: Very slow incremental receive for pool with sparse file

Created on 18 Jul 2018 · 8Comments · Source: openzfs/zfs

System information

Type | Version/Name
--- | ---
Distribution Name | CentOS 7.5 x86_64
Distribution Version | CentOS 7.5 x86_64
Linux Kernel | 3.10.0-862.9.1.el7.x86_64
Architecture | x86_64
ZFS Version | 0.7.9-1
SPL Version | 0.7.9-1

Describe the problem you're observing

Incrementally receiving dataset with big sparse file is very slow due to "txg_sync storm" on the receiving side, which cause a big synchronized write load on the receiving disk. This also led to very noticeable write amplification on the receiving side.

The problem manifest itself when the sending side has some big sparse file which are partially allocated due to random write load. For a practical example, think to a virtualization host with thin-provisioned VM disk images: the backing file will be mostly sparse, but with significant allocation due to pseudo-random writes coming from the guest.

I hope for the following reproducer to clarify things a bit. Please note that sometime I need multiple run to trigger the issue.

Describe how to reproduce the problem

SENDING SIDE:

[root@localhost ~]# zpool create tank /dev/sdb
[root@localhost ~]# zfs create tank/test
[root@localhost ~]# truncate --size=64G /tank/test/test.img
[root@localhost ~]# zfs list -t all
NAME        USED  AVAIL  REFER  MOUNTPOINT
tank        117K  7.69G  25.5K  /tank
tank/test  25.5K  7.69G  25.5K  /tank/test

[root@localhost ~]# fio --name test --filename /tank/test/test.img --io_size=4M --refill_buffers --rw randwrite
[root@localhost ~]# du -hs /tank/test/test.img
129M    /tank/test/test.img

[root@localhost ~]# zfs snapshot tank/test@snap1
[root@localhost ~]# time zfs send tank/test@snap1 | ssh [email protected] zfs recv tank/test
real    0m6.214s
user    0m1.102s
sys     0m1.117s

[root@localhost ~]# fio --name test --filename /tank/test/test.img --io_size=4M --refill_buffers --rw randwrite
[root@localhost ~]# du -hs /tank/test/test.img
129M    /tank/test/test.img
[root@localhost ~]# time zfs send -i tank/test@snap1 tank/test@snap2 | ssh [email protected] zfs recv tank/test
real    0m37.843s
user    0m0.760s
sys     0m2.792s

[root@localhost ~]# zfs list -t all
NAME              USED  AVAIL  REFER  MOUNTPOINT
tank              387M  7.31G  25.5K  /tank
tank/test         386M  7.31G   129M  /tank/test
tank/test@snap1   129M      -   129M  -
tank/test@snap2   129M      -   129M  -

Note how much time the incremental send/recv need to complete. As written above, I something need multiple fio + snapshot + send/recv passes to trigger the issue. During the very slow receive, the following happens on the receiving side...

RECEIVING SIDE:

[root@localhost ~]# dstat -d -f -n
--dsk/sda-----dsk/sdb-- -net/enp0s3
 read  writ: read  writ| recv  send
   0     0 :   0  2245k|  21M   48k
   0     0 :   0  7876k|3510k 7830B
   0     0 :   0  9643k|4319k   10k
   0     0 :   0    14M|3650k 7728B
   0     0 :   0  8382k|4050k   10k
   0     0 :   0    10M|3780k 7758B
   0     0 :   0  7807k|3651k 8754B
   0     0 :   0  8877k|4182k 9288B
   0     0 :   0  8431k|3916k 7698B
   0     0 :   0    13M|2970k 6960B
   0     0 :   0  8725k|2700k 5832B
   0     0 :   0    11M|3511k 8130B
   0     0 :   0    14M|2567k 6726B
   0     0 :   0  9815k|3243k 7890B
   0     0 :   0  9342k|2563k 6756B
   0     0 :   0  7917k|3930k 7224B
   0     0 :   0  7245k|4171k 7722B
   0     0 :   0  7325k|3779k 7962B
   0     0 :   0    14M|3511k 8094B
   0    26k:   0  7982k|3513k 8058B
   0     0 :   0  7833k|3782k 8754B
   0     0 :   0    10M|4053k 8454B
   0     0 :   0  9437k|4050k 9714B
   0     0 :   0    11M|4188k 7632B
   0     0 :   0    16M|4322k 9246B
   0     0 :   0  9638k|3239k 7122B
   0     0 :   0    16M|4055k 8520B
   0     0 :   0    11M|4186k 8378B
   0     0 :   0  9707k|3915k 8826B
   0  8192B:   0  9940k|4865k   11k
   0     0 :   0    15M|3241k 8562B
   0     0 :   0    11M|4053k 9552B
   0     0 :   0    14M| 964k 3492B

[root@localhost ~]# iostat -x -k 1
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.12    0.00   23.71   72.16    0.00    0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.00  979.38     0.00  8203.61    16.75     1.06    1.07    0.00    1.07   0.95  92.89
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.19    0.00   28.72   68.09    0.00    0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.00 1254.26     0.00 14247.34    22.72     1.11    0.87    0.00    0.87   0.75  94.04
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.17    0.00   27.08   68.75    0.00    0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.00  962.50     0.00  6493.75    13.49     1.07    1.15    0.00    1.15   0.95  91.77
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

[root@localhost ~]# cat /proc/spl/kstat/zfs/dbgmsg
1531863431   dnode_sync.c:120:free_blocks(): ds=          (null) obj=3e num=1
1531863431   txg.c:542:txg_sync_thread(): txg=1790 quiesce_txg=1792 sync_txg=1581
1531863431   txg.c:509:txg_sync_thread(): waiting; tx_synced=1790 waiting=1581 dp=ffff8eccba5c9000
1531863431   txg.c:592:txg_quiesce_thread(): txg=1791 quiesce_txg=1792 sync_txg=1581
1531863431   txg.c:600:txg_quiesce_thread(): quiesce done, handing off txg 1791
1531863431   txg.c:680:txg_wait_open(): txg=1793 quiesce_txg=1793 sync_txg=1581
1531863431   txg.c:542:txg_sync_thread(): txg=1791 quiesce_txg=1793 sync_txg=1581
1531863431   txg.c:509:txg_sync_thread(): waiting; tx_synced=1791 waiting=1581 dp=ffff8eccba5c9000
1531863431   txg.c:592:txg_quiesce_thread(): txg=1792 quiesce_txg=1793 sync_txg=1581
1531863431   txg.c:600:txg_quiesce_thread(): quiesce done, handing off txg 1792
1531863431   txg.c:542:txg_sync_thread(): txg=1792 quiesce_txg=1793 sync_txg=1581
1531863431   dnode_sync.c:120:free_blocks(): ds=ffff8eccba5fc000 obj=2 num=891
1531863431   dnode_sync.c:120:free_blocks(): ds=ffff8eccba5fc000 obj=2 num=81
1531863431   dnode_sync.c:120:free_blocks(): ds=ffff8eccba5fc000 obj=2 num=77
1531863431   txg.c:680:txg_wait_open(): txg=1794 quiesce_txg=1794 sync_txg=1581
1531863431   txg.c:592:txg_quiesce_thread(): txg=1793 quiesce_txg=1794 sync_txg=1581
1531863431   txg.c:600:txg_quiesce_thread(): quiesce done, handing off txg 1793
1531863431   txg.c:680:txg_wait_open(): txg=1795 quiesce_txg=1795 sync_txg=1581
1531863431   dnode_sync.c:120:free_blocks(): ds=          (null) obj=3e num=1
1531863431   txg.c:542:txg_sync_thread(): txg=1793 quiesce_txg=1795 sync_txg=1581
1531863431   txg.c:592:txg_quiesce_thread(): txg=1794 quiesce_txg=1795 sync_txg=1581
1531863431   txg.c:600:txg_quiesce_thread(): quiesce done, handing off txg 1794
1531863431   txg.c:680:txg_wait_open(): txg=1796 quiesce_txg=1796 sync_txg=1581
1531863431   dnode_sync.c:120:free_blocks(): ds=ffff8eccba5fc000 obj=2 num=625
1531863431   dnode_sync.c:120:free_blocks(): ds=ffff8eccba5fc000 obj=2 num=24
1531863431   dnode_sync.c:120:free_blocks(): ds=          (null) obj=3e num=1
1531863431   txg.c:542:txg_sync_thread(): txg=1794 quiesce_txg=1796 sync_txg=1581
1531863431   txg.c:509:txg_sync_thread(): waiting; tx_synced=1794 waiting=1581 dp=ffff8eccba5c9000
1531863431   txg.c:592:txg_quiesce_thread(): txg=1795 quiesce_txg=1796 sync_txg=1581
1531863431   txg.c:600:txg_quiesce_thread(): quiesce done, handing off txg 1795
1531863431   txg.c:680:txg_wait_open(): txg=1797 quiesce_txg=1797 sync_txg=1581
1531863431   txg.c:542:txg_sync_thread(): txg=1795 quiesce_txg=1797 sync_txg=1581
1531863431   txg.c:509:txg_sync_thread(): waiting; tx_synced=1795 waiting=1581 dp=ffff8eccba5c9000
1531863431   txg.c:592:txg_quiesce_thread(): txg=1796 quiesce_txg=1797 sync_txg=1581
1531863431   txg.c:600:txg_quiesce_thread(): quiesce done, handing off txg 1796
1531863431   txg.c:542:txg_sync_thread(): txg=1796 quiesce_txg=1797 sync_txg=1581
1531863431   dnode_sync.c:120:free_blocks(): ds=ffff8eccba5fc000 obj=2 num=291
1531863431   dnode_sync.c:120:free_blocks(): ds=ffff8eccba5fc000 obj=2 num=23
1531863431   txg.c:680:txg_wait_open(): txg=1798 quiesce_txg=1798 sync_txg=1581
1531863431   txg.c:592:txg_quiesce_thread(): txg=1797 quiesce_txg=1798 sync_txg=1581
1531863431   txg.c:600:txg_quiesce_thread(): quiesce done, handing off txg 1797
1531863431   dnode_sync.c:120:free_blocks(): ds=          (null) obj=3e num=1
1531863431   txg.c:680:txg_wait_open(): txg=1799 quiesce_txg=1799 sync_txg=1581

[root@localhost ~]# while true; do tail -1 /proc/spl/kstat/zfs/tank/txgs ; sleep 1; done
1581     2923512183496    O     0            0            0            0        0        0            0            0            0
1592     2924869136301    O     0            0            0            0        0        0            0            0            0
1637     2925857136846    O     0            0            0            0        0        0            0            0            0
1677     2926805983996    O     0            0            0            0        0        0            0            0            0
1716     2927870517362    O     0            0            0            0        0        0            0            0            0
1762     2928892130959    O     0            0            0            0        0        0            0            0            0
1790     2929819190105    O     0            0            0            0        0        0            0            0            0
1822     2930925821686    O     0            0            0            0        0        0            0            0            0
1851     2931902851534    O     0            0            0            0        0        0            0            0            0
1887     2932932608960    O     0            0            0            0        0        0            0            0            0
1920     2933922961357    O     0            0            0            0        0        0            0            0            0
1945     2934919538181    O     0            0            0            0        0        0            0            0            0
1988     2935935767377    O     0            0            0            0        0        0            0            0            0
2021     2936943034285    O     0            0            0            0        0        0            0            0            0
2064     2937954241755    O     0            0            0            0        0        0            0            0            0
2102     2938961198837    O     0            0            0            0        0        0            0            0            0
2141     2939959538399    O     0            0            0            0        0        0            0            0            0
2180     2940931324571    O     0            0            0            0        0        0            0            0            0
2223     2942017723225    O     0            0            0            0        0        0            0            0            0
2269     2943023817670    O     0            0            0            0        0        0            0            0            0
2300     2944011508163    O     0            0            0            0        0        0            0            0            0
2334     2945034108817    O     0            0            0            0        0        0            0            0            0
2357     2945682134602    O     0            0            0            0        0        0            0            0            0
2382     2947040590210    O     0            0            0            0        0        0            0            0            0
2412     2948068644023    O     0            0            0            0        0        0            0            0            0
2453     2949081098557    O     0            0            0            0        0        0            0            0            0
2499     2950088899578    O     0            0            0            0        0        0            0            0            0

[root@localhost ~]# zfs list -t all
NAME              USED  AVAIL  REFER  MOUNTPOINT
tank              388M  7.31G  25.5K  /tank
tank/test         386M  7.31G   129M  /tank/test
tank/test@snap1   129M      -   129M  -
tank/test@snap2   129M      -   129M  -

Focusing on the receiving side, please note how:

dstat shows a very long writeout phase;
iostat shows very small writes are pushed to disk;
dbgmsg shows frenetic txg activity;
txgs confirm how hundred of transaction are being synchronized to disk.

All this cause bad performance and very significant write amplification on the target host.

NOTE 1: I am reproducing the issue on two virtual machines, but I have multiple physical hosts which show the very same problem.

NOTE 2: I tried to add a SSD acting as SLOG device and nothing changes (the SLOG is absolutely not touched by the long data writeout phase).

Performance

Source

shodanshok

Most helpful comment

@alek-p Maybe the real fix would be to make dp_long_free_dirty_pertxg[] only include filled-in parts of the file that are being freed, not some huge range of already-free stuff.

ahrens on 19 Jul 2018

👍3

All 8 comments

Speculating here: I wonder if we are generating a lot of FREE records (because hole birth time is disabled by default), and when we process them on the receiving side they quickly exceed zfs_per_txg_dirty_frees_percent, causing us to push out a txg. You could try setting zfs_per_txg_dirty_frees_percent to 0 to disable it, and see if that helps.

ahrens on 18 Jul 2018

I just tried it on the VMs and it seems to work! I suspected it was related to some sort of threshold, but I can't figure which one (I tried with the various metaslab/frag threshold, with no avail). Tomorrow (well, today...) I'll try on a physical machine having the issue and I'll report back.

EDIT: I finally tried it on a physical, production machine and the high IO load is gone! sar stats shows iowait decreased from ~10% to <1%, and receives are much faster now. Thank you so much.

Some questions:

is setting zfs_per_txg_dirty_frees_percent to 0 safe, or it can cause issues?
if it is safe, why having it capped by default?
during my tests, I tried setting send_holes_without_birth_time to 0 but nothing changed. Why?
finally, is it now safe to use the hole_birth optimization, or does it remain bugged?

Thanks.

shodanshok on 18 Jul 2018

@shodanshok
You can read about why zfs_per_txg_dirty_frees_percent is used (nonzero) in the PR that introduced it: https://github.com/openzfs/openzfs/pull/61

send_holes_without_birth_time=0 doesn't do anything for incremental sends if you have (or ever had) >~20 files in the filesystem, because object numbers may have been reused, which inhibits our ability to tell if the other side already has holes with birth time of 0. (note that send_holes_without_birth_time=0 does help with sparse zvols.)

ahrens on 19 Jul 2018

is it now safe to use the hole_birth optimization, or does it remain bugged

Hole birth (the zpool feature) is safe. I'm not aware of any bugs with send_holes_without_birth_time=0, but I'm also not sure how much testing that has received recently.

ahrens on 19 Jul 2018

@alek-p Maybe the real fix would be to make dp_long_free_dirty_pertxg[] only include filled-in parts of the file that are being freed, not some huge range of already-free stuff.

ahrens on 19 Jul 2018

👍3

@ahrens thanks for your insights. As you said, it would be great if the delete throttling logic would account for filled-in parts only rather the for already sparse regions.

Regarding send_holes_without_birth_time, is the ~20 files limit an hard one? I can't find any reference about that. If so, I agree that it basically is a zvol-only feature. Anyway on the testing pool (whose output is shown in the first post) I have a single file only, so it should make a difference, right? But nothing changed by enabling it. Can it be due to the holes being generated at file creation time (ie: via truncate)?

Thanks.

shodanshok on 19 Jul 2018

Hi, can confirm we had same issue and setting zfs_per_txg_dirty_frees_percent=0 fixed it. Primarily replicating files with LUKS/ext4 filesystems in them (for unfortunate legacy reasons) and replication performance was less than expected, writes very small, disk utilization at an almost constant 25%. After tuning to 0, disk utilization dropped to almost 0, write sizes increased, transfer speed went up drastically. I'm very happy.

(Ubuntu 16.04.x, ZoL 0.7.5)