| Type | Version/Name |
| --- | --- |
| Distribution Name | Ubuntu |
| Distribution Version | 17.04 |
| Linux Kernel | 4.10.0-22-generic |
| Architecture | x86_64 |
| ZFS Version | 0.6.5.9 |
| SPL Version | 0.6.5.9 |
I have found a data corruption issue in zfs send. In pools using 1M recordsize, incremental sends without the -L flag sometimes silently zero out some files. The results are repeatable. Scrub does not find any errors.
Tested on Ubuntu Xenial with 0.6.5.6 and Zesty with 0.6.5.9 on the same systems.
Source pool:
root@oddity:~# modinfo zfs
filename: /lib/modules/4.10.0-22-generic/kernel/zfs/zfs/zfs.ko
version: 0.6.5.9-2
license: CDDL
author: OpenZFS on Linux
description: ZFS
srcversion: 42C4AB70887EA26A9970936
depends: spl,znvpair,zcommon,zunicode,zavl
vermagic: 4.10.0-22-generic SMP mod_unload
...
root@oddity:~# zpool status
pool: tank
state: ONLINE
scan: scrub canceled on Sun Jun 11 09:52:38 2017
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ata-ST4000VN000-1H4168_XXXXXXXX ONLINE 0 0 0
ata-ST4000VN000-1H4168_XXXXXXXX ONLINE 0 0 0
ata-ST4000VN000-2AH166_XXXXXXXX ONLINE 0 0 0
ata-ST4000VN000-1H4168_XXXXXXXX ONLINE 0 0 0
ata-ST4000VN000-1H4168_XXXXXXXX ONLINE 0 0 0
ata-ST4000VN000-1H4168_XXXXXXXX ONLINE 0 0 0
ata-ST4000VN000-1H4168_XXXXXXXX ONLINE 0 0 0
ata-ST4000VN000-1H4168_XXXXXXXX ONLINE 0 0 0
errors: No known data errors
root@oddity:~# zpool get all tank
NAME PROPERTY VALUE SOURCE
tank size 29T -
tank capacity 57% -
tank altroot - default
tank health ONLINE -
tank guid 18319514605431597227 default
tank version - default
tank bootfs - default
tank delegation on default
tank autoreplace off default
tank cachefile - default
tank failmode wait default
tank listsnapshots off default
tank autoexpand off default
tank dedupditto 0 default
tank dedupratio 1.00x -
tank free 12.3T -
tank allocated 16.7T -
tank readonly off -
tank ashift 12 local
tank comment - default
tank expandsize - -
tank freeing 0 default
tank fragmentation 5% -
tank leaked 0 default
tank feature@async_destroy enabled local
tank feature@empty_bpobj active local
tank feature@lz4_compress active local
tank feature@spacemap_histogram active local
tank feature@enabled_txg active local
tank feature@hole_birth active local
tank feature@extensible_dataset active local
tank feature@embedded_data active local
tank feature@bookmarks enabled local
tank feature@filesystem_limits enabled local
tank feature@large_blocks active local
root@oddity:~# zfs get all tank
NAME PROPERTY VALUE SOURCE
tank type filesystem -
tank creation Fri May 13 19:22 2016 -
tank used 11.8T -
tank available 8.13T -
tank referenced 222K -
tank compressratio 1.03x -
tank mounted yes -
tank quota none default
tank reservation none default
tank recordsize 1M local
tank mountpoint /tank default
tank sharenfs off default
tank checksum on default
tank compression lz4 local
tank atime off local
tank devices on default
tank exec on default
tank setuid on default
tank readonly off default
tank zoned off default
tank snapdir hidden default
tank aclinherit passthrough local
tank canmount on default
tank xattr sa local
tank copies 1 default
tank version 5 -
tank utf8only off -
tank normalization none -
tank casesensitivity mixed -
tank vscan off default
tank nbmand off default
tank sharesmb off default
tank refquota none default
tank refreservation none default
tank primarycache all default
tank secondarycache all default
tank usedbysnapshots 154K -
tank usedbydataset 222K -
tank usedbychildren 11.8T -
tank usedbyrefreservation 0 -
tank logbias latency default
tank dedup off default
tank mlslabel none default
tank sync standard default
tank refcompressratio 1.00x -
tank written 0 -
tank logicalused 12.8T -
tank logicalreferenced 41K -
tank filesystem_limit none default
tank snapshot_limit none default
tank filesystem_count none default
tank snapshot_count none default
tank snapdev hidden default
tank acltype posixacl local
tank context none default
tank fscontext none default
tank defcontext none default
tank rootcontext none default
tank relatime off default
tank redundant_metadata all default
tank overlay off default
The target pool:
root@ubackup:~# modinfo zfs
filename: /lib/modules/4.10.0-22-generic/kernel/zfs/zfs/zfs.ko
version: 0.6.5.9-2
license: CDDL
author: OpenZFS on Linux
description: ZFS
srcversion: 42C4AB70887EA26A9970936
depends: spl,znvpair,zcommon,zunicode,zavl
vermagic: 4.10.0-22-generic SMP mod_unload
...
root@ubackup:~# zpool status
pool: btank
state: ONLINE
scan: scrub repaired 0 in 3h36m with 0 errors on Tue Jun 13 13:34:08 2017
config:
NAME STATE READ WRITE CKSUM
btank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ata-WDC_WD30EZRX-00MMMB0_WD-XXXXXXXXXXXX ONLINE 0 0 0
ata-WDC_WD30EZRX-00MMMB0_WD-XXXXXXXXXXXX ONLINE 0 0 0
ata-WDC_WD30EZRX-00MMMB0_WD-XXXXXXXXXXXX ONLINE 0 0 0
ata-WDC_WD30EZRX-00MMMB0_WD-XXXXXXXXXXXX ONLINE 0 0 0
ata-WDC_WD30EZRX-00MMMB0_WD-XXXXXXXXXXXX ONLINE 0 0 0
ata-WDC_WD30EZRX-00MMMB0_WD-XXXXXXXXXXXX ONLINE 0 0 0
ata-WDC_WD30EZRX-00MMMB0_WD-XXXXXXXXXXXX ONLINE 0 0 0
ata-WDC_WD30EZRX-00MMMB0_WD-XXXXXXXXXXXX ONLINE 0 0 0
ata-WDC_WD30EZRX-00MMMB0_WD-XXXXXXXXXXXX ONLINE 0 0 0
errors: No known data errors
root@ubackup:~# zpool get all btank
NAME PROPERTY VALUE SOURCE
btank size 24.5T -
btank capacity 23% -
btank altroot - default
btank health ONLINE -
btank guid 14601555808903550874 default
btank version - default
btank bootfs - default
btank delegation on default
btank autoreplace off default
btank cachefile - default
btank failmode wait default
btank listsnapshots off default
btank autoexpand off default
btank dedupditto 0 default
btank dedupratio 1.00x -
btank free 18.9T -
btank allocated 5.64T -
btank readonly off -
btank ashift 12 local
btank comment - default
btank expandsize - -
btank freeing 0 default
btank fragmentation 9% -
btank leaked 0 default
btank feature@async_destroy enabled local
btank feature@empty_bpobj active local
btank feature@lz4_compress active local
btank feature@spacemap_histogram active local
btank feature@enabled_txg active local
btank feature@hole_birth active local
btank feature@extensible_dataset active local
btank feature@embedded_data active local
btank feature@bookmarks enabled local
btank feature@filesystem_limits enabled local
btank feature@large_blocks active local
root@ubackup:~# zfs get all btank
NAME PROPERTY VALUE SOURCE
btank type filesystem -
btank creation Mon Jun 12 18:41 2017 -
btank used 5.01T -
btank available 16.1T -
btank referenced 171K -
btank compressratio 1.03x -
btank mounted yes -
btank quota none default
btank reservation none default
btank recordsize 1M local
btank mountpoint /btank default
btank sharenfs off default
btank checksum on default
btank compression lz4 local
btank atime off local
btank devices on default
btank exec on default
btank setuid on default
btank readonly off default
btank zoned off default
btank snapdir hidden default
btank aclinherit passthrough local
btank canmount on default
btank xattr sa local
btank copies 1 default
btank version 5 -
btank utf8only on -
btank normalization formD -
btank casesensitivity mixed -
btank vscan off default
btank nbmand off default
btank sharesmb off default
btank refquota none default
btank refreservation none default
btank primarycache all default
btank secondarycache all default
btank usedbysnapshots 0 -
btank usedbydataset 171K -
btank usedbychildren 5.01T -
btank usedbyrefreservation 0 -
btank logbias latency default
btank dedup off default
btank mlslabel none default
btank sync disabled local
btank refcompressratio 1.00x -
btank written 171K -
btank logicalused 5.18T -
btank logicalreferenced 40K -
btank filesystem_limit none default
btank snapshot_limit none default
btank filesystem_count none default
btank snapshot_count none default
btank snapdev hidden default
btank acltype posixacl local
btank context none default
btank fscontext none default
btank defcontext none default
btank rootcontext none default
btank relatime off default
btank redundant_metadata all default
btank overlay off default
While the issue was observed with multiple datasets, I'll focus on a smaller one. First, the source (some uneventful snapshots omitted):
root@oddity:~# zfs list -o space -r tank/dataz/Backup
NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD
tank/dataz/Backup 8.13T 126G 461K 126G 0 0
root@oddity:~# zfs list -t snapshot -r tank/dataz/Backup
NAME USED AVAIL REFER MOUNTPOINT
tank/dataz/Backup@zfs-auto-snap_monthly-2017-06-01-0100 0 - 125G -
tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-12-1900 0 - 125G -
tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-12-2000 205K - 125G -
tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-12-2100 0 - 126G -
tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-13-0900 0 - 126G -
The initial send to ubackup has been performed with the -L and -e flags (not sure if this is relevant). Then performing an incremental send with the -L flag produces the expected result (size differences are due to different pool geometry):
root@oddity:~# zfs send -L -e -I tank/dataz/Backup@zfs-auto-snap_monthly-2017-06-01-0100 tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-13-0900 | ssh ubackup "zfs receive btank/oddity/tank/dataz/Backup"
root@ubackup:~# zfs list -o space -r btank/oddity/tank/dataz/Backup
NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD
btank/oddity/tank/dataz/Backup 16.0T 133G 754K 133G 0 0
root@ubackup:~# zfs list -t snapshot -r btank/oddity/tank/dataz/Backup
NAME USED AVAIL REFER MOUNTPOINT
btank/oddity/tank/dataz/Backup@zfs-auto-snap_monthly-2017-06-01-0100 14.2K - 131G -
btank/oddity/tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-12-1900 14.2K - 131G -
btank/oddity/tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-12-2000 156K - 131G -
btank/oddity/tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-12-2100 14.2K - 133G -
btank/oddity/tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-13-0900 0 - 133G -
But repeating the same send without the -L flag results in corruption:
root@ubackup:~# zfs rollback -r btank/oddity/tank/dataz/Backup@zfs-auto-snap_monthly-2017-06-01-0100
root@oddity:~# zfs send -e -I tank/dataz/Backup@zfs-auto-snap_monthly-2017-06-01-0100 tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-13-0900 | ssh ubackup "zfs receive btank/oddity/tank/dataz/Backup"
root@ubackup:~# zfs list -o space -r btank/oddity/tank/dataz/Backup
NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD
btank/oddity/tank/dataz/Backup 16.0T 133G 19.0G 114G 0 0
root@ubackup:~# zfs list -t snapshot -r btank/oddity/tank/dataz/Backup
NAME USED AVAIL REFER MOUNTPOINT
btank/oddity/tank/dataz/Backup@zfs-auto-snap_monthly-2017-06-01-0100 14.2K - 131G -
btank/oddity/tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-12-1900 14.2K - 131G -
btank/oddity/tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-12-2000 156K - 112G -
btank/oddity/tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-12-2100 14.2K - 114G -
btank/oddity/tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-13-0900 0 - 114G -
Notice how the REFER size drops and the USEDSNAP increase between the two runs. I looked around to find where the data had gone missing and found several random files whose reported sizes on disk were reduced to 512 bytes. Reading these files seems to return the full original size but with content all zeros.
I repeated the whole process multiple times, also recreating the target pool, with the same result. The affected files are always the same, but I haven't found a common characteristic among them. Scrubbing both pools finds no errors. Syslog shows nothing uncommon. I have never noted something similar before. The only notable change I made recently was that I started using sa based xattrs in May, but that might be unrelated.
I am happy to provide more info as far as I'm able.
I think I found the unifying characteristic of all corrupted files: All of them are in directories that have been updated via file creation or deletion between snapshots. Which of the files in the directory get corrupted seems to be random, though.
I'll try to put something together. I'm busy for the next week though, so it'll take a while.
I can reproduce the issue consistently with this script, tested on two different machines (0.6.5.6 and 0.6.5.9):
#!/bin/bash
BASE_DS="tank/issue6224"
zfs create -o recordsize=1M -o compression=lz4 $BASE_DS
zfs create $BASE_DS/source
cd /$BASE_DS/source
wget https://github.com/zfsonlinux/zfs/releases/download/zfs-0.6.5.10/zfs-0.6.5.10.tar.gz
zfs snapshot $BASE_DS/source@snap1
zfs send -L -e $BASE_DS/source@snap1 | zfs receive $BASE_DS/target
cp zfs-0.6.5.10.tar.gz zfs-0.6.5.10.tar.gz.1
zfs snapshot $BASE_DS/source@snap2
zfs send -e -i $BASE_DS/source@snap1 $BASE_DS/source@snap2 | zfs receive $BASE_DS/target
After running, check the size on disk of the files in the target dataset using du or ncdu. Note that the first file is only 512 bytes and all zero content.
NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD
tank/issue6224 8.13T 10.5M 0 205K 0 10.3M
tank/issue6224/source 8.13T 5.10M 136K 4.97M 0 0
tank/issue6224/target 8.13T 5.22M 2.52M 2.70M 0 0
@zrav thank you for providing a working reproducer!
I've not had the time to debug this properly yet, but after a cursory reading of the code i think we're seeing the result of the following block in receive_object()
2134 /*
2135 * If we are losing blkptrs or changing the block size this must
2136 * be a new file instance. We must clear out the previous file
2137 * contents before we can change this type of metadata in the dnode.
2138 */
2139 if (err == 0) {
2140 int nblkptr;
2141
2142 nblkptr = deduce_nblkptr(drro->drr_bonustype,
2143 drro->drr_bonuslen);
2144
2145 if (drro->drr_blksz != doi.doi_data_block_size ||
2146 nblkptr < doi.doi_nblkptr) {
2147 err = dmu_free_long_range(rwa->os, drro->drr_object,
2148 0, DMU_OBJECT_END);
2149 if (err != 0)
2150 return (SET_ERROR(EINVAL));
2151 }
2152 }
When we send the incremental stream our DRR OBJECT = inumof($corrupted_file) is sent with a block size that is different from the one already on disk (which was sent with -L the first time), so we "clear out the previous file contents":
First send (blksz = 1M)
[root@centos ~]# zfs send -L $POOLNAME/source@snap1 | zstreamdump -vv | grep "object = `inum /mnt/$POOLNAME/source/file1.dat`"
OBJECT object = 2 type = 19 bonustype = 44 blksz = 1048576 bonuslen = 176
FREE object = 2 offset = 5242880 length = -1
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
FREE object = 2 offset = 5242880 length = 1068498944
Incremental send (blksz = 128K)
[root@centos ~]# zfs send -i $POOLNAME/source@snap1 $POOLNAME/source@snap2 | zstreamdump -vv | grep "object = `inum /mnt/$POOLNAME/source/file1.dat`"
OBJECT object = 2 type = 19 bonustype = 44 blksz = 131072 bonuslen = 176
FREE object = 2 offset = 5242880 length = -1
The resulting file on disk is:
[root@centos ~]# zdb -ddddddd -bbbbbbb $POOLNAME/target `inum /mnt/$POOLNAME/target/file1.dat`
Dataset testpool/target [ZPL], ID 76, cr_txg 18, 5.03M, 12 objects, rootbp DVA[0]=<0:1527000:200> DVA[1]=<0:1527200:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=62L/62P fill=12 cksum=dd1da4093:545ff5a1b20:10a4e0a522bd7:2427c83816b784
Object lvl iblk dblk dsize dnsize lsize %full type
2 2 128K 128K 0 512 128K 0.00 ZFS plain file (K=inherit) (Z=inherit)
176 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
dnode maxblkid: 0
path /file1.dat
uid 0
gid 0
atime Sat Jun 24 07:26:16 2017
mtime Sat Jun 12 07:26:16 2018
ctime Sat Jun 12 07:26:16 2018
crtime Sat Jun 12 07:26:15 2018
gen 11
mode 100644
size 5242880
parent 34
links 1
pflags 40800000004
xattr 3
Indirect blocks:
0 L1 HOLE [L1 ZFS plain file] size=20000L birth=43L
The code i was mentioning earlier was integrated with 6c59307 (_Illumos 3693 - restore_object uses at least two transactions to restore an object_), even before f1512ee (_Illumos 5027 - zfs large block support_): if all this is true i think every ZFS version since Illumos 5027 is affected, possibly on other platfroms too.
This reproduces on current master and Illumos, which is quite troubling. Cross-platform reproducer here https://gist.github.com/loli10K/af70fb0f1aae7b174822aa5657e07c28.
ZoL:
root@linux:~# cat /sys/module/zfs/version
0.7.0-rc4_67_g7e35ea783
root@linux:~# bash -x ./issue-6224.sh
+ POOLNAME=testpool
+ is_linux
++ uname
+ [[ Linux == \L\i\n\u\x ]]
+ return 0
+ TMPDIR=/var/tmp
+ mountpoint -q /var/tmp
+ zpool destroy testpool
+ fallocate -l 128m /var/tmp/zpool.dat
+ zpool create -O mountpoint=/mnt/testpool testpool /var/tmp/zpool.dat
+ zfs create -o recordsize=1M testpool/source
+ dd if=/dev/urandom of=/mnt/testpool/source/file1.dat bs=1M count=5
5+0 records in
5+0 records out
5242880 bytes (5.2 MB) copied, 0.552689 s, 9.5 MB/s
+ zfs snapshot testpool/source@snap1
+ zfs send -L testpool/source@snap1
+ cat /var/tmp/full.dat
++ inum /mnt/testpool/source/file1.dat
++ stat -c %i /mnt/testpool/source/file1.dat
+ zstreamdump -v
+ grep 'object = 2'
OBJECT object = 2 type = 19 bonustype = 44 blksz = 1048576 bonuslen = 168
FREE object = 2 offset = 5242880 length = -1
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
FREE object = 2 offset = 5242880 length = 1068498944
+ cat /var/tmp/full.dat
+ zfs receive -F testpool/target
+ cp /mnt/testpool/source/file1.dat /mnt/testpool/source/full.dat
+ du -sh /mnt/testpool/source/file1.dat /mnt/testpool/source/full.dat /mnt/testpool/target/file1.dat
5.1M /mnt/testpool/source/file1.dat
5.1M /mnt/testpool/source/full.dat
5.1M /mnt/testpool/target/file1.dat
+ zfs snapshot testpool/source@snap2
+ zfs send -i testpool/source@snap1 testpool/source@snap2
+ cat /var/tmp/incr.dat
++ inum /mnt/testpool/source/file1.dat
++ stat -c %i /mnt/testpool/source/file1.dat
+ grep 'object = 2'
+ zstreamdump -v
OBJECT object = 2 type = 19 bonustype = 44 blksz = 131072 bonuslen = 168
FREE object = 2 offset = 5242880 length = -1
+ cat /var/tmp/incr.dat
+ zfs receive -F testpool/target
+ du -sh /mnt/testpool/source/file1.dat /mnt/testpool/source/full.dat /mnt/testpool/target/file1.dat /mnt/testpool/target/full.dat
5.1M /mnt/testpool/source/file1.dat
5.1M /mnt/testpool/source/full.dat
512 /mnt/testpool/target/file1.dat
5.1M /mnt/testpool/target/full.dat
++ inum /mnt/testpool/target/file1.dat
++ stat -c %i /mnt/testpool/target/file1.dat
+ zdb -ddddddd -bbbbbbb testpool/target 2
Dataset testpool/target [ZPL], ID 268, cr_txg 16, 5.03M, 8 objects, rootbp DVA[0]=<0:14fa000:200> DVA[1]=<0:14fa200:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=50L/50P fill=8 cksum=db27241b5:541d80cfdbd:10b6c0498b32d:2497961710b23a
Object lvl iblk dblk dsize dnsize lsize %full type
2 2 128K 128K 0 512 128K 0.00 ZFS plain file (K=inherit) (Z=inherit)
168 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
dnode maxblkid: 0
path /file1.dat
uid 0
gid 0
atime Mon Jun 26 19:38:50 2017
mtime Mon Jun 26 19:38:49 2017
ctime Mon Jun 26 19:38:49 2017
crtime Mon Jun 26 19:38:49 2017
gen 9
mode 100644
size 5242880
parent 34
links 1
pflags 40800000004
Indirect blocks:
0 L1 HOLE [L1 ZFS plain file] size=20000L birth=35L
root@linux:~#
Illumos (SmartOS):
[root@52-54-00-d3-7a-01 ~]# uname -v
joyent_20170622T212149Z
[root@52-54-00-d3-7a-01 ~]# bash -x ./issue-6224.sh
+ POOLNAME=testpool
+ is_linux
++ uname
+ [[ SunOS == \L\i\n\u\x ]]
+ return 1
+ TMPDIR=/tmp
+ zpool destroy testpool
+ mkfile 128m /tmp/zpool.dat
+ zpool create -O mountpoint=/mnt/testpool testpool /tmp/zpool.dat
+ zfs create -o recordsize=1M testpool/source
+ dd if=/dev/urandom of=/mnt/testpool/source/file1.dat bs=1M count=5
5+0 records in
5+0 records out
5242880 bytes transferred in 0.078813 secs (66523111 bytes/sec)
+ zfs snapshot testpool/source@snap1
+ zfs send -L testpool/source@snap1
+ cat /tmp/full.dat
+ zstreamdump -v
++ inum /mnt/testpool/source/file1.dat
++ stat -c %i /mnt/testpool/source/file1.dat
+ grep 'object = 8'
OBJECT object = 8 type = 19 bonustype = 44 blksz = 1048576 bonuslen = 168
FREE object = 8 offset = 5242880 length = -1
WRITE object = 8 type = 19 checksum type = 7 compression type = 0
WRITE object = 8 type = 19 checksum type = 7 compression type = 0
WRITE object = 8 type = 19 checksum type = 7 compression type = 0
WRITE object = 8 type = 19 checksum type = 7 compression type = 0
WRITE object = 8 type = 19 checksum type = 7 compression type = 0
FREE object = 8 offset = 5242880 length = 1068498944
+ cat /tmp/full.dat
+ zfs receive -F testpool/target
+ cp /mnt/testpool/source/file1.dat /mnt/testpool/source/full.dat
+ du -sh /mnt/testpool/source/file1.dat /mnt/testpool/source/full.dat /mnt/testpool/target/file1.dat
5.0M /mnt/testpool/source/file1.dat
5.0M /mnt/testpool/source/full.dat
5.0M /mnt/testpool/target/file1.dat
+ zfs snapshot testpool/source@snap2
+ zfs send -i testpool/source@snap1 testpool/source@snap2
+ cat /tmp/incr.dat
+ zstreamdump -v
++ inum /mnt/testpool/source/file1.dat
++ stat -c %i /mnt/testpool/source/file1.dat
+ grep 'object = 8'
OBJECT object = 8 type = 19 bonustype = 44 blksz = 131072 bonuslen = 168
FREE object = 8 offset = 5242880 length = -1
+ cat /tmp/incr.dat
+ zfs receive -F testpool/target
+ du -sh /mnt/testpool/source/file1.dat /mnt/testpool/source/full.dat /mnt/testpool/target/file1.dat /mnt/testpool/target/full.dat
5.0M /mnt/testpool/source/file1.dat
5.0M /mnt/testpool/source/full.dat
0K /mnt/testpool/target/file1.dat
5.0M /mnt/testpool/target/full.dat
++ inum /mnt/testpool/target/file1.dat
++ stat -c %i /mnt/testpool/target/file1.dat
+ zdb -ddddddd -bbbbbbb testpool/target 8
Dataset testpool/target [ZPL], ID 51, cr_txg 15, 5.03M, 9 objects, rootbp DVA[0]=<0:14f8200:200> DVA[1]=<0:14f8400:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=49L/49P fill=9 cksum=b47a0b8ce:44a5957f452:d88c20f8e067:1d7addab9d69db
Object lvl iblk dblk dsize lsize %full type
8 2 128K 128K 0 128K 0.00 ZFS plain file (K=inherit) (Z=inherit)
168 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED
dnode maxblkid: 0
path /file1.dat
uid 0
gid 0
atime Mon Jun 26 17:40:50 2017
mtime Mon Jun 26 17:40:50 2017
ctime Mon Jun 26 17:40:50 2017
crtime Mon Jun 26 17:40:50 2017
gen 8
mode 100644
size 5242880
parent 4
links 1
pflags 40800000004
Indirect blocks:
0 L1 HOLE [L1 ZFS plain file] size=20000L birth=34L
[root@52-54-00-d3-7a-01 ~]#
FWIW, although it says above that there is a cross-platform that reproduces on Illumos, @lundman notes that it hasn't made the slack and wonders if @ahrens has seen this.
(Reproduces in openzfsonosx too, and annoyingly I have a few affected files in backups that thankfully I haven't had to restore from.)
@behlendorf any particular reason this wasn't revisited?
Available developer resources, I'll add it to the 0.8 milestone so we don't loose track of it.
Just for reference, a similar issue was reported:
@loli10K that issue is related but different. I would guess the problem here is with turning -L on/off between incrementals. The incremental will report the drr_blksz in the OBJECT record as different from what is on the receiving system, so the recv assumes that the object must have been freed and reallocated, because the block size can't change on the sending system (if there are multiple blocks).
The solution might be a bit tricky, but I think it will have to involve disallowing some kinds of zfs receive.
@pcd1193182 or @tcaputi also have some experience with zfs send/recv.
We are having the same kind of issue in https://github.com/openzfs/openzfs/pull/705.... I don't have a solution right at this moment (other than simply forcing zfs or the user to use -L if the filesystem uses large blocks. Both of these issues stem from the fact that we are trying to infer whether an object was deleted from a few properties of the object, but this would be much more solid if we could somehow know for sure that this happened. Ill try to look into it more today.
I just tried this on an encrypted dataset with raw sending. Since I use legacy mountpoint I altered the script to:
#!/usr/bin/env bash
set -x
BASE_DS="tankSubi/encZFS/issue6224"
MNT="/tmp/test-6224"
zfs create -o recordsize=1M -o compression=lz4 $BASE_DS
zfs create $BASE_DS/source
mkdir -p $MNT
mount -t zfs $BASE_DS/source $MNT
cd $MNT
wget https://github.com/zfsonlinux/zfs/releases/download/zfs-0.6.5.10/zfs-0.6.5.10.tar.gz
zfs snapshot $BASE_DS/source@snap1
zfs send -L -e --raw $BASE_DS/source@snap1 | zfs receive $BASE_DS/target
cp zfs-0.6.5.10.tar.gz zfs-0.6.5.10.tar.gz.1
zfs snapshot $BASE_DS/source@snap2
zfs send -e --raw -i $BASE_DS/source@snap1 $BASE_DS/source@snap2 | zfs receive $BASE_DS/target
zfs list -t snapshot -r $BASE_DS
and it produced:
root@subi:/tmp# ./6224
+ BASE_DS=tankSubi/encZFS/issue6224
+ MNT=/tmp/test-6224
+ zfs create -o recordsize=1M -o compression=lz4 tankSubi/encZFS/issue6224
+ zfs create tankSubi/encZFS/issue6224/source
+ mkdir -p /tmp/test-6224
+ mount -t zfs tankSubi/encZFS/issue6224/source /tmp/test-6224
+ cd /tmp/test-6224
+ wget https://github.com/zfsonlinux/zfs/releases/download/zfs-0.6.5.10/zfs-0.6.5.10.tar.gz
--2018-11-07 20:13:24-- https://github.com/zfsonlinux/zfs/releases/download/zfs-0.6.5.10/zfs-0.6.5.10.tar.gz
Resolving github.com (github.com)... 140.82.118.4, 140.82.118.3
Connecting to github.com (github.com)|140.82.118.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-production-release-asset-2e65be.s3.amazonaws.com/437011/79cf5bd6-5109-11e7-8255-038691a19c2c?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20181107%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20181107T191325Z&X-Amz-Expires=300&X-Amz-Signature=47b1a953908e5f4436159c482f08d2c65c2d417ae94afab36bea78134d24f80c&X-Amz-SignedHeaders=host&actor_id=0&response-content-disposition=attachment%3B%20filename%3Dzfs-0.6.5.10.tar.gz&response-content-type=application%2Foctet-stream [following]
--2018-11-07 20:13:25-- https://github-production-release-asset-2e65be.s3.amazonaws.com/437011/79cf5bd6-5109-11e7-8255-038691a19c2c?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20181107%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20181107T191325Z&X-Amz-Expires=300&X-Amz-Signature=47b1a953908e5f4436159c482f08d2c65c2d417ae94afab36bea78134d24f80c&X-Amz-SignedHeaders=host&actor_id=0&response-content-disposition=attachment%3B%20filename%3Dzfs-0.6.5.10.tar.gz&response-content-type=application%2Foctet-stream
Resolving github-production-release-asset-2e65be.s3.amazonaws.com (github-production-release-asset-2e65be.s3.amazonaws.com)... 52.216.134.19
Connecting to github-production-release-asset-2e65be.s3.amazonaws.com (github-production-release-asset-2e65be.s3.amazonaws.com)|52.216.134.19|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2597676 (2.5M) [application/octet-stream]
Saving to: ‘zfs-0.6.5.10.tar.gz’
zfs-0.6.5.10.tar.gz 100%[========================================================================================================================================>] 2.48M 3.26MB/s in 0.8s
2018-11-07 20:13:26 (3.26 MB/s) - ‘zfs-0.6.5.10.tar.gz’ saved [2597676/2597676]
+ zfs snapshot tankSubi/encZFS/issue6224/source@snap1
+ zfs send -L -e --raw tankSubi/encZFS/issue6224/source@snap1
+ zfs receive tankSubi/encZFS/issue6224/target
+ cp zfs-0.6.5.10.tar.gz zfs-0.6.5.10.tar.gz.1
+ zfs snapshot tankSubi/encZFS/issue6224/source@snap2
+ zfs receive tankSubi/encZFS/issue6224/target
+ zfs send -e --raw -i tankSubi/encZFS/issue6224/source@snap1 tankSubi/encZFS/issue6224/source@snap2
+ zfs list -t snapshot -r tankSubi/encZFS/issue6224
NAME USED AVAIL REFER MOUNTPOINT
tankSubi/encZFS/issue6224/source@snap1 136K - 2.69M -
tankSubi/encZFS/issue6224/source@snap2 0B - 5.21M -
tankSubi/encZFS/issue6224/target@snap1 120K - 2.67M -
tankSubi/encZFS/issue6224/target@snap2 0B - 5.20M -
Also after mounting it seems to be fine
root@subi:/tmp# mkdir snp2
root@subi:/tmp# zfs load-key tankSubi/encZFS/issue6224/target
Enter passphrase for 'tankSubi/encZFS/issue6224/target':
root@subi:/tmp# mount -t zfs tankSubi/encZFS/issue6224/target /tmp/snp2
root@subi:/tmp# cd snp2/
root@subi:/tmp/snp2# ls -al
total 5114
drwxr-xr-x 2 root root 4 Nov 7 20:13 .
drwxrwxrwt 24 root root 860 Nov 7 20:17 ..
-rw-r--r-- 1 root root 2597676 Jun 14 2017 zfs-0.6.5.10.tar.gz
-rw-r--r-- 1 root root 2597676 Nov 7 20:13 zfs-0.6.5.10.tar.gz.1
It seems when using raw send on encrypted dataset that this bug is not triggered.
Right, because --raw sends the data exactly as they are on disk. This implies not changing the block size, which is what -L does
raw sends imply -L for large-block datasets. Encrypted data cannot be split into smaller blocks.
ok, thanks for the explanation.
@behlendorf Isn't this resolved by #8668?
@jwittlincohen I just tested 0.8.0-rc5 with my little reproducer from above, and unfortunately the problem isn't fixed. The difference is that now the zeroed files are shown with their correct size instead of 512 byte.
Will this data corruption bug linger forever?
@scineram I guess time will tell. Would you like to provide input on prioritizing this bug, by describing the impact it has for you?
@ahrens Any user unaware of this issue could potentially experience data loss. I don't think we should have to provide arguments why a data corruption bug should be fixed, especially in ZFS.
@zrav We all agree it should be fixed. Adding your experience helps to motivate others to do that work.
I could imagine a solution which makes it an error to toggle -L, like the following:
DMU_BACKUP_FEATURE_LARGE_BLOCKS, always activate SPA_FEATURE_LARGE_BLOCKS for this new dataset.DMU_BACKUP_FEATURE_LARGE_BLOCKS does not match the dataset's SPA_FEATURE_LARGE_BLOCKS, fail the receive.However, this wouldn't exactly work for pre-existing datasets which might not have SPA_FEATURE_LARGE_BLOCKS activated even though they were received from a DMU_BACKUP_FEATURE_LARGE_BLOCKS stream (if they didn't happen to actually have any large blocks). We could fix this by adding a new flag to the dataset to indicate that SPA_FEATURE_LARGE_BLOCKS will really be set (and if the new flag is not set, we can't do the check when receiving an incremental).
An alternate solution would be to make it an error to send a dataset that has large blocks present without also specifying -L. This is the cleanest solution since we could remove the questionable "split_large_blocks" code that's causing this bug. But existing if you've already done a receive that splits large blocks, you'd essentially be forced to toggle -L on the next incremental to it, potentially triggering this bug.
@ahrens My experience is that I got lucky and narrowly avoided losing both personal and company data.
The last alternative solution you propose is clean, but has the downside of closing the only route we currently have to convert a large block dataset down to regular block sizes.
Another option I see is having large block datasets be sent with -L by default (making -L redundant), and introducing another send flag to reduce the blocks to non-large if desired. That wouldn't fix the bug on the receive side, but at least stops data corruption from happening by forgetting to set a parameter.
I still wonder what's the technical reason that leads to zfs send/recv being unable to change the block size - effectively causing this bug.
Why can't we specify to have _this stream_ recv'd into _this_ dataset_ with recordsize=x and have the data written as if the operation would be made from userland using standard filesystem tools (like dd and truncate), meaning existing files being written with the recordsize they already have while new files being created (and written) in the recordsize currently set for the dataset (or an override specified on recv, so a full stream can be modified in this regard)?
The WRITE and FREE records in the stream are addressing using object ID, offset & length.
Shouldn't it be _completely irrelevant_ what on-disk layout (or recordsize) the target dataset has?
@ahrens What am I missing?
before this is solved, can someone mention it and it's mitigations on https://github.com/zfsonlinux/zfs/wiki/FAQ ?
I'm a bit confused about the state of encryption. This isn't a gripe, just a question: does ZoL consider encryption production ready? I only ask because the topic of Freenode's #zfsonlinux (last updated in Sep 2019) says no: "Native encryption is not production ready, see #6224."
Since it is a single-# channel on Freenode, it is fairly safe to assume it is an official channel with an official topic... but I get the impression from the activity on this ticket that it isn't considered a show-stopper for encryption being "production ready".
I'm asking because when people ask me about ZFS + Encryption, I point them to the topic of #zfsonlinux. If that isn't the thing to do, I'd like to stop :). Again, I have no gripe or annoyance -- I'm just looking for some clarification on what the team thinks.
@grahamc thank you for pointing this out. I can completely understand your confusion, we've gone ahead and updated the IRC topic to be more accurate.
There's also some unfortunate confusion surrounding this particular issue which is in fact not related to encryption. This is a send/recv issue which can occur when sending incremental datasets which contain large blocks and inconsistently setting the zfs send -L flag between incremental sends. In fact, raw encrypted send/recvs automatically imply the -L flag and are therefore unaffected.
To circle back to your original question. Yes, we consider the encryption feature to be ready and safe for production use. It's currently being used every day in a variety of environments. That said, there is still work underway to better integrate this feature with systemd and to optimize performance. To the best of our knowledge there are no critical encryption bugs. The open encryption issues are primarily feature requests, performance questions, or relatively minor uncommon issues.
The proposed fixes for this are a little tricky, but the issue can be avoid by consistent use of the zfs send -L flag. I hope that helps clear things up.
The proposed fixes for this are a little tricky, but the issue can be avoid by consistent use of the zfs send -L flag. I hope that helps clear things up.
Could that make it's way to the FAQ? it should be faster to be discovered by new users than this issue
I suspect I have experienced this same bug on Illumos, but without using '-L' on zfs send. In my case, the sending system has 128k recordsize and the receiving system has 1M recordsize. When replicating the other direction the problem does not happen.
The problem doesn't reproduce as easily, but over the course of about 1 month, I had 4.7 million files get replaced by NULLs on the receiving zfs folder. Thank god the snapshot copies were good.
ZFS scrub doesn't detect any problems on either end of the replication. Live files will be different.
Files don't have to be created or modified on the sending system to be affected in my case.
@lschweiss-wustl If you are not changing the -L setting, I don't see how this would affect you. The target system's recordsize setting shouldn't have anything to do with it. You can verify that your files have the same actual block size on the source and target systems (both should be 128K). You can do this with stat <file> (look at IO Block: output), or zdb.
@lschweiss-wustl If you are not changing the
-Lsetting, I don't see how this would affect you. The target system's recordsize setting shouldn't have anything to do with it. You can verify that your files have the same actual block size on the source and target systems (both should be 128K). You can do this withstat <file>(look atIO Block:output), orzdb.
I still can't say if this is the same bug, but it definitely appears to be a zfs send/receive bug when block sizes are not the same also resulting in NULL filled files.
The block size apparently changed on the receiving end when the file was corrupted.
Walking through an example file that was corrupted this way:
On the receiving pool at the time of corruption, recordsize=1M. The content is all nulls. The block count and size changed.
# head /HCPintraDB00/xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
# stat /HCPintraDB00/xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
File: /HCPintraDB00/xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
Size: 281070 Blocks: 1 IO Block: 131072 regular file
Device: 4350019h/70582297d Inode: 1043386 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 603/ UNKNOWN) Gid: (60023/ UNKNOWN)
Access: 2019-10-29 07:29:18.399569973 -0500
Modify: 2019-10-29 18:19:45.006930486 -0500
Change: 2019-10-30 03:32:03.353868416 -0500
Birth: 2019-10-29 07:29:18.399569973 -0500
In a snapshot on the same pool the content is correct:
# head /HCPintraDB00/.zfs/snapshot/daily_2019-11-07_00:00-0600/xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow1.nrg.mir
2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow3.humanconnectome.org
2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow3.nrg.mir:8080
2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow1.nrg.mir
2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow1.nrg.mir:8080
2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow1.nrg.mir
2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow1.nrg.mir:8080
2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow2.nrg.mir:8080
2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow3.humanconnectome.org
2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow2.nrg.mir:8080
# stat /HCPintraDB00/.zfs/snapshot/daily_2019-11-07_00:00-0600/xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
File: /HCPintraDB00/.zfs/snapshot/daily_2019-11-07_00:00-0600/xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
Size: 281070 Blocks: 35 IO Block: 281088 regular file
Device: 435011fh/70582559d Inode: 1043386 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 603/ UNKNOWN) Gid: (60023/ UNKNOWN)
Access: 2019-10-29 07:29:18.399569973 -0500
Modify: 2019-10-29 18:19:45.006930486 -0500
Change: 2019-10-30 03:32:03.353868416 -0500
Birth: 2019-10-29 07:29:18.399569973 -0500
On the recordsize=128K pool, the file never changed at the time the receiving pool corrupted the file:
I'm not sure this is the same bug, however, it certainly appears to be a zfs send/receive bug when block sizes are not the same at each end.
# stat /HCPintraDB00/.zfs/snapshot/daily_2019-11-07_00:00-0600/xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
File: /HCPintraDB00/.zfs/snapshot/daily_2019-11-07_00:00-0600/xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
Size: 281070 Blocks: 117 IO Block: 131072 regular file
Device: 43509b5h/70584757d Inode: 1043386 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 603/ UNKNOWN) Gid: (60023/ UNKNOWN)
Access: 2019-10-29 07:29:18.399569973 -0500
Modify: 2019-10-29 18:19:45.006930486 -0500
Change: 2019-10-30 03:32:03.353868416 -0500
Birth: 2019-10-29 07:29:18.399569973 -0500
# stat /HCPintraDB00/.zfs/snapshot/daily_2019-11-08_00:00-0600/xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
File: /HCPintraDB00/.zfs/snapshot/daily_2019-11-08_00:00-0600/xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
Size: 281070 Blocks: 117 IO Block: 131072 regular file
Device: 43509b6h/70584758d Inode: 1043386 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 603/ UNKNOWN) Gid: (60023/ UNKNOWN)
Access: 2019-10-29 07:29:18.399569973 -0500
Modify: 2019-10-29 18:19:45.006930486 -0500
Change: 2019-10-30 03:32:03.353868416 -0500
Birth: 2019-10-29 07:29:18.399569973 -0500
2019-11-07 is a significant date in this dataset's history. On that day the active pool was switched and the replication reversed. There were millions of files corrupted between 2019-11-07 and 2019-12-17 when the replication was flipped again.
I'll be happy to file this as a different Illumos bug if there is good evidence this is truly a different bug.
@mailinglists35 added to FAQ, https://github.com/zfsonlinux/zfs/wiki/FAQ#sending-large-blocks
Would it make sense to change the default behavior to include -L?
Because we keep having people every so often report getting burned by this, astonished that ZFS mangled their data silently, and this doesn't seem to be getting fixed anytime soon.
@ahrens @behlendorf
Added to the FAQ that likely _no one_ memorizes back to front?
Come on guys, silent data corruption in the replication/backup mechanic of a filesystem
it can't get _much_ worse, can it? Why isn't this being treated as a showstopper?
What solution do you propose, @GregorKopka?
Not Gregor, but -L by default seems like a reasonable mitigation until the underlying bug is fixed, to me - a --no-L for people who know what they're doing might be needed, but IMO leaving the unsafe option as a default when there's known silent corruption if you use it in certain circumstances seems pretty bad. (See also: defaulting send_holes_without_birth_time to 1 versus 0.)
@rincebrain The current default is safe (even if the implementation is a bit clunky). The issue is with switching between --no-L and -L (or vice versa). So switching the default without any other mitigating changes seems like a very bad idea, since it would cause uses that previously where fine (no flags) to now hit this.
Oh, I see, I had misunderstood this, and thought it was more general than that, from e.g. @lschweiss-wustl 's report, though reading more, I guess that would warrant a distinct bug.
I'd still like a way to have a safe default so I can't accidentally forget a flag one time and silently mangle my data, but it seems like it might not be worth it to implement such a parameter, since most people are using management tools and can just flip the default sets of flags.
It'd be nice if this made the list for e.g. OpenZFS 2.0, since it's been around from the 0.6.5.X days.
What solution do you propose, @GregorKopka?
@ahrens One that is free of data loss. I'm still clueless why zfs cares about on-disk block sizes on send/recv, would you be so nice to please answer https://github.com/openzfs/zfs/issues/6224#issuecomment-548631814 ? That question somehow lingers between me and a solution.
@GregorKopka well, for starters, it would involve reimplementing large chunks of the receive logic, and the performance would probably be substantially worse. And it would be pretty messy when interacting with encryption and raw sends. It's not really compatible with the design of redaction, either. It would remove large amounts of the benefits of compressed send/recv. Those are just off the top of my head.
How about rejecting the stream immediately if the dataset uses large blocks, but the send was without -L. Would that be difficult to implement?
This bug is terrible for ZFS' reputation.
The FAQ says "When sending incremental streams which contain large blocks (>128K) the --large-block flag must be specified." I use 1MB block size everywhere so the FAQ makes me think I should just go add -L to all my "zfs sync", but wouldn't that actually cause me to be flipping -l Off/On and cause corruption?
What happens when using 'zfs send -t'? According to the docs the -L option is not available there, will it just produce corrupted datasets if the original&interrupted send had -L?
zfs send [-Penv] -t receive_resume_token
Creates a send stream which resumes an interrupted receive. The receive_resume_token is the value of this property on the filesystem or volume that was being
received into. See the documentation for zfs receive -s for more details.
I've tested that 'zfs send -L -t' works, so maybe this is just a documentation issue.
If using -L is the right thing to do all the time, even if one has dataset syncs that started without -L, then sounds like the right thing to do here is to add -L by default on zfs send all the time, or at least when the sender zfs recordsize is over 128K.
@scineram I think that would work, as a partial solution.
If the dataset does not have large blocks, but the stream does, we don't currently have a way of knowing whether that's because the earlier send split the large blocks (in which case we should reject this receive), or the earlier snapshot didn't happen to have any large blocks in it (in which case we should accept it).
See my earlier comment https://github.com/openzfs/zfs/issues/6224#issuecomment-548445813 for more thoughts on this.
@gerardba
What happens when using 'zfs send -t'
Unfortunately, it looks like zfs send -Lt will add the large-block flag even if the token doesn't have it set, thus allowing you to toggle -L and hit hit bug.
wouldn't that actually cause me to be flipping -l Off/On and cause corruption?
Yes. One difficulty of dealing with this issue is that the current default is not the recommended way to do sends of filesystems with large blocks, but switching to the recommended way can hit this bug.
It seems like there is a lot of duplication of ideas and discussion going on here and elsewhere, so I'm going to attempt to consolidate and summarize all the various suggestions that have been made, and highlight the problems with each of them. This will hopefully help people to understand why this issue hasn't been resolved yet; it's not just a matter of implementing "[a solution] that is free of data loss", none of the proposed solutions avoids data loss in all cases for all users.
Personally, I would advocate that we immediately move forwards with #4; it is the only perfect long-term solution to this problem and others like it that we may encounter later on. The sooner we get the feature into zfs, the more data will have the creation_txg in it.
For all existing data, I think that the best behavior is probably a combination of 1b and 2, protecting existing users with large-block datasets who have used -L, users who have small-block datasets, and requiring large-block users who have not used -L thus far to restart their sends or requiring them to deliberately use a zstream split-blocks feature. Hopefully having to use a separate command line utility is sufficiently abnormal for the zfs send workflow that it will give users pause, and prevent them from encountering this bug, especially if adequate explanation is placed in the man pages/usage messages for various commands. Unfortunately, my guess is that there are a lot of users who fall into the large-block-without-dash-L camp, and this solution will inconvenience all of them. I don't know how to fix that.
If there are solutions that people proposed/like that they did not address, feel free to let me know and I will attempt to keep this up to date. If you see a way around one of the issues, or a new solution, again let me know and I will try to add/address it. Hopefully this helps people understand the current state of this bug and our attempts to come up with an ideal solution.
Thanks for the great write-up, @pcd1193182!
I realized there's a simpler way to do #4 (Improve/perfect detection of whether an object needs to be reallocated) which allows us to implement #2 perfectly (reject receives that toggle -L).
The ZPL already has a per-file generation number (unique for each actual file, even if they have the same object number), which it stores in the bonus buffer, which is included in the OBJECT record. We can use the generation number to tell if a file's block size changed due to toggling -L, and reject the receive in that case. Using the ZPL's generation number is a slight layering violation (send/receive operates at the DMU level and otherwise doesn't know about object contents / ZPL state), but I think that using it as an error-detection mechanism is fair, especially compared to the alternatives.
I think we can also orthogonally consider 1b (make -L the default), once we have the safety of the above solution, which I'm working on implementing.
@ahrens Do we want to reject the receive in this case? Or do we just want to correctly reallocate the dnode (as the heuristic should have done)?
After discussion with @tcaputi, I think I have a way to correctly process send streams that change to have -L, and reject send streams that change to not have -L. That would give us a better path forward towards having -L be the default (or only) option in a future release (likely after 2.0).
@pcd1193182 Thank you for taking the time to compile the options.
As far as I understand on-disk format and code the root of the issue is the details of storing objects in ObjSets:
Each object within an object set is uniquely identified by a 64 bit integer called a nobject number. An object's “object number” identifies the array element, in the dnode array, containing this object's dnode_phys_t.
which leads to objects being removed from the set leaving gaps in the dnode array, which are then (as I presume) greedily reused by ZFS to avoid growing that array unless absolutely needed, resulting in "object number" being ambiguous in a zfs send stream, resulting in the decision to have zfs recv detect recycling of an object number through looking for a changed record size.
And finally toggling the use of the zfs send -L option wrongfully triggering that heuristic.
Did I understand that correctly so far?
If yes: Do we _really need_ that check for reuse of object numbers? Shouldn't the dead list when sending the snapshot take care of removing all stale data (that belonged to the removed object) from the destination?
Apart from that...
I still think that ZFS send/recv should have the ability to convert datasets from one record-/volblocksize to another. Machines get retired, pools get transferred to new and different hardware as technology advances.If ZFS is ment to be the _last word in filesystems_... shouldn't it have the ability to remap the block size of the data in existing datasets (without sacrificing the snapshot history) to benefit from the abilities of the destination systems hardware as much as possible?
Wouldn't it make sense to be able to instruct a backup system using shingled drives to store data in the 8MB (just to pull a number from thin air) records that align with the native write stripe size of the drives, regardless of the block site of the source? And in case of a transfer of such a backup to another system where a database daemon is to access and modify the dataset in the DB native block size of (say) 16KiB... being able to recv that dataset with such a recordsize?
@GregorKopka
Did I understand that correctly so far?
That's right.
Do we really need that check for reuse of object numbers?
We need it so that the new object is set up correctly before we start getting WRITE records. e.g. has the correct block size and dnode size (number of "slots").
I still think that ZFS send/recv should have the ability to convert datasets from one record-/volblocksize to another.
I hear that you would like us to implement that new feature. It's a reasonable thing to want, but not straightforward to implement. Right now I am prioritizing this data-loss bug.
I agree with your priorities.
My question is: will the ZPL's generation number be deterministic enough to completely solve this issue _or_ will it just trade this issue against a more convoluted and obscure variation?
Sorry to hijack this but I think I've been bitten by this bug on a replication between two zfs pools (running FreeNAS). Is there _any_ recovery path once corruption is established or is the only option to destroy the replicated pool and start from scratch?
Also, pardon the necessarily stupid question but how comes that a zeroed file doesn't trigger any checksumming error upon reading? I was under the impression that zfs send/recv carry all the checksumming info necessary to confirm that what's written on the recv end exactly matches what was sent, am I mistaken?
Thanks
@f00b4r0 I'm not aware of any recovery path.
The send stream is itself checksummed, but it doesn't include checksums of every block of every file.
@f00b4r0 I'm not aware of any recovery path.
sigh, thanks. Not what I was hoping for, but not entirely unexpected.
The send stream is itself checksummed, but it doesn't include checksums of every block of every file.
This sounds like a design flaw that would have immediately exposed this bug, or am I missing something?
@f00b4r0 If the receiving filesystem has snapshots before the corruption the data should still be in the snapshot. The trick is locating all the null fill files and the clean snapshot. I recovered millions of files from my snapshot history with some scripting work.
It is best if you have a clean copy that wasn't affected by this bug and redo the replicated copy.
@lschweiss-wustl thanks. It looks like I'll have to start from scratch. Then again that bug bites so hard, I'm not sure I can ever trust send/recv again. At least rsync doesn't silently wipe data ;P
Sorry if asking something obvious, but I have a question: sending a dataset from a pool with large_block enabled, and with 1M recordsize datasets, to a non-large-block-enabled pool without ever using -L, am I right saying that I would not trigger the bug? What about the inverse scenario (sending from 128K block dataset/pool to a large_block enabled pool)?
In other words: if I never used -L I am safe, right?
Thanks.
@shodanshok
sending a dataset from a pool with large_block enabled, and with 1M recordsize datasets, to a non-large-block-enabled pool without ever using -L, am I right saying that I would not trigger the bug?
That's right. The bug is about changing the -L setting in an incremental send.
What about the inverse scenario (sending from 128K block dataset/pool to a large_block enabled pool)?
In this scenario you don't have any large blocks, so it doesn't matter. You can even toggle -L, because it's a no-op if you don't have large blocks.
if I never used -L I am safe, right?
Correct.
Most helpful comment
@grahamc thank you for pointing this out. I can completely understand your confusion, we've gone ahead and updated the IRC topic to be more accurate.
There's also some unfortunate confusion surrounding this particular issue which is in fact not related to encryption. This is a send/recv issue which can occur when sending incremental datasets which contain large blocks and inconsistently setting the
zfs send -Lflag between incremental sends. In fact, raw encrypted send/recvs automatically imply the-Lflag and are therefore unaffected.To circle back to your original question. Yes, we consider the encryption feature to be ready and safe for production use. It's currently being used every day in a variety of environments. That said, there is still work underway to better integrate this feature with systemd and to optimize performance. To the best of our knowledge there are no critical encryption bugs. The open encryption issues are primarily feature requests, performance questions, or relatively minor uncommon issues.
The proposed fixes for this are a little tricky, but the issue can be avoid by consistent use of the
zfs send -Lflag. I hope that helps clear things up.