Zfs: File incorrectly zeroed when receiving incremental stream that toggles -L

Created on 13 Jun 2017  ·  59Comments  ·  Source: openzfs/zfs

System information

| Type | Version/Name |
| --- | --- |
| Distribution Name | Ubuntu |
| Distribution Version | 17.04 |
| Linux Kernel | 4.10.0-22-generic |
| Architecture | x86_64 |
| ZFS Version | 0.6.5.9 |
| SPL Version | 0.6.5.9 |

Describe the problem you're observing

I have found a data corruption issue in zfs send. In pools using 1M recordsize, incremental sends without the -L flag sometimes silently zero out some files. The results are repeatable. Scrub does not find any errors.

Tested on Ubuntu Xenial with 0.6.5.6 and Zesty with 0.6.5.9 on the same systems.

Describe how to reproduce the problem

Source pool:

root@oddity:~# modinfo zfs
filename:       /lib/modules/4.10.0-22-generic/kernel/zfs/zfs/zfs.ko
version:        0.6.5.9-2
license:        CDDL
author:         OpenZFS on Linux
description:    ZFS
srcversion:     42C4AB70887EA26A9970936
depends:        spl,znvpair,zcommon,zunicode,zavl
vermagic:       4.10.0-22-generic SMP mod_unload
...
root@oddity:~# zpool status
  pool: tank
 state: ONLINE
  scan: scrub canceled on Sun Jun 11 09:52:38 2017
config:

        NAME                                 STATE     READ WRITE CKSUM
        tank                                 ONLINE       0     0     0
          raidz2-0                           ONLINE       0     0     0
            ata-ST4000VN000-1H4168_XXXXXXXX  ONLINE       0     0     0
            ata-ST4000VN000-1H4168_XXXXXXXX  ONLINE       0     0     0
            ata-ST4000VN000-2AH166_XXXXXXXX  ONLINE       0     0     0
            ata-ST4000VN000-1H4168_XXXXXXXX  ONLINE       0     0     0
            ata-ST4000VN000-1H4168_XXXXXXXX  ONLINE       0     0     0
            ata-ST4000VN000-1H4168_XXXXXXXX  ONLINE       0     0     0
            ata-ST4000VN000-1H4168_XXXXXXXX  ONLINE       0     0     0
            ata-ST4000VN000-1H4168_XXXXXXXX  ONLINE       0     0     0

errors: No known data errors

root@oddity:~# zpool get all tank
NAME  PROPERTY                    VALUE                       SOURCE
tank  size                        29T                         -
tank  capacity                    57%                         -
tank  altroot                     -                           default
tank  health                      ONLINE                      -
tank  guid                        18319514605431597227        default
tank  version                     -                           default
tank  bootfs                      -                           default
tank  delegation                  on                          default
tank  autoreplace                 off                         default
tank  cachefile                   -                           default
tank  failmode                    wait                        default
tank  listsnapshots               off                         default
tank  autoexpand                  off                         default
tank  dedupditto                  0                           default
tank  dedupratio                  1.00x                       -
tank  free                        12.3T                       -
tank  allocated                   16.7T                       -
tank  readonly                    off                         -
tank  ashift                      12                          local
tank  comment                     -                           default
tank  expandsize                  -                           -
tank  freeing                     0                           default
tank  fragmentation               5%                          -
tank  leaked                      0                           default
tank  feature@async_destroy       enabled                     local
tank  feature@empty_bpobj         active                      local
tank  feature@lz4_compress        active                      local
tank  feature@spacemap_histogram  active                      local
tank  feature@enabled_txg         active                      local
tank  feature@hole_birth          active                      local
tank  feature@extensible_dataset  active                      local
tank  feature@embedded_data       active                      local
tank  feature@bookmarks           enabled                     local
tank  feature@filesystem_limits   enabled                     local
tank  feature@large_blocks        active                      local

root@oddity:~# zfs get all tank
NAME  PROPERTY              VALUE                  SOURCE
tank  type                  filesystem             -
tank  creation              Fri May 13 19:22 2016  -
tank  used                  11.8T                  -
tank  available             8.13T                  -
tank  referenced            222K                   -
tank  compressratio         1.03x                  -
tank  mounted               yes                    -
tank  quota                 none                   default
tank  reservation           none                   default
tank  recordsize            1M                     local
tank  mountpoint            /tank                  default
tank  sharenfs              off                    default
tank  checksum              on                     default
tank  compression           lz4                    local
tank  atime                 off                    local
tank  devices               on                     default
tank  exec                  on                     default
tank  setuid                on                     default
tank  readonly              off                    default
tank  zoned                 off                    default
tank  snapdir               hidden                 default
tank  aclinherit            passthrough            local
tank  canmount              on                     default
tank  xattr                 sa                     local
tank  copies                1                      default
tank  version               5                      -
tank  utf8only              off                    -
tank  normalization         none                   -
tank  casesensitivity       mixed                  -
tank  vscan                 off                    default
tank  nbmand                off                    default
tank  sharesmb              off                    default
tank  refquota              none                   default
tank  refreservation        none                   default
tank  primarycache          all                    default
tank  secondarycache        all                    default
tank  usedbysnapshots       154K                   -
tank  usedbydataset         222K                   -
tank  usedbychildren        11.8T                  -
tank  usedbyrefreservation  0                      -
tank  logbias               latency                default
tank  dedup                 off                    default
tank  mlslabel              none                   default
tank  sync                  standard               default
tank  refcompressratio      1.00x                  -
tank  written               0                      -
tank  logicalused           12.8T                  -
tank  logicalreferenced     41K                    -
tank  filesystem_limit      none                   default
tank  snapshot_limit        none                   default
tank  filesystem_count      none                   default
tank  snapshot_count        none                   default
tank  snapdev               hidden                 default
tank  acltype               posixacl               local
tank  context               none                   default
tank  fscontext             none                   default
tank  defcontext            none                   default
tank  rootcontext           none                   default
tank  relatime              off                    default
tank  redundant_metadata    all                    default
tank  overlay               off                    default

The target pool:

root@ubackup:~# modinfo zfs
filename:       /lib/modules/4.10.0-22-generic/kernel/zfs/zfs/zfs.ko
version:        0.6.5.9-2
license:        CDDL
author:         OpenZFS on Linux
description:    ZFS
srcversion:     42C4AB70887EA26A9970936
depends:        spl,znvpair,zcommon,zunicode,zavl
vermagic:       4.10.0-22-generic SMP mod_unload
...
root@ubackup:~# zpool status
  pool: btank
 state: ONLINE
  scan: scrub repaired 0 in 3h36m with 0 errors on Tue Jun 13 13:34:08 2017
config:

        NAME                                          STATE     READ WRITE CKSUM
        btank                                         ONLINE       0     0     0
          raidz1-0                                    ONLINE       0     0     0
            ata-WDC_WD30EZRX-00MMMB0_WD-XXXXXXXXXXXX  ONLINE       0     0     0
            ata-WDC_WD30EZRX-00MMMB0_WD-XXXXXXXXXXXX  ONLINE       0     0     0
            ata-WDC_WD30EZRX-00MMMB0_WD-XXXXXXXXXXXX  ONLINE       0     0     0
            ata-WDC_WD30EZRX-00MMMB0_WD-XXXXXXXXXXXX  ONLINE       0     0     0
            ata-WDC_WD30EZRX-00MMMB0_WD-XXXXXXXXXXXX  ONLINE       0     0     0
            ata-WDC_WD30EZRX-00MMMB0_WD-XXXXXXXXXXXX  ONLINE       0     0     0
            ata-WDC_WD30EZRX-00MMMB0_WD-XXXXXXXXXXXX  ONLINE       0     0     0
            ata-WDC_WD30EZRX-00MMMB0_WD-XXXXXXXXXXXX  ONLINE       0     0     0
            ata-WDC_WD30EZRX-00MMMB0_WD-XXXXXXXXXXXX  ONLINE       0     0     0

errors: No known data errors
root@ubackup:~# zpool get all btank
NAME   PROPERTY                    VALUE                       SOURCE
btank  size                        24.5T                       -
btank  capacity                    23%                         -
btank  altroot                     -                           default
btank  health                      ONLINE                      -
btank  guid                        14601555808903550874        default
btank  version                     -                           default
btank  bootfs                      -                           default
btank  delegation                  on                          default
btank  autoreplace                 off                         default
btank  cachefile                   -                           default
btank  failmode                    wait                        default
btank  listsnapshots               off                         default
btank  autoexpand                  off                         default
btank  dedupditto                  0                           default
btank  dedupratio                  1.00x                       -
btank  free                        18.9T                       -
btank  allocated                   5.64T                       -
btank  readonly                    off                         -
btank  ashift                      12                          local
btank  comment                     -                           default
btank  expandsize                  -                           -
btank  freeing                     0                           default
btank  fragmentation               9%                          -
btank  leaked                      0                           default
btank  feature@async_destroy       enabled                     local
btank  feature@empty_bpobj         active                      local
btank  feature@lz4_compress        active                      local
btank  feature@spacemap_histogram  active                      local
btank  feature@enabled_txg         active                      local
btank  feature@hole_birth          active                      local
btank  feature@extensible_dataset  active                      local
btank  feature@embedded_data       active                      local
btank  feature@bookmarks           enabled                     local
btank  feature@filesystem_limits   enabled                     local
btank  feature@large_blocks        active                      local
root@ubackup:~# zfs get all btank
NAME   PROPERTY              VALUE                  SOURCE
btank  type                  filesystem             -
btank  creation              Mon Jun 12 18:41 2017  -
btank  used                  5.01T                  -
btank  available             16.1T                  -
btank  referenced            171K                   -
btank  compressratio         1.03x                  -
btank  mounted               yes                    -
btank  quota                 none                   default
btank  reservation           none                   default
btank  recordsize            1M                     local
btank  mountpoint            /btank                 default
btank  sharenfs              off                    default
btank  checksum              on                     default
btank  compression           lz4                    local
btank  atime                 off                    local
btank  devices               on                     default
btank  exec                  on                     default
btank  setuid                on                     default
btank  readonly              off                    default
btank  zoned                 off                    default
btank  snapdir               hidden                 default
btank  aclinherit            passthrough            local
btank  canmount              on                     default
btank  xattr                 sa                     local
btank  copies                1                      default
btank  version               5                      -
btank  utf8only              on                     -
btank  normalization         formD                  -
btank  casesensitivity       mixed                  -
btank  vscan                 off                    default
btank  nbmand                off                    default
btank  sharesmb              off                    default
btank  refquota              none                   default
btank  refreservation        none                   default
btank  primarycache          all                    default
btank  secondarycache        all                    default
btank  usedbysnapshots       0                      -
btank  usedbydataset         171K                   -
btank  usedbychildren        5.01T                  -
btank  usedbyrefreservation  0                      -
btank  logbias               latency                default
btank  dedup                 off                    default
btank  mlslabel              none                   default
btank  sync                  disabled               local
btank  refcompressratio      1.00x                  -
btank  written               171K                   -
btank  logicalused           5.18T                  -
btank  logicalreferenced     40K                    -
btank  filesystem_limit      none                   default
btank  snapshot_limit        none                   default
btank  filesystem_count      none                   default
btank  snapshot_count        none                   default
btank  snapdev               hidden                 default
btank  acltype               posixacl               local
btank  context               none                   default
btank  fscontext             none                   default
btank  defcontext            none                   default
btank  rootcontext           none                   default
btank  relatime              off                    default
btank  redundant_metadata    all                    default
btank  overlay               off                    default

While the issue was observed with multiple datasets, I'll focus on a smaller one. First, the source (some uneventful snapshots omitted):

root@oddity:~# zfs list -o space -r tank/dataz/Backup
NAME               AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
tank/dataz/Backup  8.13T   126G      461K    126G              0          0
root@oddity:~# zfs list -t snapshot -r tank/dataz/Backup
NAME                                                      USED  AVAIL  REFER  MOUNTPOINT
tank/dataz/Backup@zfs-auto-snap_monthly-2017-06-01-0100      0      -   125G  -
tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-12-1900       0      -   125G  -
tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-12-2000    205K      -   125G  -
tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-12-2100       0      -   126G  -
tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-13-0900       0      -   126G  -

The initial send to ubackup has been performed with the -L and -e flags (not sure if this is relevant). Then performing an incremental send with the -L flag produces the expected result (size differences are due to different pool geometry):

root@oddity:~# zfs send -L -e -I tank/dataz/Backup@zfs-auto-snap_monthly-2017-06-01-0100 tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-13-0900 | ssh ubackup "zfs receive btank/oddity/tank/dataz/Backup"

root@ubackup:~# zfs list -o space -r btank/oddity/tank/dataz/Backup
NAME                            AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
btank/oddity/tank/dataz/Backup  16.0T   133G      754K    133G              0          0
root@ubackup:~# zfs list -t snapshot -r btank/oddity/tank/dataz/Backup
NAME                                                                   USED  AVAIL  REFER  MOUNTPOINT
btank/oddity/tank/dataz/Backup@zfs-auto-snap_monthly-2017-06-01-0100  14.2K      -   131G  -
btank/oddity/tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-12-1900   14.2K      -   131G  -
btank/oddity/tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-12-2000    156K      -   131G  -
btank/oddity/tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-12-2100   14.2K      -   133G  -
btank/oddity/tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-13-0900       0      -   133G  -

But repeating the same send without the -L flag results in corruption:

root@ubackup:~# zfs rollback -r btank/oddity/tank/dataz/Backup@zfs-auto-snap_monthly-2017-06-01-0100

root@oddity:~# zfs send -e -I tank/dataz/Backup@zfs-auto-snap_monthly-2017-06-01-0100 tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-13-0900 | ssh ubackup "zfs receive btank/oddity/tank/dataz/Backup"

root@ubackup:~# zfs list -o space -r btank/oddity/tank/dataz/Backup
NAME                            AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
btank/oddity/tank/dataz/Backup  16.0T   133G     19.0G    114G              0          0
root@ubackup:~# zfs list -t snapshot -r btank/oddity/tank/dataz/Backup
NAME                                                                   USED  AVAIL  REFER  MOUNTPOINT
btank/oddity/tank/dataz/Backup@zfs-auto-snap_monthly-2017-06-01-0100  14.2K      -   131G  -
btank/oddity/tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-12-1900   14.2K      -   131G  -
btank/oddity/tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-12-2000    156K      -   112G  -
btank/oddity/tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-12-2100   14.2K      -   114G  -
btank/oddity/tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-13-0900       0      -   114G  -

Notice how the REFER size drops and the USEDSNAP increase between the two runs. I looked around to find where the data had gone missing and found several random files whose reported sizes on disk were reduced to 512 bytes. Reading these files seems to return the full original size but with content all zeros.

I repeated the whole process multiple times, also recreating the target pool, with the same result. The affected files are always the same, but I haven't found a common characteristic among them. Scrubbing both pools finds no errors. Syslog shows nothing uncommon. I have never noted something similar before. The only notable change I made recently was that I started using sa based xattrs in May, but that might be unrelated.

I am happy to provide more info as far as I'm able.

SenRecv Defect

Most helpful comment

@grahamc thank you for pointing this out. I can completely understand your confusion, we've gone ahead and updated the IRC topic to be more accurate.

There's also some unfortunate confusion surrounding this particular issue which is in fact not related to encryption. This is a send/recv issue which can occur when sending incremental datasets which contain large blocks and inconsistently setting the zfs send -L flag between incremental sends. In fact, raw encrypted send/recvs automatically imply the -L flag and are therefore unaffected.

To circle back to your original question. Yes, we consider the encryption feature to be ready and safe for production use. It's currently being used every day in a variety of environments. That said, there is still work underway to better integrate this feature with systemd and to optimize performance. To the best of our knowledge there are no critical encryption bugs. The open encryption issues are primarily feature requests, performance questions, or relatively minor uncommon issues.

The proposed fixes for this are a little tricky, but the issue can be avoid by consistent use of the zfs send -L flag. I hope that helps clear things up.

All 59 comments

I think I found the unifying characteristic of all corrupted files: All of them are in directories that have been updated via file creation or deletion between snapshots. Which of the files in the directory get corrupted seems to be random, though.

I'll try to put something together. I'm busy for the next week though, so it'll take a while.

I can reproduce the issue consistently with this script, tested on two different machines (0.6.5.6 and 0.6.5.9):

#!/bin/bash

BASE_DS="tank/issue6224"

zfs create -o recordsize=1M -o compression=lz4 $BASE_DS
zfs create $BASE_DS/source
cd /$BASE_DS/source
wget https://github.com/zfsonlinux/zfs/releases/download/zfs-0.6.5.10/zfs-0.6.5.10.tar.gz
zfs snapshot $BASE_DS/source@snap1
zfs send -L -e $BASE_DS/source@snap1 | zfs receive $BASE_DS/target
cp zfs-0.6.5.10.tar.gz zfs-0.6.5.10.tar.gz.1
zfs snapshot $BASE_DS/source@snap2
zfs send -e -i $BASE_DS/source@snap1 $BASE_DS/source@snap2 | zfs receive $BASE_DS/target

After running, check the size on disk of the files in the target dataset using du or ncdu. Note that the first file is only 512 bytes and all zero content.

NAME                   AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
tank/issue6224         8.13T  10.5M         0    205K              0      10.3M
tank/issue6224/source  8.13T  5.10M      136K   4.97M              0          0
tank/issue6224/target  8.13T  5.22M     2.52M   2.70M              0          0

@zrav thank you for providing a working reproducer!

I've not had the time to debug this properly yet, but after a cursory reading of the code i think we're seeing the result of the following block in receive_object()

  2134          /*
  2135           * If we are losing blkptrs or changing the block size this must
  2136           * be a new file instance.  We must clear out the previous file
  2137           * contents before we can change this type of metadata in the dnode.
  2138           */
  2139          if (err == 0) {
  2140                  int nblkptr;
  2141  
  2142                  nblkptr = deduce_nblkptr(drro->drr_bonustype,
  2143                      drro->drr_bonuslen);
  2144  
  2145                  if (drro->drr_blksz != doi.doi_data_block_size ||
  2146                      nblkptr < doi.doi_nblkptr) {
  2147                          err = dmu_free_long_range(rwa->os, drro->drr_object,
  2148                              0, DMU_OBJECT_END);
  2149                          if (err != 0)
  2150                                  return (SET_ERROR(EINVAL));
  2151                  }
  2152          }

When we send the incremental stream our DRR OBJECT = inumof($corrupted_file) is sent with a block size that is different from the one already on disk (which was sent with -L the first time), so we "clear out the previous file contents":

First send (blksz = 1M)

[root@centos ~]# zfs send -L $POOLNAME/source@snap1 | zstreamdump -vv | grep "object = `inum /mnt/$POOLNAME/source/file1.dat`"
OBJECT object = 2 type = 19 bonustype = 44 blksz = 1048576 bonuslen = 176
FREE object = 2 offset = 5242880 length = -1
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
FREE object = 2 offset = 5242880 length = 1068498944

Incremental send (blksz = 128K)

[root@centos ~]# zfs send -i $POOLNAME/source@snap1 $POOLNAME/source@snap2 | zstreamdump -vv | grep "object = `inum /mnt/$POOLNAME/source/file1.dat`"
OBJECT object = 2 type = 19 bonustype = 44 blksz = 131072 bonuslen = 176
FREE object = 2 offset = 5242880 length = -1

The resulting file on disk is:

[root@centos ~]# zdb -ddddddd -bbbbbbb $POOLNAME/target `inum /mnt/$POOLNAME/target/file1.dat`
Dataset testpool/target [ZPL], ID 76, cr_txg 18, 5.03M, 12 objects, rootbp DVA[0]=<0:1527000:200> DVA[1]=<0:1527200:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=62L/62P fill=12 cksum=dd1da4093:545ff5a1b20:10a4e0a522bd7:2427c83816b784

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         2    2   128K   128K      0     512   128K    0.00  ZFS plain file (K=inherit) (Z=inherit)
                                               176   bonus  System attributes
    dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED 
    dnode maxblkid: 0
    path    /file1.dat
    uid     0
    gid     0
    atime   Sat Jun 24 07:26:16 2017
    mtime   Sat Jun 12 07:26:16 2018
    ctime   Sat Jun 12 07:26:16 2018
    crtime  Sat Jun 12 07:26:15 2018
    gen 11
    mode    100644
    size    5242880
    parent  34
    links   1
    pflags  40800000004
    xattr   3
Indirect blocks:
               0 L1  HOLE [L1 ZFS plain file] size=20000L birth=43L

The code i was mentioning earlier was integrated with 6c59307 (_Illumos 3693 - restore_object uses at least two transactions to restore an object_), even before f1512ee (_Illumos 5027 - zfs large block support_): if all this is true i think every ZFS version since Illumos 5027 is affected, possibly on other platfroms too.

This reproduces on current master and Illumos, which is quite troubling. Cross-platform reproducer here https://gist.github.com/loli10K/af70fb0f1aae7b174822aa5657e07c28.

ZoL:

root@linux:~# cat /sys/module/zfs/version 
0.7.0-rc4_67_g7e35ea783
root@linux:~# bash -x ./issue-6224.sh 
+ POOLNAME=testpool
+ is_linux
++ uname
+ [[ Linux == \L\i\n\u\x ]]
+ return 0
+ TMPDIR=/var/tmp
+ mountpoint -q /var/tmp
+ zpool destroy testpool
+ fallocate -l 128m /var/tmp/zpool.dat
+ zpool create -O mountpoint=/mnt/testpool testpool /var/tmp/zpool.dat
+ zfs create -o recordsize=1M testpool/source
+ dd if=/dev/urandom of=/mnt/testpool/source/file1.dat bs=1M count=5
5+0 records in
5+0 records out
5242880 bytes (5.2 MB) copied, 0.552689 s, 9.5 MB/s
+ zfs snapshot testpool/source@snap1
+ zfs send -L testpool/source@snap1
+ cat /var/tmp/full.dat
++ inum /mnt/testpool/source/file1.dat
++ stat -c %i /mnt/testpool/source/file1.dat
+ zstreamdump -v
+ grep 'object = 2'
OBJECT object = 2 type = 19 bonustype = 44 blksz = 1048576 bonuslen = 168
FREE object = 2 offset = 5242880 length = -1
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
FREE object = 2 offset = 5242880 length = 1068498944
+ cat /var/tmp/full.dat
+ zfs receive -F testpool/target
+ cp /mnt/testpool/source/file1.dat /mnt/testpool/source/full.dat
+ du -sh /mnt/testpool/source/file1.dat /mnt/testpool/source/full.dat /mnt/testpool/target/file1.dat
5.1M    /mnt/testpool/source/file1.dat
5.1M    /mnt/testpool/source/full.dat
5.1M    /mnt/testpool/target/file1.dat
+ zfs snapshot testpool/source@snap2
+ zfs send -i testpool/source@snap1 testpool/source@snap2
+ cat /var/tmp/incr.dat
++ inum /mnt/testpool/source/file1.dat
++ stat -c %i /mnt/testpool/source/file1.dat
+ grep 'object = 2'
+ zstreamdump -v
OBJECT object = 2 type = 19 bonustype = 44 blksz = 131072 bonuslen = 168
FREE object = 2 offset = 5242880 length = -1
+ cat /var/tmp/incr.dat
+ zfs receive -F testpool/target
+ du -sh /mnt/testpool/source/file1.dat /mnt/testpool/source/full.dat /mnt/testpool/target/file1.dat /mnt/testpool/target/full.dat
5.1M    /mnt/testpool/source/file1.dat
5.1M    /mnt/testpool/source/full.dat
512 /mnt/testpool/target/file1.dat
5.1M    /mnt/testpool/target/full.dat
++ inum /mnt/testpool/target/file1.dat
++ stat -c %i /mnt/testpool/target/file1.dat
+ zdb -ddddddd -bbbbbbb testpool/target 2
Dataset testpool/target [ZPL], ID 268, cr_txg 16, 5.03M, 8 objects, rootbp DVA[0]=<0:14fa000:200> DVA[1]=<0:14fa200:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=50L/50P fill=8 cksum=db27241b5:541d80cfdbd:10b6c0498b32d:2497961710b23a

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         2    2   128K   128K      0     512   128K    0.00  ZFS plain file (K=inherit) (Z=inherit)
                                               168   bonus  System attributes
    dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED 
    dnode maxblkid: 0
    path    /file1.dat
    uid     0
    gid     0
    atime   Mon Jun 26 19:38:50 2017
    mtime   Mon Jun 26 19:38:49 2017
    ctime   Mon Jun 26 19:38:49 2017
    crtime  Mon Jun 26 19:38:49 2017
    gen 9
    mode    100644
    size    5242880
    parent  34
    links   1
    pflags  40800000004
Indirect blocks:
               0 L1  HOLE [L1 ZFS plain file] size=20000L birth=35L


root@linux:~# 

Illumos (SmartOS):

[root@52-54-00-d3-7a-01 ~]# uname -v
joyent_20170622T212149Z
[root@52-54-00-d3-7a-01 ~]# bash -x ./issue-6224.sh
+ POOLNAME=testpool
+ is_linux
++ uname
+ [[ SunOS == \L\i\n\u\x ]]
+ return 1
+ TMPDIR=/tmp
+ zpool destroy testpool
+ mkfile 128m /tmp/zpool.dat
+ zpool create -O mountpoint=/mnt/testpool testpool /tmp/zpool.dat
+ zfs create -o recordsize=1M testpool/source
+ dd if=/dev/urandom of=/mnt/testpool/source/file1.dat bs=1M count=5
5+0 records in
5+0 records out
5242880 bytes transferred in 0.078813 secs (66523111 bytes/sec)
+ zfs snapshot testpool/source@snap1
+ zfs send -L testpool/source@snap1
+ cat /tmp/full.dat
+ zstreamdump -v
++ inum /mnt/testpool/source/file1.dat
++ stat -c %i /mnt/testpool/source/file1.dat
+ grep 'object = 8'
OBJECT object = 8 type = 19 bonustype = 44 blksz = 1048576 bonuslen = 168
FREE object = 8 offset = 5242880 length = -1
WRITE object = 8 type = 19 checksum type = 7 compression type = 0
WRITE object = 8 type = 19 checksum type = 7 compression type = 0
WRITE object = 8 type = 19 checksum type = 7 compression type = 0
WRITE object = 8 type = 19 checksum type = 7 compression type = 0
WRITE object = 8 type = 19 checksum type = 7 compression type = 0
FREE object = 8 offset = 5242880 length = 1068498944
+ cat /tmp/full.dat
+ zfs receive -F testpool/target
+ cp /mnt/testpool/source/file1.dat /mnt/testpool/source/full.dat
+ du -sh /mnt/testpool/source/file1.dat /mnt/testpool/source/full.dat /mnt/testpool/target/file1.dat
 5.0M   /mnt/testpool/source/file1.dat
 5.0M   /mnt/testpool/source/full.dat
 5.0M   /mnt/testpool/target/file1.dat
+ zfs snapshot testpool/source@snap2
+ zfs send -i testpool/source@snap1 testpool/source@snap2
+ cat /tmp/incr.dat
+ zstreamdump -v
++ inum /mnt/testpool/source/file1.dat
++ stat -c %i /mnt/testpool/source/file1.dat
+ grep 'object = 8'
OBJECT object = 8 type = 19 bonustype = 44 blksz = 131072 bonuslen = 168
FREE object = 8 offset = 5242880 length = -1
+ cat /tmp/incr.dat
+ zfs receive -F testpool/target
+ du -sh /mnt/testpool/source/file1.dat /mnt/testpool/source/full.dat /mnt/testpool/target/file1.dat /mnt/testpool/target/full.dat
 5.0M   /mnt/testpool/source/file1.dat
 5.0M   /mnt/testpool/source/full.dat
   0K   /mnt/testpool/target/file1.dat
 5.0M   /mnt/testpool/target/full.dat
++ inum /mnt/testpool/target/file1.dat
++ stat -c %i /mnt/testpool/target/file1.dat
+ zdb -ddddddd -bbbbbbb testpool/target 8
Dataset testpool/target [ZPL], ID 51, cr_txg 15, 5.03M, 9 objects, rootbp DVA[0]=<0:14f8200:200> DVA[1]=<0:14f8400:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=49L/49P fill=9 cksum=b47a0b8ce:44a5957f452:d88c20f8e067:1d7addab9d69db

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         8    2   128K   128K      0   128K    0.00  ZFS plain file (K=inherit) (Z=inherit)
                                        168   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED 
        dnode maxblkid: 0
        path    /file1.dat
        uid     0
        gid     0
        atime   Mon Jun 26 17:40:50 2017
        mtime   Mon Jun 26 17:40:50 2017
        ctime   Mon Jun 26 17:40:50 2017
        crtime  Mon Jun 26 17:40:50 2017
        gen     8
        mode    100644
        size    5242880
        parent  4
        links   1
        pflags  40800000004
Indirect blocks:
               0 L1  HOLE [L1 ZFS plain file] size=20000L birth=34L


[root@52-54-00-d3-7a-01 ~]# 

FWIW, although it says above that there is a cross-platform that reproduces on Illumos, @lundman notes that it hasn't made the slack and wonders if @ahrens has seen this.

(Reproduces in openzfsonosx too, and annoyingly I have a few affected files in backups that thankfully I haven't had to restore from.)

@behlendorf any particular reason this wasn't revisited?

Available developer resources, I'll add it to the 0.8 milestone so we don't loose track of it.

Just for reference, a similar issue was reported:

@loli10K that issue is related but different. I would guess the problem here is with turning -L on/off between incrementals. The incremental will report the drr_blksz in the OBJECT record as different from what is on the receiving system, so the recv assumes that the object must have been freed and reallocated, because the block size can't change on the sending system (if there are multiple blocks).

The solution might be a bit tricky, but I think it will have to involve disallowing some kinds of zfs receive.

@pcd1193182 or @tcaputi also have some experience with zfs send/recv.

We are having the same kind of issue in https://github.com/openzfs/openzfs/pull/705.... I don't have a solution right at this moment (other than simply forcing zfs or the user to use -L if the filesystem uses large blocks. Both of these issues stem from the fact that we are trying to infer whether an object was deleted from a few properties of the object, but this would be much more solid if we could somehow know for sure that this happened. Ill try to look into it more today.

I just tried this on an encrypted dataset with raw sending. Since I use legacy mountpoint I altered the script to:

#!/usr/bin/env bash
set -x

BASE_DS="tankSubi/encZFS/issue6224"
MNT="/tmp/test-6224"

zfs create -o recordsize=1M -o compression=lz4 $BASE_DS
zfs create $BASE_DS/source
mkdir -p $MNT
mount -t zfs $BASE_DS/source $MNT
cd $MNT
wget https://github.com/zfsonlinux/zfs/releases/download/zfs-0.6.5.10/zfs-0.6.5.10.tar.gz
zfs snapshot $BASE_DS/source@snap1
zfs send -L -e --raw $BASE_DS/source@snap1 | zfs receive $BASE_DS/target
cp zfs-0.6.5.10.tar.gz zfs-0.6.5.10.tar.gz.1
zfs snapshot $BASE_DS/source@snap2
zfs send -e --raw -i $BASE_DS/source@snap1 $BASE_DS/source@snap2 | zfs receive $BASE_DS/target
zfs list -t snapshot -r $BASE_DS

and it produced:

root@subi:/tmp# ./6224
+ BASE_DS=tankSubi/encZFS/issue6224
+ MNT=/tmp/test-6224
+ zfs create -o recordsize=1M -o compression=lz4 tankSubi/encZFS/issue6224
+ zfs create tankSubi/encZFS/issue6224/source
+ mkdir -p /tmp/test-6224
+ mount -t zfs tankSubi/encZFS/issue6224/source /tmp/test-6224
+ cd /tmp/test-6224
+ wget https://github.com/zfsonlinux/zfs/releases/download/zfs-0.6.5.10/zfs-0.6.5.10.tar.gz
--2018-11-07 20:13:24--  https://github.com/zfsonlinux/zfs/releases/download/zfs-0.6.5.10/zfs-0.6.5.10.tar.gz
Resolving github.com (github.com)... 140.82.118.4, 140.82.118.3
Connecting to github.com (github.com)|140.82.118.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-production-release-asset-2e65be.s3.amazonaws.com/437011/79cf5bd6-5109-11e7-8255-038691a19c2c?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20181107%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20181107T191325Z&X-Amz-Expires=300&X-Amz-Signature=47b1a953908e5f4436159c482f08d2c65c2d417ae94afab36bea78134d24f80c&X-Amz-SignedHeaders=host&actor_id=0&response-content-disposition=attachment%3B%20filename%3Dzfs-0.6.5.10.tar.gz&response-content-type=application%2Foctet-stream [following]
--2018-11-07 20:13:25--  https://github-production-release-asset-2e65be.s3.amazonaws.com/437011/79cf5bd6-5109-11e7-8255-038691a19c2c?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20181107%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20181107T191325Z&X-Amz-Expires=300&X-Amz-Signature=47b1a953908e5f4436159c482f08d2c65c2d417ae94afab36bea78134d24f80c&X-Amz-SignedHeaders=host&actor_id=0&response-content-disposition=attachment%3B%20filename%3Dzfs-0.6.5.10.tar.gz&response-content-type=application%2Foctet-stream
Resolving github-production-release-asset-2e65be.s3.amazonaws.com (github-production-release-asset-2e65be.s3.amazonaws.com)... 52.216.134.19                                                                                                
Connecting to github-production-release-asset-2e65be.s3.amazonaws.com (github-production-release-asset-2e65be.s3.amazonaws.com)|52.216.134.19|:443... connected.                                                                            
HTTP request sent, awaiting response... 200 OK
Length: 2597676 (2.5M) [application/octet-stream]
Saving to: ‘zfs-0.6.5.10.tar.gz’

zfs-0.6.5.10.tar.gz                                         100%[========================================================================================================================================>]   2.48M  3.26MB/s    in 0.8s    

2018-11-07 20:13:26 (3.26 MB/s) - ‘zfs-0.6.5.10.tar.gz’ saved [2597676/2597676]

+ zfs snapshot tankSubi/encZFS/issue6224/source@snap1
+ zfs send -L -e --raw tankSubi/encZFS/issue6224/source@snap1
+ zfs receive tankSubi/encZFS/issue6224/target
+ cp zfs-0.6.5.10.tar.gz zfs-0.6.5.10.tar.gz.1
+ zfs snapshot tankSubi/encZFS/issue6224/source@snap2
+ zfs receive tankSubi/encZFS/issue6224/target
+ zfs send -e --raw -i tankSubi/encZFS/issue6224/source@snap1 tankSubi/encZFS/issue6224/source@snap2
+ zfs list -t snapshot -r tankSubi/encZFS/issue6224
NAME                                     USED  AVAIL  REFER  MOUNTPOINT
tankSubi/encZFS/issue6224/source@snap1   136K      -  2.69M  -
tankSubi/encZFS/issue6224/source@snap2     0B      -  5.21M  -
tankSubi/encZFS/issue6224/target@snap1   120K      -  2.67M  -
tankSubi/encZFS/issue6224/target@snap2     0B      -  5.20M  -

Also after mounting it seems to be fine

root@subi:/tmp# mkdir snp2
root@subi:/tmp# zfs load-key tankSubi/encZFS/issue6224/target
Enter passphrase for 'tankSubi/encZFS/issue6224/target': 
root@subi:/tmp# mount -t zfs tankSubi/encZFS/issue6224/target /tmp/snp2
root@subi:/tmp# cd snp2/
root@subi:/tmp/snp2# ls -al
total 5114
drwxr-xr-x  2 root root       4 Nov  7 20:13 .
drwxrwxrwt 24 root root     860 Nov  7 20:17 ..
-rw-r--r--  1 root root 2597676 Jun 14  2017 zfs-0.6.5.10.tar.gz
-rw-r--r--  1 root root 2597676 Nov  7 20:13 zfs-0.6.5.10.tar.gz.1

It seems when using raw send on encrypted dataset that this bug is not triggered.

Right, because --raw sends the data exactly as they are on disk. This implies not changing the block size, which is what -L does

raw sends imply -L for large-block datasets. Encrypted data cannot be split into smaller blocks.

ok, thanks for the explanation.

@behlendorf Isn't this resolved by #8668?

@jwittlincohen I just tested 0.8.0-rc5 with my little reproducer from above, and unfortunately the problem isn't fixed. The difference is that now the zeroed files are shown with their correct size instead of 512 byte.

Will this data corruption bug linger forever?

@scineram I guess time will tell. Would you like to provide input on prioritizing this bug, by describing the impact it has for you?

@ahrens Any user unaware of this issue could potentially experience data loss. I don't think we should have to provide arguments why a data corruption bug should be fixed, especially in ZFS.

@zrav We all agree it should be fixed. Adding your experience helps to motivate others to do that work.

I could imagine a solution which makes it an error to toggle -L, like the following:

  1. When receiving a non-incremental stream with DMU_BACKUP_FEATURE_LARGE_BLOCKS, always activate SPA_FEATURE_LARGE_BLOCKS for this new dataset.
  2. When receiving an incremental stream, if the stream's DMU_BACKUP_FEATURE_LARGE_BLOCKS does not match the dataset's SPA_FEATURE_LARGE_BLOCKS, fail the receive.

However, this wouldn't exactly work for pre-existing datasets which might not have SPA_FEATURE_LARGE_BLOCKS activated even though they were received from a DMU_BACKUP_FEATURE_LARGE_BLOCKS stream (if they didn't happen to actually have any large blocks). We could fix this by adding a new flag to the dataset to indicate that SPA_FEATURE_LARGE_BLOCKS will really be set (and if the new flag is not set, we can't do the check when receiving an incremental).

An alternate solution would be to make it an error to send a dataset that has large blocks present without also specifying -L. This is the cleanest solution since we could remove the questionable "split_large_blocks" code that's causing this bug. But existing if you've already done a receive that splits large blocks, you'd essentially be forced to toggle -L on the next incremental to it, potentially triggering this bug.

@ahrens My experience is that I got lucky and narrowly avoided losing both personal and company data.
The last alternative solution you propose is clean, but has the downside of closing the only route we currently have to convert a large block dataset down to regular block sizes.
Another option I see is having large block datasets be sent with -L by default (making -L redundant), and introducing another send flag to reduce the blocks to non-large if desired. That wouldn't fix the bug on the receive side, but at least stops data corruption from happening by forgetting to set a parameter.

I still wonder what's the technical reason that leads to zfs send/recv being unable to change the block size - effectively causing this bug.

Why can't we specify to have _this stream_ recv'd into _this_ dataset_ with recordsize=x and have the data written as if the operation would be made from userland using standard filesystem tools (like dd and truncate), meaning existing files being written with the recordsize they already have while new files being created (and written) in the recordsize currently set for the dataset (or an override specified on recv, so a full stream can be modified in this regard)?

The WRITE and FREE records in the stream are addressing using object ID, offset & length.
Shouldn't it be _completely irrelevant_ what on-disk layout (or recordsize) the target dataset has?

@ahrens What am I missing?

before this is solved, can someone mention it and it's mitigations on https://github.com/zfsonlinux/zfs/wiki/FAQ ?

I'm a bit confused about the state of encryption. This isn't a gripe, just a question: does ZoL consider encryption production ready? I only ask because the topic of Freenode's #zfsonlinux (last updated in Sep 2019) says no: "Native encryption is not production ready, see #6224."

Since it is a single-# channel on Freenode, it is fairly safe to assume it is an official channel with an official topic... but I get the impression from the activity on this ticket that it isn't considered a show-stopper for encryption being "production ready".

I'm asking because when people ask me about ZFS + Encryption, I point them to the topic of #zfsonlinux. If that isn't the thing to do, I'd like to stop :). Again, I have no gripe or annoyance -- I'm just looking for some clarification on what the team thinks.

@grahamc thank you for pointing this out. I can completely understand your confusion, we've gone ahead and updated the IRC topic to be more accurate.

There's also some unfortunate confusion surrounding this particular issue which is in fact not related to encryption. This is a send/recv issue which can occur when sending incremental datasets which contain large blocks and inconsistently setting the zfs send -L flag between incremental sends. In fact, raw encrypted send/recvs automatically imply the -L flag and are therefore unaffected.

To circle back to your original question. Yes, we consider the encryption feature to be ready and safe for production use. It's currently being used every day in a variety of environments. That said, there is still work underway to better integrate this feature with systemd and to optimize performance. To the best of our knowledge there are no critical encryption bugs. The open encryption issues are primarily feature requests, performance questions, or relatively minor uncommon issues.

The proposed fixes for this are a little tricky, but the issue can be avoid by consistent use of the zfs send -L flag. I hope that helps clear things up.

The proposed fixes for this are a little tricky, but the issue can be avoid by consistent use of the zfs send -L flag. I hope that helps clear things up.

Could that make it's way to the FAQ? it should be faster to be discovered by new users than this issue

I suspect I have experienced this same bug on Illumos, but without using '-L' on zfs send. In my case, the sending system has 128k recordsize and the receiving system has 1M recordsize. When replicating the other direction the problem does not happen.

The problem doesn't reproduce as easily, but over the course of about 1 month, I had 4.7 million files get replaced by NULLs on the receiving zfs folder. Thank god the snapshot copies were good.

ZFS scrub doesn't detect any problems on either end of the replication. Live files will be different.

Files don't have to be created or modified on the sending system to be affected in my case.

@lschweiss-wustl If you are not changing the -L setting, I don't see how this would affect you. The target system's recordsize setting shouldn't have anything to do with it. You can verify that your files have the same actual block size on the source and target systems (both should be 128K). You can do this with stat <file> (look at IO Block: output), or zdb.

@lschweiss-wustl If you are not changing the -L setting, I don't see how this would affect you. The target system's recordsize setting shouldn't have anything to do with it. You can verify that your files have the same actual block size on the source and target systems (both should be 128K). You can do this with stat <file> (look at IO Block: output), or zdb.

I still can't say if this is the same bug, but it definitely appears to be a zfs send/receive bug when block sizes are not the same also resulting in NULL filled files.

The block size apparently changed on the receiving end when the file was corrupted.

Walking through an example file that was corrupted this way:

On the receiving pool at the time of corruption, recordsize=1M. The content is all nulls. The block count and size changed.

# head /HCPintraDB00/xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29

# stat /HCPintraDB00/xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
  File: /HCPintraDB00/xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
  Size: 281070          Blocks: 1          IO Block: 131072 regular file
Device: 4350019h/70582297d      Inode: 1043386     Links: 1
Access: (0644/-rw-r--r--)  Uid: (  603/ UNKNOWN)   Gid: (60023/ UNKNOWN)
Access: 2019-10-29 07:29:18.399569973 -0500
Modify: 2019-10-29 18:19:45.006930486 -0500
Change: 2019-10-30 03:32:03.353868416 -0500
 Birth: 2019-10-29 07:29:18.399569973 -0500

In a snapshot on the same pool the content is correct:

# head /HCPintraDB00/.zfs/snapshot/daily_2019-11-07_00:00-0600/xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow1.nrg.mir
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow3.humanconnectome.org
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow3.nrg.mir:8080
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow1.nrg.mir
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow1.nrg.mir:8080
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow1.nrg.mir
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow1.nrg.mir:8080
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow2.nrg.mir:8080
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow3.humanconnectome.org
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow2.nrg.mir:8080

# stat /HCPintraDB00/.zfs/snapshot/daily_2019-11-07_00:00-0600/xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
  File: /HCPintraDB00/.zfs/snapshot/daily_2019-11-07_00:00-0600/xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
  Size: 281070          Blocks: 35         IO Block: 281088 regular file
Device: 435011fh/70582559d      Inode: 1043386     Links: 1
Access: (0644/-rw-r--r--)  Uid: (  603/ UNKNOWN)   Gid: (60023/ UNKNOWN)
Access: 2019-10-29 07:29:18.399569973 -0500
Modify: 2019-10-29 18:19:45.006930486 -0500
Change: 2019-10-30 03:32:03.353868416 -0500
 Birth: 2019-10-29 07:29:18.399569973 -0500

On the recordsize=128K pool, the file never changed at the time the receiving pool corrupted the file:

I'm not sure this is the same bug, however, it certainly appears to be a zfs send/receive bug when block sizes are not the same at each end.

# stat /HCPintraDB00/.zfs/snapshot/daily_2019-11-07_00:00-0600/xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
  File: /HCPintraDB00/.zfs/snapshot/daily_2019-11-07_00:00-0600/xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
  Size: 281070          Blocks: 117        IO Block: 131072 regular file
Device: 43509b5h/70584757d      Inode: 1043386     Links: 1
Access: (0644/-rw-r--r--)  Uid: (  603/ UNKNOWN)   Gid: (60023/ UNKNOWN)
Access: 2019-10-29 07:29:18.399569973 -0500
Modify: 2019-10-29 18:19:45.006930486 -0500
Change: 2019-10-30 03:32:03.353868416 -0500
 Birth: 2019-10-29 07:29:18.399569973 -0500

# stat /HCPintraDB00/.zfs/snapshot/daily_2019-11-08_00:00-0600/xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
  File: /HCPintraDB00/.zfs/snapshot/daily_2019-11-08_00:00-0600/xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
  Size: 281070          Blocks: 117        IO Block: 131072 regular file
Device: 43509b6h/70584758d      Inode: 1043386     Links: 1
Access: (0644/-rw-r--r--)  Uid: (  603/ UNKNOWN)   Gid: (60023/ UNKNOWN)
Access: 2019-10-29 07:29:18.399569973 -0500
Modify: 2019-10-29 18:19:45.006930486 -0500
Change: 2019-10-30 03:32:03.353868416 -0500
 Birth: 2019-10-29 07:29:18.399569973 -0500

2019-11-07 is a significant date in this dataset's history. On that day the active pool was switched and the replication reversed. There were millions of files corrupted between 2019-11-07 and 2019-12-17 when the replication was flipped again.

I'll be happy to file this as a different Illumos bug if there is good evidence this is truly a different bug.

@mailinglists35 added to FAQ, https://github.com/zfsonlinux/zfs/wiki/FAQ#sending-large-blocks

Would it make sense to change the default behavior to include -L?

Because we keep having people every so often report getting burned by this, astonished that ZFS mangled their data silently, and this doesn't seem to be getting fixed anytime soon.

@ahrens @behlendorf
Added to the FAQ that likely _no one_ memorizes back to front?

Come on guys, silent data corruption in the replication/backup mechanic of a filesystem
it can't get _much_ worse, can it? Why isn't this being treated as a showstopper?

What solution do you propose, @GregorKopka?

Not Gregor, but -L by default seems like a reasonable mitigation until the underlying bug is fixed, to me - a --no-L for people who know what they're doing might be needed, but IMO leaving the unsafe option as a default when there's known silent corruption if you use it in certain circumstances seems pretty bad. (See also: defaulting send_holes_without_birth_time to 1 versus 0.)

@rincebrain The current default is safe (even if the implementation is a bit clunky). The issue is with switching between --no-L and -L (or vice versa). So switching the default without any other mitigating changes seems like a very bad idea, since it would cause uses that previously where fine (no flags) to now hit this.

Oh, I see, I had misunderstood this, and thought it was more general than that, from e.g. @lschweiss-wustl 's report, though reading more, I guess that would warrant a distinct bug.

I'd still like a way to have a safe default so I can't accidentally forget a flag one time and silently mangle my data, but it seems like it might not be worth it to implement such a parameter, since most people are using management tools and can just flip the default sets of flags.

It'd be nice if this made the list for e.g. OpenZFS 2.0, since it's been around from the 0.6.5.X days.

What solution do you propose, @GregorKopka?

@ahrens One that is free of data loss. I'm still clueless why zfs cares about on-disk block sizes on send/recv, would you be so nice to please answer https://github.com/openzfs/zfs/issues/6224#issuecomment-548631814 ? That question somehow lingers between me and a solution.

@GregorKopka well, for starters, it would involve reimplementing large chunks of the receive logic, and the performance would probably be substantially worse. And it would be pretty messy when interacting with encryption and raw sends. It's not really compatible with the design of redaction, either. It would remove large amounts of the benefits of compressed send/recv. Those are just off the top of my head.

How about rejecting the stream immediately if the dataset uses large blocks, but the send was without -L. Would that be difficult to implement?

This bug is terrible for ZFS' reputation.

The FAQ says "When sending incremental streams which contain large blocks (>128K) the --large-block flag must be specified." I use 1MB block size everywhere so the FAQ makes me think I should just go add -L to all my "zfs sync", but wouldn't that actually cause me to be flipping -l Off/On and cause corruption?

What happens when using 'zfs send -t'? According to the docs the -L option is not available there, will it just produce corrupted datasets if the original&interrupted send had -L?

zfs send [-Penv] -t receive_resume_token Creates a send stream which resumes an interrupted receive. The receive_resume_token is the value of this property on the filesystem or volume that was being received into. See the documentation for zfs receive -s for more details.

I've tested that 'zfs send -L -t' works, so maybe this is just a documentation issue.

If using -L is the right thing to do all the time, even if one has dataset syncs that started without -L, then sounds like the right thing to do here is to add -L by default on zfs send all the time, or at least when the sender zfs recordsize is over 128K.

@scineram I think that would work, as a partial solution.

If the dataset does not have large blocks, but the stream does, we don't currently have a way of knowing whether that's because the earlier send split the large blocks (in which case we should reject this receive), or the earlier snapshot didn't happen to have any large blocks in it (in which case we should accept it).

See my earlier comment https://github.com/openzfs/zfs/issues/6224#issuecomment-548445813 for more thoughts on this.

@gerardba

What happens when using 'zfs send -t'

Unfortunately, it looks like zfs send -Lt will add the large-block flag even if the token doesn't have it set, thus allowing you to toggle -L and hit hit bug.

wouldn't that actually cause me to be flipping -l Off/On and cause corruption?

Yes. One difficulty of dealing with this issue is that the current default is not the recommended way to do sends of filesystems with large blocks, but switching to the recommended way can hit this bug.

It seems like there is a lot of duplication of ideas and discussion going on here and elsewhere, so I'm going to attempt to consolidate and summarize all the various suggestions that have been made, and highlight the problems with each of them. This will hopefully help people to understand why this issue hasn't been resolved yet; it's not just a matter of implementing "[a solution] that is free of data loss", none of the proposed solutions avoids data loss in all cases for all users.

  1. Make -L the default behavior...
    Users with existing large-blocks datasets who haven't been using -L suddenly are forcibly exposed to the bug. This can be mitigated by application of solution #2, but not totally, as @ahrens broke down ; it forces users with existing large-block datasets to re-send from scratch. This is especially pernicious for the backup use case, where the user may not realize that their sends can no longer be received until the source material is gone. Option 1b helps mitigate this downside, but see its entry.
    a. Without a --no-large-blocks flag
    This approach has the downside of removing the only way we currently have to de-large-blockify datasets.
    b. With a --no-large-blocks flag / with zstream split-blocks
    This solution reopens the bug window through incorrect usage of the --no-large-blocks flag or the userland utility. It does require that you do a more explicit step to cause yourself grief, but it by no means absolutely prevents it. Of the two, I would prefer the latter, since it would prevent the backup use case from being griefed when a user has mistakenly not passed --no-large-blocks.
  2. Reject sends that don't match the large-block-having state of the receiving system
    This solution works well moving forwards, but for systems that already have some received state, this solution does little to address their issues. Again, see @ahrens' previous comment.
  3. Reimplement zfs recv to not be block based
    This suggestion, in addition to being a substantial developer/QA workload, has numerous substantial downsides. See this comment for some of them; I'm sure others can think of more.
  4. Improve/perfect detection of whether an object needs to be reallocated, independent of large blocks
    To me, this is the ideal solution, but it is not at all simple to achieve. The bug we are faced with ultimately stems from the heuristics that attempt to determine if the object has been reallocated between the previous snapshot and this one. Because these heuristics are impossible to make perfect with the current set of available data, the only solution is to add a creation_txg or generation counter to the dnode_phys_t. This would give us perfect information about reallocation for datasets created or upgraded after this change lands, but would do nothing for other users. The heuristics would have to remain, and we would still need to come up with solutions for users with existing data, and users who can't/don't want to upgrade their pools to use the new dnode structure. This solution also needs an appropriate helping of dev and testing effort to ensure correctness in all the various edge cases that may arise.

Personally, I would advocate that we immediately move forwards with #4; it is the only perfect long-term solution to this problem and others like it that we may encounter later on. The sooner we get the feature into zfs, the more data will have the creation_txg in it.

For all existing data, I think that the best behavior is probably a combination of 1b and 2, protecting existing users with large-block datasets who have used -L, users who have small-block datasets, and requiring large-block users who have not used -L thus far to restart their sends or requiring them to deliberately use a zstream split-blocks feature. Hopefully having to use a separate command line utility is sufficiently abnormal for the zfs send workflow that it will give users pause, and prevent them from encountering this bug, especially if adequate explanation is placed in the man pages/usage messages for various commands. Unfortunately, my guess is that there are a lot of users who fall into the large-block-without-dash-L camp, and this solution will inconvenience all of them. I don't know how to fix that.

If there are solutions that people proposed/like that they did not address, feel free to let me know and I will attempt to keep this up to date. If you see a way around one of the issues, or a new solution, again let me know and I will try to add/address it. Hopefully this helps people understand the current state of this bug and our attempts to come up with an ideal solution.

Thanks for the great write-up, @pcd1193182!

I realized there's a simpler way to do #4 (Improve/perfect detection of whether an object needs to be reallocated) which allows us to implement #2 perfectly (reject receives that toggle -L).

The ZPL already has a per-file generation number (unique for each actual file, even if they have the same object number), which it stores in the bonus buffer, which is included in the OBJECT record. We can use the generation number to tell if a file's block size changed due to toggling -L, and reject the receive in that case. Using the ZPL's generation number is a slight layering violation (send/receive operates at the DMU level and otherwise doesn't know about object contents / ZPL state), but I think that using it as an error-detection mechanism is fair, especially compared to the alternatives.

I think we can also orthogonally consider 1b (make -L the default), once we have the safety of the above solution, which I'm working on implementing.

@ahrens Do we want to reject the receive in this case? Or do we just want to correctly reallocate the dnode (as the heuristic should have done)?

After discussion with @tcaputi, I think I have a way to correctly process send streams that change to have -L, and reject send streams that change to not have -L. That would give us a better path forward towards having -L be the default (or only) option in a future release (likely after 2.0).

@pcd1193182 Thank you for taking the time to compile the options.

As far as I understand on-disk format and code the root of the issue is the details of storing objects in ObjSets:

Each object within an object set is uniquely identified by a 64 bit integer called a nobject number. An object's “object number” identifies the array element, in the dnode array, containing this object's dnode_phys_t.

which leads to objects being removed from the set leaving gaps in the dnode array, which are then (as I presume) greedily reused by ZFS to avoid growing that array unless absolutely needed, resulting in "object number" being ambiguous in a zfs send stream, resulting in the decision to have zfs recv detect recycling of an object number through looking for a changed record size.
And finally toggling the use of the zfs send -L option wrongfully triggering that heuristic.

Did I understand that correctly so far?

If yes: Do we _really need_ that check for reuse of object numbers? Shouldn't the dead list when sending the snapshot take care of removing all stale data (that belonged to the removed object) from the destination?

Apart from that...

I still think that ZFS send/recv should have the ability to convert datasets from one record-/volblocksize to another. Machines get retired, pools get transferred to new and different hardware as technology advances.If ZFS is ment to be the _last word in filesystems_... shouldn't it have the ability to remap the block size of the data in existing datasets (without sacrificing the snapshot history) to benefit from the abilities of the destination systems hardware as much as possible?

Wouldn't it make sense to be able to instruct a backup system using shingled drives to store data in the 8MB (just to pull a number from thin air) records that align with the native write stripe size of the drives, regardless of the block site of the source? And in case of a transfer of such a backup to another system where a database daemon is to access and modify the dataset in the DB native block size of (say) 16KiB... being able to recv that dataset with such a recordsize?

@GregorKopka

Did I understand that correctly so far?

That's right.

Do we really need that check for reuse of object numbers?

We need it so that the new object is set up correctly before we start getting WRITE records. e.g. has the correct block size and dnode size (number of "slots").

I still think that ZFS send/recv should have the ability to convert datasets from one record-/volblocksize to another.

I hear that you would like us to implement that new feature. It's a reasonable thing to want, but not straightforward to implement. Right now I am prioritizing this data-loss bug.

I agree with your priorities.

My question is: will the ZPL's generation number be deterministic enough to completely solve this issue _or_ will it just trade this issue against a more convoluted and obscure variation?

Sorry to hijack this but I think I've been bitten by this bug on a replication between two zfs pools (running FreeNAS). Is there _any_ recovery path once corruption is established or is the only option to destroy the replicated pool and start from scratch?

Also, pardon the necessarily stupid question but how comes that a zeroed file doesn't trigger any checksumming error upon reading? I was under the impression that zfs send/recv carry all the checksumming info necessary to confirm that what's written on the recv end exactly matches what was sent, am I mistaken?

Thanks

@f00b4r0 I'm not aware of any recovery path.

The send stream is itself checksummed, but it doesn't include checksums of every block of every file.

@f00b4r0 I'm not aware of any recovery path.

sigh, thanks. Not what I was hoping for, but not entirely unexpected.

The send stream is itself checksummed, but it doesn't include checksums of every block of every file.

This sounds like a design flaw that would have immediately exposed this bug, or am I missing something?

@f00b4r0 If the receiving filesystem has snapshots before the corruption the data should still be in the snapshot. The trick is locating all the null fill files and the clean snapshot. I recovered millions of files from my snapshot history with some scripting work.

It is best if you have a clean copy that wasn't affected by this bug and redo the replicated copy.

@lschweiss-wustl thanks. It looks like I'll have to start from scratch. Then again that bug bites so hard, I'm not sure I can ever trust send/recv again. At least rsync doesn't silently wipe data ;P

Sorry if asking something obvious, but I have a question: sending a dataset from a pool with large_block enabled, and with 1M recordsize datasets, to a non-large-block-enabled pool without ever using -L, am I right saying that I would not trigger the bug? What about the inverse scenario (sending from 128K block dataset/pool to a large_block enabled pool)?

In other words: if I never used -L I am safe, right?

Thanks.

@shodanshok

sending a dataset from a pool with large_block enabled, and with 1M recordsize datasets, to a non-large-block-enabled pool without ever using -L, am I right saying that I would not trigger the bug?

That's right. The bug is about changing the -L setting in an incremental send.

What about the inverse scenario (sending from 128K block dataset/pool to a large_block enabled pool)?

In this scenario you don't have any large blocks, so it doesn't matter. You can even toggle -L, because it's a no-op if you don't have large blocks.

if I never used -L I am safe, right?

Correct.

Was this page helpful?
0 / 5 - 0 ratings