Zfs: Unlistable and disappearing files

Created on 6 Apr 2018  Â·  108Comments  Â·  Source: openzfs/zfs

System information


Type | Version/Name
--- | ---
Distribution Name | Scientific Linux
Distribution Version | 6.8
Linux Kernel | 2.6.32-696.23.1.el6.x86_64
Architecture | x86_64
ZFS Version | 0.7.7
SPL Version | 0.7.7

Describe the problem you're observing

Data loss when copying a directory with large-ish number of files. For example, cp -r SRC DST with 10000 files in SRC is likely to result in a couple of "cp: cannot create regular file `DST/XXX': No space left on device" error messages, and a few thousand files missing from the listing of the DST directory. (Needless to say, filesystem being full is not the problem.)

The missing files are missing in the sense that they don't appear in the directory listing, but can be accessed using their name (except for the couple of files for which cp generated "No space left on device" error). For example:

# ls -l DST | grep FOO | wc -l
0
# ls -l DST/FOO
-rw-r--r-- 1 root root 5 Apr  6 14:59 DST/FOO

The content of DST/FOO are accessible by path (e.g. cat DST/FOO works) and is the same as SRC/FOO. If caches are dropped (echo 3 > /proc/sys/vm/drop_caches) or the machine is rebooted, opening FOO directly by path fails.

ls -ld DST reports N fewer hard links than SRC, where N is the number of files for which cp reported "No space left on device" error.

Names of missing files are mostly predictable if SRC is small.

Scrub does not find any errors.

I think the problem appeared in 0.7.7, but I am not sure.

Describe how to reproduce the problem

# mkdir SRC
# for i in $(seq 1 10000); do echo $i > SRC/$i ; done
# cp -r SRC DST
cp: cannot create regular file `DST/8442': No space left on device
cp: cannot create regular file `DST/2629': No space left on device
# ls -l
total 3107
drwxr-xr-x 2 root root 10000 Apr  6 15:28 DST
drwxr-xr-x 2 root root 10002 Apr  6 15:27 SRC
# find DST -type f | wc -l 
8186
# ls -l DST | grep 8445 | wc -l
0
# ls -l DST/8445
-rw-r--r-- 1 root root 5 Apr  6 15:28 DST/8445
# cat DST/8445
8445
# echo 3 > /proc/sys/vm/drop_caches
# cat DST/8445
cat: DST/8445: No such file or directory

Include any warning/errors/backtraces from the system logs

# zpool status
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 87h47m with 0 errors on Sat Mar 31 07:09:27 2018
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz1-0                  ONLINE       0     0     0
            wwn-0x5000c50085ac4c0f  ONLINE       0     0     0
            wwn-0x5000c50085acda77  ONLINE       0     0     0
            wwn-0x5000c500858db3d7  ONLINE       0     0     0
            wwn-0x5000c50085ac9887  ONLINE       0     0     0
            wwn-0x5000c50085aca6df  ONLINE       0     0     0
          raidz1-1                  ONLINE       0     0     0
            wwn-0x5000c500858db743  ONLINE       0     0     0
            wwn-0x5000c500858db347  ONLINE       0     0     0
            wwn-0x5000c500858db4a7  ONLINE       0     0     0
            wwn-0x5000c500858dbb0f  ONLINE       0     0     0
            wwn-0x5000c50085acaa97  ONLINE       0     0     0
          raidz1-2                  ONLINE       0     0     0
            wwn-0x5000c50085accb4b  ONLINE       0     0     0
            wwn-0x5000c50085acab9f  ONLINE       0     0     0
            wwn-0x5000c50085ace783  ONLINE       0     0     0
            wwn-0x5000c500858db67b  ONLINE       0     0     0
            wwn-0x5000c50085acb983  ONLINE       0     0     0
          raidz1-3                  ONLINE       0     0     0
            wwn-0x5000c50085ac4fd7  ONLINE       0     0     0
            wwn-0x5000c50085acb24b  ONLINE       0     0     0
            wwn-0x5000c50085ace13b  ONLINE       0     0     0
            wwn-0x5000c500858db43f  ONLINE       0     0     0
            wwn-0x5000c500858db61b  ONLINE       0     0     0
          raidz1-4                  ONLINE       0     0     0
            wwn-0x5000c500858dbbb7  ONLINE       0     0     0
            wwn-0x5000c50085acce7f  ONLINE       0     0     0
            wwn-0x5000c50085acd693  ONLINE       0     0     0
            wwn-0x5000c50085ac3d87  ONLINE       0     0     0
            wwn-0x5000c50085acc89b  ONLINE       0     0     0
          raidz1-5                  ONLINE       0     0     0
            wwn-0x5000c500858db28b  ONLINE       0     0     0
            wwn-0x5000c500858db68f  ONLINE       0     0     0
            wwn-0x5000c500858dbadf  ONLINE       0     0     0
            wwn-0x5000c500858db623  ONLINE       0     0     0
            wwn-0x5000c500858db48b  ONLINE       0     0     0
          raidz1-6                  ONLINE       0     0     0
            wwn-0x5000c500858db6ef  ONLINE       0     0     0
            wwn-0x5000c500858db39b  ONLINE       0     0     0
            wwn-0x5000c500858db47f  ONLINE       0     0     0
            wwn-0x5000c500858dbb23  ONLINE       0     0     0
            wwn-0x5000c500858db803  ONLINE       0     0     0
        logs
          zfs-slog                  ONLINE       0     0     0
        spares
          wwn-0x5000c500858db463    AVAIL   

errors: No known data errors
# zpool list
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank   254T   159T  94.3T         -    27%    62%  1.00x  ONLINE  -



md5-f618a12917d64a78c208de73972d8f89



# zfs list -t all
NAME           USED  AVAIL  REFER  MOUNTPOINT
tank           127T  69.0T  11.5T  /mnt/tank
tank/jade      661G  69.0T   661G  /mnt/tank/jade
tank/simprod   115T  14.8T   115T  /mnt/tank/simprod



md5-f618a12917d64a78c208de73972d8f89



# zfs get all tank
NAME  PROPERTY              VALUE                  SOURCE
tank  type                  filesystem             -
tank  creation              Sat Jan 20 12:11 2018  -
tank  used                  127T                   -
tank  available             68.9T                  -
tank  referenced            11.6T                  -
tank  compressratio         1.00x                  -
tank  mounted               yes                    -
tank  quota                 none                   default
tank  reservation           none                   default
tank  recordsize            128K                   default
tank  mountpoint            /mnt/tank              local
tank  sharenfs              off                    default
tank  checksum              on                     default
tank  compression           off                    default
tank  atime                 off                    local
tank  devices               on                     default
tank  exec                  on                     default
tank  setuid                on                     default
tank  readonly              off                    default
tank  zoned                 off                    default
tank  snapdir               hidden                 default
tank  aclinherit            restricted             default
tank  createtxg             1                      -
tank  canmount              on                     default
tank  xattr                 sa                     local
tank  copies                1                      default
tank  version               5                      -
tank  utf8only              off                    -
tank  normalization         none                   -
tank  casesensitivity       sensitive              -
tank  vscan                 off                    default
tank  nbmand                off                    default
tank  sharesmb              off                    default
tank  refquota              none                   default
tank  refreservation        none                   default
tank  guid                  2271746520743372128    -
tank  primarycache          all                    default
tank  secondarycache        all                    default
tank  usedbysnapshots       0B                     -
tank  usedbydataset         11.6T                  -
tank  usedbychildren        116T                   -
tank  usedbyrefreservation  0B                     -
tank  logbias               latency                default
tank  dedup                 off                    default
tank  mlslabel              none                   default
tank  sync                  standard               default
tank  dnodesize             legacy                 default
tank  refcompressratio      1.00x                  -
tank  written               11.6T                  -
tank  logicalused           128T                   -
tank  logicalreferenced     11.6T                  -
tank  volmode               default                default
tank  filesystem_limit      none                   default
tank  snapshot_limit        none                   default
tank  filesystem_count      none                   default
tank  snapshot_count        none                   default
tank  snapdev               hidden                 default
tank  acltype               off                    default
tank  context               none                   default
tank  fscontext             none                   default
tank  defcontext            none                   default
tank  rootcontext           none                   default
tank  relatime              off                    default
tank  redundant_metadata    all                    default
tank  overlay               off                    default



md5-f618a12917d64a78c208de73972d8f89



# zpool get all tank   
NAME  PROPERTY                       VALUE                          SOURCE
tank  size                           254T                           -
tank  capacity                       62%                            -
tank  altroot                        -                              default
tank  health                         ONLINE                         -
tank  guid                           7056741522691970971            -
tank  version                        -                              default
tank  bootfs                         -                              default
tank  delegation                     on                             default
tank  autoreplace                    on                             local
tank  cachefile                      -                              default
tank  failmode                       wait                           default
tank  listsnapshots                  off                            default
tank  autoexpand                     off                            default
tank  dedupditto                     0                              default
tank  dedupratio                     1.00x                          -
tank  free                           94.2T                          -
tank  allocated                      160T                           -
tank  readonly                       off                            -
tank  ashift                         0                              default
tank  comment                        -                              default
tank  expandsize                     -                              -
tank  freeing                        0                              -
tank  fragmentation                  27%                            -
tank  leaked                         0                              -
tank  multihost                      off                            default
tank  feature@async_destroy          enabled                        local
tank  feature@empty_bpobj            active                         local
tank  feature@lz4_compress           active                         local
tank  feature@multi_vdev_crash_dump  enabled                        local
tank  feature@spacemap_histogram     active                         local
tank  feature@enabled_txg            active                         local
tank  feature@hole_birth             active                         local
tank  feature@extensible_dataset     active                         local
tank  feature@embedded_data          active                         local
tank  feature@bookmarks              enabled                        local
tank  feature@filesystem_limits      enabled                        local
tank  feature@large_blocks           enabled                        local
tank  feature@large_dnode            enabled                        local
tank  feature@sha512                 enabled                        local
tank  feature@skein                  enabled                        local
tank  feature@edonr                  enabled                        local
tank  feature@userobj_accounting     active                         local
Regression

Most helpful comment

Thank you everyone for your help with this unfortunate regression. As described above by @tuxoko the root cause of this issue is understood and a complete fix is currently being worked on. In the meanwhile commit cc63068e95ee725cce03b1b7ce50179825a6cda5 which introduced this issue will be shortly reverted from the master branch, release branch, and v0.7.8 will be tagged. We'll open a new PR with the full fix for review and feedback when it's ready.

All 108 comments

I can confirm the same behavior on a minimal CentOS 7.4 installation (running inside VirtualBox) and latest ZFS 0.7.7. Please note that when copying somewhat bigger files (ie: kernel source) it does not happen, so it seems something as a race condition...

; the only changed property was xattr=sa
[root@localhost ~]# zpool list
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank  7.94G  25.3M  7.91G         -     0%     0%  1.00x  ONLINE  -
[root@localhost ~]# zfs list
NAME        USED  AVAIL  REFER  MOUNTPOINT
tank       24.5M  7.67G  3.36M  /tank
tank/test  21.0M  7.67G  21.0M  /tank/test

; creating the source dir on a XFS filesystem
[root@localhost ~]# cd /root/
[root@localhost ~]# mkdir test
[root@localhost ~]# cd test
[root@localhost ~]# for i in $(seq 1 10000); do echo $i > SRC/$i ; done

; copying from XFS to ZFS: no problem at all
[root@localhost ~]# cd /tank/test
[root@localhost test]# cp -r /root/test/SRC/ DST1
[root@localhost test]# cp -r /root/test/SRC/ DST2
[root@localhost test]# cp -r /root/test/SRC/ DST3
[root@localhost test]# find DST1/ | wc -l
10001
[root@localhost test]# find DST2/ | wc -l
10001
[root@localhost test]# find DST3/ | wc -l
10001

; copying from ZFS dataset itself: big troubles!
[root@localhost test]# rm -rf SRC DST1 DST2 DST3
[root@localhost test]# cp -r /root/test/SRC .
[root@localhost test]# cp -r SRC DST1
cp: cannot create regular file ‘DST1/8809’: No space left on device
[root@localhost test]# cp -r SRC DST2
[root@localhost test]# cp -r SRC DST3
cp: cannot create regular file ‘DST3/6507’: No space left on device
[root@localhost test]# find DST1/ | wc -l
10000
[root@localhost test]# find DST2/ | wc -l
10001
[root@localhost test]# find DST3/ | wc -l
8189

; disabling cache: nothing changes (we continue to "lose" files)
[root@localhost test]# zfs set primarycache=none tank
[root@localhost test]# zfs set primarycache=none tank/test
[root@localhost test]# echo 3 > /proc/sys/vm/drop_caches
[root@localhost test]# rm -rf SRC; mkdir SRC; for i in $(seq 1 10000); do echo $i > SRC/$i ; done; find SRC | wc -l
10001
[root@localhost test]# rm -rf SRC; mkdir SRC; for i in $(seq 1 10000); do echo $i > SRC/$i ; done; find SRC | wc -l
10001
[root@localhost test]# rm -rf SRC; mkdir SRC; for i in $(seq 1 10000); do echo $i > SRC/$i ; done; find SRC | wc -l
10001

The problem does NOT appear on ZoL 0.7.6:

; creating the dataset and copying the SRC dir
[root@localhost ~]# zfs create tank/test
[root@localhost ~]# zfs set xattr=sa tank
[root@localhost ~]# zfs set xattr=sa tank/test
[root@localhost ~]# cp -r /root/test/SRC/ /tank/test/
[root@localhost ~]# cd /tank/test/
[root@localhost test]# find SRC/ | wc -l
10001

; more copies
[root@localhost test]# cp -r SRC/ DST
[root@localhost test]# cp -r SRC/ DST1
[root@localhost test]# cp -r SRC/ DST2
[root@localhost test]# cp -r SRC/ DST3
[root@localhost test]# cp -r SRC/ DST4
[root@localhost test]# cp -r SRC/ DST5
[root@localhost test]# find DST | wc -l
10001
[root@localhost test]# find DST1 | wc -l
10001
[root@localhost test]# find DST2 | wc -l
10001
[root@localhost test]# find DST3 | wc -l
10001
[root@localhost test]# find DST4 | wc -l
10001
[root@localhost test]# find DST5 | wc -l
10001

Maybe it can help. Here you find the output of zdb -dddddddd tank/test 192784 (a "good" DST directory):

Dataset tank/test [ZPL], ID 74, cr_txg 13, 26.5M, 190021 objects, rootbp DVA[0]=<0:5289e00:200> DVA[1]=<0:65289e00:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=123L/123P fill=190021 cksum=d622b78d2:50c053a50d0:fca8cd4455d7:2216d160ee7f7d

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
    192784    2   128K    16K   909K     512  1.02M  100.00  ZFS directory (K=inherit) (Z=inherit)
                                               272   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
        dnode maxblkid: 64
        path    /DST16
        uid     0
        gid     0
        atime   Sat Apr  7 01:11:29 2018
        mtime   Sat Apr  7 01:11:31 2018
        ctime   Sat Apr  7 01:11:31 2018
        crtime  Sat Apr  7 01:11:29 2018
        gen     97
        mode    40755
        size    10002
        parent  34
        links   2
        pflags  40800000144
        SA xattrs: 96 bytes, 1 entries

                security.selinux = unconfined_u:object_r:unlabeled_t:s0\000
        Fat ZAP stats:
                Pointer table:
                        1024 elements
                        zt_blk: 0
                        zt_numblks: 0
                        zt_shift: 10
                        zt_blks_copied: 0
                        zt_nextblk: 0
                ZAP entries: 10000
                Leaf blocks: 64
                Total blocks: 65
                zap_block_type: 0x8000000000000001
                zap_magic: 0x2f52ab2ab
                zap_salt: 0x13c18a19
                Leafs with 2^n pointers:
                          4:     64 ****************************************
                Blocks with n*5 entries:
                          9:     64 ****************************************
                Blocks n/10 full:
                          6:      4 ****
                          7:     43 ****************************************
                          8:     16 ***************
                          9:      1 *
                Entries with n chunks:
                          3:  10000 ****************************************
                Buckets with n entries:
                          0:  24119 ****************************************
                          1:   7414 *************
                          2:   1126 **
                          3:    102 *
                          4:      7 *

... and zdb -dddddddd tank/test 202785 (a "bad" DST directory):

Dataset tank/test [ZPL], ID 74, cr_txg 13, 26.5M, 190021 objects, rootbp DVA[0]=<0:5289e00:200> DVA[1]=<0:65289e00:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=123L/123P fill=190021 cksum=d622b78d2:50c053a50d0:fca8cd4455d7:2216d160ee7f7d

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
    202785    2   128K    16K   766K     512   896K  100.00  ZFS directory (K=inherit) (Z=inherit)
                                               272   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
        dnode maxblkid: 55
        path    /DST17
        uid     0
        gid     0
        atime   Sat Apr  7 01:12:49 2018
        mtime   Sat Apr  7 01:11:33 2018
        ctime   Sat Apr  7 01:11:33 2018
        crtime  Sat Apr  7 01:11:32 2018
        gen     98
        mode    40755
        size    10001
        parent  34
        links   2
        pflags  40800000144
        SA xattrs: 96 bytes, 1 entries

                security.selinux = unconfined_u:object_r:unlabeled_t:s0\000
        Fat ZAP stats:
                Pointer table:
                        1024 elements
                        zt_blk: 0
                        zt_numblks: 0
                        zt_shift: 10
                        zt_blks_copied: 0
                        zt_nextblk: 0
                ZAP entries: 8259
                Leaf blocks: 55
                Total blocks: 56
                zap_block_type: 0x8000000000000001
                zap_magic: 0x2f52ab2ab
                zap_salt: 0x1bf8e8a3
                Leafs with 2^n pointers:
                          4:     50 ****************************************
                          5:      3 ***
                          6:      2 **
                Blocks with n*5 entries:
                          9:     55 ****************************************
                Blocks n/10 full:
                          5:      6 ******
                          6:      7 *******
                          7:     32 ********************************
                          8:      6 ******
                          9:      4 ****
                Entries with n chunks:
                          3:   8259 ****************************************
                Buckets with n entries:
                          0:  20964 ****************************************
                          1:   6217 ************
                          2:    904 **
                          3:     66 *
                          4:      9 *

We are also seeing similar behavior since the install of 0.7.7

I have a hand-built ZoL 0.7.7 on a stock Ubuntu 16.04 server (currently with Ubuntu kernel version '4.4.0-109-generic') and I can't reproduce this problem on it, following the reproduction here and some variants (eg using 'seq -w' to make all of the filenames the same size). The pool I'm testing against has a single mirrored vdev.

One more data point, with the hope that it helps narrow down the issue.

I cannot reproduce the issue on the few machines I have here, neither with 10k files, nor with 100k or even 1M. They all have very similar configuraition. They use a single 2-drive mirrored vdev. The drives are Samsung SSD 950 PRO 512GB (NVMe, quite fast).

$ uname -a
Linux pat 4.9.90-gentoo #1 SMP PREEMPT Tue Mar 27 00:19:59 CEST 2018 x86_64 Intel(R) Xeon(R) CPU E3-1505M v5 @ 2.80GHz GenuineIntel GNU/Linux

$ qlist -I -v zfs-kmod
sys-fs/zfs-kmod-0.7.7

$ qlist -I -v spl
sys-kernel/spl-0.7.7

$ zpool status
  pool: pat:pool
 state: ONLINE
  scan: scrub repaired 0B in 0h1m with 0 errors on Sat Apr  7 03:35:12 2018
config:

        NAME                                                 STATE     READ WRITE CKSUM
        pat:pool                                             ONLINE       0     0     0
          mirror-0                                           ONLINE       0     0     0
            nvme0n1p4                                        ONLINE       0     0     0
            nvme1n1p4                                        ONLINE       0     0     0
        spares
          ata-Samsung_SSD_850_EVO_1TB_S2RFNXAH118721D-part8  AVAIL   

errors: No known data errors

$ zpool list
NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
pat:pool   408G   110G   298G         -    18%    26%  1.00x  ONLINE  -

$ zpool get all pat:pool
NAME      PROPERTY                       VALUE                          SOURCE
pat:pool  size                           408G                           -
pat:pool  capacity                       26%                            -
pat:pool  altroot                        -                              default
pat:pool  health                         ONLINE                         -
pat:pool  guid                           16472389984482033769           -
pat:pool  version                        -                              default
pat:pool  bootfs                         -                              default
pat:pool  delegation                     on                             default
pat:pool  autoreplace                    on                             local
pat:pool  cachefile                      -                              default
pat:pool  failmode                       wait                           default
pat:pool  listsnapshots                  off                            default
pat:pool  autoexpand                     off                            default
pat:pool  dedupditto                     0                              default
pat:pool  dedupratio                     1.00x                          -
pat:pool  free                           298G                           -
pat:pool  allocated                      110G                           -
pat:pool  readonly                       off                            -
pat:pool  ashift                         12                             local
pat:pool  comment                        -                              default
pat:pool  expandsize                     -                              -
pat:pool  freeing                        0                              -
pat:pool  fragmentation                  18%                            -
pat:pool  leaked                         0                              -
pat:pool  multihost                      off                            default
pat:pool  feature@async_destroy          enabled                        local
pat:pool  feature@empty_bpobj            active                         local
pat:pool  feature@lz4_compress           active                         local
pat:pool  feature@multi_vdev_crash_dump  enabled                        local
pat:pool  feature@spacemap_histogram     active                         local
pat:pool  feature@enabled_txg            active                         local
pat:pool  feature@hole_birth             active                         local
pat:pool  feature@extensible_dataset     active                         local
pat:pool  feature@embedded_data          active                         local
pat:pool  feature@bookmarks              enabled                        local
pat:pool  feature@filesystem_limits      enabled                        local
pat:pool  feature@large_blocks           enabled                        local
pat:pool  feature@large_dnode            enabled                        local
pat:pool  feature@sha512                 enabled                        local
pat:pool  feature@skein                  enabled                        local
pat:pool  feature@edonr                  enabled                        local
pat:pool  feature@userobj_accounting     active                         local

$ zfs list
NAME                                          USED  AVAIL  REFER  MOUNTPOINT
(...)
pat:pool/home/joe/tmp                        27.9G   285G  27.9G  /home/joe/tmp
(...)

$ zfs get all pat:pool/home/joe/tmp
NAME                   PROPERTY               VALUE                  SOURCE
pat:pool/home/joe/tmp  type                   filesystem             -
pat:pool/home/joe/tmp  creation               Sat Mar 12 17:32 2016  -
pat:pool/home/joe/tmp  used                   27.9G                  -
pat:pool/home/joe/tmp  available              285G                   -
pat:pool/home/joe/tmp  referenced             27.9G                  -
pat:pool/home/joe/tmp  compressratio          1.16x                  -
pat:pool/home/joe/tmp  mounted                yes                    -
pat:pool/home/joe/tmp  quota                  none                   default
pat:pool/home/joe/tmp  reservation            none                   default
pat:pool/home/joe/tmp  recordsize             128K                   default
pat:pool/home/joe/tmp  mountpoint             /home/joe/tmp          inherited from pat:pool/home
pat:pool/home/joe/tmp  sharenfs               off                    default
pat:pool/home/joe/tmp  checksum               on                     default
pat:pool/home/joe/tmp  compression            lz4                    inherited from pat:pool
pat:pool/home/joe/tmp  atime                  off                    inherited from pat:pool
pat:pool/home/joe/tmp  devices                on                     default
pat:pool/home/joe/tmp  exec                   on                     default
pat:pool/home/joe/tmp  setuid                 on                     default
pat:pool/home/joe/tmp  readonly               off                    default
pat:pool/home/joe/tmp  zoned                  off                    default
pat:pool/home/joe/tmp  snapdir                hidden                 default
pat:pool/home/joe/tmp  aclinherit             restricted             default
pat:pool/home/joe/tmp  createtxg              507                    -
pat:pool/home/joe/tmp  canmount               on                     default
pat:pool/home/joe/tmp  xattr                  sa                     inherited from pat:pool
pat:pool/home/joe/tmp  copies                 1                      default
pat:pool/home/joe/tmp  version                5                      -
pat:pool/home/joe/tmp  utf8only               off                    -
pat:pool/home/joe/tmp  normalization          none                   -
pat:pool/home/joe/tmp  casesensitivity        sensitive              -
pat:pool/home/joe/tmp  vscan                  off                    default
pat:pool/home/joe/tmp  nbmand                 off                    default
pat:pool/home/joe/tmp  sharesmb               off                    default
pat:pool/home/joe/tmp  refquota               none                   default
pat:pool/home/joe/tmp  refreservation         none                   default
pat:pool/home/joe/tmp  guid                   10274125767907263189   -
pat:pool/home/joe/tmp  primarycache           all                    default
pat:pool/home/joe/tmp  secondarycache         all                    default
pat:pool/home/joe/tmp  usedbysnapshots        0B                     -
pat:pool/home/joe/tmp  usedbydataset          27.9G                  -
pat:pool/home/joe/tmp  usedbychildren         0B                     -
pat:pool/home/joe/tmp  usedbyrefreservation   0B                     -
pat:pool/home/joe/tmp  logbias                latency                default
pat:pool/home/joe/tmp  dedup                  off                    default
pat:pool/home/joe/tmp  mlslabel               none                   default
pat:pool/home/joe/tmp  sync                   standard               default
pat:pool/home/joe/tmp  dnodesize              legacy                 default
pat:pool/home/joe/tmp  refcompressratio       1.16x                  -
pat:pool/home/joe/tmp  written                27.9G                  -
pat:pool/home/joe/tmp  logicalused            31.6G                  -
pat:pool/home/joe/tmp  logicalreferenced      31.6G                  -
pat:pool/home/joe/tmp  volmode                default                default
pat:pool/home/joe/tmp  filesystem_limit       none                   default
pat:pool/home/joe/tmp  snapshot_limit         none                   default
pat:pool/home/joe/tmp  filesystem_count       none                   default
pat:pool/home/joe/tmp  snapshot_count         none                   default
pat:pool/home/joe/tmp  snapdev                hidden                 default
pat:pool/home/joe/tmp  acltype                posixacl               inherited from pat:pool
pat:pool/home/joe/tmp  context                none                   default
pat:pool/home/joe/tmp  fscontext              none                   default
pat:pool/home/joe/tmp  defcontext             none                   default
pat:pool/home/joe/tmp  rootcontext            none                   default
pat:pool/home/joe/tmp  relatime               off                    default
pat:pool/home/joe/tmp  redundant_metadata     all                    default
pat:pool/home/joe/tmp  overlay                off                    default
pat:pool/home/joe/tmp  net.c-space:snapshots  keep=1M                inherited from pat:pool/home/joe
pat:pool/home/joe/tmp  net.c-space:root       0                      inherited from pat:pool

I get a worse situation on latest Centos 7 with kmod:

`[root@zirconia test]# mkdir SRC
[root@zirconia test]# for i in $(seq 1 10000); do echo $i > SRC/$i ; done
[root@zirconia test]# cp -r SRC DST
cp: cannot create regular file ‘DST/5269’: No space left on device
cp: cannot create regular file ‘DST/9923’: No space left on device
[root@zirconia test]# cat DST/5269
cat: DST/5269: No such file or directory
[root@zirconia test]# cat DST/9923
cat: DST/9923: No such file or directory
[root@zirconia test]# cat DST/9924
9924
[root@zirconia test]# cat DST/9923
cat: DST/9923: No such file or directory
[root@zirconia test]# ls -l DST/9923
ls: cannot access DST/9923: No such file or directory

[root@zirconia test]# zpool status
pool: storage
state: ONLINE
scan: none requested
config:

NAME                                            STATE     READ WRITE CKSUM
storage                                         ONLINE       0     0     0
  raidz1-0                                      ONLINE       0     0     0
    ata-Hitachi_HDS723020BLA642_MN1220F30KPM0D  ONLINE       0     0     0
    ata-Hitachi_HDS723020BLA642_MN1220F30NJDDD  ONLINE       0     0     0
    ata-Hitachi_HDS723020BLA642_MN1220F30NJAHD  ONLINE       0     0     0
  raidz1-1                                      ONLINE       0     0     0
    ata-Hitachi_HDS723020BLA642_MN1220F30NGXDD  ONLINE       0     0     0
    ata-Hitachi_HDS723020BLA642_MN1220F30NJ91D  ONLINE       0     0     0
    ata-Hitachi_HDS723020BLA642_MN1220F30LN7GD  ONLINE       0     0     0
  raidz1-2                                      ONLINE       0     0     0
    ata-Hitachi_HDS723020BLA642_MN1220F30NJM5D  ONLINE       0     0     0
    ata-HGST_HUS724020ALA640_PN2134P5GAY9PX     ONLINE       0     0     0
    ata-Hitachi_HDS723020BLA642_MN1220F30NJD5D  ONLINE       0     0     0
  raidz1-3                                      ONLINE       0     0     0
    ata-Hitachi_HDS723020BLA642_MN1220F30NJD8D  ONLINE       0     0     0
    ata-Hitachi_HDS723020BLA642_MN1220F30NJHVD  ONLINE       0     0     0
    ata-Hitachi_HDS723020BLA642_MN1220F30K5PMD  ONLINE       0     0     0
  raidz1-4                                      ONLINE       0     0     0
    ata-Hitachi_HDS723020BLA642_MN1220F30NLZLD  ONLINE       0     0     0
    ata-Hitachi_HDS723020BLA642_MN1220F30MVW4D  ONLINE       0     0     0
    ata-HGST_HUS724020ALA640_PN2134P5GBBL9X     ONLINE       0     0     0
logs
  mirror-5                                      ONLINE       0     0     0
    nvme0n1p1                                   ONLINE       0     0     0
    nvme1n1p1                                   ONLINE       0     0     0
cache
  nvme0n1p2                                     ONLINE       0     0     0
  nvme1n1p2                                     ONLINE       0     0     0`

I

@rblank Did you use empty files? Please try the following:

  • cd into your ZFS dataset
  • execute mkdir SRC; for i in $(seq 1 10000); do echo -n > SRC/$i; done; find SRC | wc -l
  • now issue for i in $(seq 1 10); do cp -r SRC DST$i; find DST$i | wc -l; done

Thanks.

I used the exact commands from the OP (which create non-empty files), only changing 10000 to 100000 and 1000000. But for completeness, I tried yours as well.

$ mkdir SRC; for i in $(seq 1 10000); do echo -n > SRC/$i; done; find SRC | wc -l
10001
$ for i in $(seq 1 10); do cp -r SRC DST$i; find DST$i | wc -l; done
10001
10001
10001
10001
10001
10001
10001
10001
10001
10001

The few data points above weakly hint at raidz, since no one was able to reproduce on mirrors so far.

On one of my pools this works fine, on another it exhibits the problems. Both datasets belong to the same pool.

bash-4.2$ mkdir SRC
bash-4.2$ for i in $(seq 1 10000); do echo $i > SRC/$i ; done
bash-4.2$ cp -r SRC DST
cp: cannot create regular file ‘DST/222’: No space left on device
cp: cannot create regular file ‘DST/6950’: No space left on device

On beast/engineering the above commands run without issue. On beast/dataio they fail.

bash-4.2$ zfs get all beast/engineering
NAME               PROPERTY               VALUE                  SOURCE
beast/engineering  type                   filesystem             -
beast/engineering  creation               Sun Nov  5 17:53 2017  -
beast/engineering  used                   1.85T                  -
beast/engineering  available              12.0T                  -
beast/engineering  referenced             1.85T                  -
beast/engineering  compressratio          1.04x                  -
beast/engineering  mounted                yes                    -
beast/engineering  quota                  none                   default
beast/engineering  reservation            none                   default
beast/engineering  recordsize             1M                     inherited from beast
beast/engineering  mountpoint             /beast/engineering     default
beast/engineering  sharenfs               on                     inherited from beast
beast/engineering  checksum               on                     default
beast/engineering  compression            lz4                    inherited from beast
beast/engineering  atime                  off                    inherited from beast
beast/engineering  devices                on                     default
beast/engineering  exec                   on                     default
beast/engineering  setuid                 on                     default
beast/engineering  readonly               off                    default
beast/engineering  zoned                  off                    default
beast/engineering  snapdir                hidden                 default
beast/engineering  aclinherit             restricted             default
beast/engineering  createtxg              20615173               -
beast/engineering  canmount               on                     default
beast/engineering  xattr                  sa                     inherited from beast
beast/engineering  copies                 1                      default
beast/engineering  version                5                      -
beast/engineering  utf8only               off                    -
beast/engineering  normalization          none                   -
beast/engineering  casesensitivity        sensitive              -
beast/engineering  vscan                  off                    default
beast/engineering  nbmand                 off                    default
beast/engineering  sharesmb               off                    inherited from beast
beast/engineering  refquota               none                   default
beast/engineering  refreservation         none                   default
beast/engineering  guid                   18311947624891459017   -
beast/engineering  primarycache           metadata               local
beast/engineering  secondarycache         all                    default
beast/engineering  usedbysnapshots        151M                   -
beast/engineering  usedbydataset          1.85T                  -
beast/engineering  usedbychildren         0B                     -
beast/engineering  usedbyrefreservation   0B                     -
beast/engineering  logbias                latency                default
beast/engineering  dedup                  off                    default
beast/engineering  mlslabel               none                   default
beast/engineering  sync                   disabled               inherited from beast
beast/engineering  dnodesize              auto                   inherited from beast
beast/engineering  refcompressratio       1.04x                  -
beast/engineering  written                0                      -
beast/engineering  logicalused            1.92T                  -
beast/engineering  logicalreferenced      1.92T                  -
beast/engineering  volmode                default                default
beast/engineering  filesystem_limit       none                   default
beast/engineering  snapshot_limit         none                   default
beast/engineering  filesystem_count       none                   default
beast/engineering  snapshot_count         none                   default
beast/engineering  snapdev                hidden                 default
beast/engineering  acltype                posixacl               inherited from beast
beast/engineering  context                none                   default
beast/engineering  fscontext              none                   default
beast/engineering  defcontext             none                   default
beast/engineering  rootcontext            none                   default
beast/engineering  relatime               off                    default
beast/engineering  redundant_metadata     all                    default
beast/engineering  overlay                off                    default
beast/engineering  com.sun:auto-snapshot  true                   inherited from beast
bash-4.2$ zfs get all beast/dataio
NAME          PROPERTY               VALUE                  SOURCE
beast/dataio  type                   filesystem             -
beast/dataio  creation               Fri Oct 13 11:13 2017  -
beast/dataio  used                   45.0T                  -
beast/dataio  available              12.0T                  -
beast/dataio  referenced             45.0T                  -
beast/dataio  compressratio          1.09x                  -
beast/dataio  mounted                yes                    -
beast/dataio  quota                  none                   default
beast/dataio  reservation            none                   default
beast/dataio  recordsize             1M                     inherited from beast
beast/dataio  mountpoint             /beast/dataio          default
beast/dataio  sharenfs               on                     inherited from beast
beast/dataio  checksum               on                     default
beast/dataio  compression            lz4                    inherited from beast
beast/dataio  atime                  off                    inherited from beast
beast/dataio  devices                on                     default
beast/dataio  exec                   on                     default
beast/dataio  setuid                 on                     default
beast/dataio  readonly               off                    default
beast/dataio  zoned                  off                    default
beast/dataio  snapdir                hidden                 default
beast/dataio  aclinherit             restricted             default
beast/dataio  createtxg              19156147               -
beast/dataio  canmount               on                     default
beast/dataio  xattr                  sa                     inherited from beast
beast/dataio  copies                 1                      default
beast/dataio  version                5                      -
beast/dataio  utf8only               off                    -
beast/dataio  normalization          none                   -
beast/dataio  casesensitivity        sensitive              -
beast/dataio  vscan                  off                    default
beast/dataio  nbmand                 off                    default
beast/dataio  sharesmb               off                    inherited from beast
beast/dataio  refquota               none                   default
beast/dataio  refreservation         none                   default
beast/dataio  guid                   7216940837685529084    -
beast/dataio  primarycache           all                    default
beast/dataio  secondarycache         all                    default
beast/dataio  usedbysnapshots        0B                     -
beast/dataio  usedbydataset          45.0T                  -
beast/dataio  usedbychildren         0B                     -
beast/dataio  usedbyrefreservation   0B                     -
beast/dataio  logbias                latency                default
beast/dataio  dedup                  off                    default
beast/dataio  mlslabel               none                   default
beast/dataio  sync                   disabled               inherited from beast
beast/dataio  dnodesize              auto                   inherited from beast
beast/dataio  refcompressratio       1.09x                  -
beast/dataio  written                45.0T                  -
beast/dataio  logicalused            49.3T                  -
beast/dataio  logicalreferenced      49.3T                  -
beast/dataio  volmode                default                default
beast/dataio  filesystem_limit       none                   default
beast/dataio  snapshot_limit         none                   default
beast/dataio  filesystem_count       none                   default
beast/dataio  snapshot_count         none                   default
beast/dataio  snapdev                hidden                 default
beast/dataio  acltype                posixacl               inherited from beast
beast/dataio  context                none                   default
beast/dataio  fscontext              none                   default
beast/dataio  defcontext             none                   default
beast/dataio  rootcontext            none                   default
beast/dataio  relatime               off                    default
beast/dataio  redundant_metadata     all                    default
beast/dataio  overlay                off                    default
beast/dataio  com.sun:auto-snapshot  false                  local

I think the issue is related to primarycache=all. If I set a pool to have primarycache=metadata there are no errors.

@rblank I replicated the issue with a simple, single-vdev pool. I'll try and report back with mirror, anyway.

@alatteri What pool/vdev layout do you use? Can you show zpool status on both machines? I tried with primarycache=none and it failed, albeit with much lower frequency (ie: it failed after the 5th copy). I'll try with primarycache=metadata.

Same machine, different datasets on the same pool.

beast: /nfs/beast/home/alan % zpool status
  pool: beast
 state: ONLINE
  scan: scrub canceled on Fri Mar  2 16:47:01 2018
config:

    NAME                                   STATE     READ WRITE CKSUM
    beast                                  ONLINE       0     0     0
      raidz2-0                             ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NAHN5M1X  ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NAHN5NPX  ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NAHNP9BX  ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NAHN6M4Y  ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NAHNPBLX  ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NAHKY7PX  ONLINE       0     0     0
      raidz2-1                             ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NCG1G8SL  ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NCG1BVVL  ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NCG13K0L  ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NCG1GA9L  ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NCG1G9YL  ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NCG6D9ZS  ONLINE       0     0     0
      raidz2-2                             ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NCG68U3S  ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NCG2WW7S  ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NAHMHVGY  ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NAHKRYUX  ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NAHKXMKX  ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NCG5ZYKS  ONLINE       0     0     0
      raidz2-3                             ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NCGSM01S  ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NCGSY9HS  ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NCGTHJUS  ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NCGTKV1S  ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NCGTMN4S  ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NCGTGTLS  ONLINE       0     0     0
      raidz2-4                             ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NCGTKUWS  ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NCGTG3YS  ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NCGTLYZS  ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NCGSZ2GS  ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NCGSV93S  ONLINE       0     0     0
        ata-HGST_HDN726060ALE610_NCGT04NS  ONLINE       0     0     0
      raidz2-5                             ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1HHZGSB  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1GTE6HD  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1GU06VD  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1GS5KNF  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_NCHA3DZS  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_NCHAE5JS  ONLINE       0     0     0
      raidz2-6                             ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1HJ21DB  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_NCH9WUXS  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_NCHAXNTS  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_NCHA0DLS  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1HJG72B  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1HHX19B  ONLINE       0     0     0
    cache
      nvme0n1                              ONLINE       0     0     0

errors: No known data errors

  pool: pimplepaste
 state: ONLINE
  scan: scrub repaired 0B in 2h38m with 0 errors on Mon Mar 19 00:17:45 2018
config:

    NAME                                   STATE     READ WRITE CKSUM
    pimplepaste                            ONLINE       0     0     0
      raidz2-0                             ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1JVHTBD  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1JVHVSD  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1JVHT1D  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1HUYA5D  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1JVDPMD  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1HZAZDD  ONLINE       0     0     0
      raidz2-1                             ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1JVATKD  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1HZB0ND  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1HY6LYD  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1JT32KD  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1JVAGVD  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1HZBL5D  ONLINE       0     0     0
      raidz2-2                             ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1HWZ1AD  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1HZAYJD  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1HZ8YMD  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1JVDN8D  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1HZAKPD  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1HWZ2ZD  ONLINE       0     0     0
      raidz2-3                             ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1HZAX7D  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1JVHD8D  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1JVG6ND  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1HW7VBD  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1HZBHMD  ONLINE       0     0     0
        ata-HGST_HDN726060ALE614_K1JVB2SD  ONLINE       0     0     0

errors: No known data errors

@vbrik what's the HW config of this system - how much RAM, what model of x86_64 CPU?

I can confirm this bug on a mirrored zpool. It is a production system so I didn't do much testing before downgrading to 0.7.6:

pool: ssdzfs-array
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
    still be used, but some features are unavailable. [it is at the 0.6.5.11 features level]
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(5) for details.
  scan: scrub repaired 0B in 0h16m with 0 errors on Sun Apr  1 01:46:59 2018
config:

    NAME                                     STATE     READ WRITE CKSUM
    ssdzfs-array                             ONLINE       0     0     0
      mirror-0                               ONLINE       0     0     0
        ata-XXXX-enc  ONLINE       0     0     0
        ata-YYYY-enc  ONLINE       0     0     0
      mirror-1                               ONLINE       0     0     0
        ata-ZZZZ-enc  ONLINE       0     0     0
        ata-QQQQ-enc  ONLINE       0     0     0

errors: No known data errors
$zfs create ssdzfs-array/tmp
$(run test as previously described; fails about 1/2 the time)
$uname -a
Linux MASKED 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

I have attempted to reproduce the bug on 0.7.6 without success. Here is an except of one of the processor feature levels:

processor   : 3
vendor_id   : GenuineIntel
cpu family  : 6
model       : 26
model name  : Intel(R) Core(TM) i7 CPU         920  @ 2.67GHz
stepping    : 5
microcode   : 0x19
cpu MHz     : 1600.000
cache size  : 8192 KB
physical id : 0
siblings    : 4
core id     : 3
cpu cores   : 4
apicid      : 6
initial apicid  : 6
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm tpr_shadow vnmi flexpriority ept vpid dtherm ida
bogomips    : 5333.51
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:
[    1.121288] microcode: CPU3 sig=0x106a5, pf=0x2, revision=0x19

I still get it with primarycache=metadata, on the first attempt to cp:
[root@zirconia ~]# zfs set primarycache=metadata storage/rhev [root@zirconia ~]# cd /storage/rhev/ [root@zirconia rhev]# ls export test [root@zirconia rhev]# cd test/ [root@zirconia test]# rm -rf DST [root@zirconia test]# rm -rf SRC/* [root@zirconia test]# for i in $(seq 1 10000); do echo $i > SRC/$i ; done [root@zirconia test]# cp -r SRC DST cp: cannot create regular file ‘DST/5269’: No space left on device cp: cannot create regular file ‘DST/3759’: No space left on device

For those that have upgraded to the 0.7.7 branch - is it advisable to downgrade back to 0.7.6 until this regression is resolved?

What is the procedure to downgrade ZFS on CentOS 7.4?

For reverts, I usually do:

$ yum history  (identify transaction that installed 0.7.7 over 0.7.6; yum history info XXX can be used to confirm)
$ yum history undo XXX (where XXX is the transaction number identified in the previous step)

Note that with dkms installs, after reverts, I usually find I need to:

$ dkms remove zfs/0.7.6 -k `uname -r`
$ dkms remove spl/0.7.6 -k `uname -r`
$ dkms install spl/0.7.6 -k `uname -r` --force
$ dkms install zfs/0.7.6 -k `uname -r` --force

To make sure all modules are actually happy and loadable on reboot.

Is this seen with rsync instead of cp?

I'm not able to reproduce this, and I have several machines (Debian unstable; 0.7.7, Linux 4.15). Can people also include uname -srvmo? Maybe the kernel version is playing a role?

Linux 4.15.0-2-amd64 #1 SMP Debian 4.15.11-1 (2018-03-20) x86_64 GNU/Linux

Ok, I've done some more tests.
System is CentOS 7.4 x86-64 with latest available kernel:

  • single vdev pool: reproduced
  • mirrored pool: reproduced
  • kmod and dkms: reproduced
  • compiled from source [1]: reproduced
  • compression lz4 and off: reproduced
  • primary cache all, metadata and none: reproduced

On a Ubuntu Server 16.04 LTS with compiled 0.7.7 spl+zfs (so not using the repository version), I can not reproduce the error. As a side note, compiling on Ubuntu does not give any warning.

So, the problem seems confined in CentOS/RHEL territory. To me, it seems a timing/racing problem (possibly related to the ARC): anything which increases copy time lowers the error probability/frequency. Some example of action which lower the fail rate:

  • cp -a (it copies file attributes)
  • disabling cache
  • copy from SRC on another filesystem (eg: root XFS). Note: this seems to completely avoid the problem.

[1] compilation give the following warning:

/usr/src/zfs-0.7.7/module/zcommon/zfs_fletcher_avx512.o: warning: objtool: fletcher_4_avx512f_byteswap()+0x4e: can't find jump dest instruction at .text+0x171
/usr/src/zfs-0.7.7/module/zfs/vdev_raidz_math_avx512f.o: warning: objtool: mul_x2_2()+0x24: can't find jump dest instruction at .text+0x39
/usr/src/zfs-0.7.7/module/zfs/vdev_raidz_math_avx512bw.o: warning: objtool: raidz_zero_abd_cb()+0x33: can't find jump dest instruction at .text+0x3d

@shodanshok I'm sorry, I'm having a lot of trouble tracking this piece of information down. What Linux kernel version is Centos 7.4 on? I assume this is with kernel-3.10.0-693.21.1.el7.x86_64.

Is anyone experiencing this issue with "recent" mainline kernels (like 4.x)?

Greetings,
I have mirrors with the same problem.
Scientific Linux 7.4 (fully updated)
zfs-0.7.7 from zfsonlinux.org repos

$ uname -srvmo
Linux 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 13:12:24 CST 2018 x86_64 GNU/Linux

The output of my yum install:

Running transaction
  Installing : kernel-devel-3.10.0-693.21.1.el7.x86_64                                                                         1/10 
  Installing : kernel-headers-3.10.0-693.21.1.el7.x86_64                                                                       2/10 
  Installing : glibc-headers-2.17-196.el7_4.2.x86_64                                                                           3/10 
  Installing : glibc-devel-2.17-196.el7_4.2.x86_64                                                                             4/10 
  Installing : gcc-4.8.5-16.el7_4.2.x86_64                                                                                     5/10 
  Installing : dkms-2.4.0-1.20170926git959bd74.el7.noarch                                                                      6/10 
  Installing : spl-dkms-0.7.7-1.el7_4.noarch                                                                                   7/10 
Loading new spl-0.7.7 DKMS files...
Building for 3.10.0-693.21.1.el7.x86_64
Building initial module for 3.10.0-693.21.1.el7.x86_64
Done.

spl:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/spl/spl/

splat.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/splat/splat/
Adding any weak-modules

depmod....

DKMS: install completed.
  Installing : zfs-dkms-0.7.7-1.el7_4.noarch                                                                                   8/10 
Loading new zfs-0.7.7 DKMS files...
Building for 3.10.0-693.21.1.el7.x86_64
Building initial module for 3.10.0-693.21.1.el7.x86_64
Done.

zavl:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/avl/avl/

znvpair.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/nvpair/znvpair/

zunicode.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/unicode/zunicode/

zcommon.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/zcommon/zcommon/

zfs.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/zfs/zfs/

zpios.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/zpios/zpios/

icp.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/icp/icp/
Adding any weak-modules

depmod....

DKMS: install completed.
  Installing : spl-0.7.7-1.el7_4.x86_64                                                                                        9/10 
  Installing : zfs-0.7.7-1.el7_4.x86_64                                                                                       10/10 
  Verifying  : dkms-2.4.0-1.20170926git959bd74.el7.noarch                                                                      1/10 
  Verifying  : zfs-dkms-0.7.7-1.el7_4.noarch                                                                                   2/10 
  Verifying  : zfs-0.7.7-1.el7_4.x86_64                                                                                        3/10 
  Verifying  : spl-0.7.7-1.el7_4.x86_64                                                                                        4/10 
  Verifying  : kernel-devel-3.10.0-693.21.1.el7.x86_64                                                                         5/10 
  Verifying  : glibc-devel-2.17-196.el7_4.2.x86_64                                                                             6/10 
  Verifying  : kernel-headers-3.10.0-693.21.1.el7.x86_64                                                                       7/10 
  Verifying  : gcc-4.8.5-16.el7_4.2.x86_64                                                                                     8/10 
  Verifying  : spl-dkms-0.7.7-1.el7_4.noarch                                                                                   9/10 
  Verifying  : glibc-headers-2.17-196.el7_4.2.x86_64                                                                          10/10 

Installed:
  zfs.x86_64 0:0.7.7-1.el7_4                                                                                                        

Dependency Installed:
  dkms.noarch 0:2.4.0-1.20170926git959bd74.el7                      gcc.x86_64 0:4.8.5-16.el7_4.2                                   
  glibc-devel.x86_64 0:2.17-196.el7_4.2                             glibc-headers.x86_64 0:2.17-196.el7_4.2                         
  kernel-devel.x86_64 0:3.10.0-693.21.1.el7                         kernel-headers.x86_64 0:3.10.0-693.21.1.el7                     
  spl.x86_64 0:0.7.7-1.el7_4                                        spl-dkms.noarch 0:0.7.7-1.el7_4                                 
  zfs-dkms.noarch 0:0.7.7-1.el7_4                                  

Complete!

I am using rsnapshot to do backups. It is when it runs the equivalent to below that issues come up.

$ /usr/bin/cp -al /bkpfs/Rsnapshot/hourly.0 /bkpfs/Rsnapshot/hourly.1
/usr/bin/cp: cannot create hard link ‘/bkpfs/Rsnapshot/hourly.1/System/home/user/filename’ to ‘/bkpfs/Rsnapshot/hourly.0/System/home/user/filename’: No space left on device

There's plenty of space

$ df -h /bkpfs/
Filesystem      Size  Used Avail Use% Mounted on
bkpfs           5.0T  4.2T  776G  85% /bkpfs
$ df -i /bkpfs/
Filesystem         Inodes   IUsed      IFree IUse% Mounted on
bkpfs          1631487194 5614992 1625872202    1% /bkpfs
zpool iostat -v bkpfs
                                                  capacity     operations     bandwidth 
pool                                            alloc   free   read  write   read  write
----------------------------------------------  -----  -----  -----  -----  -----  -----
bkpfs                                           4.52T   950G      9      5  25.4K   117K
  mirror                                        1.84T   912G      4      3  22.0K  94.7K
    ata-Hitachi_HUA723030ALA640                     -      -      2      1  11.2K  47.4K
    ata-Hitachi_HUA723030ALA640                     -      -      2      1  10.8K  47.4K
  mirror                                        2.68T  37.3G      4      2  3.46K  22.2K
    ata-Hitachi_HUA723030ALA640                     -      -      2      1  1.71K  11.1K
    ata-Hitachi_HUA723030ALA640                     -      -      2      1  1.75K  11.1K
cache                                               -      -      -      -      -      -
  ata-INTEL_SSDSC2BW120H6                       442M   111G     17      0  9.48K  10.0K
----------------------------------------------  -----  -----  -----  -----  -----  -----
zpool status
  pool: bkpfs
 state: ONLINE
  scan: scrub repaired 0B in 11h17m with 0 errors on Sun Apr  1 05:34:09 2018
config:

    NAME                                            STATE     READ WRITE CKSUM
    bkpfs                                           ONLINE       0     0     0
      mirror-0                                      ONLINE       0     0     0
        ata-Hitachi_HUA723030ALA640                 ONLINE       0     0     0
        ata-Hitachi_HUA723030ALA640                 ONLINE       0     0     0
      mirror-1                                      ONLINE       0     0     0
        ata-Hitachi_HUA723030ALA640                 ONLINE       0     0     0
        ata-Hitachi_HUA723030ALA640                 ONLINE       0     0     0
    cache
      ata-INTEL_SSDSC2BW120H6                      ONLINE       0     0     0

errors: No known data errors
zfs get all bkpfs
NAME   PROPERTY              VALUE                  SOURCE
bkpfs  type                  filesystem             -
bkpfs  creation              Fri Dec 22 10:34 2017  -
bkpfs  used                  4.52T                  -
bkpfs  available             776G                   -
bkpfs  referenced            4.19T                  -
bkpfs  compressratio         1.00x                  -
bkpfs  mounted               yes                    -
bkpfs  quota                 none                   default
bkpfs  reservation           none                   default
bkpfs  recordsize            128K                   default
bkpfs  mountpoint            /bkpfs                 default
bkpfs  sharenfs              off                    default
bkpfs  checksum              on                     default
bkpfs  compression           off                    default
bkpfs  atime                 on                     default
bkpfs  devices               on                     default
bkpfs  exec                  on                     default
bkpfs  setuid                on                     default
bkpfs  readonly              off                    default
bkpfs  zoned                 off                    default
bkpfs  snapdir               hidden                 default
bkpfs  aclinherit            restricted             default
bkpfs  createtxg             1                      -
bkpfs  canmount              on                     default
bkpfs  xattr                 on                     default
bkpfs  copies                1                      default
bkpfs  version               5                      -
bkpfs  utf8only              off                    -
bkpfs  normalization         none                   -
bkpfs  casesensitivity       sensitive              -
bkpfs  vscan                 off                    default
bkpfs  nbmand                off                    default
bkpfs  sharesmb              off                    default
bkpfs  refquota              none                   default
bkpfs  refreservation        none                   default
bkpfs  guid                  8662648373298485368    -
bkpfs  primarycache          all                    default
bkpfs  secondarycache        all                    default
bkpfs  usedbysnapshots       334G                   -
bkpfs  usedbydataset         4.19T                  -
bkpfs  usedbychildren        234M                   -
bkpfs  usedbyrefreservation  0B                     -
bkpfs  logbias               latency                default
bkpfs  dedup                 off                    default
bkpfs  mlslabel              none                   default
bkpfs  sync                  standard               default
bkpfs  dnodesize             legacy                 default
bkpfs  refcompressratio      1.00x                  -
bkpfs  written               1.38T                  -
bkpfs  logicalused           4.51T                  -
bkpfs  logicalreferenced     4.18T                  -
bkpfs  volmode               default                default
bkpfs  filesystem_limit      none                   default
bkpfs  snapshot_limit        none                   default
bkpfs  filesystem_count      none                   default
bkpfs  snapshot_count        none                   default
bkpfs  snapdev               hidden                 default
bkpfs  acltype               off                    default
bkpfs  context               none                   default
bkpfs  fscontext             none                   default
bkpfs  defcontext            none                   default
bkpfs  rootcontext           none                   default
bkpfs  relatime              off                    default
bkpfs  redundant_metadata    all                    default
bkpfs  overlay               off                    default

For those that want to know my hardware, the system is a AMD X2 255 processor with 8GB of memory (so far more than enough for my home backup system).

I can revert today, or I can help test if someone needs me to try something. Just let me know.

Thanks!

Can someone who can repro this try bisecting the changes between 0.7.6 and 0.7.7 so we can see which commit breaks people?

Most likely https://github.com/zfsonlinux/zfs/commit/cc63068e95ee725cce03b1b7ce50179825a6cda5, seems to be a race condition in the mzap->fzap upgrade phase.

@loli10K this, uh, seems horrendous enough that unless someone volunteers a fix for the race Real Fast, a revert and cutting a point release for this alone seems like it would be merited, to me at least.

@rincebrain I can try later today. I'm meeting some friends for lunch and will be gone for a few hours but I'm happy to help how I can when I get back.
[Edit] To try to bisect the changes that is. :-)

@cstackpole if you do, it's probably worth trying with and without the commit @loli10K pointed to, rather than letting the bisect naturally find it.

From what we have seen so far it certainly seems to only affect older (by which I mean lower-versioned) kernels. I have not been able to reproduce the issue on Linux 4.15 (Fedora).

@aerusso

[root@localhost test]# uname -a
Linux localhost.localdomain 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

@loli10K any clue on why it affect 3.x kernels only, while 4.x seems immune?

BTW, I bisected it, and couldn't repro it on CentOS 7 with 3.10.0-693.21.1 on eb9c453 but could on cc63068, so that does appear to be the cause.

I haven't done any testing yet, but I very much appreciate the speed at which you've found the commit, rincebrain! Since seeing this issue raised, I've been quite nervous, and I don't yet know if I'm affected.

% uname -srvmo
Linux 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 GNU/Linux

Since this seems to be FRAME_POINTER-specific (unless anyone's got a counter-example), I would guess this is #5041 2.0: Elec boogalootric

Thanks @rincebrain for confirming!
Since this is just my personal-at-home system, I don't mind leaving it in its reproducible state if anyone wants me to test something later in the week.

@kpande Yes, I've been following this one but haven't looked into it at all. Has this for sure been narrowed down to cc63068e95ee725cce03b1b7ce50179825a6cda5? This is clearly something that has to get fixed right away.

@dweeezil I couldn't readily repro it on CentOS 7 x86_64 on the commit before cc63068, and could easily repro it on cc63068, same SPL both times.

cc63068 sets a limit on the number of times zap_add will try to expand (split) a zap leaf blocks for a directory when adding a new entry would overflow an existing leaf.

The limit (2) is sufficient for handling a colliding name when casesensitivity=sensitive is set, but appears to bails out too early (with ENOSPC) when the zap for the directory grows past a certain size (possibly also due to leaf hash collisions). When zap_add fails, it rolls back the transaction so the znode for the new file is removed.

So far, this is undesirable but doesn't result in data loss per se, since the system just refuses to create new files with "No space left on device".

My hypothesis is that a subsequent zap_adds is successful as the directory's zap has already grown (as long as one to two additional leaf splits is sufficient to fit the new entry), but the subsequent zap expansions are being discarded, due to a side effect of the previous rollback (possibly closing the transaction there). The vfs page cache still reflects the new files but they're not present in ARC (or committed to disk), hence flushing the page cache makes them go away. It's not clear if the znodes for the files are leaked as a result (unlinked from the directory but still present) or if they're also being discarded.

I have masked 0.7.7 in Gentoo based on this issue.

https://bugs.gentoo.org/652828

I have cleared my schedule for tomorrow so that I have time to spend on this. I'd say more, but this blind sided me and it is too late at night for me to start looking into it now.

Ok, so the expand retry limit 2 is not enough. In fact, there shouldn't be a limit at all until we hit the limit of ZAP itself.

The reason you can create a ZAP with a lot of file but cannot copy is because, when you create file, you create file randomly in terms of hash value. However, if you copy files from one directory to another directory, you create file sequentially in terms of hash value. That means if the source directory expanded its leaves 6 times, you need to expand the destination leaves 6 times in one go.

One thing to note is that we do use different salt for different directory, so theoretically, a strong enough salt should prevent this from happening. This shows that the current salt is not strong enough.

To remove the expand limit, try removing this if block.
https://github.com/zfsonlinux/zfs/blob/cc63068e95ee725cce03b1b7ce50179825a6cda5/module/zfs/zap.c#L861

The file missing afterward is a strange issue. I'll have to investigate to see what happened. I don't think there's any transaction rollback in the error path.

Getting rid of the limit doesn't panic the box when running the casenorm ZTS group and seems to prevent this issue:

@@ -855,15 +855,6 @@ retry:
        if (err == 0) {
                zap_increment_num_entries(zap, 1, tx);
        } else if (err == EAGAIN) {
-               /*
-                * If the last two expansions did not help, there is no point
-                * trying to expand again
-                */
-               if (expand_retries > MAX_EXPAND_RETRIES && prev_l == l) {
-                       err = SET_ERROR(ENOSPC);
-                       goto out;
-               }
-
                err = zap_expand_leaf(zn, l, tag, tx, &l);
                zap = zn->zn_zap;       /* zap_expand_leaf() may change zap */
                if (err == 0) {
[root@centos ~]# lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch
Distributor ID: CentOS
Description:    CentOS Linux release 7.4.1708 (Core) 
Release:    7.4.1708
Codename:   Core
[root@centos ~]# uname -r
3.10.0-693.21.1.el7.x86_64
[root@centos ~]# cat /sys/module/zfs/version 
0.7.7-1
[root@centos ~]# while :; do
>    zpool destroy testpool
>    zpool create testpool -f -O xattr=dir -O atime=off -O mountpoint=none -O recordsize=1M /dev/vdb
>    zfs create testpool/src -o mountpoint=/mnt
>    zfs create testpool/dst -o mountpoint=/mnt/DST
>    mkdir /mnt/SRC; for i in $(seq 1 10000); do echo -n > /mnt/SRC/$i; done;
>    printf "$(find /mnt/SRC -type f | wc -l) -> "
>    cp -r /mnt/SRC /mnt/DST
>    echo "$(find /mnt/DST -type f | wc -l)"
> done
10000 -> 10000
10000 -> 10000
10000 -> 10000
10000 -> 10000
10000 -> 10000
10000 -> 10000
10000 -> 10000
10000 -> 10000
10000 -> 10000
10000 -> 10000
10000 -> 10000
10000 -> 10000
10000 -> 10000
10000 -> 10000
10000 -> 10000
^C
[root@centos ~]#
...
[root@centos ~]# sudo -u nobody -s /usr/share/zfs/zfs-tests.sh -d /var/tmp -T casenorm
Test: /usr/share/zfs/zfs-tests/tests/functional/casenorm/setup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/casenorm/case_all_values (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/casenorm/norm_all_values (run as root) [00:01] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/casenorm/mixed_create_failure (run as root) [00:10] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/casenorm/cleanup (run as root) [00:00] [PASS]

Results Summary
PASS       5

Running Time:   00:00:12
Percent passed: 100.0%
Log directory:  /var/tmp/test_results/20180401T016189

[root@centos ~]# 

Now testing kernel 3.10.x on Debian 8 with the same Kconfig from previous CentOS7 box ... EDIT: Debian stays strong and does not seem to be affected running 3.10.108.

I can confirm the ENOSPC (No space left on device) is coming from fzap_add_cd when we hit the retry limit, running the reproducer under the following stap script:

probe
module("zfs").function("zap_leaf_split").call,
module("zfs").function("fzap_add_cd").call,
module("zfs").function("mzap_upgrade").call,
module("zfs").function("zap_entry_create").call,
module("zfs").function("zap_expand_leaf").call
{
   printf(" %s -> %s\n", symname(caller_addr()), ppfunc());
}
probe
module("zfs").function("zap_leaf_split").return,
module("zfs").function("fzap_add_cd").return,
module("zfs").function("mzap_upgrade").return,
module("zfs").function("zap_entry_create").return,
module("zfs").function("zap_expand_leaf").return
{
   printf(" %s <- %s %s\n", symname(caller_addr()), ppfunc(), $$return$);
}
probe
module("zfs").statement("fzap_add_cd@module/zfs/zap.c:867")
{
   printf(" * %s <- %s expand_retries=%s\n", symname(caller_addr()), ppfunc(), $expand_retries$$);
}
````

relevant output

fzap_add_cd -> zap_entry_create
0xffffffff816b9459 <- zap_entry_create return=11

  • 0xffffffff816b9459 <- fzap_add_cd expand_retries=0
    fzap_add_cd -> zap_expand_leaf
    zap_expand_leaf -> zap_leaf_split
    0xffffffff816b9459 <- zap_leaf_split
    0xffffffff816b9459 <- zap_expand_leaf return=0
    fzap_add_cd -> zap_entry_create
    0xffffffff816b9459 <- zap_entry_create return=11
  • 0xffffffff816b9459 <- fzap_add_cd expand_retries=1
    fzap_add_cd -> zap_expand_leaf
    zap_expand_leaf -> zap_leaf_split
    0xffffffff816b9459 <- zap_leaf_split
    0xffffffff816b9459 <- zap_expand_leaf return=0
    fzap_add_cd -> zap_entry_create
    0xffffffff816b9459 <- zap_entry_create return=11
  • 0xffffffff816b9459 <- fzap_add_cd expand_retries=2
    fzap_add_cd -> zap_expand_leaf
    zap_expand_leaf -> zap_leaf_split
    0xffffffff816b9459 <- zap_leaf_split
    0xffffffff816b9459 <- zap_expand_leaf return=0
    fzap_add_cd -> zap_entry_create
    0xffffffff816b9459 <- zap_entry_create return=11
  • 0xffffffff816b9459 <- fzap_add_cd expand_retries=3
    zap_add_impl <- fzap_add_cd return=28 (ENOSPC)
    ```

Well, i could not reproduce this running CentOS7 kernel on Debian8 but using its cp:

On CentOS7, testing also with cp from Debian8:

[root@centos ~]# while :; do
>    zpool destroy testpool
>    zpool create testpool -f -O xattr=dir -O atime=off -O mountpoint=none -O recordsize=1M /dev/vdb
>    zfs create testpool/src -o mountpoint=/mnt
>    zfs create testpool/dst -o mountpoint=/mnt/DST
>    mkdir /mnt/SRC; for i in $(seq 1 10000); do echo -n > /mnt/SRC/$i; done;
>    ./debian-cp -r /mnt/SRC /mnt/DST-debian
>    cp -r /mnt/SRC /mnt/DST-centos
> done
cp: cannot create regular file ‘/mnt/DST-centos/4143’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/1970’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/5654’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/5945’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/2740’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/3659’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/2070’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/5183’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/7715’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/8593’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/9654’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/1064’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/2862’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/6636’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/865’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/6090’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/6066’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/9233’: No space left on device
^C
[root@centos ~]# lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch
Distributor ID: CentOS
Description:    CentOS Linux release 7.4.1708 (Core) 
Release:    7.4.1708
Codename:   Core
[root@centos ~]# rpm -qa coreutils
coreutils-8.22-18.el7.x86_64
[root@centos ~]# 

On Debian8, with cp from CentOS7:

root@linux:~# while :; do
>    zpool destroy testpool
>    zpool create testpool -f -O xattr=dir -O atime=off -O mountpoint=none -O recordsize=1M /dev/vdb
>    zfs create testpool/src -o mountpoint=/mnt
>    zfs create testpool/dst -o mountpoint=/mnt/DST
>    mkdir /mnt/SRC; for i in $(seq 1 10000); do echo -n > /mnt/SRC/$i; done;
>    cp -r /mnt/SRC /mnt/DST-debian
>    ./centos-cp -r /mnt/SRC /mnt/DST-centos
> done
./centos-cp: cannot create regular file ‘/mnt/DST-centos/5423’: No space left on device
./centos-cp: cannot create regular file ‘/mnt/DST-centos/8558’: No space left on device
./centos-cp: cannot create regular file ‘/mnt/DST-centos/4338’: No space left on device
./centos-cp: cannot create regular file ‘/mnt/DST-centos/3524’: No space left on device
./centos-cp: cannot create regular file ‘/mnt/DST-centos/4601’: No space left on device
./centos-cp: cannot create regular file ‘/mnt/DST-centos/9311’: No space left on device
./centos-cp: cannot create regular file ‘/mnt/DST-centos/7348’: No space left on device
./centos-cp: cannot create regular file ‘/mnt/DST-centos/3211’: No space left on device
./centos-cp: cannot create regular file ‘/mnt/DST-centos/8768’: No space left on device
./centos-cp: cannot create regular file ‘/mnt/DST-centos/6951’: No space left on device
./centos-cp: cannot create regular file ‘/mnt/DST-centos/4538’: No space left on device
./centos-cp: cannot create regular file ‘/mnt/DST-centos/7596’: No space left on device
./centos-cp: cannot create regular file ‘/mnt/DST-centos/7539’: No space left on device
^C
root@linux:~# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 8.0 (jessie)
Release:    8.0
Codename:   jessie
root@linux:~# dpkg -l coreutils
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                           Version                      Architecture                 Description
+++-==============================================-============================-============================-==================================================================================================
ii  coreutils                                      8.23-4                       amd64                        GNU core utilities
root@linux:~# 

We may need to find a better reproducer than "cp" for the regression test proposed in #7411.

This update is still being offered when RHEL-based systems do a "yum update"; given the serious nature of this bug, should the update not be pulled, leaving 0.7.6 as the latest available version?

Today is a day when I'm EXTREMELY glad I have ZFS/SPL updates blocked and do them manually during designated downtime windows or otherwise more convenient times.

@flynnjFIU @behlendorf is the maintainer for RHEL-based systems and he just got into the office. He likely does not even know about this yet. I'll give him a call to let him know so that he can take the update out of the RPM repository. Thanks for pointing it out.

@perfinion pointed out to me in IRC that it could be that this is mainly reproducible only on RHEL-based systems because they use xattr=sa to speed up SELinux's handling of filesystem labels. xattr=sa might be related. I had a late start on this today, so I am not certain either way at this point, but I think that he made a good point that the interaction with xattr=sa should be considered.

@ryao same problem occurs with xatrr=on.

@flynnjFIU I spoke to Brian. He just learned about this in something like the past hour. The tentative plan is to pull 0.7.7 from the RPM repository and push out 0.7.8 with a revert of cc63068e95ee725cce03b1b7ce50179825a6cda5. He is going to have a chat with @tonyhutter before he finalizes the plan to deal with this.

@vbrik Thanks for that information. That helps narrow things down. :)

@ryao Is there any risk that data created with 0.7.7 on CentOS will be corrupted/disappear with the fix in 0.7.8??

@alatteri My tentative understanding is that If ENOSPC did not occur, the data should be fine. I suggest downgrading to 0.7.6 for the time being though.

Would people who can/cannot reproduce this issue post this information about the systems tested?

  1. Reproducibility (yes or no)
  2. ZoL version
  3. Distribution name and version
  4. Kernel Version
  5. Coreutils Version
  6. SELinux status (enforcing, permissive, off/unused)

For those who need them, here are links to the RPM packages for coreutils on CentOS 6 and CentOS 7:

https://centos.pkgs.org/6/centos-x86_64/coreutils-8.4-46.el6.x86_64.rpm.html
https://centos.pkgs.org/7/centos-x86_64/coreutils-8.22-18.el7.x86_64.rpm.html

They contain the cp used on CentOS. Instructions on how to extract them are here:

https://www.cyberciti.biz/tips/how-to-extract-an-rpm-package-without-installing-it.html

Compiler: gcc version 6.4.0 (Gentoo Hardened 6.4.0-r1 p1.3)
uname -a: Linux baraddur 4.16.0-gentoo #1 SMP PREEMPT Wed Apr 4 12:18:23 +08 2018 x86_64 AMD Ryzen Threadripper 1950X 16-Core Processor AuthenticAMD GNU/Linux
distro: gentoo hardened selinux
ZFS kmod from HEAD: Loaded module v0.7.0-403_g1724eb62
SELinux enforcing and permissive both hit it

gentoo cp 8.28-r1 binary: cant repro even with 100k files
debian 8 8.26 binary: also cant repro
centos7 8.22 binary: hits it instantly

Reproducibility: yes
ZoL version: zfs-0.7.7-1.el6.x86_64
Distribution name and version: Scientific Linux 6.8
Kernel Version: 2.6.32-696.23.1.el6.x86_64
Coreutils Version: coreutils-8.4-46.el6.x86_64
SELinux status: off

Reproducibility: no

Distribution name and version: Arch Linux
ZoL version:

local/spl-linux-git 2018.04.04.r1070.581bc01.4.15.15.1-1 (archzfs-linux-git)
local/spl-utils-common-git 2018.04.04.r1070.581bc01-1 (archzfs-linux-git)
local/zfs-linux-git 2018.04.04.r3402.533ea0415.4.15.15.1-1 (archzfs-linux-git)
local/zfs-utils-common-git 2018.04.04.r3402.533ea0415-1 (archzfs-linux-git)

This is ZFS build built from commit 533ea0415 .

Kernel Version: Linux kiste 4.15.15-1-ARCH #1 SMP PREEMPT Sat Mar 31 23:59:25 UTC 2018 x86_64 GNU/Linux
Coreutils Version: local/coreutils 8.29-1
SELinux status (enforcing, permissive, off/unused): off

Unable to test CentOS 7 cp due to dependency on SELinux libraries (Arch doesn't support SELinux).

@tuxoko Nice analysis!

The reason you can create a ZAP with a lot of file but cannot copy is because, when you create file, you create file randomly in terms of hash value. However, if you copy files from one directory to another directory, you create file sequentially in terms of hash value. That means if the source directory expanded its leaves 6 times, you need to expand the destination leaves 6 times in one go.

One thing to note is that we do use different salt for different directory, so theoretically, a strong enough salt should prevent this from happening. This shows that the current salt is not strong enough.

The salt is pretty weak (see mzap_create_impl()); I'm not sure why we didn't just use random_get_pseudo_bytes(). I wonder if they are actually getting the same exact hash, or if there's some weakness in the way that the salt is used in zap_hash()? zdb can dump the salt to see if they are the same.

  1. Reproducable = yes
  2. zfs.x86_64 0.7.7-1.el7_4 @zfs-kmod
  3. CentOS 7.4
  4. Linux 3.10.0-693.21.1.el7.x86_64 x86_64 GNU/Linux
  5. coreutils.x86_64 8.22-18.el7
  6. SELINUX=disabled. SELINUXTYPE=targeted

We're working to get an 0.7.8 release out with https://github.com/zfsonlinux/zfs/commit/cc63068e95ee725cce03b1b7ce50179825a6cda5 reverted ASAP.

Before anyone starts bindiffing binaries: CentOS cp open(O_CREAT) is randomized, Debian is not: random file order = random hash values = more likely to zap_expand_leaf()/zap_leaf_split() i guess ...

[root@centos ~]# grep DST /tmp/debian.txt | head -n 100
execve("./debian-cp", ["./debian-cp", "-r", "/mnt/SRC", "/mnt/DST-debian"], [/* 18 vars */]) = 0
stat("/mnt/DST-debian", 0x7ffc25990cb0) = -1 ENOENT (No such file or directory)
lstat("/mnt/DST-debian", 0x7ffc25990a40) = -1 ENOENT (No such file or directory)
mkdir("/mnt/DST-debian", 0755)          = 0
lstat("/mnt/DST-debian", {st_mode=S_IFDIR|0755, st_size=2, ...}) = 0
open("/mnt/DST-debian/3357", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3358", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3359", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3360", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3361", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3362", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3363", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3364", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3365", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3366", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3367", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3368", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3369", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3370", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3371", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3372", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3373", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3374", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3375", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3376", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3377", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3378", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3379", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3380", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3381", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3382", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3383", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3384", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3385", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3386", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3387", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3388", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3389", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3390", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3391", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3392", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3393", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3394", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3395", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/1", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/2", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/4", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/5", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/6", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/7", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/8", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/9", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/10", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/11", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/12", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/13", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/14", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/15", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/16", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/17", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/18", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/19", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/20", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/21", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/22", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/23", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/24", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/25", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/26", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/27", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/28", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/29", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/30", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/31", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/32", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/33", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/34", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/35", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/36", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/37", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/38", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/39", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/40", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/41", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/42", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/43", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/44", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/45", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/46", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/47", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/48", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/49", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/50", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/51", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/52", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/53", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/54", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/55", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/56", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
[root@centos ~]# grep DST /tmp/centos.txt | head -n 100
execve("/bin/cp", ["cp", "-r", "/mnt/SRC", "/mnt/DST-centos"], [/* 18 vars */]) = 0
stat("/mnt/DST-centos", 0x7ffc6299e1d0) = -1 ENOENT (No such file or directory)
lstat("/mnt/DST-centos", 0x7ffc6299df30) = -1 ENOENT (No such file or directory)
mkdir("/mnt/DST-centos", 0755)          = 0
lstat("/mnt/DST-centos", {st_mode=S_IFDIR|0755, st_size=2, ...}) = 0
open("/mnt/DST-centos/6667", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4153", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/8772", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/2455", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/8691", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/6784", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/2422", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/8705", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/2878", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4124", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/6610", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/2558", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/2896", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/2902", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/2975", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/8608", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4029", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/6689", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/9017", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/5636", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/688", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/1590", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/7102", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/9183", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/1404", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/7096", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/3330", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/3347", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/1473", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/7175", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/5641", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/1829", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/9060", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/611", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/1509", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/1953", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/785", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/7078", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/1924", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/666", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/2065", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4939", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4563", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/6257", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/8342", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/8335", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/6220", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/2186", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4514", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/2012", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4480", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/2168", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4834", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4843", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/8238", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4419", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/7968", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/3700", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/5392", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/1034", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/9427", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/7532", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/3694", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/5206", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/5271", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/7545", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/9450", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/1043", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/3777", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/3799", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/7865", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/221", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/9970", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/1139", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/256", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/9907", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/7812", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/7448", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/9893", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/7986", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4944", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/2018", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4933", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4569", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/8348", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4849", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/2115", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4587", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/8232", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/6327", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/2081", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4413", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4464", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/6350", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/8245", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
[root@centos ~]# 

Would someone with a CentOS-family system please install gdb and coreutils-debuginfo. Then run gdb -ex 'info sources' $(which cp) and post the output for me? It will save me some trouble of getting my hands on a system so that I can try to figure out what is different between CentOS's cp and Gentoo's cp.

I ran this on Gentoo's cp, which is coreutils 8.28 to get the files that were used to build cp and after some commandline-foo, I have tentatively identified these as the patches relevant to cp on CentOS:

./coreutils-selinux.patch:diff -urNp coreutils-8.21-orig/src/copy.c coreutils-8.21/src/copy.c
./coreutils-selinux.patch:diff -urNp coreutils-8.21-orig/src/cp.c coreutils-8.21/src/cp.c
./coreutils-8.22-selinux-optionsseparate.patch:diff -urNp coreutils-8.22-orig/src/cp.c coreutils-8.22/src/cp.c
./coreutils-8.22-mv-hardlinksrace.patch:diff -urNp coreutils-8.22-orig/src/copy.c coreutils-8.22/src/copy.c
./coreutils-8.22-cp-sparsecorrupt.patch:diff --git a/src/copy.c b/src/copy.c
./coreutils-8.22-cp-selinux.patch:diff --git a/src/selinux.c b/src/selinux.c

The files that are touched are included.

Unfortunately, the files used between coreutils versions could have changed, so I need to rerun that analysis on the output from a system using CentOS 6 or CentOS 7 to get a true list. I plan to review / test on Gentoo these patches to see if I can track down the issue from the user space side. Enough people are scrutinizing the kernel side that I'll delay tackling that until after I figured out what makes CentOS' cp special.

I could set up a CentOS 7.4 VM but that could take an hour. Let me know if I should go on or if someone else has a system ready for testing.

On 2018-04-09 14:05, Richard Yao wrote:

Would someone with a CentOS-family system please install gdb and
coreutils-debuginfo. Then run gdb -ex 'info sources' $(which cp) and
post the output for me? It will save me some trouble of getting my
hands on a system so that I can try to figure out what is different
between CentOS's cp and Gentoo's cp.

I ran this on Gentoo's cp, which is coreutils 8.28 to get the files
that were used to build cp and after some commandline-foo, I have
tentatively identified these as the patches relevant to cp on CentOS:

./coreutils-selinux.patch
./coreutils-8.22-selinux-optionsseparate.patch
./coreutils-8.22-non-defaulttests.patch
./coreutils-8.22-mv-hardlinksrace.patch
./coreutils-8.22-failingtests.patch
./coreutils-8.22-cp-sparsecorrupt.patch
./coreutils-8.22-cp-selinux.patch

Unfortunately, the files used between coreutils versions could have
changed, so I need to rerun that analysis on the output from a system
using CentOS 6 or CentOS 7 to get a true list. I plan to review / test
on Gentoo these patches to see if I can track down the issue from the
user space side. Enough people are scrutinizing the kernel side that
I'll delay tackling that until after I figured out what makes CentOS'
cp special.

CentOS 7.4:

[root@nas ~]# gdb --ex 'info sources' /usr/bin/cp
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-100.el7_4.1
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show
copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/...
Reading symbols from /usr/bin/cp...Reading symbols from
/usr/bin/cp...(no debugging symbols found)...done.
(no debugging symbols found)...done.
No symbol table is loaded. Use the "file" command.
Missing separate debuginfos, use: debuginfo-install
coreutils-8.22-18.el7.x86_64

@dswartz You are missing the debuginfo. Do debuginfo-install coreutils-8.22-18.el7.x86_64 and try again. Output should look something like this:

https://paste.pound-python.org/raw/xNxJ6p2mHLj3LZVsW4Qr/

Disregard my last: wrong package...

Source files for which symbols have been read in:

Source files for which symbols will be read in on demand:

/usr/src/debug/coreutils-8.22/src/cp.c, /usr/include/sys/stat.h,
/usr/include/bits/string3.h, /usr/include/bits/stdio2.h,
/usr/src/debug/coreutils-8.22/src/system.h,
/usr/lib/gcc/x86_64-redhat-linux/4.8.5/include/stddef.h,
/usr/include/bits/types.h,
/usr/include/stdio.h, /usr/include/libio.h, /usr/include/sys/types.h,
/usr/include/time.h, /usr/include/getopt.h,
/usr/include/selinux/selinux.h, /usr/include/bits/stat.h,
/usr/src/debug/coreutils-8.22/lib/argmatch.h,
/usr/src/debug/coreutils-8.22/lib/hash.h,
/usr/src/debug/coreutils-8.22/lib/backupfile.h,
/usr/src/debug/coreutils-8.22/src/copy.h,
/usr/src/debug/coreutils-8.22/lib/stat-time.h,
/usr/src/debug/coreutils-8.22/src/version.h,
/usr/src/debug/coreutils-8.22/lib/exitfail.h,
/usr/src/debug/coreutils-8.22/lib/progname.h,
/usr/src/debug/coreutils-8.22/,
/usr/src/debug/coreutils-8.22/lib/xalloc.h,
/usr/src/debug/coreutils-8.22/lib/quote.h, /usr/include/libintl.h,
/usr/include/stdlib.h, /usr/src/debug/coreutils-8.22/lib/error.h,
/usr/include/string.h, /usr/include/bits/errno.h,
/usr/src/debug/coreutils-8.22/lib/dirname.h,
/usr/src/debug/coreutils-8.22/lib/utimens.h, /usr/include/unistd.h,
/usr/src/debug/coreutils-8.22/lib/acl.h, /usr/include/locale.h,
/usr/src/debug/coreutils-8.22/lib/filenamecat.h,
/usr/src/debug/coreutils-8.22/lib/propername.h,
/usr/src/debug/coreutils-8.22/lib/version-etc.h,
/usr/src/debug/coreutils-8.22/src/cp-hash.h,
/usr/src/debug/coreutils-8.22/src/copy.c, /usr/include/bits/unistd.h,
/usr/include/bits/stdio.h,
/usr/src/debug/coreutils-8.22/src/ioblksize.h,
/usr/src/debug/coreutils-8.22/src/extent-scan.h,
/usr/lib/gcc/x86_64-redhat-linux/4.8.5/include/stdarg.h,
/usr/include/stdint.h, /usr/src/debug/coreutils-8.22/lib/fadvise.h,
/usr/src/debug/coreutils-8.22/lib/utimecmp.h,
/usr/include/attr/error_context.h,
/usr/src/debug/coreutils-8.22/src/selinux.h,
/usr/src/debug/coreutils-8.22/lib/write-any-file.h,
/usr/src/debug/coreutils-8.22/lib/full-write.h,
/usr/include/attr/libattr.h,
/usr/src/debug/coreutils-8.22/lib/verror.h,
/usr/src/debug/coreutils-8.22/lib/unistd.h,
/usr/src/debug/coreutils-8.22/lib/filemode.h,
/usr/src/debug/coreutils-8.22/lib/same.h,
/usr/src/debug/coreutils-8.22/lib/yesno.h,
/usr/src/debug/coreutils-8.22/lib/file-set.h,
/usr/src/debug/coreutils-8.22/lib/areadlink.h,
/usr/src/debug/coreutils-8.22/lib/savedir.h,
/usr/src/debug/coreutils-8.22/lib/fcntl-safer.h,
/usr/src/debug/coreutils-8.22/lib/buffer-lcm.h,
/usr/include/sys/ioctl.h, /usr/include/assert.h,
/usr/src/debug/coreutils-8.22/src/cp-hash.c,
/usr/src/debug/coreutils-8.22/src/extent-scan.c,
/usr/src/debug/coreutils-8.22/src/fiemap.h,
/usr/src/debug/coreutils-8.22/src/selinux.c, /usr/include/bits/fcntl2.h,
/usr/include/selinux/context.h,
/usr/src/debug/coreutils-8.22/lib/canonicalize.h,
/usr/src/debug/coreutils-8.22/lib/i-ring.h,
/usr/src/debug/coreutils-8.22/lib/fts_.h, /usr/include/dirent.h,
/usr/src/debug/coreutils-8.22/lib/xfts.h,
/usr/src/debug/coreutils-8.22/src/version.c,
/usr/src/debug/coreutils-8.22/lib/copy-acl.c,
/usr/src/debug/coreutils-8.22/lib/set-acl.c,
/usr/src/debug/coreutils-8.22/lib/areadlink-with-size.c,
/usr/src/debug/coreutils-8.22/lib/argmatch.c,
/usr/src/debug/coreutils-8.22/lib/quotearg.h,
/usr/src/debug/coreutils-8.22/lib/backupfile.c,
/usr/include/bits/dirent.h,
/usr/src/debug/coreutils-8.22/lib/dirent-safer.h,
/usr/include/bits/confname.h,
/usr/src/debug/coreutils-8.22/lib/buffer-lcm.c,
/usr/src/debug/coreutils-8.22/lib/canonicalize.c,
/usr/include/bits/string2.h,
/usr/src/debug/coreutils-8.22/lib/xgetcwd.h,
/usr/src/debug/coreutils-8.22/lib/closein.c,
/usr/src/debug/coreutils-8.22/lib/freadahead.h,
/usr/src/debug/coreutils-8.22/lib/close-stream.h,
/usr/src/debug/coreutils-8.22/lib/stdio.h,
/usr/src/debug/coreutils-8.22/lib/closeout.h,
/usr/src/debug/coreutils-8.22/lib/closeout.c,
/usr/src/debug/coreutils-8.22/lib/opendir-safer.c,
/usr/src/debug/coreutils-8.22/lib/unistd-safer.h,
/usr/src/debug/coreutils-8.22/lib/dirname.c,
/usr/src/debug/coreutils-8.22/lib/dirname-lgpl.c,
/usr/src/debug/coreutils-8.22/lib/basename-lgpl.c,
/usr/src/debug/coreutils-8.22/lib/stripslash.c,
/usr/src/debug/coreutils-8.22/lib/exitfail.c,
/usr/src/debug/coreutils-8.22/lib/fadvise.c, /usr/include/fcntl.h,
/usr/src/debug/coreutils-8.22/lib/open-safer.c,
/usr/src/debug/coreutils-8.22/lib/file-set.c,
/usr/src/debug/coreutils-8.22/lib/hash-triple.h,
/usr/src/debug/coreutils-8.22/lib/filemode.c,
/usr/src/debug/coreutils-8.22/lib/filenamecat.c,
/usr/src/debug/coreutils-8.22/lib/filenamecat-lgpl.c,
/usr/src/debug/coreutils-8.22/lib/full-write.c,
/usr/src/debug/coreutils-8.22/lib/safe-write.h,
/usr/src/debug/coreutils-8.22/lib/hash.c,
/usr/src/debug/coreutils-8.22/lib/bitrotate.h,
/usr/src/debug/coreutils-8.22/lib/hash-triple.c,
/usr/src/debug/coreutils-8.22/lib/hash-pjw.h,
/usr/src/debug/coreutils-8.22/lib/progname.c, /usr/include/errno.h,
/usr/src/debug/coreutils-8.22/lib/propername.c,
/usr/src/debug/coreutils-8.22/lib/mbuiter.h,
/usr/src/debug/coreutils-8.22/lib/mbchar.h, /usr/include/wchar.h,
/usr/src/debug/coreutils-8.22/lib/strnlen1.h,
/usr/include/wctype.h, /usr/include/ctype.h,
/usr/src/debug/coreutils-8.22/lib/string.h,
/usr/src/debug/coreutils-8.22/lib/trim.h,
/usr/src/debug/coreutils-8.22/lib/xstriconv.h,
/usr/src/debug/coreutils-8.22/lib/localcharset.h,
/usr/src/debug/coreutils-8.22/lib/c-strcase.h,
/usr/src/debug/coreutils-8.22/lib/qcopy-acl.c, /usr/include/sys/acl.h,
/usr/src/debug/coreutils-8.22/lib/acl-internal.h,
/usr/src/debug/coreutils-8.22/lib/qset-acl.c, /usr/include/acl/libacl.h,
/usr/src/debug/coreutils-8.22/lib/quotearg.c,
/usr/src/debug/coreutils-8.22/lib/c-strcaseeq.h,
/usr/src/debug/coreutils-8.22/lib/safe-read.c,
/usr/src/debug/coreutils-8.22/lib/same.c,
/usr/src/debug/coreutils-8.22/lib/savedir.c,
/usr/src/debug/coreutils-8.22/lib/strnlen1.c,
/usr/src/debug/coreutils-8.22/lib/trim.c,
/usr/src/debug/coreutils-8.22/lib/mbiter.h,
/usr/src/debug/coreutils-8.22/lib/dup-safer.c,
/usr/src/debug/coreutils-8.22/lib/fcntl.h,
/usr/src/debug/coreutils-8.22/lib/fd-safer.c,
/usr/src/debug/coreutils-8.22/lib/utimecmp.c,
/usr/src/debug/coreutils-8.22/lib/utimens.c, /usr/include/bits/time.h,
/usr/src/debug/coreutils-8.22/lib/timespec.h,
/usr/src/debug/coreutils-8.22/lib/sys/stat.h, /usr/include/sys/time.h,
/usr/src/debug/coreutils-8.22/lib/verror.c,
/usr/src/debug/coreutils-8.22/lib/xvasprintf.h,
/usr/src/debug/coreutils-8.22/lib/version-etc.c,
/usr/src/debug/coreutils-8.22/lib/version-etc-fsf.c,
/usr/src/debug/coreutils-8.22/lib/write-any-file.c,
/usr/src/debug/coreutils-8.22/lib/xmalloc.c,
/usr/src/debug/coreutils-8.22/lib/xalloc-die.c,
/usr/src/debug/coreutils-8.22/lib/xfts.c,
/usr/src/debug/coreutils-8.22/lib/xgetcwd.c,
/usr/src/debug/coreutils-8.22/lib/xstriconv.c, /usr/include/iconv.h,
/usr/src/debug/coreutils-8.22/lib/striconv.h,
/usr/src/debug/coreutils-8.22/lib/xvasprintf.c,
/usr/src/debug/coreutils-8.22/lib/xsize.h,
/usr/src/debug/coreutils-8.22/lib/yesno.c,
/usr/src/debug/coreutils-8.22/lib/fcntl.c,
/usr/src/debug/coreutils-8.22/lib/fflush.c,
/usr/include/stdio_ext.h,
/usr/src/debug/coreutils-8.22/lib/freadahead.c,
/usr/src/debug/coreutils-8.22/lib/fseeko.c,
/usr/src/debug/coreutils-8.22/lib/fts-cycle.c,
/usr/src/debug/coreutils-8.22/lib/fts.c,
/usr/src/debug/coreutils-8.22/lib/cycle-check.h,
/usr/src/debug/coreutils-8.22/lib/dev-ino.h, /usr/include/bits/statfs.h,
/usr/src/debug/coreutils-8.22/lib/cloexec.h, /usr/include/sys/statfs.h,
/usr/src/debug/coreutils-8.22/lib/getfilecon.c,
/usr/src/debug/coreutils-8.22/lib/linkat.c,
/usr/src/debug/coreutils-8.22/lib/at-func.c,
/usr/src/debug/coreutils-8.22/lib/utimensat.c,
/usr/src/debug/coreutils-8.22/lib/save-cwd.h,
/usr/src/debug/coreutils-8.22/lib/openat-priv.h,
/usr/src/debug/coreutils-8.22/lib/openat.h,
/usr/src/debug/coreutils-8.22/lib/vasprintf.c,
/usr/src/debug/coreutils-8.22/lib/vasnprintf.h,
/usr/src/debug/coreutils-8.22/lib/areadlinkat.c,
/usr/src/debug/coreutils-8.22/lib/careadlinkat.h,
/usr/src/debug/coreutils-8.22/lib/c-strcasecmp.c,
/usr/src/debug/coreutils-8.22/lib/careadlinkat.c,
/usr/src/debug/coreutils-8.22/lib/allocator.h,
/usr/src/debug/coreutils-8.22/lib/cloexec.c,
/usr/src/debug/coreutils-8.22/lib/close-stream.c,
/usr/src/debug/coreutils-8.22/lib/cycle-check.c,
/usr/src/debug/coreutils-8.22/lib/gettime.c,
/usr/src/debug/coreutils-8.22/lib/hash-pjw.c,
/usr/src/debug/coreutils-8.22/lib/i-ring.c,
/usr/src/debug/coreutils-8.22/lib/localcharset.c,
/usr/include/nl_types.h,
/usr/include/langinfo.h, /usr/src/debug/coreutils-8.22/lib/mbchar.c,
/usr/src/debug/coreutils-8.22/lib/str-kmp.h,
/usr/src/debug/coreutils-8.22/lib/mbsstr.c,
/usr/src/debug/coreutils-8.22/lib/malloca.h,
/usr/src/debug/coreutils-8.22/lib/openat-die.c,
/usr/src/debug/coreutils-8.22/lib/openat-safer.c,
/usr/src/debug/coreutils-8.22/lib/acl-errno-valid.c,
/usr/src/debug/coreutils-8.22/lib/file-has-acl.c,
/usr/src/debug/coreutils-8.22/lib/save-cwd.c,
/usr/src/debug/coreutils-8.22/lib/chdir-long.h,
/usr/src/debug/coreutils-8.22/lib/striconv.c,
/usr/src/debug/coreutils-8.22/lib/chdir-long.c,
/usr/src/debug/coreutils-8.22/lib/fclose.c,
/usr/src/debug/coreutils-8.22/lib/openat-proc.c,
/usr/src/debug/coreutils-8.22/lib/vasnprintf.c,
/usr/src/debug/coreutils-8.22/lib/printf-args.h,
/usr/src/debug/coreutils-8.22/lib/printf-parse.h,
/usr/src/debug/coreutils-8.22/lib/fpucw.h,
/usr/src/debug/coreutils-8.22/lib/isnanl-nolibm.h,
/usr/src/debug/coreutils-8.22/lib/allocator.c,
/usr/src/debug/coreutils-8.22/lib/malloca.c,
/usr/src/debug/coreutils-8.22/lib/mbslen.c,
/usr/src/debug/coreutils-8.22/lib/isnan.c,
/usr/src/debug/coreutils-8.22/lib/printf-args.c,
/usr/src/debug/coreutils-8.22/lib/printf-parse.c

@dswartz Would you edit your post to use a pastebin? Also, why is there nothing under Source files for which symbols have been read in:? Did you edit the output?

On 2018-04-09 14:17, Richard Yao wrote:

@dswartz [1] Would you edit your post to use a pastebin?

Sure.

The patches that apply to cp as far as what gdb claims its source files are (with patches editing only test cases that do not apply to the cp binary removed) is the same as I got after processing the gdb output from Gentoo's cp, which is:

./coreutils-selinux.patch:diff -urNp coreutils-8.21-orig/src/copy.c coreutils-8.21/src/copy.c
./coreutils-selinux.patch:diff -urNp coreutils-8.21-orig/src/cp.c coreutils-8.21/src/cp.c
./coreutils-8.22-selinux-optionsseparate.patch:diff -urNp coreutils-8.22-orig/src/cp.c coreutils-8.22/src/cp.c
./coreutils-8.22-mv-hardlinksrace.patch:diff -urNp coreutils-8.22-orig/src/copy.c coreutils-8.22/src/copy.c
./coreutils-8.22-cp-sparsecorrupt.patch:diff --git a/src/copy.c b/src/copy.c
./coreutils-8.22-cp-selinux.patch:diff --git a/src/selinux.c b/src/selinux.c

The changes in ./coreutils-8.22-mv-hardlinksrace.patch look questionable to me, but I don't see a smoking gun. Testing it on Gentoo after applying these patches should allow us to figure out which one is making it reproducible on CentOS.

On 2018-04-09 14:17, Richard Yao wrote:

@dswartz [1] Would you edit your post to use a pastebin?

https://pastebin.com/raw/TNCNJRau

Reproducibility: no
Distribution name and version: Fedora 27
Kernel Version: 4.15.10-300.fc27.x86_64
Coreutils Version: 8.27-20.fc27
SELinux status: off

EDIT: This machine's cp is copying in alphanumeric order (verified using strace).

Not reproducible using archzfs repo of Arch Linux (thanks, @demizer).

â–  mkdir SRC
â–  for i in $(seq 1 10000); do echo $i > SRC/$i ; done
â–  cp -r SRC DST
â–  uname -srvmo
Linux 4.15.15-1-ARCH #1 SMP PREEMPT Sat Mar 31 23:59:25 UTC 2018 x86_64 GNU/Linux
â–  LC_ALL=C pacman -Qi coreutils spl-linux spl-utils-common zfs-linux zfs-utils-common | grep '^Version '
Version         : 8.29-1
Version         : 0.7.7.4.15.15.1-1
Version         : 0.7.7-1
Version         : 0.7.7.4.15.15.1-1
Version         : 0.7.7-1
â–  zpool get all | sed '2,$s/^..../tank/g'
NAME  PROPERTY                       VALUE                          SOURCE
tank  size                           43.5T                          -
tank  capacity                       81%                            -
tank  altroot                        -                              default
tank  health                         ONLINE                         -
tank  guid                           xxxxxxxxxxxxxxxxxxx            -
tank  version                        -                              default
tank  bootfs                         -                              default
tank  delegation                     on                             default
tank  autoreplace                    off                            default
tank  cachefile                      -                              default
tank  failmode                       wait                           default
tank  listsnapshots                  off                            default
tank  autoexpand                     off                            default
tank  dedupditto                     0                              default
tank  dedupratio                     1.00x                          -
tank  free                           8.00T                          -
tank  allocated                      35.5T                          -
tank  readonly                       off                            -
tank  ashift                         12                             local
tank  comment                        -                              default
tank  expandsize                     -                              -
tank  freeing                        0                              -
tank  fragmentation                  34%                            -
tank  leaked                         0                              -
tank  multihost                      off                            default
tank  feature@async_destroy          enabled                        local
tank  feature@empty_bpobj            active                         local
tank  feature@lz4_compress           active                         local
tank  feature@multi_vdev_crash_dump  disabled                       local
tank  feature@spacemap_histogram     active                         local
tank  feature@enabled_txg            active                         local
tank  feature@hole_birth             active                         local
tank  feature@extensible_dataset     enabled                        local
tank  feature@embedded_data          active                         local
tank  feature@bookmarks              enabled                        local
tank  feature@filesystem_limits      enabled                        local
tank  feature@large_blocks           enabled                        local
tank  feature@large_dnode            disabled                       local
tank  feature@sha512                 disabled                       local
tank  feature@skein                  disabled                       local
tank  feature@edonr                  disabled                       local
tank  feature@userobj_accounting     disabled                       local

The 0.7.7 release has been removed from the CentOS and Fedora RPM repositories.

@rincebrain confirmed that this is reproducible using touch to create file in the right order (to inflate the zap with hash collisions). I'll post a minimal testcase.

@trisk you might want to look at the testcase in https://github.com/zfsonlinux/zfs/pull/7411 first

@Ringdingcoder Nice find. That could explain things nicely if some tests with/without that confirm it is the difference.

@Ringdingcoder I just reproduced this in an old Gentoo VM that uses coreutils 8.21. It is affected. No redhat patches are in place there. I'll try reproducing with the patches that you linked and see what happens. I expect one of them to make the issue disappear.

You'll need this:
```--- a/src/copy.c
+++ b/src/copy.c
@@ -717,7 +717,7 @@ copy_dir (char const *src_name_in, char const *dst_name_in, bool new_dst,
struct cp_options non_command_line_options = *x;
bool ok = true;

  • name_space = savedir (src_name_in, SAVEDIR_SORT_FASTREAD);
  • name_space = savedir (src_name_in, SAVEDIR_SORT_NONE);
    if (name_space == NULL)
    {
    /* This diagnostic is a bit vague because savedir can fail in
    ```

Shell script to reproduce (cp not needed): https://gist.github.com/trisk/9966159914d9d5cd5772e44885112d30

No, actually you're going the other way around. Then you would need patch gnulib. Easier going back to unsorted from a recent version.

@Ringdingcoder I just realized that when I found that lib/savedir.c didn't exist in the source files that I have. I picked an old VM that just happened to have 8.21. I'll update it to 8.23 and then revert to verify things.

I can reproduce it immediately by going with SAVEDIR_SORT_NONE. This explains why only old distros experience this. IIRC, tar is unsorted, so a "tar cf - SRC | tar -C DST -xf -" should be able to trigger this everywhere (untested).
Obviously the hard-coded sequence from trisk's script also does it.

With @trisk 's script I can immediately reproduce this on 64-bit Ubuntu 16.04 (kernel 4.4.0-109-generic) with ZFS 0.7.7. It fails as expected:

touch: cannot touch 'DST/9259': No space left on device

@Ringdingcoder I think we have satisfactorily explained what is different between various distributions that only the RHEL ones are affected. I am going to switch to understanding what is going wrong inside the kernel.

I observed a link count corruption issue that persisted between umounts when I reproduced this, but I have had trouble reliably reproducing the problem so that I have a reproducer for the link count issue.

I have now reproduced the bug on Arch Linux, using a corrected version of @trisk's script (unexpected token error, line 6). I am unable to reproduce the bug consistently:

â–  export LC_ALL=C
â–  rm -r DST
â–  ./zap-collision-test.sh 
â–  rm -r DST
â–  ./zap-collision-test.sh 
touch: cannot touch 'DST/9259': No space left on device
â–  rm -r DST
â–  ./zap-collision-test.sh 
touch: cannot touch 'DST/9259': No space left on device
â–  rm -r DST
â–  ./zap-collision-test.sh 
â–  rm -r DST
â–  ./zap-collision-test.sh 
touch: cannot touch 'DST/9259': No space left on device
â–  rm -r DST
â–  ./zap-collision-test.sh 
â–  

@NoSuck Can confirm on Arch, too, using @trisk's script. This invalidates my previous comment.

local/spl-linux-git 2018.04.04.r1070.581bc01.4.15.15.1-1 (archzfs-linux-git)
local/spl-utils-common-git 2018.04.04.r1070.581bc01-1 (archzfs-linux-git)
local/zfs-linux-git 2018.04.04.r3402.533ea0415.4.15.15.1-1 (archzfs-linux-git)
local/zfs-utils-common-git 2018.04.04.r3402.533ea0415-1 (archzfs-linux-git)

Using @trisk's script (well, a slightly corrected version) I can reproduce this on an almost current version of the git tip, g1724eb62d, on Fedora 27 on a simple mirrored vdev in a VM. It doesn't happen on every run, but it happens reasonably frequently (at least half the time, I think).

(This git version is the most recent version I've built for my own use. I can test with the very latest git tip, but I don't see anything there that would change this, if the identified cause is right. I'd be happy to test updates in the VM.)

It might be significant that we hit the zap expansion limit at 2048 files (unclear if this reflects a property of the coreutils sorting, or the zap hash function, though).

The original reproducer creates orphaned files when it triggers, while @trisk's reproducer does not. After running the original reproducer, I observed a failure on 1 file, 8186 files in the directory according to ls -l DST | wc and a directory size of 10001. Unlinking all of the files while trying to stat them to see if any were accessible failed, despite a directory size of 1816. Here is zdb output from a testpool that I used to reproduce the issue:

https://bpaste.net/show/d9f2f0de6c61

I forget how many times that I ran the reproducer on this (likely twice), but the orphaned files are clearly visible. Here is a compressed image of the pool:

https://dev.gentoo.org/~ryao/7401-pool-orphaned-files.xz

It has sha256 5bf54d804f0cd6cd155cc781efeefdabaa6e0ddddc500695eb24061d802474ac. The pool itself is just a 1GB sparse file. The compressed version is 1938032 bytes (~2MB) in size. Others can use zdb on it and poke around to observe the orphaned files.

I am stepping out for a bit due to an appointment that I cannot preempt, but I just want to point out that those who lost files might still have them around as orphans. We'll need to examine a pool where this happened with files storing actual data to confirm that the data is there. If it is, the data could be recoverable.

Thank you everyone for your help with this unfortunate regression. As described above by @tuxoko the root cause of this issue is understood and a complete fix is currently being worked on. In the meanwhile commit cc63068e95ee725cce03b1b7ce50179825a6cda5 which introduced this issue will be shortly reverted from the master branch, release branch, and v0.7.8 will be tagged. We'll open a new PR with the full fix for review and feedback when it's ready.

@behlendorf There are still some loose ends. In particular, how are we going to deal with those affected by this? There could be orphan files in their datasets.

At present, we could tell people to backup changes between what they have now and the snapshot before the issue happened, rollback and then restore, provided that they have snapshots at all. If not, the solution at the moment would be to make a new dataset, copy the files over to it and then destroy the old one.

Neither is as clean a solution as doing something like zfs lost+found -r tank and having the orphaned files put into lost+found directories. It gets messier when we consider that orphaned files could be in recently made snapshots.

This being hard to reproduce on non-RHEL family systems had been a loose end, but it was just tied. The change being in the bundled gnulib between coreutils 8.22 and 8.23 switched the order in how things had been copied from sequential order to a pseudo-random one.

Finally, we had something like a dozen people around the world drop everything to work on this. Not all of us are yet on the same page yet and it will take some time to sync our understandings. That way we can all review the final fix.

I should add that we also need a way to check for the presence of orphans. I have confirmed that zdb can show it, but I have not yet determined what zdb would show in all cases (mainly, non-zero files) to allow reliable detection.

Our analysis so far has not determined how the additional files whose zap_add completes after a prior zap expansion failure on the directory end up orphaned.

Our analysis is not finished. I am reopening this pending the completion of our analysis.

Right I didn't mean to suggest this issue should be closed, and reverting the change was all that was needed. There's still clearly careful investigation to be done, which we can now focus on.

@ryao when possible rolling back to a snapshot would be the cleanest way to recover these files. However, since that won't always be an option let's investigate implementing a generic orhpan recovery mechanism. Adding this functionality initially to zdb would allow us to check existing datasets, and would be nice additional test coverage for ztest to leverage. We could potentially follow this up with support for a .zfs/lost+found directory.

Given the improved understanding of the cause of this regression, can anything be said about the behaviour of rsync? If it reports no errors, are the data fine?

What about mv? And what if mv is from one dataset to another, on the same pool?

@darrenfreeman The mailing list or IRC chatroom would probably be a better place to ask, but

  • rsync should be fine, since _I think_ it should bail out on e.g. rsync -a src/ dst/ once it gets ENOSPC once, and not try any additional files
  • mv across datasets on a pool is just like mv across other filesystems, cp then rm, so I would guess that might be subject to the same caveats about version peculiarities as cp above, but I haven't tested that.

Also, one final caveat:

  • knowledge, particularly about how much vulnerability exists for files that get lost in the metaphorical shuffle after getting back ENOSPC, is incomplete, so it's safest to revert versions (or bump once 0.7.8 is cut) if at all possible, and everything above is based on incomplete information.

rsync always sorts files, so it should be fine. And as long as you don't receive errors, you should be fine.
Since data is not silently lost, this is not the worst-case catastrophic bug, just a major annoyance. The most inconvenient issue about it are the orphaned files, but fortunately they are tied to their respective datasets, not to the entire pool, and can get rid of by rolling back or re-creating individual datasets.

Reproducibility: yes
ZoL version: git, recent commit, 10adee27ced279c381816e1321226fce5834340c
Distribution: Ubuntu 17.10
Kernel Version: 4.13.0-38-generic
Coreutils Version: 8.26-3ubuntu4
SELinux status: not installed AFAICT

Reproduced using: ./zap-collision-test.sh .

Furthermore, this didn't look good:

rm -Rf DST
Segmentation fault (core dumped)

The pool was freshly created as,

zfs create rpool/test -o recordsize=4k
touch -s 1G /rpool/test/file
zpool create test /rpool/test/file -o ashift=12

I am trying to install the debug symbols for rm, however I am now also getting segfaults when not even touching this zpool. (apt-key is segfaulting when trying to trust the debug repo.) So I fear I better push the comment button now and reboot :/

Update: can't reproduce the segfault on rm -Rf DST, after rebooting and installing debug symbols.

Thanks for the solutions and quick efforts to fix.
Are there any methods to check a complete filesystem if there any affected files? I do have backups - anyone give me a oneliner to list them?

Given this bug has now been listed on The Register (https://www.theregister.co.uk/2018/04/10/zfs_on_linux_data_loss_fixed/), it might be wise have an FAQ article on the wiki page (with a link in this ticket). The FAQ article should clearly state which versions of ZoL are affected and which distros/kernel versions (similar to the birthhole bug). This would hopefully limit any panic concerns about the reliability of ZoL as a storage layer.

Given this bug has now been listed on The Register (https://www.theregister.co.uk/2018/04/10/zfs_on_linux_data_loss_fixed/)

From that article (emphasis mine):
"So even though three reviewers signed off on the cruddy commit, the speedy response may mean it’s possible to consider this a triumph of sorts for open source."

Ouch.

I agree with @markdesouza that there should be a FAQ article for that so we ZFS apologizers can point anyone who questions us about that to it. I would also like to suggest that the ZFS signing-off procedure be reviewed to avoid (or at least make it way more improbable) for such a "cruddy commit" to make it into a ZFS stable release, and that notice of this review also be added to that same FAQ article.

In #7411, the random_creation test looks like it may be a more robust reproducer (especially for future bugs) because it naturally relies on the ordering of the ZAP hashes. Also, if there are other reproducers, it might be a good idea to centralize discussion of them in that PR so they can be easily included.

Answering my earlier question. Debian 9.3 as above.

rsync doesn't hit the bug, it creates files in lexical order. (I.e. file 999 is followed by 9990.) In a very small number of tests, I didn't find a combination of switches that would fail.

So anyone who prefers rsync, should have a pretty good chance of having missed the bug.

Something similar to mv /pool/dataset1/SRC /pool/dataset2/ also didn't fail. (Move between datasets within the same pool.) Although, on the same box, cp doesn't fail either, so that doesn't prove much.

FYI - you probably all saw it already, but we released zfs-0.7.8 with the reverted patch last night.

@ort163 We do not have a one liner yet. People are continuing to analyze the issue and we will have a proper fix in the near future. That will include a way to detect+correct the wrong directory sizes, list snapshots affected and place the orphaned files in some kind of lost+found directory. I am leaning toward extending scrub to do it.

@markdesouza I have spent a fair amount of time explaining things to end users on Hacker News, Reddit and Phoronix. I do not think that our understanding is sufficient to post a final FAQ yet, but we could post an interim FAQ.

I think the interim FAQ entry should advise users to upgrade ASAP to avoid having to possibly deal with orphaned files if nothing has happened yet, or more orphaned files if something has already happened; and not to change how they do things after upgrading unless they deem it necessary until we finish our analysis, make a proper fix, and issue proper instructions on how to repair the damage in the release notes. I do not think there is any harm to pools if datasets have incorrect directory sizes and orphaned files while people wait for us to release a proper fix with instructions on how to completely address the issue, so telling them to wait after upgrading should be fine. The orphan files should stay around and persist through send/recv unless snapshot rollback is done or the dataset is destroyed.

Until that is up, you could point users to my hacker news post:

https://news.ycombinator.com/item?id=16797932

In specific, we need to nail down whether existing files’ directory entries could be lost, what if any other side effects happen when this is triggered on new file creation, what course of events leads to directory entries disappearing after ENOSPC, how system administrators could detect it and how system administrators will repair it. Then we should be able to make a proper FAQ entry.

Edit: The first 3 questions are answered satisfactorily in #7421.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

pcd1193182 picture pcd1193182  Â·  4Comments

Greek64 picture Greek64  Â·  3Comments

tronder88 picture tronder88  Â·  3Comments

Hubbitus picture Hubbitus  Â·  4Comments

FransUrbo picture FransUrbo  Â·  4Comments