Type | Version/Name
--- | ---
Distribution Name | Scientific Linux
Distribution Version | 6.8
Linux Kernel | 2.6.32-696.23.1.el6.x86_64
Architecture | x86_64
ZFS Version | 0.7.7
SPL Version | 0.7.7
Data loss when copying a directory with large-ish number of files. For example, cp -r SRC DST with 10000 files in SRC is likely to result in a couple of "cp: cannot create regular file `DST/XXX': No space left on device" error messages, and a few thousand files missing from the listing of the DST directory. (Needless to say, filesystem being full is not the problem.)
The missing files are missing in the sense that they don't appear in the directory listing, but can be accessed using their name (except for the couple of files for which cp generated "No space left on device" error). For example:
# ls -l DST | grep FOO | wc -l
0
# ls -l DST/FOO
-rw-r--r-- 1 root root 5 Apr 6 14:59 DST/FOO
The content of DST/FOO are accessible by path (e.g. cat DST/FOO works) and is the same as SRC/FOO. If caches are dropped (echo 3 > /proc/sys/vm/drop_caches) or the machine is rebooted, opening FOO directly by path fails.
ls -ld DST reports N fewer hard links than SRC, where N is the number of files for which cp reported "No space left on device" error.
Names of missing files are mostly predictable if SRC is small.
Scrub does not find any errors.
I think the problem appeared in 0.7.7, but I am not sure.
# mkdir SRC
# for i in $(seq 1 10000); do echo $i > SRC/$i ; done
# cp -r SRC DST
cp: cannot create regular file `DST/8442': No space left on device
cp: cannot create regular file `DST/2629': No space left on device
# ls -l
total 3107
drwxr-xr-x 2 root root 10000 Apr 6 15:28 DST
drwxr-xr-x 2 root root 10002 Apr 6 15:27 SRC
# find DST -type f | wc -l
8186
# ls -l DST | grep 8445 | wc -l
0
# ls -l DST/8445
-rw-r--r-- 1 root root 5 Apr 6 15:28 DST/8445
# cat DST/8445
8445
# echo 3 > /proc/sys/vm/drop_caches
# cat DST/8445
cat: DST/8445: No such file or directory
# zpool status
pool: tank
state: ONLINE
scan: scrub repaired 0B in 87h47m with 0 errors on Sat Mar 31 07:09:27 2018
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
wwn-0x5000c50085ac4c0f ONLINE 0 0 0
wwn-0x5000c50085acda77 ONLINE 0 0 0
wwn-0x5000c500858db3d7 ONLINE 0 0 0
wwn-0x5000c50085ac9887 ONLINE 0 0 0
wwn-0x5000c50085aca6df ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
wwn-0x5000c500858db743 ONLINE 0 0 0
wwn-0x5000c500858db347 ONLINE 0 0 0
wwn-0x5000c500858db4a7 ONLINE 0 0 0
wwn-0x5000c500858dbb0f ONLINE 0 0 0
wwn-0x5000c50085acaa97 ONLINE 0 0 0
raidz1-2 ONLINE 0 0 0
wwn-0x5000c50085accb4b ONLINE 0 0 0
wwn-0x5000c50085acab9f ONLINE 0 0 0
wwn-0x5000c50085ace783 ONLINE 0 0 0
wwn-0x5000c500858db67b ONLINE 0 0 0
wwn-0x5000c50085acb983 ONLINE 0 0 0
raidz1-3 ONLINE 0 0 0
wwn-0x5000c50085ac4fd7 ONLINE 0 0 0
wwn-0x5000c50085acb24b ONLINE 0 0 0
wwn-0x5000c50085ace13b ONLINE 0 0 0
wwn-0x5000c500858db43f ONLINE 0 0 0
wwn-0x5000c500858db61b ONLINE 0 0 0
raidz1-4 ONLINE 0 0 0
wwn-0x5000c500858dbbb7 ONLINE 0 0 0
wwn-0x5000c50085acce7f ONLINE 0 0 0
wwn-0x5000c50085acd693 ONLINE 0 0 0
wwn-0x5000c50085ac3d87 ONLINE 0 0 0
wwn-0x5000c50085acc89b ONLINE 0 0 0
raidz1-5 ONLINE 0 0 0
wwn-0x5000c500858db28b ONLINE 0 0 0
wwn-0x5000c500858db68f ONLINE 0 0 0
wwn-0x5000c500858dbadf ONLINE 0 0 0
wwn-0x5000c500858db623 ONLINE 0 0 0
wwn-0x5000c500858db48b ONLINE 0 0 0
raidz1-6 ONLINE 0 0 0
wwn-0x5000c500858db6ef ONLINE 0 0 0
wwn-0x5000c500858db39b ONLINE 0 0 0
wwn-0x5000c500858db47f ONLINE 0 0 0
wwn-0x5000c500858dbb23 ONLINE 0 0 0
wwn-0x5000c500858db803 ONLINE 0 0 0
logs
zfs-slog ONLINE 0 0 0
spares
wwn-0x5000c500858db463 AVAIL
errors: No known data errors
# zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 254T 159T 94.3T - 27% 62% 1.00x ONLINE -
md5-f618a12917d64a78c208de73972d8f89
# zfs list -t all
NAME USED AVAIL REFER MOUNTPOINT
tank 127T 69.0T 11.5T /mnt/tank
tank/jade 661G 69.0T 661G /mnt/tank/jade
tank/simprod 115T 14.8T 115T /mnt/tank/simprod
md5-f618a12917d64a78c208de73972d8f89
# zfs get all tank
NAME PROPERTY VALUE SOURCE
tank type filesystem -
tank creation Sat Jan 20 12:11 2018 -
tank used 127T -
tank available 68.9T -
tank referenced 11.6T -
tank compressratio 1.00x -
tank mounted yes -
tank quota none default
tank reservation none default
tank recordsize 128K default
tank mountpoint /mnt/tank local
tank sharenfs off default
tank checksum on default
tank compression off default
tank atime off local
tank devices on default
tank exec on default
tank setuid on default
tank readonly off default
tank zoned off default
tank snapdir hidden default
tank aclinherit restricted default
tank createtxg 1 -
tank canmount on default
tank xattr sa local
tank copies 1 default
tank version 5 -
tank utf8only off -
tank normalization none -
tank casesensitivity sensitive -
tank vscan off default
tank nbmand off default
tank sharesmb off default
tank refquota none default
tank refreservation none default
tank guid 2271746520743372128 -
tank primarycache all default
tank secondarycache all default
tank usedbysnapshots 0B -
tank usedbydataset 11.6T -
tank usedbychildren 116T -
tank usedbyrefreservation 0B -
tank logbias latency default
tank dedup off default
tank mlslabel none default
tank sync standard default
tank dnodesize legacy default
tank refcompressratio 1.00x -
tank written 11.6T -
tank logicalused 128T -
tank logicalreferenced 11.6T -
tank volmode default default
tank filesystem_limit none default
tank snapshot_limit none default
tank filesystem_count none default
tank snapshot_count none default
tank snapdev hidden default
tank acltype off default
tank context none default
tank fscontext none default
tank defcontext none default
tank rootcontext none default
tank relatime off default
tank redundant_metadata all default
tank overlay off default
md5-f618a12917d64a78c208de73972d8f89
# zpool get all tank
NAME PROPERTY VALUE SOURCE
tank size 254T -
tank capacity 62% -
tank altroot - default
tank health ONLINE -
tank guid 7056741522691970971 -
tank version - default
tank bootfs - default
tank delegation on default
tank autoreplace on local
tank cachefile - default
tank failmode wait default
tank listsnapshots off default
tank autoexpand off default
tank dedupditto 0 default
tank dedupratio 1.00x -
tank free 94.2T -
tank allocated 160T -
tank readonly off -
tank ashift 0 default
tank comment - default
tank expandsize - -
tank freeing 0 -
tank fragmentation 27% -
tank leaked 0 -
tank multihost off default
tank feature@async_destroy enabled local
tank feature@empty_bpobj active local
tank feature@lz4_compress active local
tank feature@multi_vdev_crash_dump enabled local
tank feature@spacemap_histogram active local
tank feature@enabled_txg active local
tank feature@hole_birth active local
tank feature@extensible_dataset active local
tank feature@embedded_data active local
tank feature@bookmarks enabled local
tank feature@filesystem_limits enabled local
tank feature@large_blocks enabled local
tank feature@large_dnode enabled local
tank feature@sha512 enabled local
tank feature@skein enabled local
tank feature@edonr enabled local
tank feature@userobj_accounting active local
I can confirm the same behavior on a minimal CentOS 7.4 installation (running inside VirtualBox) and latest ZFS 0.7.7. Please note that when copying somewhat bigger files (ie: kernel source) it does not happen, so it seems something as a race condition...
; the only changed property was xattr=sa
[root@localhost ~]# zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 7.94G 25.3M 7.91G - 0% 0% 1.00x ONLINE -
[root@localhost ~]# zfs list
NAME USED AVAIL REFER MOUNTPOINT
tank 24.5M 7.67G 3.36M /tank
tank/test 21.0M 7.67G 21.0M /tank/test
; creating the source dir on a XFS filesystem
[root@localhost ~]# cd /root/
[root@localhost ~]# mkdir test
[root@localhost ~]# cd test
[root@localhost ~]# for i in $(seq 1 10000); do echo $i > SRC/$i ; done
; copying from XFS to ZFS: no problem at all
[root@localhost ~]# cd /tank/test
[root@localhost test]# cp -r /root/test/SRC/ DST1
[root@localhost test]# cp -r /root/test/SRC/ DST2
[root@localhost test]# cp -r /root/test/SRC/ DST3
[root@localhost test]# find DST1/ | wc -l
10001
[root@localhost test]# find DST2/ | wc -l
10001
[root@localhost test]# find DST3/ | wc -l
10001
; copying from ZFS dataset itself: big troubles!
[root@localhost test]# rm -rf SRC DST1 DST2 DST3
[root@localhost test]# cp -r /root/test/SRC .
[root@localhost test]# cp -r SRC DST1
cp: cannot create regular file ‘DST1/8809’: No space left on device
[root@localhost test]# cp -r SRC DST2
[root@localhost test]# cp -r SRC DST3
cp: cannot create regular file ‘DST3/6507’: No space left on device
[root@localhost test]# find DST1/ | wc -l
10000
[root@localhost test]# find DST2/ | wc -l
10001
[root@localhost test]# find DST3/ | wc -l
8189
; disabling cache: nothing changes (we continue to "lose" files)
[root@localhost test]# zfs set primarycache=none tank
[root@localhost test]# zfs set primarycache=none tank/test
[root@localhost test]# echo 3 > /proc/sys/vm/drop_caches
[root@localhost test]# rm -rf SRC; mkdir SRC; for i in $(seq 1 10000); do echo $i > SRC/$i ; done; find SRC | wc -l
10001
[root@localhost test]# rm -rf SRC; mkdir SRC; for i in $(seq 1 10000); do echo $i > SRC/$i ; done; find SRC | wc -l
10001
[root@localhost test]# rm -rf SRC; mkdir SRC; for i in $(seq 1 10000); do echo $i > SRC/$i ; done; find SRC | wc -l
10001
The problem does NOT appear on ZoL 0.7.6:
; creating the dataset and copying the SRC dir
[root@localhost ~]# zfs create tank/test
[root@localhost ~]# zfs set xattr=sa tank
[root@localhost ~]# zfs set xattr=sa tank/test
[root@localhost ~]# cp -r /root/test/SRC/ /tank/test/
[root@localhost ~]# cd /tank/test/
[root@localhost test]# find SRC/ | wc -l
10001
; more copies
[root@localhost test]# cp -r SRC/ DST
[root@localhost test]# cp -r SRC/ DST1
[root@localhost test]# cp -r SRC/ DST2
[root@localhost test]# cp -r SRC/ DST3
[root@localhost test]# cp -r SRC/ DST4
[root@localhost test]# cp -r SRC/ DST5
[root@localhost test]# find DST | wc -l
10001
[root@localhost test]# find DST1 | wc -l
10001
[root@localhost test]# find DST2 | wc -l
10001
[root@localhost test]# find DST3 | wc -l
10001
[root@localhost test]# find DST4 | wc -l
10001
[root@localhost test]# find DST5 | wc -l
10001
Maybe it can help. Here you find the output of zdb -dddddddd tank/test 192784 (a "good" DST directory):
Dataset tank/test [ZPL], ID 74, cr_txg 13, 26.5M, 190021 objects, rootbp DVA[0]=<0:5289e00:200> DVA[1]=<0:65289e00:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=123L/123P fill=190021 cksum=d622b78d2:50c053a50d0:fca8cd4455d7:2216d160ee7f7d
Object lvl iblk dblk dsize dnsize lsize %full type
192784 2 128K 16K 909K 512 1.02M 100.00 ZFS directory (K=inherit) (Z=inherit)
272 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
dnode maxblkid: 64
path /DST16
uid 0
gid 0
atime Sat Apr 7 01:11:29 2018
mtime Sat Apr 7 01:11:31 2018
ctime Sat Apr 7 01:11:31 2018
crtime Sat Apr 7 01:11:29 2018
gen 97
mode 40755
size 10002
parent 34
links 2
pflags 40800000144
SA xattrs: 96 bytes, 1 entries
security.selinux = unconfined_u:object_r:unlabeled_t:s0\000
Fat ZAP stats:
Pointer table:
1024 elements
zt_blk: 0
zt_numblks: 0
zt_shift: 10
zt_blks_copied: 0
zt_nextblk: 0
ZAP entries: 10000
Leaf blocks: 64
Total blocks: 65
zap_block_type: 0x8000000000000001
zap_magic: 0x2f52ab2ab
zap_salt: 0x13c18a19
Leafs with 2^n pointers:
4: 64 ****************************************
Blocks with n*5 entries:
9: 64 ****************************************
Blocks n/10 full:
6: 4 ****
7: 43 ****************************************
8: 16 ***************
9: 1 *
Entries with n chunks:
3: 10000 ****************************************
Buckets with n entries:
0: 24119 ****************************************
1: 7414 *************
2: 1126 **
3: 102 *
4: 7 *
... and zdb -dddddddd tank/test 202785 (a "bad" DST directory):
Dataset tank/test [ZPL], ID 74, cr_txg 13, 26.5M, 190021 objects, rootbp DVA[0]=<0:5289e00:200> DVA[1]=<0:65289e00:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=123L/123P fill=190021 cksum=d622b78d2:50c053a50d0:fca8cd4455d7:2216d160ee7f7d
Object lvl iblk dblk dsize dnsize lsize %full type
202785 2 128K 16K 766K 512 896K 100.00 ZFS directory (K=inherit) (Z=inherit)
272 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
dnode maxblkid: 55
path /DST17
uid 0
gid 0
atime Sat Apr 7 01:12:49 2018
mtime Sat Apr 7 01:11:33 2018
ctime Sat Apr 7 01:11:33 2018
crtime Sat Apr 7 01:11:32 2018
gen 98
mode 40755
size 10001
parent 34
links 2
pflags 40800000144
SA xattrs: 96 bytes, 1 entries
security.selinux = unconfined_u:object_r:unlabeled_t:s0\000
Fat ZAP stats:
Pointer table:
1024 elements
zt_blk: 0
zt_numblks: 0
zt_shift: 10
zt_blks_copied: 0
zt_nextblk: 0
ZAP entries: 8259
Leaf blocks: 55
Total blocks: 56
zap_block_type: 0x8000000000000001
zap_magic: 0x2f52ab2ab
zap_salt: 0x1bf8e8a3
Leafs with 2^n pointers:
4: 50 ****************************************
5: 3 ***
6: 2 **
Blocks with n*5 entries:
9: 55 ****************************************
Blocks n/10 full:
5: 6 ******
6: 7 *******
7: 32 ********************************
8: 6 ******
9: 4 ****
Entries with n chunks:
3: 8259 ****************************************
Buckets with n entries:
0: 20964 ****************************************
1: 6217 ************
2: 904 **
3: 66 *
4: 9 *
We are also seeing similar behavior since the install of 0.7.7
I have a hand-built ZoL 0.7.7 on a stock Ubuntu 16.04 server (currently with Ubuntu kernel version '4.4.0-109-generic') and I can't reproduce this problem on it, following the reproduction here and some variants (eg using 'seq -w' to make all of the filenames the same size). The pool I'm testing against has a single mirrored vdev.
One more data point, with the hope that it helps narrow down the issue.
I cannot reproduce the issue on the few machines I have here, neither with 10k files, nor with 100k or even 1M. They all have very similar configuraition. They use a single 2-drive mirrored vdev. The drives are Samsung SSD 950 PRO 512GB (NVMe, quite fast).
$ uname -a
Linux pat 4.9.90-gentoo #1 SMP PREEMPT Tue Mar 27 00:19:59 CEST 2018 x86_64 Intel(R) Xeon(R) CPU E3-1505M v5 @ 2.80GHz GenuineIntel GNU/Linux
$ qlist -I -v zfs-kmod
sys-fs/zfs-kmod-0.7.7
$ qlist -I -v spl
sys-kernel/spl-0.7.7
$ zpool status
pool: pat:pool
state: ONLINE
scan: scrub repaired 0B in 0h1m with 0 errors on Sat Apr 7 03:35:12 2018
config:
NAME STATE READ WRITE CKSUM
pat:pool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme0n1p4 ONLINE 0 0 0
nvme1n1p4 ONLINE 0 0 0
spares
ata-Samsung_SSD_850_EVO_1TB_S2RFNXAH118721D-part8 AVAIL
errors: No known data errors
$ zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
pat:pool 408G 110G 298G - 18% 26% 1.00x ONLINE -
$ zpool get all pat:pool
NAME PROPERTY VALUE SOURCE
pat:pool size 408G -
pat:pool capacity 26% -
pat:pool altroot - default
pat:pool health ONLINE -
pat:pool guid 16472389984482033769 -
pat:pool version - default
pat:pool bootfs - default
pat:pool delegation on default
pat:pool autoreplace on local
pat:pool cachefile - default
pat:pool failmode wait default
pat:pool listsnapshots off default
pat:pool autoexpand off default
pat:pool dedupditto 0 default
pat:pool dedupratio 1.00x -
pat:pool free 298G -
pat:pool allocated 110G -
pat:pool readonly off -
pat:pool ashift 12 local
pat:pool comment - default
pat:pool expandsize - -
pat:pool freeing 0 -
pat:pool fragmentation 18% -
pat:pool leaked 0 -
pat:pool multihost off default
pat:pool feature@async_destroy enabled local
pat:pool feature@empty_bpobj active local
pat:pool feature@lz4_compress active local
pat:pool feature@multi_vdev_crash_dump enabled local
pat:pool feature@spacemap_histogram active local
pat:pool feature@enabled_txg active local
pat:pool feature@hole_birth active local
pat:pool feature@extensible_dataset active local
pat:pool feature@embedded_data active local
pat:pool feature@bookmarks enabled local
pat:pool feature@filesystem_limits enabled local
pat:pool feature@large_blocks enabled local
pat:pool feature@large_dnode enabled local
pat:pool feature@sha512 enabled local
pat:pool feature@skein enabled local
pat:pool feature@edonr enabled local
pat:pool feature@userobj_accounting active local
$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
(...)
pat:pool/home/joe/tmp 27.9G 285G 27.9G /home/joe/tmp
(...)
$ zfs get all pat:pool/home/joe/tmp
NAME PROPERTY VALUE SOURCE
pat:pool/home/joe/tmp type filesystem -
pat:pool/home/joe/tmp creation Sat Mar 12 17:32 2016 -
pat:pool/home/joe/tmp used 27.9G -
pat:pool/home/joe/tmp available 285G -
pat:pool/home/joe/tmp referenced 27.9G -
pat:pool/home/joe/tmp compressratio 1.16x -
pat:pool/home/joe/tmp mounted yes -
pat:pool/home/joe/tmp quota none default
pat:pool/home/joe/tmp reservation none default
pat:pool/home/joe/tmp recordsize 128K default
pat:pool/home/joe/tmp mountpoint /home/joe/tmp inherited from pat:pool/home
pat:pool/home/joe/tmp sharenfs off default
pat:pool/home/joe/tmp checksum on default
pat:pool/home/joe/tmp compression lz4 inherited from pat:pool
pat:pool/home/joe/tmp atime off inherited from pat:pool
pat:pool/home/joe/tmp devices on default
pat:pool/home/joe/tmp exec on default
pat:pool/home/joe/tmp setuid on default
pat:pool/home/joe/tmp readonly off default
pat:pool/home/joe/tmp zoned off default
pat:pool/home/joe/tmp snapdir hidden default
pat:pool/home/joe/tmp aclinherit restricted default
pat:pool/home/joe/tmp createtxg 507 -
pat:pool/home/joe/tmp canmount on default
pat:pool/home/joe/tmp xattr sa inherited from pat:pool
pat:pool/home/joe/tmp copies 1 default
pat:pool/home/joe/tmp version 5 -
pat:pool/home/joe/tmp utf8only off -
pat:pool/home/joe/tmp normalization none -
pat:pool/home/joe/tmp casesensitivity sensitive -
pat:pool/home/joe/tmp vscan off default
pat:pool/home/joe/tmp nbmand off default
pat:pool/home/joe/tmp sharesmb off default
pat:pool/home/joe/tmp refquota none default
pat:pool/home/joe/tmp refreservation none default
pat:pool/home/joe/tmp guid 10274125767907263189 -
pat:pool/home/joe/tmp primarycache all default
pat:pool/home/joe/tmp secondarycache all default
pat:pool/home/joe/tmp usedbysnapshots 0B -
pat:pool/home/joe/tmp usedbydataset 27.9G -
pat:pool/home/joe/tmp usedbychildren 0B -
pat:pool/home/joe/tmp usedbyrefreservation 0B -
pat:pool/home/joe/tmp logbias latency default
pat:pool/home/joe/tmp dedup off default
pat:pool/home/joe/tmp mlslabel none default
pat:pool/home/joe/tmp sync standard default
pat:pool/home/joe/tmp dnodesize legacy default
pat:pool/home/joe/tmp refcompressratio 1.16x -
pat:pool/home/joe/tmp written 27.9G -
pat:pool/home/joe/tmp logicalused 31.6G -
pat:pool/home/joe/tmp logicalreferenced 31.6G -
pat:pool/home/joe/tmp volmode default default
pat:pool/home/joe/tmp filesystem_limit none default
pat:pool/home/joe/tmp snapshot_limit none default
pat:pool/home/joe/tmp filesystem_count none default
pat:pool/home/joe/tmp snapshot_count none default
pat:pool/home/joe/tmp snapdev hidden default
pat:pool/home/joe/tmp acltype posixacl inherited from pat:pool
pat:pool/home/joe/tmp context none default
pat:pool/home/joe/tmp fscontext none default
pat:pool/home/joe/tmp defcontext none default
pat:pool/home/joe/tmp rootcontext none default
pat:pool/home/joe/tmp relatime off default
pat:pool/home/joe/tmp redundant_metadata all default
pat:pool/home/joe/tmp overlay off default
pat:pool/home/joe/tmp net.c-space:snapshots keep=1M inherited from pat:pool/home/joe
pat:pool/home/joe/tmp net.c-space:root 0 inherited from pat:pool
I get a worse situation on latest Centos 7 with kmod:
`[root@zirconia test]# mkdir SRC
[root@zirconia test]# for i in $(seq 1 10000); do echo $i > SRC/$i ; done
[root@zirconia test]# cp -r SRC DST
cp: cannot create regular file ‘DST/5269’: No space left on device
cp: cannot create regular file ‘DST/9923’: No space left on device
[root@zirconia test]# cat DST/5269
cat: DST/5269: No such file or directory
[root@zirconia test]# cat DST/9923
cat: DST/9923: No such file or directory
[root@zirconia test]# cat DST/9924
9924
[root@zirconia test]# cat DST/9923
cat: DST/9923: No such file or directory
[root@zirconia test]# ls -l DST/9923
ls: cannot access DST/9923: No such file or directory
[root@zirconia test]# zpool status
pool: storage
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ata-Hitachi_HDS723020BLA642_MN1220F30KPM0D ONLINE 0 0 0
ata-Hitachi_HDS723020BLA642_MN1220F30NJDDD ONLINE 0 0 0
ata-Hitachi_HDS723020BLA642_MN1220F30NJAHD ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
ata-Hitachi_HDS723020BLA642_MN1220F30NGXDD ONLINE 0 0 0
ata-Hitachi_HDS723020BLA642_MN1220F30NJ91D ONLINE 0 0 0
ata-Hitachi_HDS723020BLA642_MN1220F30LN7GD ONLINE 0 0 0
raidz1-2 ONLINE 0 0 0
ata-Hitachi_HDS723020BLA642_MN1220F30NJM5D ONLINE 0 0 0
ata-HGST_HUS724020ALA640_PN2134P5GAY9PX ONLINE 0 0 0
ata-Hitachi_HDS723020BLA642_MN1220F30NJD5D ONLINE 0 0 0
raidz1-3 ONLINE 0 0 0
ata-Hitachi_HDS723020BLA642_MN1220F30NJD8D ONLINE 0 0 0
ata-Hitachi_HDS723020BLA642_MN1220F30NJHVD ONLINE 0 0 0
ata-Hitachi_HDS723020BLA642_MN1220F30K5PMD ONLINE 0 0 0
raidz1-4 ONLINE 0 0 0
ata-Hitachi_HDS723020BLA642_MN1220F30NLZLD ONLINE 0 0 0
ata-Hitachi_HDS723020BLA642_MN1220F30MVW4D ONLINE 0 0 0
ata-HGST_HUS724020ALA640_PN2134P5GBBL9X ONLINE 0 0 0
logs
mirror-5 ONLINE 0 0 0
nvme0n1p1 ONLINE 0 0 0
nvme1n1p1 ONLINE 0 0 0
cache
nvme0n1p2 ONLINE 0 0 0
nvme1n1p2 ONLINE 0 0 0`
I
@rblank Did you use empty files? Please try the following:
mkdir SRC; for i in $(seq 1 10000); do echo -n > SRC/$i; done; find SRC | wc -lfor i in $(seq 1 10); do cp -r SRC DST$i; find DST$i | wc -l; doneThanks.
I used the exact commands from the OP (which create non-empty files), only changing 10000 to 100000 and 1000000. But for completeness, I tried yours as well.
$ mkdir SRC; for i in $(seq 1 10000); do echo -n > SRC/$i; done; find SRC | wc -l
10001
$ for i in $(seq 1 10); do cp -r SRC DST$i; find DST$i | wc -l; done
10001
10001
10001
10001
10001
10001
10001
10001
10001
10001
The few data points above weakly hint at raidz, since no one was able to reproduce on mirrors so far.
On one of my pools this works fine, on another it exhibits the problems. Both datasets belong to the same pool.
bash-4.2$ mkdir SRC
bash-4.2$ for i in $(seq 1 10000); do echo $i > SRC/$i ; done
bash-4.2$ cp -r SRC DST
cp: cannot create regular file ‘DST/222’: No space left on device
cp: cannot create regular file ‘DST/6950’: No space left on device
On beast/engineering the above commands run without issue. On beast/dataio they fail.
bash-4.2$ zfs get all beast/engineering
NAME PROPERTY VALUE SOURCE
beast/engineering type filesystem -
beast/engineering creation Sun Nov 5 17:53 2017 -
beast/engineering used 1.85T -
beast/engineering available 12.0T -
beast/engineering referenced 1.85T -
beast/engineering compressratio 1.04x -
beast/engineering mounted yes -
beast/engineering quota none default
beast/engineering reservation none default
beast/engineering recordsize 1M inherited from beast
beast/engineering mountpoint /beast/engineering default
beast/engineering sharenfs on inherited from beast
beast/engineering checksum on default
beast/engineering compression lz4 inherited from beast
beast/engineering atime off inherited from beast
beast/engineering devices on default
beast/engineering exec on default
beast/engineering setuid on default
beast/engineering readonly off default
beast/engineering zoned off default
beast/engineering snapdir hidden default
beast/engineering aclinherit restricted default
beast/engineering createtxg 20615173 -
beast/engineering canmount on default
beast/engineering xattr sa inherited from beast
beast/engineering copies 1 default
beast/engineering version 5 -
beast/engineering utf8only off -
beast/engineering normalization none -
beast/engineering casesensitivity sensitive -
beast/engineering vscan off default
beast/engineering nbmand off default
beast/engineering sharesmb off inherited from beast
beast/engineering refquota none default
beast/engineering refreservation none default
beast/engineering guid 18311947624891459017 -
beast/engineering primarycache metadata local
beast/engineering secondarycache all default
beast/engineering usedbysnapshots 151M -
beast/engineering usedbydataset 1.85T -
beast/engineering usedbychildren 0B -
beast/engineering usedbyrefreservation 0B -
beast/engineering logbias latency default
beast/engineering dedup off default
beast/engineering mlslabel none default
beast/engineering sync disabled inherited from beast
beast/engineering dnodesize auto inherited from beast
beast/engineering refcompressratio 1.04x -
beast/engineering written 0 -
beast/engineering logicalused 1.92T -
beast/engineering logicalreferenced 1.92T -
beast/engineering volmode default default
beast/engineering filesystem_limit none default
beast/engineering snapshot_limit none default
beast/engineering filesystem_count none default
beast/engineering snapshot_count none default
beast/engineering snapdev hidden default
beast/engineering acltype posixacl inherited from beast
beast/engineering context none default
beast/engineering fscontext none default
beast/engineering defcontext none default
beast/engineering rootcontext none default
beast/engineering relatime off default
beast/engineering redundant_metadata all default
beast/engineering overlay off default
beast/engineering com.sun:auto-snapshot true inherited from beast
bash-4.2$ zfs get all beast/dataio
NAME PROPERTY VALUE SOURCE
beast/dataio type filesystem -
beast/dataio creation Fri Oct 13 11:13 2017 -
beast/dataio used 45.0T -
beast/dataio available 12.0T -
beast/dataio referenced 45.0T -
beast/dataio compressratio 1.09x -
beast/dataio mounted yes -
beast/dataio quota none default
beast/dataio reservation none default
beast/dataio recordsize 1M inherited from beast
beast/dataio mountpoint /beast/dataio default
beast/dataio sharenfs on inherited from beast
beast/dataio checksum on default
beast/dataio compression lz4 inherited from beast
beast/dataio atime off inherited from beast
beast/dataio devices on default
beast/dataio exec on default
beast/dataio setuid on default
beast/dataio readonly off default
beast/dataio zoned off default
beast/dataio snapdir hidden default
beast/dataio aclinherit restricted default
beast/dataio createtxg 19156147 -
beast/dataio canmount on default
beast/dataio xattr sa inherited from beast
beast/dataio copies 1 default
beast/dataio version 5 -
beast/dataio utf8only off -
beast/dataio normalization none -
beast/dataio casesensitivity sensitive -
beast/dataio vscan off default
beast/dataio nbmand off default
beast/dataio sharesmb off inherited from beast
beast/dataio refquota none default
beast/dataio refreservation none default
beast/dataio guid 7216940837685529084 -
beast/dataio primarycache all default
beast/dataio secondarycache all default
beast/dataio usedbysnapshots 0B -
beast/dataio usedbydataset 45.0T -
beast/dataio usedbychildren 0B -
beast/dataio usedbyrefreservation 0B -
beast/dataio logbias latency default
beast/dataio dedup off default
beast/dataio mlslabel none default
beast/dataio sync disabled inherited from beast
beast/dataio dnodesize auto inherited from beast
beast/dataio refcompressratio 1.09x -
beast/dataio written 45.0T -
beast/dataio logicalused 49.3T -
beast/dataio logicalreferenced 49.3T -
beast/dataio volmode default default
beast/dataio filesystem_limit none default
beast/dataio snapshot_limit none default
beast/dataio filesystem_count none default
beast/dataio snapshot_count none default
beast/dataio snapdev hidden default
beast/dataio acltype posixacl inherited from beast
beast/dataio context none default
beast/dataio fscontext none default
beast/dataio defcontext none default
beast/dataio rootcontext none default
beast/dataio relatime off default
beast/dataio redundant_metadata all default
beast/dataio overlay off default
beast/dataio com.sun:auto-snapshot false local
I think the issue is related to primarycache=all. If I set a pool to have primarycache=metadata there are no errors.
@rblank I replicated the issue with a simple, single-vdev pool. I'll try and report back with mirror, anyway.
@alatteri What pool/vdev layout do you use? Can you show zpool status on both machines? I tried with primarycache=none and it failed, albeit with much lower frequency (ie: it failed after the 5th copy). I'll try with primarycache=metadata.
Same machine, different datasets on the same pool.
beast: /nfs/beast/home/alan % zpool status
pool: beast
state: ONLINE
scan: scrub canceled on Fri Mar 2 16:47:01 2018
config:
NAME STATE READ WRITE CKSUM
beast ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NAHN5M1X ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NAHN5NPX ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NAHNP9BX ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NAHN6M4Y ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NAHNPBLX ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NAHKY7PX ONLINE 0 0 0
raidz2-1 ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NCG1G8SL ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NCG1BVVL ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NCG13K0L ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NCG1GA9L ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NCG1G9YL ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NCG6D9ZS ONLINE 0 0 0
raidz2-2 ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NCG68U3S ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NCG2WW7S ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NAHMHVGY ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NAHKRYUX ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NAHKXMKX ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NCG5ZYKS ONLINE 0 0 0
raidz2-3 ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NCGSM01S ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NCGSY9HS ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NCGTHJUS ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NCGTKV1S ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NCGTMN4S ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NCGTGTLS ONLINE 0 0 0
raidz2-4 ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NCGTKUWS ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NCGTG3YS ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NCGTLYZS ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NCGSZ2GS ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NCGSV93S ONLINE 0 0 0
ata-HGST_HDN726060ALE610_NCGT04NS ONLINE 0 0 0
raidz2-5 ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1HHZGSB ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1GTE6HD ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1GU06VD ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1GS5KNF ONLINE 0 0 0
ata-HGST_HDN726060ALE614_NCHA3DZS ONLINE 0 0 0
ata-HGST_HDN726060ALE614_NCHAE5JS ONLINE 0 0 0
raidz2-6 ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1HJ21DB ONLINE 0 0 0
ata-HGST_HDN726060ALE614_NCH9WUXS ONLINE 0 0 0
ata-HGST_HDN726060ALE614_NCHAXNTS ONLINE 0 0 0
ata-HGST_HDN726060ALE614_NCHA0DLS ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1HJG72B ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1HHX19B ONLINE 0 0 0
cache
nvme0n1 ONLINE 0 0 0
errors: No known data errors
pool: pimplepaste
state: ONLINE
scan: scrub repaired 0B in 2h38m with 0 errors on Mon Mar 19 00:17:45 2018
config:
NAME STATE READ WRITE CKSUM
pimplepaste ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1JVHTBD ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1JVHVSD ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1JVHT1D ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1HUYA5D ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1JVDPMD ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1HZAZDD ONLINE 0 0 0
raidz2-1 ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1JVATKD ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1HZB0ND ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1HY6LYD ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1JT32KD ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1JVAGVD ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1HZBL5D ONLINE 0 0 0
raidz2-2 ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1HWZ1AD ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1HZAYJD ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1HZ8YMD ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1JVDN8D ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1HZAKPD ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1HWZ2ZD ONLINE 0 0 0
raidz2-3 ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1HZAX7D ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1JVHD8D ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1JVG6ND ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1HW7VBD ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1HZBHMD ONLINE 0 0 0
ata-HGST_HDN726060ALE614_K1JVB2SD ONLINE 0 0 0
errors: No known data errors
@vbrik what's the HW config of this system - how much RAM, what model of x86_64 CPU?
I can confirm this bug on a mirrored zpool. It is a production system so I didn't do much testing before downgrading to 0.7.6:
pool: ssdzfs-array
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
still be used, but some features are unavailable. [it is at the 0.6.5.11 features level]
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(5) for details.
scan: scrub repaired 0B in 0h16m with 0 errors on Sun Apr 1 01:46:59 2018
config:
NAME STATE READ WRITE CKSUM
ssdzfs-array ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-XXXX-enc ONLINE 0 0 0
ata-YYYY-enc ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
ata-ZZZZ-enc ONLINE 0 0 0
ata-QQQQ-enc ONLINE 0 0 0
errors: No known data errors
$zfs create ssdzfs-array/tmp
$(run test as previously described; fails about 1/2 the time)
$uname -a
Linux MASKED 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
I have attempted to reproduce the bug on 0.7.6 without success. Here is an except of one of the processor feature levels:
processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
stepping : 5
microcode : 0x19
cpu MHz : 1600.000
cache size : 8192 KB
physical id : 0
siblings : 4
core id : 3
cpu cores : 4
apicid : 6
initial apicid : 6
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm tpr_shadow vnmi flexpriority ept vpid dtherm ida
bogomips : 5333.51
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
[ 1.121288] microcode: CPU3 sig=0x106a5, pf=0x2, revision=0x19
I still get it with primarycache=metadata, on the first attempt to cp:
[root@zirconia ~]# zfs set primarycache=metadata storage/rhev
[root@zirconia ~]# cd /storage/rhev/
[root@zirconia rhev]# ls
export test
[root@zirconia rhev]# cd test/
[root@zirconia test]# rm -rf DST
[root@zirconia test]# rm -rf SRC/*
[root@zirconia test]# for i in $(seq 1 10000); do echo $i > SRC/$i ; done
[root@zirconia test]# cp -r SRC DST
cp: cannot create regular file ‘DST/5269’: No space left on device
cp: cannot create regular file ‘DST/3759’: No space left on device
For those that have upgraded to the 0.7.7 branch - is it advisable to downgrade back to 0.7.6 until this regression is resolved?
What is the procedure to downgrade ZFS on CentOS 7.4?
For reverts, I usually do:
$ yum history (identify transaction that installed 0.7.7 over 0.7.6; yum history info XXX can be used to confirm)
$ yum history undo XXX (where XXX is the transaction number identified in the previous step)
Note that with dkms installs, after reverts, I usually find I need to:
$ dkms remove zfs/0.7.6 -k `uname -r`
$ dkms remove spl/0.7.6 -k `uname -r`
$ dkms install spl/0.7.6 -k `uname -r` --force
$ dkms install zfs/0.7.6 -k `uname -r` --force
To make sure all modules are actually happy and loadable on reboot.
Is this seen with rsync instead of cp?
I'm not able to reproduce this, and I have several machines (Debian unstable; 0.7.7, Linux 4.15). Can people also include uname -srvmo? Maybe the kernel version is playing a role?
Linux 4.15.0-2-amd64 #1 SMP Debian 4.15.11-1 (2018-03-20) x86_64 GNU/Linux
Ok, I've done some more tests.
System is CentOS 7.4 x86-64 with latest available kernel:
On a Ubuntu Server 16.04 LTS with compiled 0.7.7 spl+zfs (so not using the repository version), I can not reproduce the error. As a side note, compiling on Ubuntu does not give any warning.
So, the problem seems confined in CentOS/RHEL territory. To me, it seems a timing/racing problem (possibly related to the ARC): anything which increases copy time lowers the error probability/frequency. Some example of action which lower the fail rate:
cp -a (it copies file attributes)[1] compilation give the following warning:
/usr/src/zfs-0.7.7/module/zcommon/zfs_fletcher_avx512.o: warning: objtool: fletcher_4_avx512f_byteswap()+0x4e: can't find jump dest instruction at .text+0x171
/usr/src/zfs-0.7.7/module/zfs/vdev_raidz_math_avx512f.o: warning: objtool: mul_x2_2()+0x24: can't find jump dest instruction at .text+0x39
/usr/src/zfs-0.7.7/module/zfs/vdev_raidz_math_avx512bw.o: warning: objtool: raidz_zero_abd_cb()+0x33: can't find jump dest instruction at .text+0x3d
@shodanshok I'm sorry, I'm having a lot of trouble tracking this piece of information down. What Linux kernel version is Centos 7.4 on? I assume this is with kernel-3.10.0-693.21.1.el7.x86_64.
Is anyone experiencing this issue with "recent" mainline kernels (like 4.x)?
Greetings,
I have mirrors with the same problem.
Scientific Linux 7.4 (fully updated)
zfs-0.7.7 from zfsonlinux.org repos
$ uname -srvmo
Linux 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 13:12:24 CST 2018 x86_64 GNU/Linux
The output of my yum install:
Running transaction
Installing : kernel-devel-3.10.0-693.21.1.el7.x86_64 1/10
Installing : kernel-headers-3.10.0-693.21.1.el7.x86_64 2/10
Installing : glibc-headers-2.17-196.el7_4.2.x86_64 3/10
Installing : glibc-devel-2.17-196.el7_4.2.x86_64 4/10
Installing : gcc-4.8.5-16.el7_4.2.x86_64 5/10
Installing : dkms-2.4.0-1.20170926git959bd74.el7.noarch 6/10
Installing : spl-dkms-0.7.7-1.el7_4.noarch 7/10
Loading new spl-0.7.7 DKMS files...
Building for 3.10.0-693.21.1.el7.x86_64
Building initial module for 3.10.0-693.21.1.el7.x86_64
Done.
spl:
Running module version sanity check.
- Original module
- No original module exists within this kernel
- Installation
- Installing to /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/spl/spl/
splat.ko:
Running module version sanity check.
- Original module
- No original module exists within this kernel
- Installation
- Installing to /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/splat/splat/
Adding any weak-modules
depmod....
DKMS: install completed.
Installing : zfs-dkms-0.7.7-1.el7_4.noarch 8/10
Loading new zfs-0.7.7 DKMS files...
Building for 3.10.0-693.21.1.el7.x86_64
Building initial module for 3.10.0-693.21.1.el7.x86_64
Done.
zavl:
Running module version sanity check.
- Original module
- No original module exists within this kernel
- Installation
- Installing to /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/avl/avl/
znvpair.ko:
Running module version sanity check.
- Original module
- No original module exists within this kernel
- Installation
- Installing to /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/nvpair/znvpair/
zunicode.ko:
Running module version sanity check.
- Original module
- No original module exists within this kernel
- Installation
- Installing to /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/unicode/zunicode/
zcommon.ko:
Running module version sanity check.
- Original module
- No original module exists within this kernel
- Installation
- Installing to /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/zcommon/zcommon/
zfs.ko:
Running module version sanity check.
- Original module
- No original module exists within this kernel
- Installation
- Installing to /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/zfs/zfs/
zpios.ko:
Running module version sanity check.
- Original module
- No original module exists within this kernel
- Installation
- Installing to /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/zpios/zpios/
icp.ko:
Running module version sanity check.
- Original module
- No original module exists within this kernel
- Installation
- Installing to /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/icp/icp/
Adding any weak-modules
depmod....
DKMS: install completed.
Installing : spl-0.7.7-1.el7_4.x86_64 9/10
Installing : zfs-0.7.7-1.el7_4.x86_64 10/10
Verifying : dkms-2.4.0-1.20170926git959bd74.el7.noarch 1/10
Verifying : zfs-dkms-0.7.7-1.el7_4.noarch 2/10
Verifying : zfs-0.7.7-1.el7_4.x86_64 3/10
Verifying : spl-0.7.7-1.el7_4.x86_64 4/10
Verifying : kernel-devel-3.10.0-693.21.1.el7.x86_64 5/10
Verifying : glibc-devel-2.17-196.el7_4.2.x86_64 6/10
Verifying : kernel-headers-3.10.0-693.21.1.el7.x86_64 7/10
Verifying : gcc-4.8.5-16.el7_4.2.x86_64 8/10
Verifying : spl-dkms-0.7.7-1.el7_4.noarch 9/10
Verifying : glibc-headers-2.17-196.el7_4.2.x86_64 10/10
Installed:
zfs.x86_64 0:0.7.7-1.el7_4
Dependency Installed:
dkms.noarch 0:2.4.0-1.20170926git959bd74.el7 gcc.x86_64 0:4.8.5-16.el7_4.2
glibc-devel.x86_64 0:2.17-196.el7_4.2 glibc-headers.x86_64 0:2.17-196.el7_4.2
kernel-devel.x86_64 0:3.10.0-693.21.1.el7 kernel-headers.x86_64 0:3.10.0-693.21.1.el7
spl.x86_64 0:0.7.7-1.el7_4 spl-dkms.noarch 0:0.7.7-1.el7_4
zfs-dkms.noarch 0:0.7.7-1.el7_4
Complete!
I am using rsnapshot to do backups. It is when it runs the equivalent to below that issues come up.
$ /usr/bin/cp -al /bkpfs/Rsnapshot/hourly.0 /bkpfs/Rsnapshot/hourly.1
/usr/bin/cp: cannot create hard link ‘/bkpfs/Rsnapshot/hourly.1/System/home/user/filename’ to ‘/bkpfs/Rsnapshot/hourly.0/System/home/user/filename’: No space left on device
There's plenty of space
$ df -h /bkpfs/
Filesystem Size Used Avail Use% Mounted on
bkpfs 5.0T 4.2T 776G 85% /bkpfs
$ df -i /bkpfs/
Filesystem Inodes IUsed IFree IUse% Mounted on
bkpfs 1631487194 5614992 1625872202 1% /bkpfs
zpool iostat -v bkpfs
capacity operations bandwidth
pool alloc free read write read write
---------------------------------------------- ----- ----- ----- ----- ----- -----
bkpfs 4.52T 950G 9 5 25.4K 117K
mirror 1.84T 912G 4 3 22.0K 94.7K
ata-Hitachi_HUA723030ALA640 - - 2 1 11.2K 47.4K
ata-Hitachi_HUA723030ALA640 - - 2 1 10.8K 47.4K
mirror 2.68T 37.3G 4 2 3.46K 22.2K
ata-Hitachi_HUA723030ALA640 - - 2 1 1.71K 11.1K
ata-Hitachi_HUA723030ALA640 - - 2 1 1.75K 11.1K
cache - - - - - -
ata-INTEL_SSDSC2BW120H6 442M 111G 17 0 9.48K 10.0K
---------------------------------------------- ----- ----- ----- ----- ----- -----
zpool status
pool: bkpfs
state: ONLINE
scan: scrub repaired 0B in 11h17m with 0 errors on Sun Apr 1 05:34:09 2018
config:
NAME STATE READ WRITE CKSUM
bkpfs ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-Hitachi_HUA723030ALA640 ONLINE 0 0 0
ata-Hitachi_HUA723030ALA640 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
ata-Hitachi_HUA723030ALA640 ONLINE 0 0 0
ata-Hitachi_HUA723030ALA640 ONLINE 0 0 0
cache
ata-INTEL_SSDSC2BW120H6 ONLINE 0 0 0
errors: No known data errors
zfs get all bkpfs
NAME PROPERTY VALUE SOURCE
bkpfs type filesystem -
bkpfs creation Fri Dec 22 10:34 2017 -
bkpfs used 4.52T -
bkpfs available 776G -
bkpfs referenced 4.19T -
bkpfs compressratio 1.00x -
bkpfs mounted yes -
bkpfs quota none default
bkpfs reservation none default
bkpfs recordsize 128K default
bkpfs mountpoint /bkpfs default
bkpfs sharenfs off default
bkpfs checksum on default
bkpfs compression off default
bkpfs atime on default
bkpfs devices on default
bkpfs exec on default
bkpfs setuid on default
bkpfs readonly off default
bkpfs zoned off default
bkpfs snapdir hidden default
bkpfs aclinherit restricted default
bkpfs createtxg 1 -
bkpfs canmount on default
bkpfs xattr on default
bkpfs copies 1 default
bkpfs version 5 -
bkpfs utf8only off -
bkpfs normalization none -
bkpfs casesensitivity sensitive -
bkpfs vscan off default
bkpfs nbmand off default
bkpfs sharesmb off default
bkpfs refquota none default
bkpfs refreservation none default
bkpfs guid 8662648373298485368 -
bkpfs primarycache all default
bkpfs secondarycache all default
bkpfs usedbysnapshots 334G -
bkpfs usedbydataset 4.19T -
bkpfs usedbychildren 234M -
bkpfs usedbyrefreservation 0B -
bkpfs logbias latency default
bkpfs dedup off default
bkpfs mlslabel none default
bkpfs sync standard default
bkpfs dnodesize legacy default
bkpfs refcompressratio 1.00x -
bkpfs written 1.38T -
bkpfs logicalused 4.51T -
bkpfs logicalreferenced 4.18T -
bkpfs volmode default default
bkpfs filesystem_limit none default
bkpfs snapshot_limit none default
bkpfs filesystem_count none default
bkpfs snapshot_count none default
bkpfs snapdev hidden default
bkpfs acltype off default
bkpfs context none default
bkpfs fscontext none default
bkpfs defcontext none default
bkpfs rootcontext none default
bkpfs relatime off default
bkpfs redundant_metadata all default
bkpfs overlay off default
For those that want to know my hardware, the system is a AMD X2 255 processor with 8GB of memory (so far more than enough for my home backup system).
I can revert today, or I can help test if someone needs me to try something. Just let me know.
Thanks!
Can someone who can repro this try bisecting the changes between 0.7.6 and 0.7.7 so we can see which commit breaks people?
Most likely https://github.com/zfsonlinux/zfs/commit/cc63068e95ee725cce03b1b7ce50179825a6cda5, seems to be a race condition in the mzap->fzap upgrade phase.
@loli10K this, uh, seems horrendous enough that unless someone volunteers a fix for the race Real Fast, a revert and cutting a point release for this alone seems like it would be merited, to me at least.
@rincebrain I can try later today. I'm meeting some friends for lunch and will be gone for a few hours but I'm happy to help how I can when I get back.
[Edit] To try to bisect the changes that is. :-)
@cstackpole if you do, it's probably worth trying with and without the commit @loli10K pointed to, rather than letting the bisect naturally find it.
From what we have seen so far it certainly seems to only affect older (by which I mean lower-versioned) kernels. I have not been able to reproduce the issue on Linux 4.15 (Fedora).
@aerusso
[root@localhost test]# uname -a
Linux localhost.localdomain 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
@loli10K any clue on why it affect 3.x kernels only, while 4.x seems immune?
BTW, I bisected it, and couldn't repro it on CentOS 7 with 3.10.0-693.21.1 on eb9c453 but could on cc63068, so that does appear to be the cause.
I haven't done any testing yet, but I very much appreciate the speed at which you've found the commit, rincebrain! Since seeing this issue raised, I've been quite nervous, and I don't yet know if I'm affected.
% uname -srvmo
Linux 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 GNU/Linux
Since this seems to be FRAME_POINTER-specific (unless anyone's got a counter-example), I would guess this is #5041 2.0: Elec boogalootric
Thanks @rincebrain for confirming!
Since this is just my personal-at-home system, I don't mind leaving it in its reproducible state if anyone wants me to test something later in the week.
@kpande Yes, I've been following this one but haven't looked into it at all. Has this for sure been narrowed down to cc63068e95ee725cce03b1b7ce50179825a6cda5? This is clearly something that has to get fixed right away.
@dweeezil I couldn't readily repro it on CentOS 7 x86_64 on the commit before cc63068, and could easily repro it on cc63068, same SPL both times.
cc63068 sets a limit on the number of times zap_add will try to expand (split) a zap leaf blocks for a directory when adding a new entry would overflow an existing leaf.
The limit (2) is sufficient for handling a colliding name when casesensitivity=sensitive is set, but appears to bails out too early (with ENOSPC) when the zap for the directory grows past a certain size (possibly also due to leaf hash collisions). When zap_add fails, it rolls back the transaction so the znode for the new file is removed.
So far, this is undesirable but doesn't result in data loss per se, since the system just refuses to create new files with "No space left on device".
My hypothesis is that a subsequent zap_adds is successful as the directory's zap has already grown (as long as one to two additional leaf splits is sufficient to fit the new entry), but the subsequent zap expansions are being discarded, due to a side effect of the previous rollback (possibly closing the transaction there). The vfs page cache still reflects the new files but they're not present in ARC (or committed to disk), hence flushing the page cache makes them go away. It's not clear if the znodes for the files are leaked as a result (unlinked from the directory but still present) or if they're also being discarded.
I have masked 0.7.7 in Gentoo based on this issue.
https://bugs.gentoo.org/652828
I have cleared my schedule for tomorrow so that I have time to spend on this. I'd say more, but this blind sided me and it is too late at night for me to start looking into it now.
Ok, so the expand retry limit 2 is not enough. In fact, there shouldn't be a limit at all until we hit the limit of ZAP itself.
The reason you can create a ZAP with a lot of file but cannot copy is because, when you create file, you create file randomly in terms of hash value. However, if you copy files from one directory to another directory, you create file sequentially in terms of hash value. That means if the source directory expanded its leaves 6 times, you need to expand the destination leaves 6 times in one go.
One thing to note is that we do use different salt for different directory, so theoretically, a strong enough salt should prevent this from happening. This shows that the current salt is not strong enough.
To remove the expand limit, try removing this if block.
https://github.com/zfsonlinux/zfs/blob/cc63068e95ee725cce03b1b7ce50179825a6cda5/module/zfs/zap.c#L861
The file missing afterward is a strange issue. I'll have to investigate to see what happened. I don't think there's any transaction rollback in the error path.
Getting rid of the limit doesn't panic the box when running the casenorm ZTS group and seems to prevent this issue:
@@ -855,15 +855,6 @@ retry:
if (err == 0) {
zap_increment_num_entries(zap, 1, tx);
} else if (err == EAGAIN) {
- /*
- * If the last two expansions did not help, there is no point
- * trying to expand again
- */
- if (expand_retries > MAX_EXPAND_RETRIES && prev_l == l) {
- err = SET_ERROR(ENOSPC);
- goto out;
- }
-
err = zap_expand_leaf(zn, l, tag, tx, &l);
zap = zn->zn_zap; /* zap_expand_leaf() may change zap */
if (err == 0) {
[root@centos ~]# lsb_release -a
LSB Version: :core-4.1-amd64:core-4.1-noarch
Distributor ID: CentOS
Description: CentOS Linux release 7.4.1708 (Core)
Release: 7.4.1708
Codename: Core
[root@centos ~]# uname -r
3.10.0-693.21.1.el7.x86_64
[root@centos ~]# cat /sys/module/zfs/version
0.7.7-1
[root@centos ~]# while :; do
> zpool destroy testpool
> zpool create testpool -f -O xattr=dir -O atime=off -O mountpoint=none -O recordsize=1M /dev/vdb
> zfs create testpool/src -o mountpoint=/mnt
> zfs create testpool/dst -o mountpoint=/mnt/DST
> mkdir /mnt/SRC; for i in $(seq 1 10000); do echo -n > /mnt/SRC/$i; done;
> printf "$(find /mnt/SRC -type f | wc -l) -> "
> cp -r /mnt/SRC /mnt/DST
> echo "$(find /mnt/DST -type f | wc -l)"
> done
10000 -> 10000
10000 -> 10000
10000 -> 10000
10000 -> 10000
10000 -> 10000
10000 -> 10000
10000 -> 10000
10000 -> 10000
10000 -> 10000
10000 -> 10000
10000 -> 10000
10000 -> 10000
10000 -> 10000
10000 -> 10000
10000 -> 10000
^C
[root@centos ~]#
...
[root@centos ~]# sudo -u nobody -s /usr/share/zfs/zfs-tests.sh -d /var/tmp -T casenorm
Test: /usr/share/zfs/zfs-tests/tests/functional/casenorm/setup (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/casenorm/case_all_values (run as root) [00:00] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/casenorm/norm_all_values (run as root) [00:01] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/casenorm/mixed_create_failure (run as root) [00:10] [PASS]
Test: /usr/share/zfs/zfs-tests/tests/functional/casenorm/cleanup (run as root) [00:00] [PASS]
Results Summary
PASS 5
Running Time: 00:00:12
Percent passed: 100.0%
Log directory: /var/tmp/test_results/20180401T016189
[root@centos ~]#
Now testing kernel 3.10.x on Debian 8 with the same Kconfig from previous CentOS7 box ... EDIT: Debian stays strong and does not seem to be affected running 3.10.108.
I can confirm the ENOSPC (No space left on device) is coming from fzap_add_cd when we hit the retry limit, running the reproducer under the following stap script:
probe
module("zfs").function("zap_leaf_split").call,
module("zfs").function("fzap_add_cd").call,
module("zfs").function("mzap_upgrade").call,
module("zfs").function("zap_entry_create").call,
module("zfs").function("zap_expand_leaf").call
{
printf(" %s -> %s\n", symname(caller_addr()), ppfunc());
}
probe
module("zfs").function("zap_leaf_split").return,
module("zfs").function("fzap_add_cd").return,
module("zfs").function("mzap_upgrade").return,
module("zfs").function("zap_entry_create").return,
module("zfs").function("zap_expand_leaf").return
{
printf(" %s <- %s %s\n", symname(caller_addr()), ppfunc(), $$return$);
}
probe
module("zfs").statement("fzap_add_cd@module/zfs/zap.c:867")
{
printf(" * %s <- %s expand_retries=%s\n", symname(caller_addr()), ppfunc(), $expand_retries$$);
}
````
relevant output
fzap_add_cd -> zap_entry_create
0xffffffff816b9459 <- zap_entry_create return=11
Well, i could not reproduce this running CentOS7 kernel on Debian8 but using its cp:
On CentOS7, testing also with cp from Debian8:
[root@centos ~]# while :; do
> zpool destroy testpool
> zpool create testpool -f -O xattr=dir -O atime=off -O mountpoint=none -O recordsize=1M /dev/vdb
> zfs create testpool/src -o mountpoint=/mnt
> zfs create testpool/dst -o mountpoint=/mnt/DST
> mkdir /mnt/SRC; for i in $(seq 1 10000); do echo -n > /mnt/SRC/$i; done;
> ./debian-cp -r /mnt/SRC /mnt/DST-debian
> cp -r /mnt/SRC /mnt/DST-centos
> done
cp: cannot create regular file ‘/mnt/DST-centos/4143’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/1970’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/5654’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/5945’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/2740’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/3659’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/2070’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/5183’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/7715’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/8593’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/9654’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/1064’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/2862’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/6636’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/865’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/6090’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/6066’: No space left on device
cp: cannot create regular file ‘/mnt/DST-centos/9233’: No space left on device
^C
[root@centos ~]# lsb_release -a
LSB Version: :core-4.1-amd64:core-4.1-noarch
Distributor ID: CentOS
Description: CentOS Linux release 7.4.1708 (Core)
Release: 7.4.1708
Codename: Core
[root@centos ~]# rpm -qa coreutils
coreutils-8.22-18.el7.x86_64
[root@centos ~]#
On Debian8, with cp from CentOS7:
root@linux:~# while :; do
> zpool destroy testpool
> zpool create testpool -f -O xattr=dir -O atime=off -O mountpoint=none -O recordsize=1M /dev/vdb
> zfs create testpool/src -o mountpoint=/mnt
> zfs create testpool/dst -o mountpoint=/mnt/DST
> mkdir /mnt/SRC; for i in $(seq 1 10000); do echo -n > /mnt/SRC/$i; done;
> cp -r /mnt/SRC /mnt/DST-debian
> ./centos-cp -r /mnt/SRC /mnt/DST-centos
> done
./centos-cp: cannot create regular file ‘/mnt/DST-centos/5423’: No space left on device
./centos-cp: cannot create regular file ‘/mnt/DST-centos/8558’: No space left on device
./centos-cp: cannot create regular file ‘/mnt/DST-centos/4338’: No space left on device
./centos-cp: cannot create regular file ‘/mnt/DST-centos/3524’: No space left on device
./centos-cp: cannot create regular file ‘/mnt/DST-centos/4601’: No space left on device
./centos-cp: cannot create regular file ‘/mnt/DST-centos/9311’: No space left on device
./centos-cp: cannot create regular file ‘/mnt/DST-centos/7348’: No space left on device
./centos-cp: cannot create regular file ‘/mnt/DST-centos/3211’: No space left on device
./centos-cp: cannot create regular file ‘/mnt/DST-centos/8768’: No space left on device
./centos-cp: cannot create regular file ‘/mnt/DST-centos/6951’: No space left on device
./centos-cp: cannot create regular file ‘/mnt/DST-centos/4538’: No space left on device
./centos-cp: cannot create regular file ‘/mnt/DST-centos/7596’: No space left on device
./centos-cp: cannot create regular file ‘/mnt/DST-centos/7539’: No space left on device
^C
root@linux:~# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux 8.0 (jessie)
Release: 8.0
Codename: jessie
root@linux:~# dpkg -l coreutils
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-==============================================-============================-============================-==================================================================================================
ii coreutils 8.23-4 amd64 GNU core utilities
root@linux:~#
We may need to find a better reproducer than "cp" for the regression test proposed in #7411.
This update is still being offered when RHEL-based systems do a "yum update"; given the serious nature of this bug, should the update not be pulled, leaving 0.7.6 as the latest available version?
Today is a day when I'm EXTREMELY glad I have ZFS/SPL updates blocked and do them manually during designated downtime windows or otherwise more convenient times.
@flynnjFIU @behlendorf is the maintainer for RHEL-based systems and he just got into the office. He likely does not even know about this yet. I'll give him a call to let him know so that he can take the update out of the RPM repository. Thanks for pointing it out.
@perfinion pointed out to me in IRC that it could be that this is mainly reproducible only on RHEL-based systems because they use xattr=sa to speed up SELinux's handling of filesystem labels. xattr=sa might be related. I had a late start on this today, so I am not certain either way at this point, but I think that he made a good point that the interaction with xattr=sa should be considered.
@ryao same problem occurs with xatrr=on.
@flynnjFIU I spoke to Brian. He just learned about this in something like the past hour. The tentative plan is to pull 0.7.7 from the RPM repository and push out 0.7.8 with a revert of cc63068e95ee725cce03b1b7ce50179825a6cda5. He is going to have a chat with @tonyhutter before he finalizes the plan to deal with this.
@vbrik Thanks for that information. That helps narrow things down. :)
@ryao Is there any risk that data created with 0.7.7 on CentOS will be corrupted/disappear with the fix in 0.7.8??
@alatteri My tentative understanding is that If ENOSPC did not occur, the data should be fine. I suggest downgrading to 0.7.6 for the time being though.
Would people who can/cannot reproduce this issue post this information about the systems tested?
For those who need them, here are links to the RPM packages for coreutils on CentOS 6 and CentOS 7:
https://centos.pkgs.org/6/centos-x86_64/coreutils-8.4-46.el6.x86_64.rpm.html
https://centos.pkgs.org/7/centos-x86_64/coreutils-8.22-18.el7.x86_64.rpm.html
They contain the cp used on CentOS. Instructions on how to extract them are here:
https://www.cyberciti.biz/tips/how-to-extract-an-rpm-package-without-installing-it.html
Compiler: gcc version 6.4.0 (Gentoo Hardened 6.4.0-r1 p1.3)
uname -a: Linux baraddur 4.16.0-gentoo #1 SMP PREEMPT Wed Apr 4 12:18:23 +08 2018 x86_64 AMD Ryzen Threadripper 1950X 16-Core Processor AuthenticAMD GNU/Linux
distro: gentoo hardened selinux
ZFS kmod from HEAD: Loaded module v0.7.0-403_g1724eb62
SELinux enforcing and permissive both hit it
gentoo cp 8.28-r1 binary: cant repro even with 100k files
debian 8 8.26 binary: also cant repro
centos7 8.22 binary: hits it instantly
Reproducibility: yes
ZoL version: zfs-0.7.7-1.el6.x86_64
Distribution name and version: Scientific Linux 6.8
Kernel Version: 2.6.32-696.23.1.el6.x86_64
Coreutils Version: coreutils-8.4-46.el6.x86_64
SELinux status: off
Reproducibility: no
Distribution name and version: Arch Linux
ZoL version:
local/spl-linux-git 2018.04.04.r1070.581bc01.4.15.15.1-1 (archzfs-linux-git)
local/spl-utils-common-git 2018.04.04.r1070.581bc01-1 (archzfs-linux-git)
local/zfs-linux-git 2018.04.04.r3402.533ea0415.4.15.15.1-1 (archzfs-linux-git)
local/zfs-utils-common-git 2018.04.04.r3402.533ea0415-1 (archzfs-linux-git)
This is ZFS build built from commit 533ea0415 .
Kernel Version: Linux kiste 4.15.15-1-ARCH #1 SMP PREEMPT Sat Mar 31 23:59:25 UTC 2018 x86_64 GNU/Linux
Coreutils Version: local/coreutils 8.29-1
SELinux status (enforcing, permissive, off/unused): off
Unable to test CentOS 7 cp due to dependency on SELinux libraries (Arch doesn't support SELinux).
@tuxoko Nice analysis!
The reason you can create a ZAP with a lot of file but cannot copy is because, when you create file, you create file randomly in terms of hash value. However, if you copy files from one directory to another directory, you create file sequentially in terms of hash value. That means if the source directory expanded its leaves 6 times, you need to expand the destination leaves 6 times in one go.
One thing to note is that we do use different salt for different directory, so theoretically, a strong enough salt should prevent this from happening. This shows that the current salt is not strong enough.
The salt is pretty weak (see mzap_create_impl()); I'm not sure why we didn't just use random_get_pseudo_bytes(). I wonder if they are actually getting the same exact hash, or if there's some weakness in the way that the salt is used in zap_hash()? zdb can dump the salt to see if they are the same.
We're working to get an 0.7.8 release out with https://github.com/zfsonlinux/zfs/commit/cc63068e95ee725cce03b1b7ce50179825a6cda5 reverted ASAP.
Before anyone starts bindiffing binaries: CentOS cp open(O_CREAT) is randomized, Debian is not: random file order = random hash values = more likely to zap_expand_leaf()/zap_leaf_split() i guess ...
[root@centos ~]# grep DST /tmp/debian.txt | head -n 100
execve("./debian-cp", ["./debian-cp", "-r", "/mnt/SRC", "/mnt/DST-debian"], [/* 18 vars */]) = 0
stat("/mnt/DST-debian", 0x7ffc25990cb0) = -1 ENOENT (No such file or directory)
lstat("/mnt/DST-debian", 0x7ffc25990a40) = -1 ENOENT (No such file or directory)
mkdir("/mnt/DST-debian", 0755) = 0
lstat("/mnt/DST-debian", {st_mode=S_IFDIR|0755, st_size=2, ...}) = 0
open("/mnt/DST-debian/3357", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3358", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3359", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3360", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3361", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3362", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3363", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3364", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3365", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3366", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3367", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3368", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3369", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3370", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3371", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3372", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3373", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3374", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3375", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3376", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3377", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3378", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3379", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3380", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3381", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3382", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3383", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3384", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3385", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3386", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3387", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3388", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3389", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3390", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3391", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3392", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3393", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3394", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3395", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/1", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/2", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/3", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/4", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/5", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/6", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/7", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/8", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/9", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/10", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/11", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/12", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/13", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/14", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/15", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/16", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/17", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/18", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/19", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/20", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/21", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/22", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/23", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/24", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/25", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/26", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/27", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/28", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/29", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/30", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/31", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/32", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/33", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/34", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/35", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/36", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/37", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/38", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/39", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/40", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/41", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/42", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/43", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/44", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/45", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/46", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/47", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/48", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/49", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/50", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/51", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/52", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/53", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/54", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/55", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-debian/56", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
[root@centos ~]# grep DST /tmp/centos.txt | head -n 100
execve("/bin/cp", ["cp", "-r", "/mnt/SRC", "/mnt/DST-centos"], [/* 18 vars */]) = 0
stat("/mnt/DST-centos", 0x7ffc6299e1d0) = -1 ENOENT (No such file or directory)
lstat("/mnt/DST-centos", 0x7ffc6299df30) = -1 ENOENT (No such file or directory)
mkdir("/mnt/DST-centos", 0755) = 0
lstat("/mnt/DST-centos", {st_mode=S_IFDIR|0755, st_size=2, ...}) = 0
open("/mnt/DST-centos/6667", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4153", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/8772", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/2455", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/8691", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/6784", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/2422", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/8705", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/2878", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4124", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/6610", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/2558", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/2896", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/2902", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/2975", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/8608", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4029", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/6689", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/9017", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/5636", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/688", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/1590", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/7102", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/9183", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/1404", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/7096", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/3330", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/3347", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/1473", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/7175", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/5641", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/1829", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/9060", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/611", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/1509", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/1953", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/785", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/7078", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/1924", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/666", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/2065", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4939", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4563", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/6257", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/8342", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/8335", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/6220", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/2186", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4514", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/2012", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4480", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/2168", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4834", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4843", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/8238", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4419", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/7968", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/3700", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/5392", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/1034", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/9427", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/7532", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/3694", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/5206", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/5271", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/7545", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/9450", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/1043", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/3777", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/3799", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/7865", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/221", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/9970", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/1139", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/256", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/9907", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/7812", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/7448", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/9893", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/7986", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4944", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/2018", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4933", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4569", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/8348", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4849", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/2115", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4587", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/8232", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/6327", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/2081", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4413", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/4464", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/6350", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
open("/mnt/DST-centos/8245", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
[root@centos ~]#
Would someone with a CentOS-family system please install gdb and coreutils-debuginfo. Then run gdb -ex 'info sources' $(which cp) and post the output for me? It will save me some trouble of getting my hands on a system so that I can try to figure out what is different between CentOS's cp and Gentoo's cp.
I ran this on Gentoo's cp, which is coreutils 8.28 to get the files that were used to build cp and after some commandline-foo, I have tentatively identified these as the patches relevant to cp on CentOS:
./coreutils-selinux.patch:diff -urNp coreutils-8.21-orig/src/copy.c coreutils-8.21/src/copy.c
./coreutils-selinux.patch:diff -urNp coreutils-8.21-orig/src/cp.c coreutils-8.21/src/cp.c
./coreutils-8.22-selinux-optionsseparate.patch:diff -urNp coreutils-8.22-orig/src/cp.c coreutils-8.22/src/cp.c
./coreutils-8.22-mv-hardlinksrace.patch:diff -urNp coreutils-8.22-orig/src/copy.c coreutils-8.22/src/copy.c
./coreutils-8.22-cp-sparsecorrupt.patch:diff --git a/src/copy.c b/src/copy.c
./coreutils-8.22-cp-selinux.patch:diff --git a/src/selinux.c b/src/selinux.c
The files that are touched are included.
Unfortunately, the files used between coreutils versions could have changed, so I need to rerun that analysis on the output from a system using CentOS 6 or CentOS 7 to get a true list. I plan to review / test on Gentoo these patches to see if I can track down the issue from the user space side. Enough people are scrutinizing the kernel side that I'll delay tackling that until after I figured out what makes CentOS' cp special.
I could set up a CentOS 7.4 VM but that could take an hour. Let me know if I should go on or if someone else has a system ready for testing.
On 2018-04-09 14:05, Richard Yao wrote:
Would someone with a CentOS-family system please install gdb and
coreutils-debuginfo. Then run gdb -ex 'info sources' $(which cp) and
post the output for me? It will save me some trouble of getting my
hands on a system so that I can try to figure out what is different
between CentOS's cp and Gentoo's cp.I ran this on Gentoo's cp, which is coreutils 8.28 to get the files
that were used to build cp and after some commandline-foo, I have
tentatively identified these as the patches relevant to cp on CentOS:./coreutils-selinux.patch
./coreutils-8.22-selinux-optionsseparate.patch
./coreutils-8.22-non-defaulttests.patch
./coreutils-8.22-mv-hardlinksrace.patch
./coreutils-8.22-failingtests.patch
./coreutils-8.22-cp-sparsecorrupt.patch
./coreutils-8.22-cp-selinux.patchUnfortunately, the files used between coreutils versions could have
changed, so I need to rerun that analysis on the output from a system
using CentOS 6 or CentOS 7 to get a true list. I plan to review / test
on Gentoo these patches to see if I can track down the issue from the
user space side. Enough people are scrutinizing the kernel side that
I'll delay tackling that until after I figured out what makes CentOS'
cp special.
CentOS 7.4:
[root@nas ~]# gdb --ex 'info sources' /usr/bin/cp
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-100.el7_4.1
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show
copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/...
Reading symbols from /usr/bin/cp...Reading symbols from
/usr/bin/cp...(no debugging symbols found)...done.
(no debugging symbols found)...done.
No symbol table is loaded. Use the "file" command.
Missing separate debuginfos, use: debuginfo-install
coreutils-8.22-18.el7.x86_64
@dswartz You are missing the debuginfo. Do debuginfo-install coreutils-8.22-18.el7.x86_64 and try again. Output should look something like this:
Disregard my last: wrong package...
Source files for which symbols have been read in:
Source files for which symbols will be read in on demand:
/usr/src/debug/coreutils-8.22/src/cp.c, /usr/include/sys/stat.h,
/usr/include/bits/string3.h, /usr/include/bits/stdio2.h,
/usr/src/debug/coreutils-8.22/src/system.h,
/usr/lib/gcc/x86_64-redhat-linux/4.8.5/include/stddef.h,
/usr/include/bits/types.h,
/usr/include/stdio.h, /usr/include/libio.h, /usr/include/sys/types.h,
/usr/include/time.h, /usr/include/getopt.h,
/usr/include/selinux/selinux.h, /usr/include/bits/stat.h,
/usr/src/debug/coreutils-8.22/lib/argmatch.h,
/usr/src/debug/coreutils-8.22/lib/hash.h,
/usr/src/debug/coreutils-8.22/lib/backupfile.h,
/usr/src/debug/coreutils-8.22/src/copy.h,
/usr/src/debug/coreutils-8.22/lib/stat-time.h,
/usr/src/debug/coreutils-8.22/src/version.h,
/usr/src/debug/coreutils-8.22/lib/exitfail.h,
/usr/src/debug/coreutils-8.22/lib/progname.h,
/usr/src/debug/coreutils-8.22/
/usr/src/debug/coreutils-8.22/lib/xalloc.h,
/usr/src/debug/coreutils-8.22/lib/quote.h, /usr/include/libintl.h,
/usr/include/stdlib.h, /usr/src/debug/coreutils-8.22/lib/error.h,
/usr/include/string.h, /usr/include/bits/errno.h,
/usr/src/debug/coreutils-8.22/lib/dirname.h,
/usr/src/debug/coreutils-8.22/lib/utimens.h, /usr/include/unistd.h,
/usr/src/debug/coreutils-8.22/lib/acl.h, /usr/include/locale.h,
/usr/src/debug/coreutils-8.22/lib/filenamecat.h,
/usr/src/debug/coreutils-8.22/lib/propername.h,
/usr/src/debug/coreutils-8.22/lib/version-etc.h,
/usr/src/debug/coreutils-8.22/src/cp-hash.h,
/usr/src/debug/coreutils-8.22/src/copy.c, /usr/include/bits/unistd.h,
/usr/include/bits/stdio.h,
/usr/src/debug/coreutils-8.22/src/ioblksize.h,
/usr/src/debug/coreutils-8.22/src/extent-scan.h,
/usr/lib/gcc/x86_64-redhat-linux/4.8.5/include/stdarg.h,
/usr/include/stdint.h, /usr/src/debug/coreutils-8.22/lib/fadvise.h,
/usr/src/debug/coreutils-8.22/lib/utimecmp.h,
/usr/include/attr/error_context.h,
/usr/src/debug/coreutils-8.22/src/selinux.h,
/usr/src/debug/coreutils-8.22/lib/write-any-file.h,
/usr/src/debug/coreutils-8.22/lib/full-write.h,
/usr/include/attr/libattr.h,
/usr/src/debug/coreutils-8.22/lib/verror.h,
/usr/src/debug/coreutils-8.22/lib/unistd.h,
/usr/src/debug/coreutils-8.22/lib/filemode.h,
/usr/src/debug/coreutils-8.22/lib/same.h,
/usr/src/debug/coreutils-8.22/lib/yesno.h,
/usr/src/debug/coreutils-8.22/lib/file-set.h,
/usr/src/debug/coreutils-8.22/lib/areadlink.h,
/usr/src/debug/coreutils-8.22/lib/savedir.h,
/usr/src/debug/coreutils-8.22/lib/fcntl-safer.h,
/usr/src/debug/coreutils-8.22/lib/buffer-lcm.h,
/usr/include/sys/ioctl.h, /usr/include/assert.h,
/usr/src/debug/coreutils-8.22/src/cp-hash.c,
/usr/src/debug/coreutils-8.22/src/extent-scan.c,
/usr/src/debug/coreutils-8.22/src/fiemap.h,
/usr/src/debug/coreutils-8.22/src/selinux.c, /usr/include/bits/fcntl2.h,
/usr/include/selinux/context.h,
/usr/src/debug/coreutils-8.22/lib/canonicalize.h,
/usr/src/debug/coreutils-8.22/lib/i-ring.h,
/usr/src/debug/coreutils-8.22/lib/fts_.h, /usr/include/dirent.h,
/usr/src/debug/coreutils-8.22/lib/xfts.h,
/usr/src/debug/coreutils-8.22/src/version.c,
/usr/src/debug/coreutils-8.22/lib/copy-acl.c,
/usr/src/debug/coreutils-8.22/lib/set-acl.c,
/usr/src/debug/coreutils-8.22/lib/areadlink-with-size.c,
/usr/src/debug/coreutils-8.22/lib/argmatch.c,
/usr/src/debug/coreutils-8.22/lib/quotearg.h,
/usr/src/debug/coreutils-8.22/lib/backupfile.c,
/usr/include/bits/dirent.h,
/usr/src/debug/coreutils-8.22/lib/dirent-safer.h,
/usr/include/bits/confname.h,
/usr/src/debug/coreutils-8.22/lib/buffer-lcm.c,
/usr/src/debug/coreutils-8.22/lib/canonicalize.c,
/usr/include/bits/string2.h,
/usr/src/debug/coreutils-8.22/lib/xgetcwd.h,
/usr/src/debug/coreutils-8.22/lib/closein.c,
/usr/src/debug/coreutils-8.22/lib/freadahead.h,
/usr/src/debug/coreutils-8.22/lib/close-stream.h,
/usr/src/debug/coreutils-8.22/lib/stdio.h,
/usr/src/debug/coreutils-8.22/lib/closeout.h,
/usr/src/debug/coreutils-8.22/lib/closeout.c,
/usr/src/debug/coreutils-8.22/lib/opendir-safer.c,
/usr/src/debug/coreutils-8.22/lib/unistd-safer.h,
/usr/src/debug/coreutils-8.22/lib/dirname.c,
/usr/src/debug/coreutils-8.22/lib/dirname-lgpl.c,
/usr/src/debug/coreutils-8.22/lib/basename-lgpl.c,
/usr/src/debug/coreutils-8.22/lib/stripslash.c,
/usr/src/debug/coreutils-8.22/lib/exitfail.c,
/usr/src/debug/coreutils-8.22/lib/fadvise.c, /usr/include/fcntl.h,
/usr/src/debug/coreutils-8.22/lib/open-safer.c,
/usr/src/debug/coreutils-8.22/lib/file-set.c,
/usr/src/debug/coreutils-8.22/lib/hash-triple.h,
/usr/src/debug/coreutils-8.22/lib/filemode.c,
/usr/src/debug/coreutils-8.22/lib/filenamecat.c,
/usr/src/debug/coreutils-8.22/lib/filenamecat-lgpl.c,
/usr/src/debug/coreutils-8.22/lib/full-write.c,
/usr/src/debug/coreutils-8.22/lib/safe-write.h,
/usr/src/debug/coreutils-8.22/lib/hash.c,
/usr/src/debug/coreutils-8.22/lib/bitrotate.h,
/usr/src/debug/coreutils-8.22/lib/hash-triple.c,
/usr/src/debug/coreutils-8.22/lib/hash-pjw.h,
/usr/src/debug/coreutils-8.22/lib/progname.c, /usr/include/errno.h,
/usr/src/debug/coreutils-8.22/lib/propername.c,
/usr/src/debug/coreutils-8.22/lib/mbuiter.h,
/usr/src/debug/coreutils-8.22/lib/mbchar.h, /usr/include/wchar.h,
/usr/src/debug/coreutils-8.22/lib/strnlen1.h,
/usr/include/wctype.h, /usr/include/ctype.h,
/usr/src/debug/coreutils-8.22/lib/string.h,
/usr/src/debug/coreutils-8.22/lib/trim.h,
/usr/src/debug/coreutils-8.22/lib/xstriconv.h,
/usr/src/debug/coreutils-8.22/lib/localcharset.h,
/usr/src/debug/coreutils-8.22/lib/c-strcase.h,
/usr/src/debug/coreutils-8.22/lib/qcopy-acl.c, /usr/include/sys/acl.h,
/usr/src/debug/coreutils-8.22/lib/acl-internal.h,
/usr/src/debug/coreutils-8.22/lib/qset-acl.c, /usr/include/acl/libacl.h,
/usr/src/debug/coreutils-8.22/lib/quotearg.c,
/usr/src/debug/coreutils-8.22/lib/c-strcaseeq.h,
/usr/src/debug/coreutils-8.22/lib/safe-read.c,
/usr/src/debug/coreutils-8.22/lib/same.c,
/usr/src/debug/coreutils-8.22/lib/savedir.c,
/usr/src/debug/coreutils-8.22/lib/strnlen1.c,
/usr/src/debug/coreutils-8.22/lib/trim.c,
/usr/src/debug/coreutils-8.22/lib/mbiter.h,
/usr/src/debug/coreutils-8.22/lib/dup-safer.c,
/usr/src/debug/coreutils-8.22/lib/fcntl.h,
/usr/src/debug/coreutils-8.22/lib/fd-safer.c,
/usr/src/debug/coreutils-8.22/lib/utimecmp.c,
/usr/src/debug/coreutils-8.22/lib/utimens.c, /usr/include/bits/time.h,
/usr/src/debug/coreutils-8.22/lib/timespec.h,
/usr/src/debug/coreutils-8.22/lib/sys/stat.h, /usr/include/sys/time.h,
/usr/src/debug/coreutils-8.22/lib/verror.c,
/usr/src/debug/coreutils-8.22/lib/xvasprintf.h,
/usr/src/debug/coreutils-8.22/lib/version-etc.c,
/usr/src/debug/coreutils-8.22/lib/version-etc-fsf.c,
/usr/src/debug/coreutils-8.22/lib/write-any-file.c,
/usr/src/debug/coreutils-8.22/lib/xmalloc.c,
/usr/src/debug/coreutils-8.22/lib/xalloc-die.c,
/usr/src/debug/coreutils-8.22/lib/xfts.c,
/usr/src/debug/coreutils-8.22/lib/xgetcwd.c,
/usr/src/debug/coreutils-8.22/lib/xstriconv.c, /usr/include/iconv.h,
/usr/src/debug/coreutils-8.22/lib/striconv.h,
/usr/src/debug/coreutils-8.22/lib/xvasprintf.c,
/usr/src/debug/coreutils-8.22/lib/xsize.h,
/usr/src/debug/coreutils-8.22/lib/yesno.c,
/usr/src/debug/coreutils-8.22/lib/fcntl.c,
/usr/src/debug/coreutils-8.22/lib/fflush.c,
/usr/include/stdio_ext.h,
/usr/src/debug/coreutils-8.22/lib/freadahead.c,
/usr/src/debug/coreutils-8.22/lib/fseeko.c,
/usr/src/debug/coreutils-8.22/lib/fts-cycle.c,
/usr/src/debug/coreutils-8.22/lib/fts.c,
/usr/src/debug/coreutils-8.22/lib/cycle-check.h,
/usr/src/debug/coreutils-8.22/lib/dev-ino.h, /usr/include/bits/statfs.h,
/usr/src/debug/coreutils-8.22/lib/cloexec.h, /usr/include/sys/statfs.h,
/usr/src/debug/coreutils-8.22/lib/getfilecon.c,
/usr/src/debug/coreutils-8.22/lib/linkat.c,
/usr/src/debug/coreutils-8.22/lib/at-func.c,
/usr/src/debug/coreutils-8.22/lib/utimensat.c,
/usr/src/debug/coreutils-8.22/lib/save-cwd.h,
/usr/src/debug/coreutils-8.22/lib/openat-priv.h,
/usr/src/debug/coreutils-8.22/lib/openat.h,
/usr/src/debug/coreutils-8.22/lib/vasprintf.c,
/usr/src/debug/coreutils-8.22/lib/vasnprintf.h,
/usr/src/debug/coreutils-8.22/lib/areadlinkat.c,
/usr/src/debug/coreutils-8.22/lib/careadlinkat.h,
/usr/src/debug/coreutils-8.22/lib/c-strcasecmp.c,
/usr/src/debug/coreutils-8.22/lib/careadlinkat.c,
/usr/src/debug/coreutils-8.22/lib/allocator.h,
/usr/src/debug/coreutils-8.22/lib/cloexec.c,
/usr/src/debug/coreutils-8.22/lib/close-stream.c,
/usr/src/debug/coreutils-8.22/lib/cycle-check.c,
/usr/src/debug/coreutils-8.22/lib/gettime.c,
/usr/src/debug/coreutils-8.22/lib/hash-pjw.c,
/usr/src/debug/coreutils-8.22/lib/i-ring.c,
/usr/src/debug/coreutils-8.22/lib/localcharset.c,
/usr/include/nl_types.h,
/usr/include/langinfo.h, /usr/src/debug/coreutils-8.22/lib/mbchar.c,
/usr/src/debug/coreutils-8.22/lib/str-kmp.h,
/usr/src/debug/coreutils-8.22/lib/mbsstr.c,
/usr/src/debug/coreutils-8.22/lib/malloca.h,
/usr/src/debug/coreutils-8.22/lib/openat-die.c,
/usr/src/debug/coreutils-8.22/lib/openat-safer.c,
/usr/src/debug/coreutils-8.22/lib/acl-errno-valid.c,
/usr/src/debug/coreutils-8.22/lib/file-has-acl.c,
/usr/src/debug/coreutils-8.22/lib/save-cwd.c,
/usr/src/debug/coreutils-8.22/lib/chdir-long.h,
/usr/src/debug/coreutils-8.22/lib/striconv.c,
/usr/src/debug/coreutils-8.22/lib/chdir-long.c,
/usr/src/debug/coreutils-8.22/lib/fclose.c,
/usr/src/debug/coreutils-8.22/lib/openat-proc.c,
/usr/src/debug/coreutils-8.22/lib/vasnprintf.c,
/usr/src/debug/coreutils-8.22/lib/printf-args.h,
/usr/src/debug/coreutils-8.22/lib/printf-parse.h,
/usr/src/debug/coreutils-8.22/lib/fpucw.h,
/usr/src/debug/coreutils-8.22/lib/isnanl-nolibm.h,
/usr/src/debug/coreutils-8.22/lib/allocator.c,
/usr/src/debug/coreutils-8.22/lib/malloca.c,
/usr/src/debug/coreutils-8.22/lib/mbslen.c,
/usr/src/debug/coreutils-8.22/lib/isnan.c,
/usr/src/debug/coreutils-8.22/lib/printf-args.c,
/usr/src/debug/coreutils-8.22/lib/printf-parse.c
@dswartz Would you edit your post to use a pastebin? Also, why is there nothing under Source files for which symbols have been read in:? Did you edit the output?
On 2018-04-09 14:17, Richard Yao wrote:
@dswartz [1] Would you edit your post to use a pastebin?
Sure.
The patches that apply to cp as far as what gdb claims its source files are (with patches editing only test cases that do not apply to the cp binary removed) is the same as I got after processing the gdb output from Gentoo's cp, which is:
./coreutils-selinux.patch:diff -urNp coreutils-8.21-orig/src/copy.c coreutils-8.21/src/copy.c
./coreutils-selinux.patch:diff -urNp coreutils-8.21-orig/src/cp.c coreutils-8.21/src/cp.c
./coreutils-8.22-selinux-optionsseparate.patch:diff -urNp coreutils-8.22-orig/src/cp.c coreutils-8.22/src/cp.c
./coreutils-8.22-mv-hardlinksrace.patch:diff -urNp coreutils-8.22-orig/src/copy.c coreutils-8.22/src/copy.c
./coreutils-8.22-cp-sparsecorrupt.patch:diff --git a/src/copy.c b/src/copy.c
./coreutils-8.22-cp-selinux.patch:diff --git a/src/selinux.c b/src/selinux.c
The changes in ./coreutils-8.22-mv-hardlinksrace.patch look questionable to me, but I don't see a smoking gun. Testing it on Gentoo after applying these patches should allow us to figure out which one is making it reproducible on CentOS.
On 2018-04-09 14:17, Richard Yao wrote:
@dswartz [1] Would you edit your post to use a pastebin?
Reproducibility: no
Distribution name and version: Fedora 27
Kernel Version: 4.15.10-300.fc27.x86_64
Coreutils Version: 8.27-20.fc27
SELinux status: off
EDIT: This machine's cp is copying in alphanumeric order (verified using strace).
Not reproducible using archzfs repo of Arch Linux (thanks, @demizer).
â– mkdir SRC
â– for i in $(seq 1 10000); do echo $i > SRC/$i ; done
â– cp -r SRC DST
â– uname -srvmo
Linux 4.15.15-1-ARCH #1 SMP PREEMPT Sat Mar 31 23:59:25 UTC 2018 x86_64 GNU/Linux
â– LC_ALL=C pacman -Qi coreutils spl-linux spl-utils-common zfs-linux zfs-utils-common | grep '^Version '
Version : 8.29-1
Version : 0.7.7.4.15.15.1-1
Version : 0.7.7-1
Version : 0.7.7.4.15.15.1-1
Version : 0.7.7-1
â– zpool get all | sed '2,$s/^..../tank/g'
NAME PROPERTY VALUE SOURCE
tank size 43.5T -
tank capacity 81% -
tank altroot - default
tank health ONLINE -
tank guid xxxxxxxxxxxxxxxxxxx -
tank version - default
tank bootfs - default
tank delegation on default
tank autoreplace off default
tank cachefile - default
tank failmode wait default
tank listsnapshots off default
tank autoexpand off default
tank dedupditto 0 default
tank dedupratio 1.00x -
tank free 8.00T -
tank allocated 35.5T -
tank readonly off -
tank ashift 12 local
tank comment - default
tank expandsize - -
tank freeing 0 -
tank fragmentation 34% -
tank leaked 0 -
tank multihost off default
tank feature@async_destroy enabled local
tank feature@empty_bpobj active local
tank feature@lz4_compress active local
tank feature@multi_vdev_crash_dump disabled local
tank feature@spacemap_histogram active local
tank feature@enabled_txg active local
tank feature@hole_birth active local
tank feature@extensible_dataset enabled local
tank feature@embedded_data active local
tank feature@bookmarks enabled local
tank feature@filesystem_limits enabled local
tank feature@large_blocks enabled local
tank feature@large_dnode disabled local
tank feature@sha512 disabled local
tank feature@skein disabled local
tank feature@edonr disabled local
tank feature@userobj_accounting disabled local
The 0.7.7 release has been removed from the CentOS and Fedora RPM repositories.
@rincebrain confirmed that this is reproducible using touch to create file in the right order (to inflate the zap with hash collisions). I'll post a minimal testcase.
@trisk you might want to look at the testcase in https://github.com/zfsonlinux/zfs/pull/7411 first
@Ringdingcoder Nice find. That could explain things nicely if some tests with/without that confirm it is the difference.
@Ringdingcoder I just reproduced this in an old Gentoo VM that uses coreutils 8.21. It is affected. No redhat patches are in place there. I'll try reproducing with the patches that you linked and see what happens. I expect one of them to make the issue disappear.
You'll need this:
```--- a/src/copy.c
+++ b/src/copy.c
@@ -717,7 +717,7 @@ copy_dir (char const *src_name_in, char const *dst_name_in, bool new_dst,
struct cp_options non_command_line_options = *x;
bool ok = true;
Shell script to reproduce (cp not needed): https://gist.github.com/trisk/9966159914d9d5cd5772e44885112d30
No, actually you're going the other way around. Then you would need patch gnulib. Easier going back to unsorted from a recent version.
@Ringdingcoder I just realized that when I found that lib/savedir.c didn't exist in the source files that I have. I picked an old VM that just happened to have 8.21. I'll update it to 8.23 and then revert to verify things.
I can reproduce it immediately by going with SAVEDIR_SORT_NONE. This explains why only old distros experience this. IIRC, tar is unsorted, so a "tar cf - SRC | tar -C DST -xf -" should be able to trigger this everywhere (untested).
Obviously the hard-coded sequence from trisk's script also does it.
With @trisk 's script I can immediately reproduce this on 64-bit Ubuntu 16.04 (kernel 4.4.0-109-generic) with ZFS 0.7.7. It fails as expected:
touch: cannot touch 'DST/9259': No space left on device
@ryao Yes, the actual code change is this: http://git.savannah.gnu.org/cgit/gnulib.git/commit/?id=be7d73709d2b3bceb987f1be00a049bb7021bf87
@Ringdingcoder I think we have satisfactorily explained what is different between various distributions that only the RHEL ones are affected. I am going to switch to understanding what is going wrong inside the kernel.
I observed a link count corruption issue that persisted between umounts when I reproduced this, but I have had trouble reliably reproducing the problem so that I have a reproducer for the link count issue.
I have now reproduced the bug on Arch Linux, using a corrected version of @trisk's script (unexpected token error, line 6). I am unable to reproduce the bug consistently:
â– export LC_ALL=C
â– rm -r DST
â– ./zap-collision-test.sh
â– rm -r DST
â– ./zap-collision-test.sh
touch: cannot touch 'DST/9259': No space left on device
â– rm -r DST
â– ./zap-collision-test.sh
touch: cannot touch 'DST/9259': No space left on device
â– rm -r DST
â– ./zap-collision-test.sh
â– rm -r DST
â– ./zap-collision-test.sh
touch: cannot touch 'DST/9259': No space left on device
â– rm -r DST
â– ./zap-collision-test.sh
â–
@NoSuck Can confirm on Arch, too, using @trisk's script. This invalidates my previous comment.
local/spl-linux-git 2018.04.04.r1070.581bc01.4.15.15.1-1 (archzfs-linux-git)
local/spl-utils-common-git 2018.04.04.r1070.581bc01-1 (archzfs-linux-git)
local/zfs-linux-git 2018.04.04.r3402.533ea0415.4.15.15.1-1 (archzfs-linux-git)
local/zfs-utils-common-git 2018.04.04.r3402.533ea0415-1 (archzfs-linux-git)
Using @trisk's script (well, a slightly corrected version) I can reproduce this on an almost current version of the git tip, g1724eb62d, on Fedora 27 on a simple mirrored vdev in a VM. It doesn't happen on every run, but it happens reasonably frequently (at least half the time, I think).
(This git version is the most recent version I've built for my own use. I can test with the very latest git tip, but I don't see anything there that would change this, if the identified cause is right. I'd be happy to test updates in the VM.)
It might be significant that we hit the zap expansion limit at 2048 files (unclear if this reflects a property of the coreutils sorting, or the zap hash function, though).
The original reproducer creates orphaned files when it triggers, while @trisk's reproducer does not. After running the original reproducer, I observed a failure on 1 file, 8186 files in the directory according to ls -l DST | wc and a directory size of 10001. Unlinking all of the files while trying to stat them to see if any were accessible failed, despite a directory size of 1816. Here is zdb output from a testpool that I used to reproduce the issue:
https://bpaste.net/show/d9f2f0de6c61
I forget how many times that I ran the reproducer on this (likely twice), but the orphaned files are clearly visible. Here is a compressed image of the pool:
https://dev.gentoo.org/~ryao/7401-pool-orphaned-files.xz
It has sha256 5bf54d804f0cd6cd155cc781efeefdabaa6e0ddddc500695eb24061d802474ac. The pool itself is just a 1GB sparse file. The compressed version is 1938032 bytes (~2MB) in size. Others can use zdb on it and poke around to observe the orphaned files.
I am stepping out for a bit due to an appointment that I cannot preempt, but I just want to point out that those who lost files might still have them around as orphans. We'll need to examine a pool where this happened with files storing actual data to confirm that the data is there. If it is, the data could be recoverable.
Thank you everyone for your help with this unfortunate regression. As described above by @tuxoko the root cause of this issue is understood and a complete fix is currently being worked on. In the meanwhile commit cc63068e95ee725cce03b1b7ce50179825a6cda5 which introduced this issue will be shortly reverted from the master branch, release branch, and v0.7.8 will be tagged. We'll open a new PR with the full fix for review and feedback when it's ready.
@behlendorf There are still some loose ends. In particular, how are we going to deal with those affected by this? There could be orphan files in their datasets.
At present, we could tell people to backup changes between what they have now and the snapshot before the issue happened, rollback and then restore, provided that they have snapshots at all. If not, the solution at the moment would be to make a new dataset, copy the files over to it and then destroy the old one.
Neither is as clean a solution as doing something like zfs lost+found -r tank and having the orphaned files put into lost+found directories. It gets messier when we consider that orphaned files could be in recently made snapshots.
This being hard to reproduce on non-RHEL family systems had been a loose end, but it was just tied. The change being in the bundled gnulib between coreutils 8.22 and 8.23 switched the order in how things had been copied from sequential order to a pseudo-random one.
Finally, we had something like a dozen people around the world drop everything to work on this. Not all of us are yet on the same page yet and it will take some time to sync our understandings. That way we can all review the final fix.
I should add that we also need a way to check for the presence of orphans. I have confirmed that zdb can show it, but I have not yet determined what zdb would show in all cases (mainly, non-zero files) to allow reliable detection.
Our analysis so far has not determined how the additional files whose zap_add completes after a prior zap expansion failure on the directory end up orphaned.
Our analysis is not finished. I am reopening this pending the completion of our analysis.
Right I didn't mean to suggest this issue should be closed, and reverting the change was all that was needed. There's still clearly careful investigation to be done, which we can now focus on.
@ryao when possible rolling back to a snapshot would be the cleanest way to recover these files. However, since that won't always be an option let's investigate implementing a generic orhpan recovery mechanism. Adding this functionality initially to zdb would allow us to check existing datasets, and would be nice additional test coverage for ztest to leverage. We could potentially follow this up with support for a .zfs/lost+found directory.
Given the improved understanding of the cause of this regression, can anything be said about the behaviour of rsync? If it reports no errors, are the data fine?
What about mv? And what if mv is from one dataset to another, on the same pool?
@darrenfreeman The mailing list or IRC chatroom would probably be a better place to ask, but
Also, one final caveat:
rsync always sorts files, so it should be fine. And as long as you don't receive errors, you should be fine.
Since data is not silently lost, this is not the worst-case catastrophic bug, just a major annoyance. The most inconvenient issue about it are the orphaned files, but fortunately they are tied to their respective datasets, not to the entire pool, and can get rid of by rolling back or re-creating individual datasets.
Reproducibility: yes
ZoL version: git, recent commit, 10adee27ced279c381816e1321226fce5834340c
Distribution: Ubuntu 17.10
Kernel Version: 4.13.0-38-generic
Coreutils Version: 8.26-3ubuntu4
SELinux status: not installed AFAICT
Reproduced using: ./zap-collision-test.sh .
Furthermore, this didn't look good:
rm -Rf DST
Segmentation fault (core dumped)
The pool was freshly created as,
zfs create rpool/test -o recordsize=4k
touch -s 1G /rpool/test/file
zpool create test /rpool/test/file -o ashift=12
I am trying to install the debug symbols for rm, however I am now also getting segfaults when not even touching this zpool. (apt-key is segfaulting when trying to trust the debug repo.) So I fear I better push the comment button now and reboot :/
Update: can't reproduce the segfault on rm -Rf DST, after rebooting and installing debug symbols.
Thanks for the solutions and quick efforts to fix.
Are there any methods to check a complete filesystem if there any affected files? I do have backups - anyone give me a oneliner to list them?
Given this bug has now been listed on The Register (https://www.theregister.co.uk/2018/04/10/zfs_on_linux_data_loss_fixed/), it might be wise have an FAQ article on the wiki page (with a link in this ticket). The FAQ article should clearly state which versions of ZoL are affected and which distros/kernel versions (similar to the birthhole bug). This would hopefully limit any panic concerns about the reliability of ZoL as a storage layer.
Given this bug has now been listed on The Register (https://www.theregister.co.uk/2018/04/10/zfs_on_linux_data_loss_fixed/)
From that article (emphasis mine):
"So even though three reviewers signed off on the cruddy commit, the speedy response may mean it’s possible to consider this a triumph of sorts for open source."
Ouch.
I agree with @markdesouza that there should be a FAQ article for that so we ZFS apologizers can point anyone who questions us about that to it. I would also like to suggest that the ZFS signing-off procedure be reviewed to avoid (or at least make it way more improbable) for such a "cruddy commit" to make it into a ZFS stable release, and that notice of this review also be added to that same FAQ article.
In #7411, the random_creation test looks like it may be a more robust reproducer (especially for future bugs) because it naturally relies on the ordering of the ZAP hashes. Also, if there are other reproducers, it might be a good idea to centralize discussion of them in that PR so they can be easily included.
Answering my earlier question. Debian 9.3 as above.
rsync doesn't hit the bug, it creates files in lexical order. (I.e. file 999 is followed by 9990.) In a very small number of tests, I didn't find a combination of switches that would fail.
So anyone who prefers rsync, should have a pretty good chance of having missed the bug.
Something similar to mv /pool/dataset1/SRC /pool/dataset2/ also didn't fail. (Move between datasets within the same pool.) Although, on the same box, cp doesn't fail either, so that doesn't prove much.
FYI - you probably all saw it already, but we released zfs-0.7.8 with the reverted patch last night.
@ort163 We do not have a one liner yet. People are continuing to analyze the issue and we will have a proper fix in the near future. That will include a way to detect+correct the wrong directory sizes, list snapshots affected and place the orphaned files in some kind of lost+found directory. I am leaning toward extending scrub to do it.
@markdesouza I have spent a fair amount of time explaining things to end users on Hacker News, Reddit and Phoronix. I do not think that our understanding is sufficient to post a final FAQ yet, but we could post an interim FAQ.
I think the interim FAQ entry should advise users to upgrade ASAP to avoid having to possibly deal with orphaned files if nothing has happened yet, or more orphaned files if something has already happened; and not to change how they do things after upgrading unless they deem it necessary until we finish our analysis, make a proper fix, and issue proper instructions on how to repair the damage in the release notes. I do not think there is any harm to pools if datasets have incorrect directory sizes and orphaned files while people wait for us to release a proper fix with instructions on how to completely address the issue, so telling them to wait after upgrading should be fine. The orphan files should stay around and persist through send/recv unless snapshot rollback is done or the dataset is destroyed.
Until that is up, you could point users to my hacker news post:
https://news.ycombinator.com/item?id=16797932
In specific, we need to nail down whether existing files’ directory entries could be lost, what if any other side effects happen when this is triggered on new file creation, what course of events leads to directory entries disappearing after ENOSPC, how system administrators could detect it and how system administrators will repair it. Then we should be able to make a proper FAQ entry.
Edit: The first 3 questions are answered satisfactorily in #7421.
Most helpful comment
Thank you everyone for your help with this unfortunate regression. As described above by @tuxoko the root cause of this issue is understood and a complete fix is currently being worked on. In the meanwhile commit cc63068e95ee725cce03b1b7ce50179825a6cda5 which introduced this issue will be shortly reverted from the master branch, release branch, and v0.7.8 will be tagged. We'll open a new PR with the full fix for review and feedback when it's ready.