Zfs: Move to ZFS volume creates correctly sized file filled with \0

Created on 20 Feb 2015  路  70Comments  路  Source: openzfs/zfs

Test case:

[sun@ethan ~ :) 4]$ cp a.txt b.txt
[sun@ethan ~ :) 6]$ sha1sum a.txt b.txt
3d8bb1556fd9d5d2c43d97011d7bf6bef8e8c295  a.txt
3d8bb1556fd9d5d2c43d97011d7bf6bef8e8c295  b.txt
[sun@ethan ~ :) 7]$ cp b.txt /tank/share/media/audio/
[sun@ethan ~ :) 8]$ sha1sum a.txt b.txt /tank/share/media/audio/b.txt
3d8bb1556fd9d5d2c43d97011d7bf6bef8e8c295  a.txt
3d8bb1556fd9d5d2c43d97011d7bf6bef8e8c295  b.txt
3d8bb1556fd9d5d2c43d97011d7bf6bef8e8c295  /tank/share/media/audio/b.txt
[sun@ethan ~ :) 9]$ rm /tank/share/media/audio/b.txt
[sun@ethan ~ :) 10]$ mv b.txt /tank/share/media/audio/
[sun@ethan ~ :) 11]$ sha1sum a.txt /tank/share/media/audio/b.txt
3d8bb1556fd9d5d2c43d97011d7bf6bef8e8c295  a.txt
92382bb65cc7b6b5293356054e9a146a96043336  /tank/share/media/audio/b.txt
[sun@ethan ~ :) 12]$ hexdump -C /tank/share/media/audio/b.txt
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000009e0  00 00 00 00 00 00 00 00  00                       |.........|
000009e9

This seems to only happen on a single dataset so far.
Tracing through mv with gdb suggests this is somehow timing related, as stepping through
seems to prevent the effect from happening.

NAME                    PROPERTY               VALUE                    SOURCE
tank/share/media/audio  type                   filesystem               -
tank/share/media/audio  creation               Thu Oct  3 12:02 2013    -
tank/share/media/audio  used                   372G                     -
tank/share/media/audio  available              3.13T                    -
tank/share/media/audio  referenced             366G                     -
tank/share/media/audio  compressratio          1.00x                    -
tank/share/media/audio  mounted                yes                      -
tank/share/media/audio  quota                  none                     default
tank/share/media/audio  reservation            none                     default
tank/share/media/audio  recordsize             128K                     default
tank/share/media/audio  mountpoint             /tank/share/media/audio  inherited from tank/share
tank/share/media/audio  sharenfs               rw=10.200.254.1,ro       inherited from tank/share
tank/share/media/audio  checksum               on                       default
tank/share/media/audio  compression            off                      inherited from tank/share/media
tank/share/media/audio  atime                  off                      inherited from tank/share
tank/share/media/audio  devices                off                      inherited from tank
tank/share/media/audio  exec                   on                       default
tank/share/media/audio  setuid                 off                      inherited from tank
tank/share/media/audio  readonly               off                      default
tank/share/media/audio  zoned                  off                      default
tank/share/media/audio  snapdir                hidden                   default
tank/share/media/audio  aclinherit             restricted               default
tank/share/media/audio  canmount               on                       default
tank/share/media/audio  xattr                  sa                       inherited from tank
tank/share/media/audio  copies                 1                        default
tank/share/media/audio  version                5                        -
tank/share/media/audio  utf8only               off                      -
tank/share/media/audio  normalization          none                     -
tank/share/media/audio  casesensitivity        sensitive                -
tank/share/media/audio  vscan                  off                      default
tank/share/media/audio  nbmand                 off                      default
tank/share/media/audio  sharesmb               off                      default
tank/share/media/audio  refquota               none                     default
tank/share/media/audio  refreservation         none                     default
tank/share/media/audio  primarycache           all                      default
tank/share/media/audio  secondarycache         all                      default
tank/share/media/audio  usedbysnapshots        6.05G                    -
tank/share/media/audio  usedbydataset          366G                     -
tank/share/media/audio  usedbychildren         0                        -
tank/share/media/audio  usedbyrefreservation   0                        -
tank/share/media/audio  logbias                latency                  default
tank/share/media/audio  dedup                  off                      default
tank/share/media/audio  mlslabel               none                     default
tank/share/media/audio  sync                   standard                 default
tank/share/media/audio  refcompressratio       1.00x                    -
tank/share/media/audio  written                0                        -
tank/share/media/audio  logicalused            366G                     -
tank/share/media/audio  logicalreferenced      363G                     -
tank/share/media/audio  snapdev                hidden                   default
tank/share/media/audio  acltype                posixacl                 local
tank/share/media/audio  context                none                     default
tank/share/media/audio  fscontext              none                     default
tank/share/media/audio  defcontext             none                     default
tank/share/media/audio  rootcontext            none                     default
tank/share/media/audio  relatime               off                      default
tank/share/media/audio  com.sun:auto-snapshot  true                     inherited from tank/share
Regression

All 70 comments

Note: source and destination are not on the same filesystem, so an actual copy/unlink takes place.

Works for me. Is this some old zol version, perhaps?

$ dpkg -l zfs-dkms spl-dkms
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                 Version                 Architecture            Description
+++-====================================-=======================-=======================-==============================================================================
ii  spl-dkms                             0.6.3-16~a3c1eb~wheezy  all                     Solaris Porting Layer kernel modules for Linux
ii  zfs-dkms                             0.6.3-26~2d9d57~wheezy  all                     Native ZFS filesystem kernel modules for Linux
$ ./test.sh 
+ df -h . ../d2
Filesystem      Size  Used Avail Use% Mounted on
rpool/test/d1    47G     0   47G   0% /rpool/test/d1
rpool/test/d2    47G     0   47G   0% /rpool/test/d2
+ dd if=/dev/urandom of=a.txt bs=32 count=1
1+0 records in
1+0 records out
32 bytes (32 B) copied, 0.000312707 s, 102 kB/s
+ cp a.txt b.txt
+ sha1sum a.txt b.txt
2ad172f4e6de75e974df4b00bd97ddeb5e1135e0  a.txt
2ad172f4e6de75e974df4b00bd97ddeb5e1135e0  b.txt
+ cp b.txt ../d2
+ sha1sum a.txt b.txt ../d2/b.txt
2ad172f4e6de75e974df4b00bd97ddeb5e1135e0  a.txt
2ad172f4e6de75e974df4b00bd97ddeb5e1135e0  b.txt
2ad172f4e6de75e974df4b00bd97ddeb5e1135e0  ../d2/b.txt
+ rm ../d2/b.txt
+ mv b.txt ../d2/b.txt
+ sha1sum a.txt ../d2/b.txt
2ad172f4e6de75e974df4b00bd97ddeb5e1135e0  a.txt
2ad172f4e6de75e974df4b00bd97ddeb5e1135e0  ../d2/b.txt
+ hexdump -C ../d2/b.txt
00000000  f7 e3 a6 5c 86 e9 40 b3  58 1f 0c cd a3 0c 15 07  |...\[email protected].......|
00000010  01 1b c6 3c 6c 6c 55 1a  f1 59 ea 4e c5 21 f7 09  |...

This happens only on one of the datasets on this pool, others are fine. On this single one it's reproducible, though.

Versions (forgot about those):
SPL: Loaded module v0.6.3-1.2
ZFS: Loaded module v0.6.3-1.2, ZFS pool version 5000, ZFS filesystem version 5
3.18.7-200.fc21.x86_64

@Lalufu that's very troubling. Is the problem with the new file persistent? That is if you export/import the pool does the empty file now have good contents or is it still empty. Can you reliably reproduce this for that dataset?

Is a reboot equivalent to an export/import for this purpose? That would be easier to test than exporting the pool in a running system.

On this particular dataset the issue is reliably reproducible.

Looking at an strace of a (working) cp and a (not working) mv the main difference seems to be that mv (after copying the file over to the destinanation) does the following syscalls:

utimensat() (on the destination)
flistxattr(NULL) (on the source)
flistxattr("security.selinux") (in the source)
fgetxattr("system.posix_acl_access") (on the source, fails with ENODATA)
fstat() (on the source)
fsetxattr("system.posix_acl_access") (on the destination)

@Lalufu your strace output plus the xattr=sa point at an SA corruption problem. The 0.6.3-1.2 tag is missing a62d1b0 which fixes the last known SA problem. It would be interesting to see all of the xattr-related calls from the strace of the mv command. That said, if your system has selinux enabled and you're running a ZFS-enabled selinux policy, you'll not see all of the xattr calls from strace because some of them are made internally from the kernel's selinux code.

Also, if you have used older versions of ZoL on this pool, they were likely missing other SA fixes which opens the possibility the filesystem and/or the pool is permanently corrupted.

It might be worth trying current master code or cherry-picking a62d1b0 into 0.6.3-1.2 (it applies cleanly) and see what happens. It would also be a good idea to run zdb -b <pool> (while it's exported) to make sure it can be traversed cleanly and that there's no missing space.

Yes, I thought about xattr as well. Unfortunately most of the preconditions you mentioned are there, the system is running selinux, the pool has been created on 0.6.2 (as far as I can remember, is there an easy way to check this?).

I'll try the zdb call later to see what gives, and try to build a zfs version with the check.

zdb -b seems to be happy:

[root@ethan ~]# zdb -b tank

Traversing all blocks to verify nothing leaked ...

        No leaks (block sum matches space maps exactly)

        bp count:        86093402
        bp logical:    7577584837632      avg:  88015
        bp physical:   7483290201088      avg:  86920     compression:   1.01
        bp allocated:  11350897614848      avg: 131843     compression:   0.67
        bp deduped:             0    ref>1:      0   deduplication:   1.00
        SPA allocated: 11350896795648     used: 63.53%

Still need to cherry pick the patch and see if the zero filled file is still empty after a reboot.

The file is still empty after a clean reboot.

I've cherry-picked a62d1b0, and it does not make a difference (I've only rebuilt the kernel module, not user space with the patch).

I've uploaded the strace of a cp and a mv of the same file to the same destination, for comparison.

http://tara.camperquake.de/temp/cp.strace
http://tara.camperquake.de/temp/mv.strace

Those links will expire after ca. 14 days.

Hi, @Lalufu
The strace links are dead. Could you repost them?
I would like to look into it a bit.

I've reuploaded the files.

For the record:

I've upgraded to 0.6.4.1 recently, and the problem has disappeared.

I can reproduce this problem on 0.7.1. I use a gentoo derivative, and the packages are compiled on a different ZFS filesystem (mounted at /var/tmp) than the root. When I install packages, there is a random chance that some of the files will be zeroed. It seems bursty -- I can install a package a few times with it corrupting each time, and then spend a while unable to reproduce it. When it does corrupt, that too seems bursty, insofar that often it'll be just a bunch of man pages corrupted, or just a bunch of shared libraries, not a random selection of files.

I have a working theory that I'm testing. Can you set /var/tmp/portage to sync=always? It's gonna be slower than usual but I haven't seen this since.

edit: here's a little test script a banged out to find these corrupted files.

#!/bin/bash

find /lib64/*.so* | xargs file | grep " data"
find /lib32/*.so* | xargs file | grep " data"
find /usr/lib64/*.so* | xargs file | grep " data"
find /usr/lib32/*.so* | xargs file | grep " data"

find /usr/share/man | grep bz2 | xargs file | grep "  data"

I set it to sync=always and set it to recompile nss in a loop, checking for the presence of corrupted files after each iteration, and went to bed. It ran for... probably a bit over 12 hours, without detecting any corruption. That said, with how dramatically it was slowed, it's unclear how many times it actually completed. Still, I expect this is evidence that it works.
The script I ran was:

while [[ -z $stop ]]; do emerge -1 dev-libs/nss; for f in $(equery --no-color files --filter=obj,conf,cmd,doc,man,info dev-libs/nss 2>/dev/null); do tr -d '\0' < $f | read -n 1 || stop=1; done; done

This is slower than yours, but is more precise, it will only detect files that are fully zeroed. when doing broader testing using find it's useful to pass find ... '!' -size 0 to exclude empty files, but this isn't necessary here because the test package doesn't happen to install any.

Hmm, at least I'm not the only one who see this. genlop -t nss should tell you how many times it ran.

Question now is, what causes these files to get zero filled? @behlendorf @tuxoko any ideas?

@klkblake Since you seem to be able to reproduce this problem, could you perform a zdb -ddddd <pool>/<fs> <inum> ( is inode number from "ls -i") on one of the bogus files and post the result. It would be interesting to see whether the file is sparse. It might be useful to know your kernel version as well. The most obvious place I can think of where ZFS would produce zeroes would be fallocate.

Currently running kernel version 4.12.13.
output of ls -lsh /usr/lib64/debug/usr/lib64/libnssckbi.so.debug:

512 -rw-r--r-- 1 root root 557K Sep 22 04:39 /usr/lib64/debug/usr/lib64/libnssckbi.so.debug

lsattr -l and getfattr -d both give nothing
the complete output of zdb -ddddd mjolnir/mjolnir 4300130:

Dataset mjolnir/mjolnir [ZPL], ID 75, cr_txg 16, 271G, 2049944 objects, rootbp DVA[0]=<0:959ccfc000:1000> DVA[1]=<0:f1b5900000:1000> [L0 DMU objset] sha512 uncompressed LE contiguous unique double size=800L/800P birth=2274219L/2274219P fill=2049944 cksum=57d50f64245d8bb3:a6ce60c66e1d4807:80c60b8ed12fae6b:286c78b732cf6303

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
   4300130    1   128K   128K      0     512   128K    0.00  ZFS plain file
                                               168   bonus  System attributes
        dnode flags: USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED 
        dnode maxblkid: 0
        path    /usr/lib64/debug/usr/lib64/libnssckbi.so.debug
        uid     0
        gid     0
        atime   Fri Sep 22 04:39:32 2017
        mtime   Fri Sep 22 04:39:16 2017
        ctime   Fri Sep 22 04:39:17 2017
        crtime  Fri Sep 22 04:39:17 2017
        gen     2267470
        mode    100644
        size    569984
        parent  1015946
        links   1
        pflags  40800000004
Indirect blocks:


@klkblake Somewhat at expected, the file is completely sparse. Have you got any of the following enabled on the dataset: compression, dedup, non-standard checksumming? Are the corrupted files always ".so.debug" files?

Compression is on, deduplication is off, checksum is set to sha512. They aren't always .so.debug. I also have seen .so and .bz2 (compressed man pages). I might have seen executables at one point? But also, I've been testing this by installing packages repeatedly, so this is more or less what you would expect.

@klkblake The reason I asked about those specific properties is because they open other avenues (than traditional lseek and hole-punching) for creating sparse files or otherwise affecting the write path (nop-write, zero suppression, etc.). Could you please try the same test with checksum=on (fletcher4) to rule out nop-write. Also, without compression enabled.

FWIW I still use default checksumming. I've only ever seen this on libraries and man pages.

Let's decisively rule out NOP writes as a possible cause by disabling them entirely. This can be done by setting the module option zfs_nopwrite_enabled=0 at any time and they'll be disabled immediately.

@bunder2015
Does this only happens on 0.7 branch? If it's possible can you try this on 0.6.5 branch?
Since the only report we have is on 0.6.3 and 0.7 I wonder if there's a regression.

My laptop is currently running 08de8c16. I don't ever remember seeing it on 0.6.5.x, I could probably try downgrading on my laptop. I spent an hour compiling nss and an hour with nfs-utils the other day and couldn't get it to trigger. It's very sporadic. edit: I disabled nopwrite and rebuilt qtgui for an hour tonight and couldn't get it to trigger either.

I set zfs_nopwrite_enabled=0 and had zeroed files on the third try compiling nss (mostly header files, some .so.debug).

Just my luck, I turned off sync for /var/tmp on my fw/router box running 0.7.0 release and genkernel went bananas. Turns out the .patch file got zero-filled. 馃檨

I wonder if this package/system combination might be a good test candidate for me for loop testing since I haven't had much luck with my other tests recently. I'll have to do more testing over the weekend. I'll start with nopwrite, then move on to disabling compression.

firewall ~ # genkernel initramfs --kernel-config=/usr/src/linux/.config
* Gentoo Linux Genkernel; Version 3.5.2.1
* Running with options: initramfs --kernel-config=/usr/src/linux/.config

* Using genkernel.conf from /etc/genkernel.conf
* Sourcing arch-specific config.sh from /usr/share/genkernel/arch/x86_64/config.sh ..
* Sourcing arch-specific modules_load from /usr/share/genkernel/arch/x86_64/modules_load ..
* Linux Kernel 4.9.16-gentoo for x86_64...
* .. with config file /usr/src/linux-4.9.16-gentoo/.config
* busybox: >> Applying patches...
patching file console-tools/openvt.c
Hunk #1 succeeded at 147 (offset 23 lines).
*           - 1.18.1-openvt.diff
patching file modutils/modprobe.c
Hunk #1 succeeded at 444 (offset 31 lines).
Hunk #2 succeeded at 492 (offset 32 lines).
*           - busybox-1.20.2-modprobe.patch
patching file util-linux/mdStart.c
*           - busybox-1.26.0-mdstart.patch
patch: **** Only garbage was found in the patch input.
patch: **** Only garbage was found in the patch input.
patch: **** Only garbage was found in the patch input.
patch: **** Only garbage was found in the patch input.
patch: **** Only garbage was found in the patch input.
* ERROR: could not apply patch /usr/share/genkernel/patches/busybox/1.27.2/busybox-1.7.4-signal-hack.patch for busybox-1.27.2

Just upgraded from kernel 4.9.34 with zfs 0.7.1 to kernel 4.4.12 (gentoo btw) with zfs 0.7.2. The first install I did created /opt/bin/skypeforlinux with \0's in it. As far as I know I never ran into this before.
dedup is off
compression is lz4
checksum is on

The 2nd install was correct:
quad3400 ~ # file /opt/bin/skypeforlinux
POSIX shell script, ASCII text executable

I'll try recreating the issue and follow the options mentioned above.

Setting compression=off (was inherited lz4) on the source and destination filesystem did not help.
Setting sync=always (was inherited standard) on both filesystems did not help.
Setting checksum=sha256 (was inherited on) on both filesystems did not help.
echo 0 > /sys/module/zfs/parameters/zfs_nopwrite_enabled did not help.
Setting xattr=off (was inherited sa) on both filesystems did not help.

It's easy for me to reproduce. In each case it took about 2 times emerging net-im/skypeforlinux to trigger it.

Since I don't think anyone has a concrete idea about this issue, can someone tries to bisect this issue? I know it's going to be a long and tedious process. But we should really try to move forward this issue.

We might be able to use the fsx test utility to make bisecting this easier. For those unfamiliar with it, fsx is a torture test which is part of xfstests. It generates a mixed workload of IO operations (read, write, mmap-read, mmap-write, fallocate, hole punching, fsync, etc) and validates the resulting file. It's proven to be useful in the past in uncovering problems.

Having a test case other than emerge would go a long way towards identifying the root cause. You can run fsx directly from the xfstests source tree.

# Build xfstests
$ git clone git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
$ cd xfstests-dev
$ make

# Create a test dataset and set 'sync=disabled'.  Disabling sync is not required
# to trigger the issue but it has been reported in #6768 to make it more likely.
$ zfs create tank/fsx-test
$ zfs set sync=disabled tank/fsx-test
$ ltp/fsx -l 262144 -o 65536 -S 0 -N 100000 /tank/fsx-test/fsx_std_mmap 

How did sync=disabled ever become a precursor to this particular issue? The original poster (@lalufu) clearly had sync=standard.

@dweeezil I only included sync=disabled in the example above because based on the comments I've seen it makes it easier to hit the issue. My understanding is it's not required, I'll update my comment above to make that clear.

I let it run for a while on my laptop but I couldn't see any problems... but I see this so rarely that it might not mean anything. Just a note for the other gentoo testers, this requires libaio and xfsprogs.

@behlendorf OK, that makes sense. Presumably based on the already-referenced #6768. I suppose it would be worth asking the author of that issue whether they could confirm the corrupted ELF files are completely sparse, in which case the issues likely have the same root cause.

@dweeezil

Dataset gtank/gentoo/sysroot [ZPL], ID 49, cr_txg 89, 60.3G, 636966 objects, rootbp DVA[0]=<0:58678c7000:1000> DVA[1]=<0:70000d0000:1000> [L0 DMU objset] fletcher4 uncompressed LE contiguous unique double size=800L/800P birth=2648617L/2648617P fill=636966 cksum=ca7f83c5b:f39e02bcf32:b2a36f536f471:605e83d8c260015

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
    209312    1   128K   128K      0     512   128K    0.00  ZFS plain file
                                               168   bonus  System attributes
        dnode flags: USERUSED_ACCOUNTED 
        dnode maxblkid: 0
        path    /root/libgwenviewlib.so.4.97.0
        uid     0
        gid     0
        atime   Wed Oct 18 17:06:18 2017
        mtime   Fri Oct 13 16:56:45 2017
        ctime   Fri Oct 13 16:56:45 2017
        crtime  Fri Oct 13 16:56:45 2017
        gen     2544908
        mode    100755
        size    1383992
        parent  11
        links   1
        pflags  40800000104
Indirect blocks:

Hope it helps. I also verified it's zeroed. Should I close #6768 as a dup?

It would seem that way. Have you had any luck with fsx? I haven't yet, I let it run for another half hour again this morning.

Using the given commands for fsx:

Seed set to 17995
main: filesystem does not support fallocate mode 0, disabling!
main: filesystem does not support fallocate mode FALLOC_FL_KEEP_SIZE, disabling!
main: filesystem does not support fallocate mode FALLOC_FL_ZERO_RANGE, disabling!
main: filesystem does not support fallocate mode FALLOC_FL_COLLAPSE_RANGE, disabling!
main: filesystem does not support fallocate mode FALLOC_FL_INSERT_RANGE, disabling!
skipping zero size read
truncating to largest ever: 0x18b4d
truncating to largest ever: 0x199f8
truncating to largest ever: 0x2bf78
truncating to largest ever: 0x3e504
truncating to largest ever: 0x3e5fd
truncating to largest ever: 0x3fd5c
truncating to largest ever: 0x3ff79
truncating to largest ever: 0x3ffdc
All 100000 operations completed A-OK!

Looks like I wasn't able to detect any issue with it? I'm not terribly surprised -- the key thing is copying across volumes AFAICT and fsx doesn't do that.

Ran it for 109589400 ops and still didn't find anything, which makes me strongly suspect that this is not able to reproduce the issue.

My updates:

  1. It doesn't only affect executables, I've had this happen to a text file (/usr/lib64/cmake/KF5ItemModels/KF5ItemModelsConfig.cmake) as well.
  2. It also happens with sync=standard. I get the impression that it doesn't happen as often as with sync=disabled, but I haven't properly measured the occurrence rate.

Does it make sense for me to try fsx as well? It does sound like this only happens when files are moved across datasets and if fsx doesn't test this case I don't see any point in my trying it too.

Has anyone ever run into this by having copy/move operations not handled by portage? This is a long shot, of course, but could it be related to something specific used by the "qmerge" function/stage in portage (the one implementing the merge of the package image into the live filesystem)?

@gedia
It's probably not worth trying fsx anymore at this point.

@klkblake

the key thing is copying across volumes AFAICT and fsx doesn't do that.

Is this confirmed? Do you know if it only shows up when copying from certain source filesystem type, like only from zfs to zfs?

It's not confirmed that that is the cause, that was a premature assumption on my part, sorry. With regard to filesystem types, I only have ZFS (and a fat32 EFI partition) on my machines currently, so I can't easily test other configurations. Unless you think a ZFS volume with a different filesystem on top would be worth trying?

It's probably not worth trying fsx anymore at this point.

Right, I was hoping a stress test would expose the issue but it clearly hasn't.

the key thing is copying across volumes AFAICT and fsx doesn't do that.

That does match original description of the issue. Has anyone tried reproducing this with the original reproducer from the top comment?

Do we know anything else for certain? The initial report of this issue was from Fedora 21 using zfs-0.6.3 and a 3.18 kernel. After updating to zfs-0.6.4 the issue "went away" without any clear understanding of why. Since the symptoms are the same we've been assuming these are the same issue.

Then from what I can tell _all_ of the recent reports, which started in September, are from Gentoo (or a derivative distribution) using the 0.7.0 release or newer. Which is a tad suspicious. Am I right is assuming the reports of this issue were from the Gentoo provided packages? If so it's probably worth taking a look at what kernel config options are being used and what, if any, patches are being carried against their kernel and zfs versions.

On Fri, Oct 20, 2017 at 10:45:26AM -0700, Brian Behlendorf wrote:

Do we know anything else for certain? The initial report of this
issue was from Fedora 21 using zfs-0.6.3 and a 3.18 kernel. After
updating to zfs-0.6.4 the issue "went away" without any clear
understanding of why. Since the symptoms are the same we've been
assuming these are the same issue.

I still run the same system that showed the issue all the way back, now
upgraded to Fedora 25 and 0.7.2. As far as I can tell the issue has
never reappeared for me.

I decided to bang out another test script, I think it's faithful enough to the original test case.

https://gist.github.com/bunder2015/6913c0895e25656788703c2ebc1b00f0

I'm going to let it run on my laptop overnight and see what happens.

@behlendorf I tried the reproducer in the OP and was unable to reproduce it.

AFAICT, all the gentoo reports are from people who have /var/tmp and / on different ZFS filesystems. During package installation, portage extracts and compiles the package in /var/tmp and then copies all the files across to their final installed locations. I suspect that we are hitting a race condition when copying multiple files close together in time, possibly to the same directory? I don't recall seeing this happen to a file that was the only file installed in that directory, but I may be misremembering.

w.r.t patches, the userland utilities are only patched to make paths of a few utilities match where they are installed on gentoo. The kernel module forces the following settings:

!DEBUG_LOCK_ALLOC
EFI_PARTITION
IOSCHED_NOOP
MODULES
!PAX_KERNEXEC_PLUGIN_METHOD_OR
ZLIB_DEFLATE
ZLIB_INFLATE

(I have the module built into the kernel, so my kernel has MODULES disabled.)
Looks like the ebuild removes the zpios directory from the Makefile unless the debug useflag is set.
SPL forces a similar set of settings:

!DEBUG_LOCK_ALLOC
MODULES
KALLSYMS
!PAX_KERNEXEC_PLUGIN_METHOD_OR
!PAX_SIZE_OVERFLOW
ZLIB_DEFLATE
ZLIB_INFLATE

It also patches it to fix a path to a utility and removes the splat directory from the Makefile unless debugging.

The set of patches for the standard gentoo kernel are found here for the release I'm using: https://dev.gentoo.org/~mpagano/genpatches/patches-4.12-13.htm
(and here more generally: https://dev.gentoo.org/~mpagano/genpatches/index.htm)

Bisecting pointed me to https://github.com/zfsonlinux/zfs/commit/66aca24730adfb2e3875e5148a03dd1fb435d438

I used
while [[ -z $stop ]]; do emerge -1 net-im/skypeforlinux; for f in $(equery --no-color files --filter=obj,conf,cmd,doc,man,info net-im/skypeforlinux 2>/dev/null); do tr -d '\0' < $f | read -n 1 || stop=1; done; done
and assumed a version was good if it compiled skype succesfully 10 times in a row.
The bad ones always failed within 3 times.

I thought it might have been that patch, but adjusting the tunable it provided didn't seem to make a difference. See #6144.

I hit this bug after upgrading to 0.7.0 (I don't remember if 0.7.0-rc3 had it because I had bigger problems back then #5742 ). Downgrading to 0.6.5.11 fixes the problem (which I did after hitting this bug too often). Yesterday I upgraded to 0.7.3 hoping that this has been fixed but immediately after an emerge -anDNu @world google-chrome and nvidia-drivers got corrupted.

/opt/google/chrome/chrome_200_percent.pak: only nulls
/opt/google/chrome/libwidevinecdmadapter.so: only nulls
/usr/lib64/libnvidia-ml.so.384.90: only nulls

@behlendorf I think you were right in https://github.com/zfsonlinux/zfs/issues/3125#issuecomment-338276027 it seems to be a gentoo-only issue.
After reading this bug report I dug in to portage code and found these:
https://github.com/gentoo/portage/blob/83b44fc48c07a4d8f0b6cb361f61b1493965fcb2/pym/portage/util/file_copy/__init__.py
https://github.com/gentoo/portage/blob/83b44fc48c07a4d8f0b6cb361f61b1493965fcb2/src/portage_util_file_copy_reflink_linux.c
I think this issue only occurs on Gentoo because the way portage copies files from /var/tmp/portage to /, it is using in-kernel reflink_linux copy operations.

I made a little tool to help me detect \0 only files: https://gist.github.com/hunbalazs/0fcedfaa2a63ce5e1bb0953641a0e76c
I hope it will be useful to you too until this bug is fixed.

@hunbalazs This doesn't affect me, but I am still on the Linux 4.1.y series. I'll see if I hit it when I upgrade my kernel sometime in the next few days.

On Sat, Oct 21, 2017 at 08:16:47AM -0700, hunbalazs wrote:

After reading this bug report I dug in to portage code and found these:
https://github.com/gentoo/portage/blob/83b44fc48c07a4d8f0b6cb361f61b1493965fcb2/pym/portage/util/file_copy/__init__.py
https://github.com/gentoo/portage/blob/83b44fc48c07a4d8f0b6cb361f61b1493965fcb2/src/portage_util_file_copy_reflink_linux.c
I think this issue only occurs on Gentoo because the way portage copies files from /var/tmp/portage to /, it is using in-kernel reflink_linux copy operations.

This would be very easy to check by just patching portage to always use
shutil.copyfile and run it.

I created a branch from 0.7.3 with https://github.com/zfsonlinux/zfs/commit/66aca24730adfb2e3875e5148a03dd1fb435d438 reverted at https://github.com/janlam7/zfs/tree/issue/3125

On my machine it prevents the issue from occuring.

@ryao has asked me to file a gentoo bug on this issue, https://bugs.gentoo.org/635002

@Lalufu indeed

--- /usr/lib64/python3.4/site-packages/portage/util/file_copy/__init__.py.orig  2017-10-21 17:49:36.707482499 +0200
+++ /usr/lib64/python3.4/site-packages/portage/util/file_copy/__init__.py   2017-10-21 17:50:13.168483047 +0200
@@ -30,7 +30,8 @@
        _file_copy(src_file.fileno(), dst_file.fileno())


-if _file_copy is None:
-   copyfile = shutil.copyfile
-else:
-   copyfile = _optimized_copyfile
+#if _file_copy is None:
+#  copyfile = shutil.copyfile
+#else:
+#  copyfile = _optimized_copyfile
+copyfile = shutil.copyfile

I'll try that. I tried to reproduce by using portage's _optimized_copyfile over and over but my short stress test didn't give any result. I'll try to continue testing overnight.

News from the gentoo side...

Looking and the sendfile documentation, it looks like portage's interpretation of the *offset parameter is incorrect, since that variable represents the input file offset rather than the output file offset. I'll fix it to use sendfiles return value to measure bytes written to the output file.

Since I can't reproduce this very easily, can someone test this portage patch and report any changes? https://bugs.gentoo.org/635002#c6

On my system the patch in https://bugs.gentoo.org/635002#c6 does not prevent this issue from occurring.

Using shutil.copyfile ( @Lalufu ) and https://bugs.gentoo.org/635002#c6 ( @bunder2015 ) did not help. Mounting /var/tmp/portage as tmpfs fixes it.
The interesting thing is that by using @janlam7 's method I could not reproduce the issue (by installing skypeforlinux multiple times) only by using nvidia-drivers.
I'll try with linux 4.1.x ( @ryao ) and with @janlam7 's branch

With kernel 4.1.43-gentoo-r1 and zfs v0.7.2-r0-gentoo I can reproduce it.

The gentoo folks have pushed out a revision to the patch I linked yesterday, same URL.

@bunder2015 I'm still able to reproduce this with version 5 of said patch and sync=disabled

Would revert the dirty check fix the issue?

diff --git a/module/zfs/dmu.c b/module/zfs/dmu.c
index b3cf10c..6569131 100644
--- a/module/zfs/dmu.c
+++ b/module/zfs/dmu.c
@@ -2054,12 +2054,10 @@ dmu_offset_next(objset_t *os, uint64_t object, boolean_t hole, uint64_t *off)
    /*
     * Check if dnode is dirty
     */
-   if (dn->dn_dirtyctx != DN_UNDIRTIED) {
-       for (i = 0; i < TXG_SIZE; i++) {
-           if (!list_is_empty(&dn->dn_dirty_records[i])) {
-               clean = B_FALSE;
-               break;
-           }
+   for (i = 0; i < TXG_SIZE; i++) {
+       if (list_link_active(&dn->dn_dirty_link[i])) {
+           clean = B_FALSE;
+           break;
        }
    }

Sorry for not keeping an eye on this during the week... Got word back from gentoo again yesterday after their last round of portage patches.

A problem in the lseek SEEK_DATA/SEEK_HOLE implementation might cause this.

Has anyone had a chance to test @tuxoko's patch?

@tuxoko 's patch works for me.

@tuxoko good thought, that would explain things. Better to solely check the dnode for dirty status. It would be great if we could get a few other users to verify the fix and a PR opened.

@tuxoko With tuxoko's patch I haven't been able to reproduce this bug so far either. Thanks!

@behlendorf
To be honest, I don't know what exact check we should use. The new check obviously would cause this regression. However, there might some problem with the old one that warrant a change. Do you know what was the reason for this change in the first place? Also, do you think we should consider mmap here?

@tuxoko we should use the one you proposed, which is what was here prior to 66aca24730adfb2e3875e5148a03dd1fb435d438 for SEEK_HOLE. This was a case of over optimizing the dirty check which slipped through since it's _almost_ always right. As for mmap we should be OK here since zpl_writepage() will dirty the object with dmu_write() like any normal write.

@behlendorf
But wouldn't zpl_writepage only happens when writing back of dirty pages?
There would still be a window between dirty page and zpl_writepage. Or do we don't need to care about it for SEEK_HOLE/SEEK_DATA?

@tuxoko my understanding from the msync(2) man page is there's no guarantee made to applications about this. We could go the extra mile here and conservatively force an msync when zp->z_is_mapped is set.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

vbrik picture vbrik  路  108Comments

dmaziuk picture dmaziuk  路  52Comments

Tualua picture Tualua  路  54Comments

chrisrd picture chrisrd  路  65Comments

cytrinox picture cytrinox  路  66Comments