Gocryptfs: xfstests generic/273 failure

Created on 5 Jan 2019 · 36Comments · Source: rfjakob/gocryptfs

Needs further analysis:

_porter 28 not complete
cp: cannot create regular file '/var/tmp/check-gocryptfs/scratchdir/sub_28/origin/file_548': No such file or directory

https://github.com/rfjakob/gocryptfs/blob/master/Documentation/XFSTESTS.md#generic273

bug bug outside gocryptfs

Source

rfjakob

Most helpful comment

@rfjakob Thanks for the USENIX paper. Also, thanks to you and @slackner for your continuous improvements. Never before have I used encrypted home folders with so little effort, and so few problems. Great work!

lechner on 5 Jan 2019

❤2

All 36 comments

273.full test log: https://gist.github.com/rfjakob/0d677814be13b6992ddd04829630767f

rfjakob on 5 Jan 2019

Same failure with gocryptfs v1.6.1

$ sudo ./check-gocryptfs generic/273
gocryptfs v1.6.1; go-fuse v20170619-66-g6df8ddc; 2019-01-05 go1.11.4
fuse-xfstests gocryptfs-2018-08-18/d8111119
Sat Jan  5 15:54:13 UTC 2019

FSTYP         -- fuse.gocryptfs
PLATFORM      -- Linux/x86_64 brikett 4.19.12-301.fc29.x86_64
MKFS_OPTIONS  -- /var/tmp/check-gocryptfs/scratchdev
MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 /var/tmp/check-gocryptfs/scratchdev /var/tmp/check-gocryptfs/scratchdir

generic/273 - output mismatch (see /home/jakob.donotbackup/code/fuse-xfstests/results//generic/273.out.bad)
    --- tests/generic/273.out   2018-01-20 14:29:39.092452016 +0100
    +++ /home/jakob.donotbackup/code/fuse-xfstests/results//generic/273.out.bad 2019-01-05 16:55:48.660593249 +0100
    @@ -2,3 +2,70 @@
     ------------------------------
     start the workload
     ------------------------------
    +_porter 8 not complete
    +_porter 28 not complete
    +_porter 65 not complete
    +_porter 33 not complete
    ...
    (Run 'diff -u tests/generic/273.out /home/jakob.donotbackup/code/fuse-xfstests/results//generic/273.out.bad'  to see the entire diff)
Ran: generic/273
Failures: generic/273
Failed 1 of 1 tests

Runtime was 95 seconds

rfjakob on 5 Jan 2019

Oh, I see, this is the first time this test runs. Last time it was skipped:

generic/273  [not run] Filesystem fuse.gocryptfs not supported in _scratch_mkfs_sized

rfjakob on 5 Jan 2019

Core of xfstests generic/273 is 100 parallel "cp -r". Attached.
mass_cp.tar.xz.zip

rfjakob on 5 Jan 2019

We seem to trigger the ".deleted" condition in go-fuse

https://github.com/hanwen/go-fuse/blob/f520193c2228d18ac98d8d83792c470803f26be5/fuse/pathfs/pathfs.go#L307

Added some debug output:

Jan 05 18:03:46 brikett gocryptfs[11665]: go-fuse: GetPath: deleted node? path=""

rfjakob on 5 Jan 2019

Glad you were able to track this down. However, I do not see much similarity between the comment in go-fuse and the stress test:

    if walkUp != n.pathFs.root.Inode() {
        // This might happen if the node has been removed from
        // the tree using unlink, but we are forced to run
        // some file system operation, because the file is
        // still opened.

        // TODO - add a deterministic disambiguating suffix.
        return ".deleted"
    }

The test does not involve any deletion operation, as far as I can see?!

slackner on 5 Jan 2019

How does the stress test in commit fb705f99782f222e72c42b23be9263bbd302a4d4 fail for you? I executed it a couple of times without any error.

slackner on 5 Jan 2019

Fails like this:

0 jakob@brikett:/var/tmp/b$ /home/jakob/go/src/github.com/rfjakob/gocryptfs/tests/stress_tests/parallel_cp.sh 
deleting old files
creating files with dd
cp starting: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 
cp finished: cp: cannot open 'origin/file_106' for reading: No such file or directory
cp: cannot open 'origin/file_218' for reading: No such file or directory
cp: cannot open 'origin/file_249' for reading: No such file or directory
[...]
cp: cannot open 'origin/file_716' for reading: No such file or directory
cp: cannot open 'origin/file_716' for reading: No such file or directory
cp: cannot create regular file 'sub_51/origin/file_608': No such file or directory
85 39 76 25 41 68 69 97 2 40 24 90 54 20 44 35 18 29 26 46 36 8 91 7 87 21 83 63 52 67 1 93 9 65 47 5 81 77 4 60 98 82 32 75 14 80 79 57 49 11 96 95 70 56 59 13 17 64 33 88 72 73 100 94 53 55 50 42 34 6 74 62 10 78 15 23 99 38 28 43 86 92 22 31 71 84 89 58 27 3 19 45 30 61 12 37 48 51 66 16 
Runtime was 35 seconds

However, it does not always fail, it seems to somehow depend on system load?

I think the "delete operation" in this case is actually a FORGET. The kernel sends a FORGET when it evicts a file from from the inode table, so maybe it only happens under memory pressure.

rfjakob on 5 Jan 2019

Running this during the tests seems to make reproducing the problem easier:

while sleep 1 ; do echo 3 > /proc/sys/vm/drop_caches  ; done

rfjakob on 5 Jan 2019

Still no luck so far. I have tested both with tmpfs and ext4 file systems, and with and without dropping caches in parallel.

I am using go-fuse commit 7954c2167d2a0ef04a28577311f3b5698639a385 atm, but there are two more cache related commits in the upstream branch. Did they maybe cause a regression?

slackner on 5 Jan 2019

Ah, I think I know whats going on.

go-fuse commit
7954c2167d2a0ef04a28577311f3b5698639a385 -> works
9d9c4ddf2e89545226f2f9cbe6dd5b9fbaf913e6 -> doesn't work
f520193c2228d18ac98d8d83792c470803f26be5 (latest) -> works

You are maybe using the broken commit?

slackner on 5 Jan 2019

I was on 6df8ddc and have now updated to f520193c:

gocryptfs v1.7-rc1-17-gbd055ed-dirty; go-fuse v20170619-75-gf520193; 2019-01-05 go1.11.4

After a few tries, I can still repro:

0 jakob@brikett:/tmp/tmp.mhwVVvfXd1/b$ ~/go/src/github.com/rfjakob/gocryptfs/tests/stress_tests/parallel_cp.sh 
deleting old files
creating files with dd
cp starting: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 
cp finished: cp: cp: cannot open 'origin/file_310' for readingcannot open 'origin/file_310' for readingcp: cannot open 'origin/file_310' for reading: No such file or directory
cp: cannot open 'origin/file_310' for reading: No such file or directory
: No such file or directory
cp: cannot open 'origin/file_310' for reading: No such file or directory
cp: cannot open 'origin/file_310' for reading: No such file or directory
: No such file or directory
cp: cannot open 'origin/file_310' for reading: No such file or directory
cp: cannot open 'origin/file_310' for reading: No such file or directory
cp: cannot open 'origin/file_310' for reading: No such file or directory
cp: cannot open 'origin/file_310' for reading: No such file or directory
cp: cannot open 'origin/file_310' for reading: No such file or directory
cp: cannot open 'origin/file_310' for reading: No such file or directory
cp: cannot open 'origin/file_310' for reading: No such file or directory
cp: cannot open 'origin/file_310' for reading: No such file or directory
cp: cannot open 'origin/file_310' for reading: No such file or directory
[...]

rfjakob on 5 Jan 2019

Kernel probably also matters, I'm on 4.19.12.

rfjakob on 5 Jan 2019

I'm on 4.18.0-3-amd64 at the moment (Debian Buster). On closer inspection, the No such file or directory regression I encountered with 9d9c4ddf said that the file was not found during stat, not during open, so it might be unrelated. For me latest go-fuse works pretty reliable. Just to be sure, let me run it 100 times in a loop, and check if there are any random failures.

slackner on 5 Jan 2019

With lots of retrys and smaller delay of 0.1 between cache flushes, I was also able to reproduce it on my machine. At the same time, gocryptfs complained with:

2019/01/05 19:13:02 Unimplemented opcode INTERRUPT
2019/01/05 19:13:02 writer: Write/Writev failed, err: 2=no such file or directory. opcode: INTERRUPT

Do you also get the same error?

slackner on 5 Jan 2019

I get these "INTERRUPT" every now and then, yes, always have. I think these are for aborting a running request, which is not implemented in go-fuse, but otherwise harmless.

rfjakob on 5 Jan 2019

The INTERRUPT messages seem to be unrelated and probably were just caused by programs aborting due to the failed cp operation. For me the key to reproduce it was the smaller delay between cache flushes. The error occurs both with the latest go-fuse and the version two commits ago.

slackner on 5 Jan 2019

BTW, because I only found it recently: Figure 2 in this paper is great:
https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf

rfjakob on 5 Jan 2019

Are you able to reproduce the issue with -sharedstorage? I was not able to reproduce it with that switch yet, but it is also faster by a factor of ~7 (when using tmpfs).

slackner on 5 Jan 2019

With -sharedstorage it runs... Faster!?

rfjakob on 5 Jan 2019

lechner on 5 Jan 2019

❤2

With -sharedstorage it runs... Faster!?

Yes, at least for tmpfs and for this specific stress-test.

slackner on 5 Jan 2019

Did you already try to reproduce this issue with the loopback filesystem of go-fuse?

slackner on 6 Jan 2019

Not this time, no. But i hit a similar issue some tine ago, and it reproduced on both loopback and gocryptfs. It was ultimately fixed with changes in go-fuse: https://github.com/hanwen/go-fuse/issues/99

rfjakob on 6 Jan 2019

That sharedstorage runs faster could be because it disables hard links, which saves some overhead:

https://github.com/rfjakob/gocryptfs/blob/fcaca5fc94d981aa637beb752edc8cb3c2265e96/mount.go#L301

However, the massive factor of 7 speedup suggests that there is some optimization potential.

rfjakob on 12 Jan 2019

Interesting section of gocryptfs -fusedebug -debug: https://gist.github.com/rfjakob/f2cdcc121a41088cb1e9075185ca57bc

rfjakob on 20 Jan 2019

I don't understand where things go south.

2019/01/20 17:51:49 rx 83920: LOOKUP i1 ["sub_42"] 7b

Lookup "sub_42"

2019/01/20 17:51:49 tx 83920:     OK, {i100 g22998 tE=1s tA=1s {M040775 SZ=760 L=2 1026:1026 B0*4096 i0:116781415 A 1548003103.035988 M 1548003109.476006 C 1548003109.476006}}

Found "sub_42", assigned id 100

2019/01/20 17:51:49 rx 84206: LOOKUP i100 ["file_130"] 9b

Lookup "sub_42/file_130"

FS.GetAttr(".deleted/file_130")

go-fuse forgot about "sub_42" and now calls it ".deleted"

rfjakob on 20 Jan 2019

Complete log this time, and I patched go-fuse so it does not recycle ids: https://gist.githubusercontent.com/rfjakob/ee4fbe522f7ae90ca4369a10bc0535bb/raw/a08212b601d0cd8590e58838a6a8cd602a9a9eb7/parallel_cp_norecycle.log

Output was:

$ ( cd b && ~/go/src/github.com/rfjakob/gocryptfs/tests/stress_tests/parallel_cp.sh )
deleting old files
creating files with dd
cp starting: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 
cp finished: cp: cannot create regular file 'sub_56/file_116': No such file or directory
cp: cannot create regular file 'sub_56/file_117': No such file or directory
cp: cannot create regular file 'sub_56/file_118': No such file or directory
cp: cannot create regular file 'sub_56/file_119': No such file or directory
cp: cannot create regular file 'sub_56/file_12': No such file or directory
cp: cannot create regular file 'sub_56/file_120': No such file or directory
cp: cannot create regular file 'sub_56/file_121': No such file or directory
cp: cannot create regular file 'sub_56/file_122': No such file or directory
cp: cannot create regular file 'sub_56/file_123': No such file or directory
cp: cannot create regular file 'sub_56/file_124': No such file or directory
cp: cannot create regular file 'sub_56/file_125': No such file or directory
cp: cannot create regular file 'sub_56/file_126': No such file or directory
^C

Patch to go-fuse was:

$ git diff
diff --git a/fuse/nodefs/handle.go b/fuse/nodefs/handle.go
index 8bad21d..6a45e23 100644
--- a/fuse/nodefs/handle.go
+++ b/fuse/nodefs/handle.go
@@ -83,7 +83,7 @@ func (m *portableHandleMap) Register(obj *handled) (handle, generation uint64) {
                return obj.handle, obj.generation
        }
        // Create a new handle number or recycle one on from the free list
-       if len(m.freeIds) == 0 {
+       if true || len(m.freeIds) == 0 {
                obj.handle = uint64(len(m.handles))
                m.handles = append(m.handles, obj)
        } else {

rfjakob on 20 Jan 2019

Yesterday I spent some more time looking at this bug, and I still suspect that you are seeing a different error than what I get. While I was able to reproduce similar output, I never saw any access to .deleted. Does the following debug code maybe reveal anything useful for you? http://ix.io/1z6u

slackner on 24 Jan 2019

BTW, did you already try if disabling ClientInodes has the same effect for you (improving performance and getting rid of the errors)? In the beginning I suspected it has to do with a couple of places in go-fuse checking conditions outside of locks - see http://ix.io/1z6K which adds a couple of panics -, but I was never able to trigger any of them.

slackner on 24 Jan 2019

I added the patch from http://ix.io/1z6u , but the panics do not trigger. Also, adding -sharedstorage did NOT fix the problem.

rfjakob on 27 Jan 2019

Whoops, just hit one:

2019/01/28 18:08:46 Forgot the wrong node? node{36184} vs node{0}
panic: Forgot the wrong node? node{36184} vs node{0}

goroutine 222580 [running]:
log.Panicf(0x685219, 0x1f, 0xc0055c4bb8, 0x2, 0x2)
    /usr/local/go/src/log/log.go:333 +0xda
github.com/hanwen/go-fuse/fuse/nodefs.(*FileSystemConnector).forgetUpdate(0xc00007acc0, 0x8d1b, 0x1)
    github.com/hanwen/go-fuse/fuse/nodefs/fsconnector.go:159 +0x3c4
github.com/hanwen/go-fuse/fuse/nodefs.(*rawBridge).Forget(0xc00007acc0, 0x8d1b, 0x1)
    github.com/hanwen/go-fuse/fuse/nodefs/fsops.go:124 +0x3f
github.com/hanwen/go-fuse/fuse.doBatchForget(0xc0000ac000, 0xc0047acfc0)
    github.com/hanwen/go-fuse/fuse/opcode.go:322 +0xff
github.com/hanwen/go-fuse/fuse.(*Server).handleRequest(0xc0000ac000, 0xc0047acfc0, 0xc0047acfc0)
    github.com/hanwen/go-fuse/fuse/server.go:431 +0x26b
github.com/hanwen/go-fuse/fuse.(*Server).loop(0xc0000ac000, 0xc004afc901)
    github.com/hanwen/go-fuse/fuse/server.go:403 +0x18f
created by github.com/hanwen/go-fuse/fuse.(*Server).readRequest
    github.com/hanwen/go-fuse/fuse/server.go:291 +0x2d8

rfjakob on 28 Jan 2019

https://github.com/rfjakob/go-fuse/commit/edb435ebd4a2513e6057f330fa62f83f9bf22a08 seems to fix the problem here.

rfjakob on 3 Feb 2019

If the problem is that we return forgotten nodes to the kernel, how can we be sure that it doesn't happen anymore with this patch? There could be an arbitrary delay between receiving a message and processing it. Also there are batch-forgets that always will take more time to process.

slackner on 3 Feb 2019

xfstests looks good: https://gist.github.com/rfjakob/0537715c1264a6f494680e77d83f3882, generic/273 passes.

@slackner the problem is that LOOKUP and FORGET operate two data structures (the nodeid table and the directory tree). Between updating the first and updating the second one the state is inconsistent.

hanwen posted an example here: https://github.com/hanwen/go-fuse/issues/168#issuecomment-459740379

rfjakob on 3 Feb 2019

Fixed by https://github.com/hanwen/go-fuse/commit/d0fca860a5759d17592becfa1b8e5b1bd354b24a , merged into go-fuse.

rfjakob on 6 Feb 2019

Was this page helpful?

0 / 5 - 0 ratings