Needs further analysis:
_porter 28 not complete
cp: cannot create regular file '/var/tmp/check-gocryptfs/scratchdir/sub_28/origin/file_548': No such file or directory
https://github.com/rfjakob/gocryptfs/blob/master/Documentation/XFSTESTS.md#generic273
273.full test log: https://gist.github.com/rfjakob/0d677814be13b6992ddd04829630767f
Same failure with gocryptfs v1.6.1
$ sudo ./check-gocryptfs generic/273
gocryptfs v1.6.1; go-fuse v20170619-66-g6df8ddc; 2019-01-05 go1.11.4
fuse-xfstests gocryptfs-2018-08-18/d8111119
Sat Jan 5 15:54:13 UTC 2019
FSTYP -- fuse.gocryptfs
PLATFORM -- Linux/x86_64 brikett 4.19.12-301.fc29.x86_64
MKFS_OPTIONS -- /var/tmp/check-gocryptfs/scratchdev
MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 /var/tmp/check-gocryptfs/scratchdev /var/tmp/check-gocryptfs/scratchdir
generic/273 - output mismatch (see /home/jakob.donotbackup/code/fuse-xfstests/results//generic/273.out.bad)
--- tests/generic/273.out 2018-01-20 14:29:39.092452016 +0100
+++ /home/jakob.donotbackup/code/fuse-xfstests/results//generic/273.out.bad 2019-01-05 16:55:48.660593249 +0100
@@ -2,3 +2,70 @@
------------------------------
start the workload
------------------------------
+_porter 8 not complete
+_porter 28 not complete
+_porter 65 not complete
+_porter 33 not complete
...
(Run 'diff -u tests/generic/273.out /home/jakob.donotbackup/code/fuse-xfstests/results//generic/273.out.bad' to see the entire diff)
Ran: generic/273
Failures: generic/273
Failed 1 of 1 tests
Runtime was 95 seconds
Oh, I see, this is the first time this test runs. Last time it was skipped:
generic/273 [not run] Filesystem fuse.gocryptfs not supported in _scratch_mkfs_sized
Core of xfstests generic/273 is 100 parallel "cp -r". Attached.
mass_cp.tar.xz.zip
We seem to trigger the ".deleted" condition in go-fuse
Added some debug output:
Jan 05 18:03:46 brikett gocryptfs[11665]: go-fuse: GetPath: deleted node? path=""
Glad you were able to track this down. However, I do not see much similarity between the comment in go-fuse and the stress test:
if walkUp != n.pathFs.root.Inode() {
// This might happen if the node has been removed from
// the tree using unlink, but we are forced to run
// some file system operation, because the file is
// still opened.
// TODO - add a deterministic disambiguating suffix.
return ".deleted"
}
The test does not involve any deletion operation, as far as I can see?!
How does the stress test in commit fb705f99782f222e72c42b23be9263bbd302a4d4 fail for you? I executed it a couple of times without any error.
Fails like this:
0 jakob@brikett:/var/tmp/b$ /home/jakob/go/src/github.com/rfjakob/gocryptfs/tests/stress_tests/parallel_cp.sh
deleting old files
creating files with dd
cp starting: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
cp finished: cp: cannot open 'origin/file_106' for reading: No such file or directory
cp: cannot open 'origin/file_218' for reading: No such file or directory
cp: cannot open 'origin/file_249' for reading: No such file or directory
[...]
cp: cannot open 'origin/file_716' for reading: No such file or directory
cp: cannot open 'origin/file_716' for reading: No such file or directory
cp: cannot create regular file 'sub_51/origin/file_608': No such file or directory
85 39 76 25 41 68 69 97 2 40 24 90 54 20 44 35 18 29 26 46 36 8 91 7 87 21 83 63 52 67 1 93 9 65 47 5 81 77 4 60 98 82 32 75 14 80 79 57 49 11 96 95 70 56 59 13 17 64 33 88 72 73 100 94 53 55 50 42 34 6 74 62 10 78 15 23 99 38 28 43 86 92 22 31 71 84 89 58 27 3 19 45 30 61 12 37 48 51 66 16
Runtime was 35 seconds
However, it does not always fail, it seems to somehow depend on system load?
I think the "delete operation" in this case is actually a FORGET. The kernel sends a FORGET when it evicts a file from from the inode table, so maybe it only happens under memory pressure.
Running this during the tests seems to make reproducing the problem easier:
while sleep 1 ; do echo 3 > /proc/sys/vm/drop_caches ; done
Still no luck so far. I have tested both with tmpfs and ext4 file systems, and with and without dropping caches in parallel.
I am using go-fuse commit 7954c2167d2a0ef04a28577311f3b5698639a385 atm, but there are two more cache related commits in the upstream branch. Did they maybe cause a regression?
Ah, I think I know whats going on.
go-fuse commit
7954c2167d2a0ef04a28577311f3b5698639a385 -> works
9d9c4ddf2e89545226f2f9cbe6dd5b9fbaf913e6 -> doesn't work
f520193c2228d18ac98d8d83792c470803f26be5 (latest) -> works
You are maybe using the broken commit?
I was on 6df8ddc and have now updated to f520193c:
gocryptfs v1.7-rc1-17-gbd055ed-dirty; go-fuse v20170619-75-gf520193; 2019-01-05 go1.11.4
After a few tries, I can still repro:
0 jakob@brikett:/tmp/tmp.mhwVVvfXd1/b$ ~/go/src/github.com/rfjakob/gocryptfs/tests/stress_tests/parallel_cp.sh
deleting old files
creating files with dd
cp starting: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
cp finished: cp: cp: cannot open 'origin/file_310' for readingcannot open 'origin/file_310' for readingcp: cannot open 'origin/file_310' for reading: No such file or directory
cp: cannot open 'origin/file_310' for reading: No such file or directory
: No such file or directory
cp: cannot open 'origin/file_310' for reading: No such file or directory
cp: cannot open 'origin/file_310' for reading: No such file or directory
: No such file or directory
cp: cannot open 'origin/file_310' for reading: No such file or directory
cp: cannot open 'origin/file_310' for reading: No such file or directory
cp: cannot open 'origin/file_310' for reading: No such file or directory
cp: cannot open 'origin/file_310' for reading: No such file or directory
cp: cannot open 'origin/file_310' for reading: No such file or directory
cp: cannot open 'origin/file_310' for reading: No such file or directory
cp: cannot open 'origin/file_310' for reading: No such file or directory
cp: cannot open 'origin/file_310' for reading: No such file or directory
cp: cannot open 'origin/file_310' for reading: No such file or directory
[...]
Kernel probably also matters, I'm on 4.19.12.
I'm on 4.18.0-3-amd64 at the moment (Debian Buster). On closer inspection, the No such file or directory regression I encountered with 9d9c4ddf said that the file was not found during stat, not during open, so it might be unrelated. For me latest go-fuse works pretty reliable. Just to be sure, let me run it 100 times in a loop, and check if there are any random failures.
With lots of retrys and smaller delay of 0.1 between cache flushes, I was also able to reproduce it on my machine. At the same time, gocryptfs complained with:
2019/01/05 19:13:02 Unimplemented opcode INTERRUPT
2019/01/05 19:13:02 writer: Write/Writev failed, err: 2=no such file or directory. opcode: INTERRUPT
Do you also get the same error?
I get these "INTERRUPT" every now and then, yes, always have. I think these are for aborting a running request, which is not implemented in go-fuse, but otherwise harmless.
The INTERRUPT messages seem to be unrelated and probably were just caused by programs aborting due to the failed cp operation. For me the key to reproduce it was the smaller delay between cache flushes. The error occurs both with the latest go-fuse and the version two commits ago.
BTW, because I only found it recently: Figure 2 in this paper is great:
https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf
Are you able to reproduce the issue with -sharedstorage? I was not able to reproduce it with that switch yet, but it is also faster by a factor of ~7 (when using tmpfs).
With -sharedstorage it runs... Faster!?
@rfjakob Thanks for the USENIX paper. Also, thanks to you and @slackner for your continuous improvements. Never before have I used encrypted home folders with so little effort, and so few problems. Great work!
With -sharedstorage it runs... Faster!?
Yes, at least for tmpfs and for this specific stress-test.
Did you already try to reproduce this issue with the loopback filesystem of go-fuse?
Not this time, no. But i hit a similar issue some tine ago, and it reproduced on both loopback and gocryptfs. It was ultimately fixed with changes in go-fuse: https://github.com/hanwen/go-fuse/issues/99
That sharedstorage runs faster could be because it disables hard links, which saves some overhead:
https://github.com/rfjakob/gocryptfs/blob/fcaca5fc94d981aa637beb752edc8cb3c2265e96/mount.go#L301
However, the massive factor of 7 speedup suggests that there is some optimization potential.
Interesting section of gocryptfs -fusedebug -debug: https://gist.github.com/rfjakob/f2cdcc121a41088cb1e9075185ca57bc
I don't understand where things go south.
2019/01/20 17:51:49 rx 83920: LOOKUP i1 ["sub_42"] 7b
Lookup "sub_42"
2019/01/20 17:51:49 tx 83920: OK, {i100 g22998 tE=1s tA=1s {M040775 SZ=760 L=2 1026:1026 B0*4096 i0:116781415 A 1548003103.035988 M 1548003109.476006 C 1548003109.476006}}
Found "sub_42", assigned id 100
2019/01/20 17:51:49 rx 84206: LOOKUP i100 ["file_130"] 9b
Lookup "sub_42/file_130"
FS.GetAttr(".deleted/file_130")
go-fuse forgot about "sub_42" and now calls it ".deleted"
Complete log this time, and I patched go-fuse so it does not recycle ids: https://gist.githubusercontent.com/rfjakob/ee4fbe522f7ae90ca4369a10bc0535bb/raw/a08212b601d0cd8590e58838a6a8cd602a9a9eb7/parallel_cp_norecycle.log
Output was:
$ ( cd b && ~/go/src/github.com/rfjakob/gocryptfs/tests/stress_tests/parallel_cp.sh )
deleting old files
creating files with dd
cp starting: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
cp finished: cp: cannot create regular file 'sub_56/file_116': No such file or directory
cp: cannot create regular file 'sub_56/file_117': No such file or directory
cp: cannot create regular file 'sub_56/file_118': No such file or directory
cp: cannot create regular file 'sub_56/file_119': No such file or directory
cp: cannot create regular file 'sub_56/file_12': No such file or directory
cp: cannot create regular file 'sub_56/file_120': No such file or directory
cp: cannot create regular file 'sub_56/file_121': No such file or directory
cp: cannot create regular file 'sub_56/file_122': No such file or directory
cp: cannot create regular file 'sub_56/file_123': No such file or directory
cp: cannot create regular file 'sub_56/file_124': No such file or directory
cp: cannot create regular file 'sub_56/file_125': No such file or directory
cp: cannot create regular file 'sub_56/file_126': No such file or directory
^C
Patch to go-fuse was:
$ git diff
diff --git a/fuse/nodefs/handle.go b/fuse/nodefs/handle.go
index 8bad21d..6a45e23 100644
--- a/fuse/nodefs/handle.go
+++ b/fuse/nodefs/handle.go
@@ -83,7 +83,7 @@ func (m *portableHandleMap) Register(obj *handled) (handle, generation uint64) {
return obj.handle, obj.generation
}
// Create a new handle number or recycle one on from the free list
- if len(m.freeIds) == 0 {
+ if true || len(m.freeIds) == 0 {
obj.handle = uint64(len(m.handles))
m.handles = append(m.handles, obj)
} else {
Yesterday I spent some more time looking at this bug, and I still suspect that you are seeing a different error than what I get. While I was able to reproduce similar output, I never saw any access to .deleted. Does the following debug code maybe reveal anything useful for you? http://ix.io/1z6u
BTW, did you already try if disabling ClientInodes has the same effect for you (improving performance and getting rid of the errors)? In the beginning I suspected it has to do with a couple of places in go-fuse checking conditions outside of locks - see http://ix.io/1z6K which adds a couple of panics -, but I was never able to trigger any of them.
I added the patch from http://ix.io/1z6u , but the panics do not trigger. Also, adding -sharedstorage did NOT fix the problem.
Whoops, just hit one:
2019/01/28 18:08:46 Forgot the wrong node? node{36184} vs node{0}
panic: Forgot the wrong node? node{36184} vs node{0}
goroutine 222580 [running]:
log.Panicf(0x685219, 0x1f, 0xc0055c4bb8, 0x2, 0x2)
/usr/local/go/src/log/log.go:333 +0xda
github.com/hanwen/go-fuse/fuse/nodefs.(*FileSystemConnector).forgetUpdate(0xc00007acc0, 0x8d1b, 0x1)
github.com/hanwen/go-fuse/fuse/nodefs/fsconnector.go:159 +0x3c4
github.com/hanwen/go-fuse/fuse/nodefs.(*rawBridge).Forget(0xc00007acc0, 0x8d1b, 0x1)
github.com/hanwen/go-fuse/fuse/nodefs/fsops.go:124 +0x3f
github.com/hanwen/go-fuse/fuse.doBatchForget(0xc0000ac000, 0xc0047acfc0)
github.com/hanwen/go-fuse/fuse/opcode.go:322 +0xff
github.com/hanwen/go-fuse/fuse.(*Server).handleRequest(0xc0000ac000, 0xc0047acfc0, 0xc0047acfc0)
github.com/hanwen/go-fuse/fuse/server.go:431 +0x26b
github.com/hanwen/go-fuse/fuse.(*Server).loop(0xc0000ac000, 0xc004afc901)
github.com/hanwen/go-fuse/fuse/server.go:403 +0x18f
created by github.com/hanwen/go-fuse/fuse.(*Server).readRequest
github.com/hanwen/go-fuse/fuse/server.go:291 +0x2d8
https://github.com/rfjakob/go-fuse/commit/edb435ebd4a2513e6057f330fa62f83f9bf22a08 seems to fix the problem here.
If the problem is that we return forgotten nodes to the kernel, how can we be sure that it doesn't happen anymore with this patch? There could be an arbitrary delay between receiving a message and processing it. Also there are batch-forgets that always will take more time to process.
xfstests looks good: https://gist.github.com/rfjakob/0537715c1264a6f494680e77d83f3882, generic/273 passes.
@slackner the problem is that LOOKUP and FORGET operate two data structures (the nodeid table and the directory tree). Between updating the first and updating the second one the state is inconsistent.
hanwen posted an example here: https://github.com/hanwen/go-fuse/issues/168#issuecomment-459740379
Fixed by https://github.com/hanwen/go-fuse/commit/d0fca860a5759d17592becfa1b8e5b1bd354b24a , merged into go-fuse.
Most helpful comment
@rfjakob Thanks for the USENIX paper. Also, thanks to you and @slackner for your continuous improvements. Never before have I used encrypted home folders with so little effort, and so few problems. Great work!