When running containers on Azure Kubernetes Service (AKS), the only practical storage option for persistent volumes is Azure Files, which is CIFS/SMB-based. So, Nextcloud has the data directory mounted over SMBv3. In this configuration, when using Alpine FPM containers and performing a chunked upload that has more than 62 chunks, the chunked upload will fail with a 403 error at the end. The issue does not appear to affect Nextcloud running on Ubuntu-based containers -- just Alpine.
Based on my analysis, the root cause of this issue lies somewhere inside the standard libraries for Alpine, though Nextcloud could implement a workaround for the issue until such time as the issue is fixed in Alpine's standard libraries.
Mount the Nextcloud data folder on a CIFS/SMB v3 share, such that chunked uploads for users get saved to the file share. For example, given a test user named TestUser, the folder /var/www/html/data/TestUser/uploads must be a folder that lives on an SMBv3 share.
_We are using the following CIFS mount options: rw,relatime,vers=3.0,cache=strict,uid=33,forceuid,gid=33,forcegid,file_mode=0770,dir_mode=0770,soft,persistenthandles,nounix,serverino,mapposix,nostrictsync,rsize=1048576,wsize=1048576,echo_interval=60,actimeo=5._
_Adjusting these values -- including setting actimeo to 1 or 0 -- does not appear to affect this bug._
TestUser.TestUser.
dd if=/dev/urandom of=./test-630 count=63 bs=10M
test-630 file that was created in step 5 by dragging-and-dropping it on the files area of the Nextcloud interface.Shell-in to the VM/container/host running Nextcloud PHP-FPM.
_How you do this depends on how you're running Nextcloud. We're using Azure Kubernetes Service, so we do this with kubectl exec -it nextcloud-POD-SUFFIX sh._
/var/www/html/data/TestUser/uploads folder in the host running Nextcloud PHP FPM.ls -l to locate the most-recent web-file-upload folder.ls -l to see what files are left.web-file-upload folder (it should have been deleted).Error when assembling chunks, status code 403 appears.web-file-upload folder corresponding to the upload.The 403 error message comes from \OCA\DAV\Connector\Sabre\Directory::delete() on line 314, but the root cause is that not all of the files in the chunked upload source folder got removed before Nextcloud attempted to delete the folder with rmdir. Stepping through \OC\Files\Storage\Local::rmdir(), the RecursiveDirectoryIterator is not getting a full list of the files in the folder, and so that's why it's not deleting all the files there. This behavior
The root cause _is not_ Nextcloud's fault -- on Alpine Linux, even rm -rf cannot remove the folder on a single try if it contains _more_ than 62 files. I do not know why 62 is significant other than that, if you count . and .., it would contain 64 files, but it's more nuanced than that.
The point of my filing this ticket is to come up with a mitigation/fix for this issue on the Nextcloud side since we can't do much about the Alpine bug. Presumably, the mitigation in Nextcloud is to keep trying to remove the files in the folder until either the folder is removed or Nextcloud notices that the number of files in the folder isn't decreasing.
See "Reproducing the Root Cause without Nextcloud" below to understand why I think this is an appropriate mitigation, and to see how I know the root cause lies outside Nextcloud.
Operating system:
4.15.0-1063-azure #68-Ubuntu SMP Fri Nov 8 09:30:20 UTC 2019 x86_64 x86_64 x86_64 GNU/Linuxnginx/1.15.10 with PHP FPMBrowser: Google Chrome 78.0.3904.97 (Official Build) (64-bit)
Operating system: Windows 10.0.18362 Build 18362
N/A
{
"reqId": "yRypcPE7j3zummFkYoSb",
"level": 3,
"time": "2019-11-17T06:50:38+00:00",
"remoteAddr": "123.45.67.123",
"user": "TestUser",
"app": "PHP",
"method": "MOVE",
"url": "/remote.php/dav/uploads/TestUser/web-file-upload-5514056e9ca8e7927f2255f80a3085d1-1573971043550/.file",
"message": "rmdir(/var/www/html/data/TestUser/uploads/web-file-upload-5514056e9ca8e7927f2255f80a3085d1-1573971043550): Directory not empty at /var/www/html/lib/private/Files/Storage/Local.php#119",
"userAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36",
"version": "16.0.6.1"
}
<?xml version="1.0" encoding="utf-8"?>
<d:error xmlns:d="DAV:" xmlns:s="http://sabredav.org/ns">
<s:exception>Sabre\DAV\Exception\Forbidden</s:exception>
<s:message/>
</d:error>
I created the following script as test.sh inside the /var/www/html/data/TestUser/uploads folder:
#!/usr/bin/env sh
starting_file_count="$1"
mkdir "test"
echo "Creating '${starting_file_count}' test files..."
for file_index in $(seq 1 "${starting_file_count}"); do
filename="test/test_${file_index}"
touch "${filename}"
done
echo ""
echo "Trying to delete test files..."
while [ -d "test" ]; do
count_before=$(ls test/ | wc -l)
count_deleted=$(rm -vrf test/ 2>/dev/null | wc -l)
count_after=$( (ls test/ | wc -l) 2>/dev/null || echo 0)
echo "DELETED: ${count_deleted} BEFORE: ${count_before} AFTER: ${count_after}"
done
echo ""
This script basically creates a bunch of zero-length files in a new folder under the current working directory (presumably, an SMBv3-mounted share), and then tries to remove the folder with "rm -rf". It counts how many files are in the folder before it removes the folder vs. after attempting to remove the folder, as well as how many files that a verbose "rm -rf" indicates it's removing. The script then loops until the folder is _actually_ removed, printing counts along the way.
With a properly-functioning storage driver, kernel, and standard library, the script should be able to remove all files in a single pass. But, with Alpine Linux 3.9 and 3.10, the script has to make several passes as soon as it is creating more than 62 files in the test folder. This demonstrates the issue we are seeing with Nextcloud, but at a lower-level of the system.
What follows are various results of running it. At first it seems like the way Alpine behaves always forces the last few batches of files to be powers of 62 files, but that pattern breaks down as soon as there are more than 250 files (see the results for powers of 3).
My script is not able to reproduce the issue when running an Ubuntu-based PHP container like php:7.3.11-cli even though the container is using the same file share, same mount settings, and same Kubernetes node (so exact same kernel).
for i in $(seq 50); do count=$(expr $i \* 5); test.sh "${count}"; done
Creating '5' test files...
Trying to delete test files...
DELETED: 6 BEFORE: 5 AFTER: 0
Creating '10' test files...
Trying to delete test files...
DELETED: 11 BEFORE: 10 AFTER: 0
Creating '15' test files...
Trying to delete test files...
DELETED: 16 BEFORE: 15 AFTER: 0
Creating '20' test files...
Trying to delete test files...
DELETED: 21 BEFORE: 20 AFTER: 0
Creating '25' test files...
Trying to delete test files...
DELETED: 26 BEFORE: 25 AFTER: 0
Creating '30' test files...
Trying to delete test files...
DELETED: 31 BEFORE: 30 AFTER: 0
Creating '35' test files...
Trying to delete test files...
DELETED: 36 BEFORE: 35 AFTER: 0
Creating '40' test files...
Trying to delete test files...
DELETED: 41 BEFORE: 40 AFTER: 0
Creating '45' test files...
Trying to delete test files...
DELETED: 46 BEFORE: 45 AFTER: 0
Creating '50' test files...
Trying to delete test files...
DELETED: 51 BEFORE: 50 AFTER: 0
Creating '55' test files...
Trying to delete test files...
DELETED: 56 BEFORE: 55 AFTER: 0
Creating '60' test files...
Trying to delete test files...
DELETED: 61 BEFORE: 60 AFTER: 0
Creating '65' test files...
Trying to delete test files...
DELETED: 62 BEFORE: 65 AFTER: 3
DELETED: 4 BEFORE: 3 AFTER: 0
Creating '70' test files...
Trying to delete test files...
DELETED: 62 BEFORE: 70 AFTER: 8
DELETED: 9 BEFORE: 8 AFTER: 0
Creating '75' test files...
Trying to delete test files...
DELETED: 62 BEFORE: 75 AFTER: 13
DELETED: 14 BEFORE: 13 AFTER: 0
Creating '80' test files...
Trying to delete test files...
DELETED: 62 BEFORE: 80 AFTER: 18
DELETED: 19 BEFORE: 18 AFTER: 0
Creating '85' test files...
Trying to delete test files...
DELETED: 62 BEFORE: 85 AFTER: 23
DELETED: 24 BEFORE: 23 AFTER: 0
Creating '90' test files...
Trying to delete test files...
DELETED: 62 BEFORE: 90 AFTER: 28
DELETED: 29 BEFORE: 28 AFTER: 0
Creating '95' test files...
Trying to delete test files...
DELETED: 62 BEFORE: 95 AFTER: 33
DELETED: 34 BEFORE: 33 AFTER: 0
Creating '100' test files...
DELETED: 62 BEFORE: 105 AFTER: 43
DELETED: 44 BEFORE: 43 AFTER: 0
Creating '110' test files...
Trying to delete test files...
DELETED: 62 BEFORE: 110 AFTER: 48
DELETED: 49 BEFORE: 48 AFTER: 0
Creating '115' test files...
Trying to delete test files...
DELETED: 62 BEFORE: 115 AFTER: 53
DELETED: 54 BEFORE: 53 AFTER: 0
Creating '120' test files...
Trying to delete test files...
DELETED: 62 BEFORE: 120 AFTER: 58
DELETED: 59 BEFORE: 58 AFTER: 0
Creating '125' test files...
Trying to delete test files...
DELETED: 63 BEFORE: 125 AFTER: 62
DELETED: 63 BEFORE: 62 AFTER: 0
Creating '130' test files...
Trying to delete test files...
DELETED: 68 BEFORE: 130 AFTER: 62
DELETED: 63 BEFORE: 62 AFTER: 0
Creating '135' test files...
Trying to delete test files...
DELETED: 73 BEFORE: 135 AFTER: 62
DELETED: 63 BEFORE: 62 AFTER: 0
Creating '140' test files...
Trying to delete test files...
DELETED: 78 BEFORE: 140 AFTER: 62
DELETED: 63 BEFORE: 62 AFTER: 0
Creating '145' test files...
Trying to delete test files...
DELETED: 83 BEFORE: 145 AFTER: 62
DELETED: 63 BEFORE: 62 AFTER: 0
Creating '150' test files...
Trying to delete test files...
DELETED: 88 BEFORE: 150 AFTER: 62
DELETED: 63 BEFORE: 62 AFTER: 0
Creating '155' test files...
Trying to delete test files...
DELETED: 93 BEFORE: 155 AFTER: 62
DELETED: 63 BEFORE: 62 AFTER: 0
Creating '160' test files...
Trying to delete test files...
DELETED: 98 BEFORE: 160 AFTER: 62
DELETED: 63 BEFORE: 62 AFTER: 0
Creating '165' test files...
Trying to delete test files...
DELETED: 103 BEFORE: 165 AFTER: 62
DELETED: 63 BEFORE: 62 AFTER: 0
Creating '170' test files...
Trying to delete test files...
DELETED: 108 BEFORE: 170 AFTER: 62
DELETED: 63 BEFORE: 62 AFTER: 0
Creating '175' test files...
Trying to delete test files...
DELETED: 113 BEFORE: 175 AFTER: 62
DELETED: 63 BEFORE: 62 AFTER: 0
Creating '180' test files...
Trying to delete test files...
DELETED: 118 BEFORE: 180 AFTER: 62
DELETED: 63 BEFORE: 62 AFTER: 0
Creating '185' test files...
Trying to delete test files...
DELETED: 123 BEFORE: 185 AFTER: 62
DELETED: 63 BEFORE: 62 AFTER: 0
Creating '190' test files...
Trying to delete test files...
DELETED: 126 BEFORE: 190 AFTER: 64
DELETED: 62 BEFORE: 64 AFTER: 2
DELETED: 3 BEFORE: 2 AFTER: 0
Creating '195' test files...
Trying to delete test files...
DELETED: 126 BEFORE: 195 AFTER: 69
DELETED: 62 BEFORE: 69 AFTER: 7
DELETED: 8 BEFORE: 7 AFTER: 0
Creating '200' test files...
Trying to delete test files...
DELETED: 126 BEFORE: 200 AFTER: 74
DELETED: 62 BEFORE: 74 AFTER: 12
DELETED: 13 BEFORE: 12 AFTER: 0
Creating '205' test files...
Trying to delete test files...
DELETED: 126 BEFORE: 205 AFTER: 79
DELETED: 62 BEFORE: 79 AFTER: 17
DELETED: 18 BEFORE: 17 AFTER: 0
Creating '210' test files...
Trying to delete test files...
DELETED: 126 BEFORE: 210 AFTER: 84
DELETED: 62 BEFORE: 84 AFTER: 22
DELETED: 23 BEFORE: 22 AFTER: 0
Creating '215' test files...
Trying to delete test files...
DELETED: 126 BEFORE: 215 AFTER: 89
DELETED: 62 BEFORE: 89 AFTER: 27
DELETED: 28 BEFORE: 27 AFTER: 0
Creating '220' test files...
Trying to delete test files...
DELETED: 126 BEFORE: 220 AFTER: 94
DELETED: 62 BEFORE: 94 AFTER: 32
DELETED: 33 BEFORE: 32 AFTER: 0
Creating '225' test files...
Trying to delete test files...
DELETED: 126 BEFORE: 225 AFTER: 99
DELETED: 62 BEFORE: 99 AFTER: 37
DELETED: 38 BEFORE: 37 AFTER: 0
Creating '230' test files...
Trying to delete test files...
DELETED: 126 BEFORE: 230 AFTER: 104
DELETED: 62 BEFORE: 104 AFTER: 42
DELETED: 43 BEFORE: 42 AFTER: 0
Creating '235' test files...
Trying to delete test files...
DELETED: 126 BEFORE: 235 AFTER: 109
DELETED: 62 BEFORE: 109 AFTER: 47
DELETED: 48 BEFORE: 47 AFTER: 0
Creating '240' test files...
Trying to delete test files...
DELETED: 126 BEFORE: 240 AFTER: 114
DELETED: 62 BEFORE: 114 AFTER: 52
DELETED: 53 BEFORE: 52 AFTER: 0
Creating '245' test files...
Trying to delete test files...
DELETED: 126 BEFORE: 245 AFTER: 119
DELETED: 62 BEFORE: 119 AFTER: 57
DELETED: 58 BEFORE: 57 AFTER: 0
Creating '250' test files...
Trying to delete test files...
DELETED: 126 BEFORE: 250 AFTER: 124
DELETED: 62 BEFORE: 124 AFTER: 62
DELETED: 63 BEFORE: 62 AFTER: 0
for i in $(seq 10); do count=$((2**$i)); test.sh "${count}"; done
Creating '2' test files...
Trying to delete test files...
DELETED: 3 BEFORE: 2 AFTER: 0
Creating '4' test files...
Trying to delete test files...
DELETED: 5 BEFORE: 4 AFTER: 0
Creating '8' test files...
Trying to delete test files...
DELETED: 9 BEFORE: 8 AFTER: 0
Creating '16' test files...
Trying to delete test files...
DELETED: 17 BEFORE: 16 AFTER: 0
Creating '32' test files...
Trying to delete test files...
DELETED: 33 BEFORE: 32 AFTER: 0
Creating '64' test files...
Trying to delete test files...
DELETED: 62 BEFORE: 64 AFTER: 2
DELETED: 3 BEFORE: 2 AFTER: 0
Creating '128' test files...
Trying to delete test files...
DELETED: 66 BEFORE: 128 AFTER: 62
DELETED: 63 BEFORE: 62 AFTER: 0
Creating '256' test files...
Trying to delete test files...
DELETED: 130 BEFORE: 256 AFTER: 126
DELETED: 64 BEFORE: 126 AFTER: 62
DELETED: 63 BEFORE: 62 AFTER: 0
Creating '512' test files...
Trying to delete test files...
DELETED: 280 BEFORE: 512 AFTER: 232
DELETED: 126 BEFORE: 232 AFTER: 106
DELETED: 62 BEFORE: 106 AFTER: 44
DELETED: 45 BEFORE: 44 AFTER: 0
Creating '1024' test files...
Trying to delete test files...
DELETED: 558 BEFORE: 1024 AFTER: 466
DELETED: 234 BEFORE: 466 AFTER: 232
DELETED: 126 BEFORE: 232 AFTER: 106
DELETED: 62 BEFORE: 106 AFTER: 44
DELETED: 45 BEFORE: 44 AFTER: 0
for i in $(seq 8); do count=$((3**$i)); test.sh "${count}"; done
Creating '3' test files...
Trying to delete test files...
DELETED: 4 BEFORE: 3 AFTER: 0
Creating '9' test files...
Trying to delete test files...
DELETED: 10 BEFORE: 9 AFTER: 0
Creating '27' test files...
Trying to delete test files...
DELETED: 28 BEFORE: 27 AFTER: 0
Creating '81' test files...
Trying to delete test files...
DELETED: 62 BEFORE: 81 AFTER: 19
DELETED: 20 BEFORE: 19 AFTER: 0
Creating '243' test files...
Trying to delete test files...
DELETED: 126 BEFORE: 243 AFTER: 117
DELETED: 62 BEFORE: 117 AFTER: 55
DELETED: 56 BEFORE: 55 AFTER: 0
Creating '729' test files...
Trying to delete test files...
DELETED: 402 BEFORE: 729 AFTER: 327
DELETED: 201 BEFORE: 327 AFTER: 126
DELETED: 64 BEFORE: 126 AFTER: 62
DELETED: 63 BEFORE: 62 AFTER: 0
Creating '2187' test files...
Trying to delete test files...
DELETED: 1101 BEFORE: 2187 AFTER: 1086
DELETED: 552 BEFORE: 1086 AFTER: 534
DELETED: 285 BEFORE: 534 AFTER: 249
DELETED: 126 BEFORE: 249 AFTER: 123
DELETED: 62 BEFORE: 123 AFTER: 61
DELETED: 62 BEFORE: 61 AFTER: 0
Creating '6561' test files...
Trying to delete test files...
DELETED: 3304 BEFORE: 6561 AFTER: 3257
DELETED: 1633 BEFORE: 3257 AFTER: 1624
DELETED: 857 BEFORE: 1624 AFTER: 767
DELETED: 388 BEFORE: 767 AFTER: 379
DELETED: 221 BEFORE: 379 AFTER: 158
DELETED: 96 BEFORE: 158 AFTER: 62
DELETED: 63 BEFORE: 62 AFTER: 0
I've prepared a patch (linked above) that works for us, but may not be the best way to solve it long term. I would expect any place in the Nextcloud codebase where RecursiveDirectoryIterator is used to be affected as well, but right now I'm just fixing the deletion issue.
It looks like this is actually caused by a bug in a standard library in Alpine Linux. I am going to update this issue to reflect it.
I am digging further... will update when I have more info.
After a LOT of digging, it seems as though the musl C-library in Alpine (which stands-in for what would be GNU LibC on an Ubuntu system) has a bug in the way it navigates through folders with the readdir() library function when on NFS or SMB.
The bug actually seems to exactly match a bug from 2012 on other distros, such as this one from CentOS: https://bugs.centos.org/view.php?id=5496
It seems like readdir() (which PHP's RecursiveDirectoryIterator depends on under the hood) loads 64 file entities at a time and maintains a cookie/pointer in the overall result set for NFS/SMB so it knows when to load more data. The issue is that if the contents of the directory change (e.g. via rmdir()), the pointer to the next file is relative to where the file was in the original result set instead of where it now is now that files have been removed. This is why some files get skipped, and the amount of files that get skipped is roughly 50% of the total number of files.
Since I cannot reproduce this issue with rmdir on an Ubuntu-based container with the same kernel, I must conclude that GNU LibC has a better implementation than musl in this scenario.
Given that the defect presents itself because the contents of the folder are changing as files are being removed, I believe that another workaround within Nextcloud to this issue would be to load the listing of all files to be removed before removing them, then remove based on the list. That implementation would only fail if new files get added to the folder in the interim.
I've filed the upstream ticket here: https://gitlab.alpinelinux.org/alpine/aports/issues/10960
Working with the musl library team on IRC, I now have a workaround that involves patching and recompiling the musl library that's used in the Nextcloud container.
See this commit:
https://github.com/GuyPaddock/inveniem-nextcloud-azure/commit/7f699301b11a9524d7e0685513f49e32f35720fd
There's also been some movement on this from the Azure Files kernel team:
https://github.com/Azure/AKS/issues/1325