Time was measured with the news app but other apps – like for notes – show similar behavior. Web access via browser seems normal.
Syncing takes 1-2 seconds. 1-3 threads of php-fpm utilize 10-20% CPU.
(php-fpm <= 7.3.7)
Syncing takes ca. 10-12 seconds. 1-3 threads of php-fpm utilize 50+ % CPU, at
least one close to 100% in that time.
(php-fpm >= 7.3.8)
Hardware: Raspberry Pi 3 Model B
Operating system: Arch Linux ARM
Web server: Lighttpd
Database: MariaDB
PHP version: 7.3.9
Nextcloud version: 16.0.4
Updated from an older Nextcloud/ownCloud or fresh install: Updated
Where did you install Nextcloud from: Arch package manager – pacman
Signing status:
Signing status
No errors have been found.
List of activated apps:
App list
Enabled:
- accessibility: 1.2.0
- calendar: 1.7.1
- cloud_federation_api: 0.2.0
- contacts: 3.1.3
- dav: 1.9.2
- federatedfilesharing: 1.6.0
- files: 1.11.0
- files_rightclick: 0.15.1
- files_trashbin: 1.6.0
- files_versions: 1.9.0
- gallery: 18.3.0
- logreader: 2.1.0
- lookup_server_connector: 1.4.0
- news: 13.1.6
- notes: 3.0.3
- notifications: 2.4.1
- oauth2: 1.4.2
- provisioning_api: 1.6.0
- serverinfo: 1.6.0
- tasks: 0.11.3
- twofactor_backupcodes: 1.5.0
- updatenotification: 1.6.0
- workflowengine: 1.6.0
Disabled:
- activity
- admin_audit
- comments
- encryption
- federation
- files_external
- files_pdfviewer
- files_sharing
- files_texteditor
- files_videoplayer
- firstrunwizard
- nextcloud_announcements
- password_policy
- privacy
- recommendations
- sharebymail
- support
- survey_client
- systemtags
- theming
- user_ldap
- viewer
Nextcloud configuration:
{
"system": {
"instanceid": "***REMOVED SENSITIVE VALUE***",
"passwordsalt": "***REMOVED SENSITIVE VALUE***",
"secret": "***REMOVED SENSITIVE VALUE***",
"trusted_domains": [
"192.168.10.5",
"localhost",
"picloud.local",
"picloud",
"::1",
"127.0.1.1"
],
"datadirectory": "***REMOVED SENSITIVE VALUE***",
"dbtype": "mysql",
"version": "16.0.4.1",
"dbname": "***REMOVED SENSITIVE VALUE***",
"dbhost": "***REMOVED SENSITIVE VALUE***",
"dbport": "",
"dbtableprefix": "oc_",
"mysql.utf8mb4": true,
"dbuser": "***REMOVED SENSITIVE VALUE***",
"dbpassword": "***REMOVED SENSITIVE VALUE***",
"installed": true,
"overwrite.cli.url": "http:\/\/192.168.10.5",
"memcache.local": "\\OC\\Memcache\\APCu",
"maintenance": false,
"loglevel": 2,
"app_install_overwrite": [
"calendar",
"tasks"
],
"mail_smtpmode": "smtp",
"mail_smtpsecure": "ssl",
"mail_sendmailmode": "smtp",
"mail_from_address": "***REMOVED SENSITIVE VALUE***",
"mail_domain": "***REMOVED SENSITIVE VALUE***",
"mail_smtphost": "***REMOVED SENSITIVE VALUE***",
"mail_smtpport": "465",
"mail_smtpauthtype": "LOGIN",
"mail_smtpauth": 1,
"mail_smtpname": "***REMOVED SENSITIVE VALUE***",
"mail_smtppassword": "***REMOVED SENSITIVE VALUE***",
"data-fingerprint": "f6ae37315645dadf1abd0b748a32218a"
}
}
Are you using external storage, if yes which one: No
Are you using encryption: No
Are you using an external user-backend, if yes which one: No
Browser: Firefox 69.0.1 (64-bit)
Operating system: 5.3.1-arch1-1-ARCH x86_64 GNU/Linux
Web server error log
2019-09-22 00:00:02: (server.c.969) [note] graceful shutdown started
2019-09-22 00:00:02: (server.c.2091) server stopped by UID = 0 PID = 91765
2019-09-22 00:00:02: (server.c.1521) server started (lighttpd/1.4.54)
2019-09-22 04:20:05: (server.c.969) [note] graceful shutdown started
2019-09-22 04:20:05: (server.c.2091) server stopped by UID = 0 PID = 1
2019-09-22 04:21:03: (server.c.1521) server started (lighttpd/1.4.54)
2019-09-22 16:53:15: (server.c.969) [note] graceful shutdown started
2019-09-22 16:53:15: (server.c.2091) server stopped by UID = 0 PID = 1
2019-09-22 16:53:17: (server.c.1521) server started (lighttpd/1.4.54)
2019-09-22 22:00:06: (server.c.969) [note] graceful shutdown started
2019-09-22 22:00:06: (server.c.2091) server stopped by UID = 0 PID = 1
2019-09-22 22:00:09: (server.c.1521) server started (lighttpd/1.4.54)
2019-09-22 22:49:46: (server.c.969) [note] graceful shutdown started
2019-09-22 22:49:46: (server.c.2091) server stopped by UID = 0 PID = 1
2019-09-22 22:49:47: (server.c.1521) server started (lighttpd/1.4.54)
2019-09-23 04:20:03: (server.c.969) [note] graceful shutdown started
2019-09-23 04:20:03: (server.c.2091) server stopped by UID = 0 PID = 1
2019-09-23 04:20:54: (server.c.1521) server started (lighttpd/1.4.54)
Nextcloud log
No entries at the time of the event.
Thank you for reporting this issue :+1:
You narrowed down the problem to php-fpm. I'm not sure how nextcloud can fix a regression (if there is one) in php-fpm?
It was asked if I already got in touch with the Nextcloud team, so I did. Any help on further narrowing down the issue would most likely be appreciated. I think info on how php-fpm is involved during the snycing process could help.
Issue @ PHP
Hmm. We can do some profiling with xdebug and check if there is a difference. Do you see the same with the nextcloud app?
No. Performance with the Nextcloud app is about the same through the different php-fpm versions. So far I can confirm the problem with Nextcloud news and notes (android app I use). I did not test other apps yet since I don't use those.
Can't reproduce here.
# php -v
PHP 7.3.9-1+ubuntu18.04.1+deb.sury.org+1 (cli) (built: Sep 2 2019 12:54:24) ( NTS )
Copyright (c) 1997-2018 The PHP Group
Zend Engine v3.3.9, Copyright (c) 1998-2018 Zend Technologies
with Zend OPcache v7.3.9-1+ubuntu18.04.1+deb.sury.org+1, Copyright (c) 1999-2018, by Zend Technologies
I have no clue what fixed it, but it is working as expected now. Closing...
Thanks for your time.
Facing a similar performance issue after upgrading to php 7.2.24 (see the spikes after Nov 4), running NC 16 also in fpm mode with about 60 active users. No clue yet, optimized all child and spawn parameters.
We found a infinite loop with search dav requests. Check https://github.com/nextcloud/server/issues/17447 for more details.
@kesselb Any idea when it will be released?
Thankks @kesselb
Is just copying 3rdparty/icewind
from master to current 17.0.1 installation should fix it? I wonder if I could use it as a solution in the meantime.
Is just copying
3rdparty/icewind
from master to current 17.0.1 installation should fix it?
Yes. I would use the 1.0.2 version from releases. Probably a bit easier: https://github.com/nextcloud/server/issues/17447#issuecomment-549267179. This is the necessary code change without the tests.
damn. @kesselb seems like this wasn't it. CPU load goes through the roof (on two separate instances) ever since the php update to 7.3.8 and higher so I was hoping this was it. I updated searchdav to 1.0.2 but that did not seem to help. Debug logs do not produce any visible errors that could suggest anything. Switching to version prior php7.3.8 works fine. However with the ransomware on the loose even with updated nginx config, it's too much of a gamble. Any ideas what to check, what to do?
I also applied the patch in Literal.php https://github.com/nextcloud/server/issues/17447#issuecomment-549815417 one week ago with no effect, still having extreme CPU load. Sometimes we have one php-fpm process which never stops and eats all CPU resources, needs to be killed manually. As mentioned above, this behaviour started the same time when we upgraded to php 7.2.24.
Waiting for backport to 16.0.7 with the complete patch https://github.com/nextcloud/server/pull/17905 , maybe I missed something.
@rasos it does not look like there is anything to miss tbh. I tired all kind of stuff even though I know it does not really matter (cloned 3rdparty repo master directly, copied only the searchDAV part, copied searchDAV1.0.2 from icewind repo, copied manually only the files changed etc). On two instances nothing has changed. I know it's for sure due to php upgrade as first instance had issue right after weekly OS update (thus reverted back to previous version until security issue), second one was rather neglected (so OS not updated regulary to say the least) but when security issue was posted on 2nd october, even though the php packages with the fix were not published yet, I updated the system and nginx resulting in spikes right away.
While the second instance is smaller so I can clearly see the spikes, the first instance is rather huge and it results in all cpu cores being 100% busy at all times.
I tried to disable most of the apps last night to see if one of them could be the cause, and I deployed nextcloud code from scratch to make sure there is no leftovers from previous updates, but it does not seem to fix anything.
I have not yet see anything pointing to any issue in the logs. Though since it's monday the smaller instance is used more but little enough to track this which is what I'm going to do today (it only spikes as oppose to constant cpu grind, so with debug logs I could pin point the times it jumps and hopefully see what exactly is causing it).
@NaofumiHonda Could you share how to debug such issues with gdb?
Assume php is 7.3.11.
extract .gdbinit from php-7.3.11.tar.xz which is in the root dir of the source tree.
Under root privilege, issue the command
gdb --init-command=.gdbinit -p "process no eating much cpu time"
Now you are in the gdb console.
(gdb) zbacktrace
Then gdb prints a function call tree currently executed.
Now issue "c" command in the gdb console (The process runs again).
After a few seconds, hit "^C" (the process stops).
Issue zbacktrace again and check the currently executed function.
Do these procedure several times until you find a root cause of an infinite loop.
Good luck!
So far I have not found anything pointing at any issue, just to give small update on things. I am trying to look elsewhere to maybe trigger/reproduce the issue so that we could see where to focus.
@rasos Do you by any chance use LDAP auth on your instance?
@muppeth yes, we use LDAP.
We have doubled memory and php workers today, load is now between 3 and 8, no more dead locked fpm workers which caused load of 25 the days before.
adb debugging showed often:
#0 0xb7eeed09 in __kernel_vsyscall ()
#1 0xb75f5ee2 in __libc_accept (fd=9, addr=..., len=0xbfa5cbf4) at ../sysdeps/unix/sysv/linux/accept.c:30
"accept" syscall means, that the process is waiting.
The accept() system call is used with connection-based socket types
(SOCK_STREAM, SOCK_SEQPACKET). It extracts the first connection
request on the queue of pending connections for the listening socket,
sockfd, creates a new connected socket, and returns a new file descrip‐
tor referring to that socket. The newly created socket is not in the
listening state. The original socket sockfd is unaffected by this
call.
@rasos Hmmm... Are you on nginx or apache2?
@muppeth we are on apache2.
I switched to apache2 and mod_php this evening instead of fpm+nginx to see if its any different, and nope it isnt. Which means the issue is not dependent on fpm looks like. Atm I'm back to nginx and php-fpm set 120 workers (static) on fpm and they are all active (running) which means that the load is high and of course the performance is horrible, but at least stuff does load.
I disabled most of the apps to see if there is any difference and also negative:
Enabled:
- accessibility: 1.3.0
- activity: 2.10.1
- announcementcenter: 3.6.1
- apporder: 0.8.0
- checksum: 0.4.3
- cloud_federation_api: 1.0.0
- comments: 1.7.0
- dav: 1.13.0
- encryption: 2.5.0
- federatedfilesharing: 1.7.0
- files: 1.12.0
- files_sharing: 1.9.0
- files_trashbin: 1.7.0
- files_versions: 1.10.0
- files_videoplayer: 1.6.0
- firstrunwizard: 2.6.0
- lookup_server_connector: 1.5.0
- nextcloud_announcements: 1.6.0
- notes: 3.0.3
- notifications: 2.5.0
- oauth2: 1.5.0
- ownpad: 0.6.13
- password_policy: 1.7.0
- privacy: 1.1.0
- provisioning_api: 1.7.0
- ransomware_protection: 1.5.1
- theming: 1.8.0
- twofactor_backupcodes: 1.6.0
- twofactor_totp: 4.0.0
- updatenotification: 1.7.0
- user_ldap: 1.7.0
- viewer: 1.2.0
- workflowengine: 1.7.0
Also just to check I've disabled /remote.php/dav
locations on reverse proxy. The load obviously dropped but still seen 1-2 php workers spinning at 100%, so something I will look into still (check why they are high).
When checking processes with strace I see lot of /var/log/nextcloud/apps/calendar/composer/autoload.php -1 ENOENT (No such file or directory)
.
Interesting to see the waiting processes, but none eats a lot of CPU (third row).
ps -u www-data -o pid,pcpu,cputime,etimes,stat,wchan:26,comm --sort cputime
PID %CPU TIME ELAPSED STAT WCHAN COMMAND
545 0.0 00:00:00 719 S hrtimer_nanosleep php-fpm7.2
4681 0.2 00:00:00 326 S hrtimer_nanosleep php-fpm7.2
5547 0.3 00:00:00 230 Sl futex_wait_queue_me apache2
6100 0.3 00:00:00 166 S hrtimer_nanosleep php-fpm7.2
6797 0.2 00:00:00 115 S hrtimer_nanosleep php-fpm7.2
7723 2.6 00:00:00 36 S skb_wait_for_more_packets php-fpm7.2
3508 0.2 00:00:01 422 S hrtimer_nanosleep php-fpm7.2
6428 1.5 00:00:01 126 Sl pipe_wait apache2
954 0.4 00:00:03 706 S hrtimer_nanosleep php-fpm7.2
1757 0.0 00:00:04 166919 Ss poll_schedule_timeout htcacheclean
3450 0.9 00:00:04 426 S hrtimer_nanosleep php-fpm7.2
31268 0.5 00:00:04 932 S hrtimer_nanosleep php-fpm7.2
3481 1.3 00:00:05 424 S skb_wait_for_more_packets php-fpm7.2
3506 1.3 00:00:05 422 S skb_wait_for_more_packets php-fpm7.2
6143 3.4 00:00:05 165 S hrtimer_nanosleep php-fpm7.2
7200 6.1 00:00:05 91 S skb_wait_for_more_packets php-fpm7.2
3602 1.4 00:00:06 420 S skb_wait_for_more_packets php-fpm7.2
31279 0.7 00:00:07 931 S hrtimer_nanosleep php-fpm7.2
522 1.1 00:00:08 721 Sl futex_wait_queue_me php-fpm7.2
3441 2.1 00:00:09 427 S hrtimer_nanosleep php-fpm7.2
4863 3.1 00:00:09 310 S skb_wait_for_more_packets php-fpm7.2
4885 2.9 00:00:09 307 S skb_wait_for_more_packets php-fpm7.2
29726 0.8 00:00:09 1062 S hrtimer_nanosleep php-fpm7.2
533 1.4 00:00:10 720 S skb_wait_for_more_packets php-fpm7.2
32407 1.3 00:00:10 772 Sl pipe_wait apache2
3498 2.7 00:00:11 423 S skb_wait_for_more_packets php-fpm7.2
3520 2.8 00:00:11 421 S skb_wait_for_more_packets php-fpm7.2
934 1.6 00:00:12 708 S hrtimer_nanosleep php-fpm7.2
3603 2.8 00:00:12 420 S hrtimer_nanosleep php-fpm7.2
532 1.8 00:00:13 720 S hrtimer_nanosleep php-fpm7.2
1124 2.1 00:00:14 686 S poll_schedule_timeout php-fpm7.2
27914 1.0 00:00:14 1293 Sl pipe_wait apache2
302 2.0 00:00:15 741 S hrtimer_nanosleep php-fpm7.2
930 2.1 00:00:15 709 S skb_wait_for_more_packets php-fpm7.2
951 2.1 00:00:15 707 S hrtimer_nanosleep php-fpm7.2
546 2.2 00:00:16 719 S skb_wait_for_more_packets php-fpm7.2
2415 2.9 00:00:16 558 S skb_wait_for_more_packets php-fpm7.2
3507 3.8 00:00:16 422 S hrtimer_nanosleep php-fpm7.2
303 2.4 00:00:18 741 S hrtimer_nanosleep php-fpm7.2
29734 1.7 00:00:18 1061 S hrtimer_nanosleep php-fpm7.2
620 2.8 00:00:20 717 S hrtimer_nanosleep php-fpm7.2
31278 2.1 00:00:20 931 S skb_wait_for_more_packets php-fpm7.2
32761 2.7 00:00:20 742 Sl futex_wait_queue_me php-fpm7.2
3510 5.4 00:00:22 421 S skb_wait_for_more_packets php-fpm7.2
31277 2.3 00:00:22 931 S hrtimer_nanosleep php-fpm7.2
3497 5.4 00:00:23 423 S hrtimer_nanosleep php-fpm7.2
935 3.4 00:00:24 708 S skb_wait_for_more_packets php-fpm7.2
343 4.1 00:00:30 737 S hrtimer_nanosleep php-fpm7.2
31353 3.3 00:00:30 920 S hrtimer_nanosleep php-fpm7.2
I configured slowdebug
in: /etc/php/7.2/fpm/pool.d/www.conf
throwing a trace if a script takes longer than 10s. The password hashing sequence is often shown as delayed:
[20-Nov-2019 20:12:31] [pool www] pid 8734
script_filename = /var/www/nextcloud/remote.php
[0xb541ca60] password_verify() /var/www/nextcloud/lib/private/Security/Hasher.php:156
[0xb541c9f0] verifyHashV2() /var/www/nextcloud/lib/private/Security/Hasher.php:178
[0xb541c960] verify() /var/www/nextcloud/apps/user_ldap/lib/Connection.php:650
[0xb541c8c0] bind() /var/www/nextcloud/apps/user_ldap/lib/Access.php:1593
[0xb541c830] areCredentialsValid() /var/www/nextcloud/apps/user_ldap/lib/User_LDAP.php:208
[0xb541c780] checkPassword() /var/www/nextcloud/apps/user_ldap/lib/User_Proxy.php:108
[0xb541c730] call_user_func_array() /var/www/nextcloud/apps/user_ldap/lib/User_Proxy.php:108
[0xb541c650] callOnLastSeenOn() /var/www/nextcloud/apps/user_ldap/lib/Proxy.php:150
[0xb541c5c0] handleRequest() /var/www/nextcloud/apps/user_ldap/lib/User_Proxy.php:196
[0xb541c560] checkPassword() /var/www/nextcloud/lib/private/User/Manager.php:207
[0xb541c4d0] checkPasswordNoLogging() /var/www/nextcloud/lib/private/User/Manager.php:184
[0xb541c440] checkPassword() /var/www/nextcloud/lib/private/User/Session.php:719
[0xb541c390] checkTokenCredentials() /var/www/nextcloud/lib/private/User/Session.php:753
[0xb541c300] validateToken() /var/www/nextcloud/lib/private/User/Session.php:260
[0xb541c290] validateSession() /var/www/nextcloud/lib/private/User/Session.php:235
[0xb541c240] getUser() /var/www/nextcloud/lib/private/legacy/app.php:345
[0xb541c1b0] getEnabledApps() /var/www/nextcloud/lib/private/legacy/app.php:113
[0xb541c120] loadApps() /var/www/nextcloud/remote.php:148
@rasos i can confirm that password hashing was the thing I saw pretty much at all times when running gdb.
update
Checking slow logs confirm no1 is password hasher as well. I wonder if it has anything to do with ldap since not everyone sees this issue in the loose.
Just caught it on the small instance that only spikes every now and then. I set slow log to 5s since it's small instance of about 25 people and atm apart from few clients, its not being used. It does correspond to the cpu peak
I switched to php7.4 on the smaller instance but looks like there is no change (I was hoping this would be low hanging fruit fix). @kesselb who should we talk to further regarding this? is blizzz the person to pull into here (did not mention him on purpose yet to not bother if not needed). My wild guess is that it might have something to do with ldap since very few people experience the issue.
When checking processes with strace I see lot of /var/log/nextcloud/apps/calendar/composer/autoload.php -1 ENOENT (No such file or directory).
Do you see this for other apps or only calendar? Very confusing log line.
i can confirm that password hashing was the thing I saw pretty much at all times when running gdb.
I think we use a newer password hashing way for php > 7.3 Probably unrelated if happening with 7.3.11 but not 7.3.8.
who should we talk to further regarding this? i
cc @blizzz @rullzer regarding ldap and password hashing.
Do you see this for other apps or only calendar? Very confusing log line.
Yes I see it on many apps (I would say all). For instance, when checking a process with strace and grepping for it:
access("/var/www/nextcloud/apps/serverinfo/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/spreed/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/tasks/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/text/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/theming/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/twofactor_totp/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/viewer/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
stat("/var/www/nextcloud/themes/default", 0x7ffe7835f2a0) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/themes//defaults.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/lib/l10n/en.json", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/settings", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/activity/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/apporder/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/bookmarks/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/bruteforcesettings/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/calendar/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/checksum/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/circles/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/contacts/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/deck/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/files_ebookreader/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/files_external/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/files_pdfviewer/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/files_rightclick/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/files_videoplayer/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/firstrunwizard/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/gallery/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/keeporsweep/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/news/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/nextcloud_announcements/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/notes/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/notifications/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/ownpad/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/password_policy/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/privacy/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/qownnotesapi/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/rainloop/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/ransomware_protection/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
access("/var/www/nextcloud/apps/recommendations/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
Here is the example with lines prior.
access("/var/www/nextcloud.clean/apps/deck", F_OK) = 0
access("/var/www/nextcloud.clean/apps/deck/appinfo/info.xml", F_OK) = 0
stat("/var/www/nextcloud.clean/apps/deck/appinfo/info.xml", {st_mode=S_IFREG|0644, st_size=2792, ...}) = 0
access("/var/www/nextcloud.clean/apps/deck/composer/autoload.php", F_OK) = -1 ENOENT (No such file or directory)
I think we use a newer password hashing way for php > 7.3 Probably unrelated if happening with 7.3.11 but not 7.3.8.
Thanks for the info. Rasos looks like uses php 7.2
I have this problem too, on Debian10 with php-fpm 7.3.11.
With slowlog enabled, I see password_verify() /var/www/nextcloud/lib/private/Security/Hasher.php:156 quite often. I have an app that console.log the round trip time, so I can measure what's happening. I modified Hasher.php:verifyHasherV2() to always return true to skip that suspicious password_verify() function in line 156, but the response time remains high, so apparently the delay is before that.
PostgreSQL and LDAP are ruled out: pgsql response time 8ms, ldap bind 15-20ms.
My smaller instance (the situation on the big instance is rather dramatic with constant load grinding away) also logs in slowlog additionally:
[0x00007f9c28e204d0] password_hash() /var/www/nextcloud/lib/private/Security/Hasher.php:80
[0x00007f9c28e20450] hash() /var/www/nextcloud/apps/user_ldap/lib/Connection.php:666
[0x00007f9c28e20390] bind() /var/www/nextcloud/apps/user_ldap/lib/Connection.php:576
[0x00007f9c28e202c0] establishConnection() /var/www/nextcloud/apps/user_ldap/lib/Connection.php:195
[0x00007f9c28e20260] init() /var/www/nextcloud/apps/user_ldap/lib/Connection.php:203
[0x00007f9c28e201f0] getConnectionResource() /var/www/nextcloud/apps/user_ldap/lib/Connection.php:643
[0x00007f9c28e20130] bind() /var/www/nextcloud/apps/user_ldap/lib/Access.php:1624
[0x00007f9c28e20080] areCredentialsValid() /var/www/nextcloud/apps/user_ldap/lib/User_LDAP.php:208
[0x00007f9c28e1ffb0] checkPassword() /var/www/nextcloud/apps/user_ldap/lib/User_Proxy.php:108
[0x00007f9c28e1ff40] call_user_func_array() /var/www/nextcloud/apps/user_ldap/lib/User_Proxy.php:108
[0x00007f9c28e1fe40] callOnLastSeenOn() /var/www/nextcloud/apps/user_ldap/lib/Proxy.php:150
[0x00007f9c28e1fd90] handleRequest() /var/www/nextcloud/apps/user_ldap/lib/User_Proxy.php:196
[0x00007f9c28e1fd10] checkPassword() /var/www/nextcloud/lib/private/User/Manager.php:212
[0x00007f9c28e1fc60] checkPasswordNoLogging() /var/www/nextcloud/lib/private/User/Session.php:588
[0x00007f9c28e1fbd0] loginWithPassword() /var/www/nextcloud/lib/private/User/Session.php:343
[0x00007f9c28e1fb50] login() /var/www/nextcloud/lib/private/User/Session.php:438
[0x00007f9c28e1fa50] logClientIn() /var/www/nextcloud/apps/dav/lib/Connector/Sabre/Auth.php:130
[0x00007f9c28e1f9c0] validateUserPass() /var/www/nextcloud/3rdparty/sabre/dav/lib/DAV/Auth/Backend/AbstractBasic.php:105
[0x00007f9c28e1f8f0] check() /var/www/nextcloud/apps/dav/lib/Connector/Sabre/Auth.php:253
[0x00007f9c28e1f810] auth() /var/www/nextcloud/apps/dav/lib/Connector/Sabre/Auth.php:155
@andreas-p thank you for sharing your analysis. If I understand you correctly, even when skipping verifyHasherV2 the response time remains slow. In my test set-up that is longer than 30s if there are about 20 clients polling in parallel, but what could be the root cause? LDAP here is also answering fast as expected. Is SabreAuth not getting the session and trying again and again?
I created a test app, see https://github.com/andreas-p/ncping which will check the web server response time (ping), php response time (version) and nextcloud response time (check).
The NC instance's response time is never below 700ms, while php will respond within 30-35ms (which is the icmp ping rtt as well), so it's not a network/apache/php-fpm response problem.
I installed a fresh NC17.0.1 instance on that same server (6-core Xeon, no HT), backed by postgresql, but no ldap. The response time is the 700ms+ as well.
Interestingly, I have another installation on an Atom board (also Debian10/php7.3-fpm), NC response is about 130ms. This machine used to be a lot slower (x3) than that Xeon machine with NC15.
From the observation, having one fast machine performing slow, and a slow machine performing fast, I copied the fast-performing php configuration to the slow, and checked the difference. It turns out that having the xdebug extension enabled (but not configured in any way) is the culprit. removing php/conf.d/20-xdebug.ini made my system fast again.
@andreas-p well that is somehow to be expected,. that the debug extension slows down a system. Here is the list of extensions we use.
ls fpm/conf.d/
10-mysqlnd.ini 20-calendar.ini 20-ftp.ini 20-imagick.ini 20-pdo_mysql.ini 20-shmop.ini 20-sysvshm.ini 20-zip.ini
10-opcache.ini 20-ctype.ini 20-gd.ini 20-intl.ini 20-phar.ini 20-simplexml.ini 20-tokenizer.ini
10-pdo.ini 20-curl.ini 20-gettext.ini 20-json.ini 20-posix.ini 20-soap.ini 20-wddx.ini
15-xml.ini 20-dom.ini 20-gmp.ini 20-ldap.ini 20-pspell.ini 20-sockets.ini 20-xmlreader.ini
20-apcu.ini 20-exif.ini 20-iconv.ini 20-mbstring.ini 20-readline.ini 20-sysvmsg.ini 20-xmlwriter.ini
20-bz2.ini 20-fileinfo.ini 20-igbinary.ini 20-mysqli.ini 20-redis.ini 20-sysvsem.ini 20-xsl.ini
I rather think we need to further grab in the NC or sabredav code and find out, where LDAP / fpm based system have a race condition. I guess it is close to the authentication procedure, which seems to loop somewhere.
As mentioned above, a freshly installed test system without ldap suffered from the same problem until I removed xdebug.
We are observing that nextcloud is firing up to 17 connections at once to the ldap host with watch -n 1 netstat -tepn
. Those fires correlate to the high nextcloud CPU usage.
I've seen only single connections to openldap.
@rasos I can confirm but I dont know if its just yet another symptom. Since the password hash function is slow logged it makes sense it checks with ldap. I see multiple connection firing up upon the peak, which also corresponds to clients sync (calendars mainly via evolution) on ~40 user instance when checking logs. On the big instance i see ton of continuous ldap connection (atm between 700-800 constant) but thats also because there I'm not dealing with spikes but rather constant issue.
Here is one of the short spikes on the small instance: https://bin.disroot.org/?1834918990e4c7b5#EJQ7rnu3QEYi6uad2oS8ZQoxqJmyeSZZPFy6iFT5nwMM
@blizzz @kesselb I'm trying to deploy a test instance and somehow force the issue occurence. Would you find it useful that if I manage to achieve that, I give you root access to the server to check and investigate? The situation on the big instance is very bad at the moment and users as much patience as they have, start to complain (which I'm not surprised), so I would like to help with finding the root cause as much as possible. If there is anything else that you guys require please also just say the word. If you have in-house stress tests that could be used, please let me know so I can check on test instance.
To sum up several tests having been made by people in this thread:
We tested again the LDAP connection, which responses within <1 ms. So our OpenLDAP server is not the bottleneck.
Did anybody try to downgrade to php 7.3.7 to narrow down the version change as a trigger? @muppeth @andreas-p
@rasos With ransomware on the loose even though I have updated nginx config I dont dare to downgrade. I have kept it on old (check which one it was) version for quite a while with no issues, only when upgrading the issue pops up, so I'm quite certain the upgrade triggered the issue.
Edit:
Definitely the last working version of php was 7.3.7
@rasos have you perhaps upgraded to 7.3.12?
I find it very odd that this issue is not reported by more users, as it looks like it should affect anyone running nextcloud with ldap auth and about 30 users who use calendar/contact sync or mount webdav on their machines.
I have setup a separate from scratch test instance. It's just nextcloud 17.0.1 on top of php7.3-fpm and nginx using mariadb.
I've created 100 users, set them all up on a laptop running gnome using "online accounts". I then enabled them all, or at least tried. While enabling / mounting webdav links on nautilus, around user50 the nextcloud has started timming out. Checking the server it has had all set workers active and slowlog started logging the Hasher function as mentioned above.
The test users have no data besides the default 'skel' and personal calendars (with no data on them). So it is not even about the amount of files or other data, that could trigger it.
@blizzz @kesselb @rullzer if you want. I can give you ssh access to the machine so you can check whats going on and maybe it will be easier for you to spot the issue. I would really appreciate someone from dev team looking into it as it is very critical imo and it can be easily reproducible
Update
The way to trigger the issue. Go to Gnome Calendar app and hit, "Syncronize calendars". Even if the calendars have no data, all php workers will get active exhausting the resources. Probably if there is data it takes longer. Also depending on amount of users and how often they sync, either spikes or constant load will be experienced.
I can see the issue is not triggered by searchDAV. To test I have simply removed icewind/searchdav (just to check) and upon hitting the "Synchronize calendars" I can observe the symptoms (load jumps, amount of active php workers maxes out, slow log logs the above posted).
@muppeth most of our users sync files only, and we observe the same effect as you describe. The headline of the issue should probably be rephrased by omitting "apps from phone".
As you now have a test instance, would you dare to step back the php version to check, which bugfix in php could have triggered the behaviour https://www.php.net/ChangeLog-7.php ? If you do not assign PATH_INFO then you should not face a risk with nginx https://bugs.php.net/bug.php?id=78599
@rasos Yes one of the reasons I did it was to do exactly this. I currently have knife on my throat regarding this issue so really want to fix it asap. I'll go through the change log, commit after commit to try to find which one is triggering it.
Also I hope for nextcloud devs to pop over coming week over here. I am going to create VBox image with all the test accounts created (just like I did on my laptop) so the issue can be triggered at will which should make it easy to find whats going.
Could this be somehow related to the bug fix https://bugs.php.net/bug.php?id=78269 ?
Change log on php Version 7.3.8 points to a bug fix related to password hashing
Place your custom configuration in the above field. Possible values are: https://www.php.net/manual/en/password.constants.php
Unfortunately options are not changeable via config.php.
@kesselb Setting the options to 'memory_cost' => 2048, 'time_cost' => 4, 'threads' => 3
seems to lower the load to values before the php update. slow log doesn't show any slow queries related to password hashing as posted previously by @rasos
If we use the current default values 64<<10
(65536) for memory_cost, as introduced in fix #78269, the number of workers maxes out and we experience the above symptoms.
We don't know what is the exact cause - are the values to high or is it related to Nextcloud hasher. The same happens on two different servers, different hardware with different amount of users and load.
Edit: currently we set the values to 'memory_cost' => 16<<10, 'time_cost' => 4
which works OK load wise, still not ideal.
To develop: Make memory_cost, time_cost, threads
configurable (probably expose options). Show warning to users with hashingCost
.
@kesselb can you clarify on the warning part what do you mean?
Does hashingCost
option in nextcloud is set to to 10
by default and does it mean it multiply everything by 10?
I still wonder (I will try to run some tests this week) if I get the same load issues if running similar setup (nextcloud with 100 test users syncing calendars) but without Ldap auth. I find it very weird that so little people (I did ask couple of other admins I know) experience this issue if it affects everyone out there.
can you clarify on the warning part what do you mean?
hashingCost
option is applicable for bcrypt but not for aragon2i. If $options is configurable via config.php we should show a warning to user with hashingCost
that this value is deprecated. < 7.3 cost
otherwise memory_cost, time_cost, threads
. No warning if hashingCost
is not set in config.php
.
in nextcloud is set to to 10 by default
No. It's the default for password_hash
and bcrypt.
Patching in nextcloud/lib/private/Security/Hasher.php
the hashing load parameters has tremendously reduced CPU load on a production environment.
/** private $options = []; */
private $options = [
'memory_cost' => 2048,
'time_cost' => 2,
'threads' => 1
];
The parameters should become configurable in php or nextcloud. Or even better, threads
should be something like CPU cores available / 2
and memory cost
should be a fraction (/32?) of max memory assigned to fpm workers.
Default threads = 2
does not work on small / cheap virtual machines which have assigned only one core.
EDIT - adding a monitoring screenshot - we added a 2nd CPU core Dec3 and hashing patch was applied Dec 17
Tried this patch on my low end device. Until now it is working perfectly.
Without it the system was unusable with the Android app. Web ui and desktop client worked fine and never had problems with high cpu usage on the server.
Something must have been changed with server version 16.0.5 causing the Android app to cause this.
Tried this patch on my low end device. Until now it is working perfectly.
Without it the system was unusable with the Android app. Web ui and desktop client worked fine and never had problems with high cpu usage on the server.
Something must have been changed with server version 16.0.5 causing the Android app to cause this.
Sounds like https://github.com/nextcloud/server/issues/17447
Sounds like #17447
I upgraded to 16.0.7 in hope for resolving this. But the problem was still there. #17447 should be fixed in this release, shouldn't it?
I've also switched to 16.0.7 in Dec 2019 but the high CPU usage problem persisted. I experience the problem with the Nextcloud Notes Android app and with Thunderbird Lightning on my Desktop PC.
A few days ago I have set up a my system from scratch with Armbian Buster and Nextcloud 17.0.2 but I still have the high CPU load problem.
Have you tried patching _nextcloud/lib/private/Security/Hasher.php_ as in rasos last post above?
Worked for me perfectly.
I've obviously overseen the most important post ;-) After patching Hasher.php both the Notes app sync and the Lightning sync work perfectly for me now :-) Thanks a lot
Hopefully this setting becomes a configurable option in one of the next releases.
With maybe an automatic default setting based on available cpu cores and memory as suggested by rasos.
hashing configuration will be exposed with #19023 (will be backported)
Hm, the patch allows tuning the config to solve the problem manually if you know what you're doing, I wonder if the typical wants/can do that. Even identifying where the problem is ain't easy, so IMHO the automatic configuration is quite important (maybe in the 1st-time installer only).
That's something to discuss with the php developers. I'm sure there is a good reason for this default.
Most helpful comment
@kesselb Setting the options to
'memory_cost' => 2048, 'time_cost' => 4, 'threads' => 3
seems to lower the load to values before the php update. slow log doesn't show any slow queries related to password hashing as posted previously by @rasosIf we use the current default values
64<<10
(65536) for memory_cost, as introduced in fix #78269, the number of workers maxes out and we experience the above symptoms.We don't know what is the exact cause - are the values to high or is it related to Nextcloud hasher. The same happens on two different servers, different hardware with different amount of users and load.
Edit: currently we set the values to
'memory_cost' => 16<<10, 'time_cost' => 4
which works OK load wise, still not ideal.