We are using Oracle linux (6.9). After upgrading icinga2 today ps that runs from the icinga user crashes.
abrt report
abrt_version: 2.0.8
cgroup:
cmdline: /bin/ps -eo 's uid pid ppid vsz rss pcpu etime comm args'
event_log:
executable: /bin/ps
hostname: silo2
kernel: 3.8.13-118.11.2.el6uek.x86_64
last_occurrence: 1524644388
machineid: sosreport_uploader-dmidecode=3e09daeef311ed180ecdce08b9798954e1b07b24b7a91ae57195bf48c0f82fa9
pid: 2219
pkg_arch: x86_64
pkg_epoch: 0
pkg_fingerprint: 72F9 7B74 EC55 1F03
pkg_name: procps
pkg_release: 45.0.1.el6_9.1
pkg_vendor: Oracle America
pkg_version: 3.2.8
pwd: /
time: Wed 25 Apr 2018 09:15:18 AM CEST
uid: 498
username: icinga
sosreport.tar.xz: Binary file, 1256500 bytes
core_backtrace:
:{ "signal": 11
:, "executable": "/bin/ps"
:, "stacktrace":
: [ { "crash_thread": true
: , "frames":
: [ { "address": 4206748
: , "build_id": "2ab2498a96e7cfc4942207da4da8376443d1d7ba"
: , "build_id_offset": 12444
: , "file_name": "/bin/ps"
: }
: , { "address": 4203318
: , "build_id": "2ab2498a96e7cfc4942207da4da8376443d1d7ba"
: , "build_id_offset": 9014
: , "file_name": "/bin/ps"
: } ]
: } ]
:}
dso_list:
:/lib64/ld-2.12.so glibc-2.12-1.209.0.3.el6_9.2.x86_64 (Oracle America) 1497950689
:/lib64/libproc-3.2.8.so procps-3.2.8-45.0.1.el6_9.1.x86_64 (Oracle America) 1499866591
:/bin/ps procps-3.2.8-45.0.1.el6_9.1.x86_64 (Oracle America) 1499866591
:/lib64/libc-2.12.so glibc-2.12-1.209.0.3.el6_9.2.x86_64 (Oracle America) 1497950689
:/lib64/libselinux.so.1 libselinux-2.0.94-7.el6.x86_64 (Oracle America) 1475744335
:/lib64/libdl-2.12.so glibc-2.12-1.209.0.3.el6_9.2.x86_64 (Oracle America) 1497950689
environ:
:TERM=screen
:PATH=/sbin:/usr/sbin:/bin:/usr/bin
:PWD=/
:LANG=en_US.UTF-8
:SHLVL=1
:LC_NUMERIC=C
:LC_ALL=C
limits:
:Limit Soft Limit Hard Limit Units
:Max cpu time unlimited unlimited seconds
:Max file size unlimited unlimited bytes
:Max data size unlimited unlimited bytes
:Max stack size 262144 unlimited bytes
:Max core file size 0 unlimited bytes
:Max resident set unlimited unlimited bytes
:Max processes 16384 16384 processes
:Max open files 16384 16384 files
:Max locked memory 65536 65536 bytes
:Max address space unlimited unlimited bytes
:Max file locks unlimited unlimited locks
:Max pending signals 63680 63680 signals
:Max msgqueue size 819200 819200 bytes
:Max nice priority 0 0
:Max realtime priority 0 0
:Max realtime timeout unlimited unlimited us
maps:
:00400000-00414000 r-xp 00000000 fc:00 1839 /bin/ps
:00614000-00615000 rw-p 00014000 fc:00 1839 /bin/ps
:00615000-00635000 rw-p 00000000 00:00 0
:00cbb000-00cdc000 rw-p 00000000 00:00 0 [heap]
:7f877f7bc000-7f877f7be000 r-xp 00000000 fc:00 24469 /lib64/libdl-2.12.so
:7f877f7be000-7f877f9be000 ---p 00002000 fc:00 24469 /lib64/libdl-2.12.so
:7f877f9be000-7f877f9bf000 r--p 00002000 fc:00 24469 /lib64/libdl-2.12.so
:7f877f9bf000-7f877f9c0000 rw-p 00003000 fc:00 24469 /lib64/libdl-2.12.so
:7f877f9c0000-7f877fb4a000 r-xp 00000000 fc:00 3016 /lib64/libc-2.12.so
:7f877fb4a000-7f877fd4a000 ---p 0018a000 fc:00 3016 /lib64/libc-2.12.so
:7f877fd4a000-7f877fd4e000 r--p 0018a000 fc:00 3016 /lib64/libc-2.12.so
:7f877fd4e000-7f877fd50000 rw-p 0018e000 fc:00 3016 /lib64/libc-2.12.so
:7f877fd50000-7f877fd54000 rw-p 00000000 00:00 0
:7f877fd54000-7f877fd62000 r-xp 00000000 fc:00 4212 /lib64/libproc-3.2.8.so
:7f877fd62000-7f877ff62000 ---p 0000e000 fc:00 4212 /lib64/libproc-3.2.8.so
:7f877ff62000-7f877ff63000 rw-p 0000e000 fc:00 4212 /lib64/libproc-3.2.8.so
:7f877ff63000-7f877ff77000 rw-p 00000000 00:00 0
:7f877ff77000-7f877ff94000 r-xp 00000000 fc:00 18559 /lib64/libselinux.so.1
:7f877ff94000-7f8780193000 ---p 0001d000 fc:00 18559 /lib64/libselinux.so.1
:7f8780193000-7f8780194000 r--p 0001c000 fc:00 18559 /lib64/libselinux.so.1
:7f8780194000-7f8780195000 rw-p 0001d000 fc:00 18559 /lib64/libselinux.so.1
:7f8780195000-7f8780196000 rw-p 00000000 00:00 0
:7f8780196000-7f87801b6000 r-xp 00000000 fc:00 3008 /lib64/ld-2.12.so
:7f87803a3000-7f87803a7000 rw-p 00000000 00:00 0
:7f87803b5000-7f87803b6000 rw-p 00000000 00:00 0
:7f87803b6000-7f87803b7000 r--p 00020000 fc:00 3008 /lib64/ld-2.12.so
:7f87803b7000-7f87803b8000 rw-p 00021000 fc:00 3008 /lib64/ld-2.12.so
:7f87803b8000-7f87803b9000 rw-p 00000000 00:00 0
:7ffcc5fd9000-7ffcc5ffa000 rw-p 00000000 00:00 0 [stack]
:7ffcc5ffd000-7ffcc5fff000 r-xp 00000000 00:00 0 [vdso]
:ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
open_fds:
:0:/dev/null
:pos: 0
:flags: 0100002
:1:pipe:[188294541]
:pos: 0
:flags: 01
:2:pipe:[188294542]
:pos: 0
:flags: 01
var_log_messages:
:Apr 25 09:15:18 silo2 kernel: ps[2219]: segfault at 7ffcc5f77ef8 ip 000000000040309c sp 00007ffcc5f77f00 error 6 in ps[400000+14000]
:Apr 25 09:15:18 silo2 abrt[2220]: Saved core dump of pid 2219 (/bin/ps) to /var/spool/abrt/ccpp-2018-04-25-09:15:18-2219 (503808 bytes)
:Apr 25 09:15:22 silo2 kernel: ps[2468]: segfault at 7ffdff0e49e8 ip 000000000040309c sp 00007ffdff0e49f0 error 6 in ps[400000+14000]
:Apr 25 09:15:22 silo2 abrt[2469]: Not saving repeating crash in '/bin/ps'
Seems we have that problem too.
From perl script command:
$msg_count = `$path_to_sudo $path_to_exim -bpc`;
Returns error code 11
Same command under icinga user running directly from shell returns code 0.
After downgrade to 2.8.2-1 all works as before.
@olegy89 Also on Oracle?
@Crunsher centos6, centos7
I'm not able to reproduce this on centos7. Can you share your exact Host, Service and CheckCommand object definition?
This works fine:
object Host "c" {
check_command = "c"
check_interval = 5s
retry_interval = 5s
}
object CheckCommand "c" {
command = [ "/bin/ps", "-eo", "s uid pid ppid vsz rss pcpu etime comm args" ]
}
Hi,
in our case it's the mailq command that fails. It does not fail in all cases with the earlier icinga2 versions. These checks run for months, user nagios is allowed and so on.
nagios$ mailq
nagios$
nagios$ '/usr/lib/nagios/plugins/check_mailq' '-M' 'exim' '-c' '5' '-w' '2'
OK: exim mailq (0) is below threshold (2/5)|unsent=0;2;5;0
CRITICALCRITICAL: Error code 0 returned from /usr/bin/mailq
[2018-04-25 09:36:34 +0200] notice/Process: Running command '/usr/lib/nagios/plugins/check_mailq' '-M' 'exim' '-c' '5' '-w' '2': PID 3365
[2018-04-25 09:36:34 +0200] notice/Process: PID 3365 ('/usr/lib/nagios/plugins/check_mailq' '-M' 'exim' '-c' '5' '-w' '2') terminated with exit code 2
Apr 25 09:36:34 lnv-2065 kernel: [ 974.883826] mailq[3366]: segfault at 7fff49eca968 ip 0000559f3db94463 sp 00007fff49eca810 error 6 in exim4[559f3db80000+f3000]
````
apply Service "mailq" {
check_command = "mailq"
max_check_attempts = "5"
check_period = "always"
check_interval = 1m
retry_interval = 1m
check_timeout = 10s
enable_notifications = false
enable_active_checks = true
enable_passive_checks = true
enable_event_handler = true
enable_perfdata = true
volatile = false
assign where "Linux Agent via Icinga 2 Core" in host.templates
command_endpoint = host_name
vars.mailq_critical = "5"
vars.mailq_servertype = "exim"
vars.mailq_warning = "2"
import DirectorOverrideTemplate
}
````
It was okay before and happens since installing icinga2-2.8.3-1
Using Ubuntu 16.04-LTS / Ubuntu 14.04 LTS
Cheers,
Marianne
We noticed problem only with external command 'sudo exim -bpc' and 'check_ipmi_sensor' plugin.
'ps' works fine. But 'eximq' fails not on each host despite same version of icinga and exim.
object CheckCommand "eximq" {
import "ipv4-or-ipv6"
command = [ PluginDir + "/base/" + "check_eximq" ]
arguments = {
"--critical" = "$critical$"
"--warning" = "$warning$"
}
timeout = "60"
}
cat ./check_eximq
#!/usr/bin/env perl
$msg_count = `sudo exim -bpc`;
print $?;
exit;
sudo -u icinga ./check_eximq
0
Result displayed in icinga web:
Plugin Output
11
@olegy89 Could you run uname -srvmo on the machine? The problem might be kernel specific
@Crunsher
Linux 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 GNU/Linux
Linux 3.10.0-714.10.2.lve1.5.12.el7.x86_64 #1 SMP Fri Feb 2 00:27:48 EST 2018 x86_64 GNU/Linux
Linux 3.10.0-693.21.1.vz7.46.3 #1 SMP Mon Apr 2 18:21:35 MSK 2018 x86_64 GNU/Linux
@Crunsher affected examples:
Linux 4.4.0-121-generic #145-Ubuntu SMP Fri Apr 13 13:47:23 UTC 2018 x86_64 GNU/Linux
Linux 3.13.0-144-generic #193-Ubuntu SMP Thu Mar 15 17:03:53 UTC 2018 x86_64 GNU/Linux
Thanks! So it has nothing to do with the kernel sigh
I am able to reproduce this using @sysadmama 's config example
@dnsmichi
we are using check_procs.
apply Service "procs" {
import "generic-service"
check_command = "procs"
assign where host.name == NodeName
}
The commit at fault is bf959371c4505bfe27b0682611c035d64b90efd3
Tickets: #6119 #6215
We've isolated the problem and are preparing 2.8.4 which reverts the regression.
Backported to support/2.8
Release is in progress: https://github.com/Icinga/icinga2/blob/master/RELEASE.md
Thanks for the reports everyone 馃挭
2.8.4 is published to our package repos.
[root@608c145dffda /]# icinga2 --version
icinga2 - The Icinga 2 network monitoring daemon (version: r2.8.4-1)
Copyright (c) 2012-2017 Icinga Development Team (https://www.icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Application information:
Installation root: /usr
Sysconf directory: /etc
Run directory: /run
Local state directory: /var
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /run/icinga2/icinga2.pid
System information:
Platform: CentOS Linux
Platform version: 7 (Core)
Kernel: Linux
Kernel version: 4.9.87-linuxkit-aufs
Architecture: x86_64
Build information:
Compiler: GNU 4.8.5
Build host: unknown
I can confirm, patch is working for check_ipmi_sensor. Thanks !
Jup, patch ist working too. Thanks! :tada:
hi,
just as an information: I had/have the same issue for my nagios-plugins-ceph (check_ceph_*) and check_ipmi. It took me some hours to find it, but going back to 2.8.2-1.stretch solved the problem.
@linuxmail 2.8.4 has this fixed
I did a little reading yesterday evening on the faulty patch, and for some technical reference it can be assumed that it changed the way the default stack size was set and handled later. This caused a too low stack size where specific applications/plugins would then crash from in this process/thread space.
We've seen a similar thing with the stack guard patches in the RHEL kernel where setting the stack size also failed and made applications crash. That experience, and the only known located change in application.cpp justifies the immediate revert for production. Future patches in this region will be reviewed long-term, and if not properly proven with test protocols, likely not get merged.
Cheers,
Michael
Hello @dnsmichi
Sorry for this regression. Actually I believe this is because bf95937 fix the rlimit stack resetting feature, then let the default rlimit value 256 * 1024(hardcoded there https://github.com/Icinga/icinga2/blob/v2.8.4/lib/base/application.cpp#L1503) become effected, which is too low for some specific check commands.
I think we just fix the logic there( https://github.com/Icinga/icinga2/blob/v2.8.4/lib/base/application.cpp#L249 ) - if user didn't set the RLimitStack config, we just don't reset the rlimit value.
@tclh123 Feel free to open another PR. We would like having this fixed but got our hands are full with 2.9.0
Such a PR must include a test protocol with and without the patch testing all the edge cases, and requires long term tests. As can be seen, there are more implications with breaking things here.
Most helpful comment
Thanks for the reports everyone 馃挭
2.8.4 is published to our package repos.