oc v1.2.0-1-g7386b49
kubernetes v1.2.0-36-g4a3f9c5
tail -n 100 /var/log/messages
Jun 17 08:25:45 os_master_01 journal: time="2016-06-17T08:25:45.11x737815+02:00" level=error msg="Error reading loginuid: open /proc/32654/loginuid: no such file or directory"
Jun 17 08:25:46 os_master_01 journal: time="2016-06-17T08:25:46.115x25135+02:00" level=error msg="Error reading loginuid: open /proc/32654/loginuid: no such file or directory"
Jun 17 08:25:46 os_master_01 journal: time="2016-06-17T08:25:46.582x95897+02:00" level=error msg="Error reading loginuid: open /proc/32654/loginuid: no such file or directory"
Jun 17 08:25:47 os_master_01 journal: time="2016-06-17T08:25:47.115x30728+02:00" level=error msg="Error reading loginuid: open /proc/32654/loginuid: no such file or directory"
Jun 17 08:25:48 os_master_01 journal: time="2016-06-17T08:25:48.116x68137+02:00" level=error msg="Error reading loginuid: open /proc/32654/loginuid: no such file or directory"
Jun 17 08:25:48 os_master_01 journal: time="2016-06-17T08:25:48.582x07583+02:00" level=error msg="Error reading loginuid: open /proc/32654/loginuid: no such file or directory"
Jun 17 08:25:48 os_master_01 journal: time="2016-06-17T08:25:48.58257x671+02:00" level=error msg="Error reading loginuid: open /proc/32654/loginuid: no such file or directory"
The logs are showed when executingsudo service docker status
Can someone explain this behaviour? The logs dissapear when I perform sudo service origin-node stop
@runcom does this ring a bell?
@lvthillo what OS? What docker version? If you have docker running and origin-node is not running, do you see additional error messages like this when you run commands like docker ps?
@ncdc is that with Docker 1.10.3?
@runcom does this ring a bell?
Btw, yes, that's coming from a patch we're carrying (the audit one) - @nalind wrote it, not sure what's going on honestly (except the pid isn't there anymore and we can't read loginuid)
@ncdc @runcom
The logs are only on servers which are running my nodes
Centos 7.2
Client:
Version: 1.9.1
API version: 1.21
Package version: docker-common-1.9.1-40.el7.centos.x86_64
Go version: go1.4.2
Git commit: ab77bde/1.9.1
Built:
OS/Arch: linux/amd64
Server:
Version: 1.9.1
API version: 1.21
Package version: docker-common-1.9.1-40.el7.centos.x86_64
Go version: go1.4.2
Git commit: ab77bde/1.9.1
Built:
OS/Arch: linux/amd64
sudo service origin-node status gave me:
state running:
9739 container_manager_linux.go:267] failed to detect process id for "docker" - failed to find pid o...it status 1
sudo service docker status:
time="2016-06-21T08:48:24.971651035+02:00" level=error msg="Error reading loginuid: open /proc/xxxx/loginuid: no such fil... directory"
Now I stop my node (docker keeps running):
docker status: level=error msg="Error reading loginuid: open /proc/9739/loginuid: no such fil... directory"
restart docker:
Jun 21 08:49:32 forward-journal[13338]: time="2016-06-21T08:49:32.791468938+02:00" level=info msg="Docker daemon" commit="ab77bde/1.9.1" execdriver=native-0.2 g...rsion=1.9.1
Jun 21 08:49:32 forward-journal[13338]: time="2016-06-21T08:49:32.801115851+02:00" level=info msg="API listen on /var/run/docker.sock"
Jun 21 08:49:32 systemd[1]: Started Docker Application Container Engine.
When I start my openshift-node again docker shows me again:
time="2016-06-21T08:51:21.581770368+02:00" level=error msg="Error reading loginuid: open /proc/13504/loginuid: no such f... directory"
the node is showing me:
I0621 08:50:10.993842 13504 manager.go:261] Recovery completed
Jun 21 08:50:11 origin-node[13504]: I0621 08:50:11.031174 13504 kubelet.go:2770] Recording NodeHasSufficientDisk event message for node
Jun 21 08:50:11 [13504]: I0621 08:50:11.031203 13504 kubelet.go:2770] Recording NodeReady event message for node
Jun 21 08:50:15 origin-node[13504]: I0621 08:50:15.654958 13504 kubelet.go:2404] SyncLoop (ADD, "api"): ""
Hello guys, the same is happening here.
A lot of spamming messages from origin-node.service, filling up the /var/log/messages.
GCE
OS: CentOS Linux release 7.2.1511 (Core)
OpenShift Origin
oc v1.2.0
kubernetes v1.2.0-36-g4a3f9c5
Same Client & Server version of docker daemon described on this thread.
If you need any more info or help on this matter.
I will be glad to provide.
Best regards
Gabriel Scheffer
@lvthillo @gabrielscheffer please stop origin-node, then run some docker commands (docker images, docker ps, etc) - do you continue to see the loginuid error messages showing up for the docker daemon?
@ncdc (in my case) the logs aren't showed anymore. It's only showed when the origin node is also running. After restarting the node, the logs are back
Same as here, the logs only show up when origin-node is running.
And I noticed the following logs on the service status:
origin-node[49497]: E0621 13:40:46.056433 49497 container_manager_linux.go:267] failed to detect process id for "docker" - failed to find pid of "docker": exit status 1
origin-node[49497]: E0621 13:45:46.063560 49497 container_manager_linux.go:267] failed to detect process id for "docker" - failed to find pid of "docker": exit status 1
I don't know if in some way is related to this issue.
@gabrielscheffer did you maby try it on centos 7.1 or 7.0? Don't know if 7.2 is recommended? (we are also using it)
@lvthillo Hmm, didn't know that. I just got the default Centos 7 image from my cloud provider. I will give a try on an older version.
So, I created a small cluster from scratch.
But I've some bad news, unfortunately even on CentOS Linux release 7.1.1503 (Core) I got those spam logs.
I reinstalled it multiple times but every time the same issue on each node.
@gabrielscheffer which cloud provider are you using. We want to rule out that it is our provider or something.
@lvthillo Google Compute Engine.
But a friend of mine has the same problem on AWS.
@gabrielscheffer Okay good to know. We are using OVH and same problem too so it's an openshift issue.
Are you all running OpenShift containerized?
@ncdc No using the ansible installation. Reinstalled it every day since friday and always the same issue.
@ncdc Nope mate, running the default install from Ansible master branch.
I'm spinning up a Centos VM to check things out.
@ncdc Where you able to reproduce the issue?
Not yet, at least not with the latest origin master code. I didn't try an ansible install - I was just building and running from source. I'm checking out v1.2.0 and will try that tag to see if the behavior is any different.
@ncdc I've done multiple installs and my oc version seems sometimes a bit different but the issue is always coming up.
Are you running pods on the node, or is the node empty of containers?
@ncdc I'm able to run pods (router®istry) on my node, but the errors are there after the playbook is finished. So when the node is empty of pods/containers
Trying an ansible install now. Will report back.
I get this, which I assume has a similar root cause: level=error msg="Failed to get pwuid struct: user: unknown userid 4294967295"
Actually, that's https://bugzilla.redhat.com/show_bug.cgi?id=1335635
@lvthillo @gabrielscheffer are you able to look for /proc/<pid>/loginuid - does it exist?
@ncdc for me it does not exist. (/proc/pid exisits, but not the loginuid) What steps are you following to install openshift origin without ansible?
@lvthillo I'm using an ansible-based install now. And I'm hitting the bz I linked above. But I can't reproduce your error locally.
@ncdc I want to install openshift v1.2.0 without using ansible so that the issue isn't coming up (you didn't saw the issue when you did not use the ansible install)
I generally do something like this (assuming you have a working go environment):
@ncdc so, /proc/
My ansible installation come deployment_type=origin from master of openshift-ansible on github.
Steps to reproduce:
Maybe that is the problem? Should I open a issue on the ansible team?
Can you paste the current pid with the loginuid error along with the output of ps auxwf?
Sure, but after last reinstall I'm just getting these errors:
------/var/log/messages------
Jun 22 14:39:45 master-node-1 journal: time="2016-06-22T14:39:45.829214266-03:00" level=error msg="Failed to get pwuid struct: user: unknown userid 4294967295"
Jun 22 14:39:46 master-node-1 journal: time="2016-06-22T14:39:46.167511935-03:00" level=error msg="Failed to get pwuid struct: user: unknown userid 4294967295"
-------------ps auxwf------------------------
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.1 125964 6680 ? Ss 12:08 0:01 /usr/lib/systemd/systemd --switched-root --system --deserialize 21
root 234 0.1 0.3 53628 14492 ? Ss 12:08 0:10 /usr/lib/systemd/systemd-journald
root 245 0.0 0.1 202876 4080 ? Ssl 12:08 0:00 /usr/sbin/lvmetad -f
root 259 0.0 0.1 45884 4636 ? Ss 12:08 0:00 /usr/lib/systemd/systemd-udevd
root 334 0.0 0.0 51188 1612 ? S<sl 12:08 0:00 /sbin/auditd -n
dbus 374 0.0 0.0 35160 2132 ? Ssl 12:08 0:00 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
root 376 0.0 0.0 203368 1240 ? Ssl 12:08 0:00 /usr/sbin/gssproxy -D
avahi 388 0.0 0.0 28120 1788 ? Ss 12:08 0:00 avahi-daemon: running [master-node-1.local]
avahi 400 0.0 0.0 27988 236 ? S 12:08 0:00 \_ avahi-daemon: chroot helper
root 393 0.0 0.3 251760 11788 ? Ssl 12:08 0:03 /usr/sbin/rsyslogd -n
root 395 0.0 0.2 534036 9676 ? Ssl 12:08 0:00 /usr/sbin/NetworkManager --no-daemon
root 824 0.0 0.4 110524 15864 ? S 12:08 0:00 \_ /sbin/dhclient -d -q -sf /usr/libexec/nm-dhcp-helper -pf /var/run/dhclient-eth0.pid -lf /var/lib/NetworkMan
root 397 0.0 0.0 26400 1724 ? Ss 12:08 0:00 /usr/lib/systemd/systemd-logind
root 405 0.0 0.0 4340 548 ? Ss 12:08 0:00 /usr/sbin/acpid
root 568 0.0 0.0 45488 716 ? S<s 12:08 0:00 ovsdb-server: monitoring pid 569 (healthy)
root 569 0.0 0.0 45888 2384 ? S< 12:08 0:01 \_ ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsyslog:err -vfile:info --remote=punix:/var/run/openv
root 579 0.0 0.0 49152 804 ? S<s 12:08 0:00 ovs-vswitchd: monitoring pid 580 (healthy)
root 580 0.0 0.9 270996 35384 ? S<Ll 12:08 0:09 \_ ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdi
root 614 0.0 0.0 53064 2652 ? Ss 12:08 0:00 /usr/sbin/wpa_supplicant -u -f /var/log/wpa_supplicant.log -c /etc/wpa_supplicant/wpa_supplicant.conf -u -f /va
polkitd 615 0.0 0.3 527460 13044 ? Ssl 12:08 0:00 /usr/lib/polkit-1/polkitd --no-debug
root 617 0.0 0.5 133564 20948 ? S<Lsl 12:08 0:01 /usr/sbin/dmeventd -f
root 885 0.0 0.2 208400 10640 ? Ss 12:08 0:00 /usr/bin/python /usr/share/google/google_daemon/manage_clock_sync.py
root 888 0.0 0.2 213384 11272 ? Ss 12:08 0:00 /usr/bin/python /usr/share/google/google_daemon/manage_accounts.py
root 13817 0.0 0.2 215452 8920 ? S 14:43 0:00 \_ /usr/bin/python /usr/share/google/google_daemon/manage_accounts.py
root 895 2.6 8.0 885352 303944 ? Ssl 12:08 4:05 /usr/bin/openshift start master --config=/etc/origin/master/master-config.yaml --loglevel=2
root 899 0.0 0.4 553164 16508 ? Ssl 12:08 0:01 /usr/bin/python -Es /usr/sbin/tuned -l -P
root 900 0.0 0.2 206876 11100 ? Ss 12:08 0:00 /usr/bin/python /usr/share/google/google_daemon/manage_addresses.py
root 969 0.0 0.0 126332 1700 ? Ss 12:08 0:00 /usr/sbin/crond -n
ntp 975 0.0 0.0 29408 2064 ? Ss 12:08 0:00 /usr/sbin/ntpd -u ntp:ntp -g
root 977 0.0 0.0 25964 964 ? Ss 12:08 0:00 /usr/sbin/atd -f
root 1089 0.0 0.0 110036 840 tty1 Ss+ 12:08 0:00 /sbin/agetty --noclear tty1 linux
root 1091 0.0 0.0 110036 860 ttyS0 Ss+ 12:08 0:00 /sbin/agetty --keep-baud 115200 38400 9600 ttyS0 vt220
root 1125 0.0 0.7 682248 27580 ? Ssl 12:08 0:06 /opt/stackdriver/collectd/sbin/stackdriver-collectd -C /opt/stackdriver/collectd/etc/collectd.conf -P /var/run/
root 1179 0.0 0.0 82560 3572 ? Ss 12:08 0:00 /usr/sbin/sshd -D
root 10149 0.0 0.1 140788 5076 ? Ss 13:59 0:00 \_ sshd: gabrielscheffer [priv]
centos 10153 0.0 0.0 140788 2368 ? S 13:59 0:00 \_ sshd: gabrielscheffer@pts/0
centos 10154 0.0 0.3 127184 13960 pts/0 Ss 13:59 0:00 \_ -bash
centos 13878 0.0 0.0 151168 1960 pts/0 R+ 14:44 0:00 \_ ps auxwf
root 1262 0.0 0.2 203796 7580 ? S 12:08 0:00 python /opt/stackdriver/extractor/bin/extractd.py -c /opt/stackdriver/extractor/etc/extractor.conf -P /var/run/
root 1319 0.0 0.0 91140 2160 ? Ss 12:08 0:00 /usr/libexec/postfix/master -w
postfix 1321 0.0 0.1 91420 3932 ? S 12:08 0:00 \_ qmgr -l -t unix -u
postfix 9229 0.0 0.1 91244 3916 ? S 13:46 0:00 \_ pickup -l -t unix -u
root 1325 0.9 1.6 728192 60956 ? Ssl 12:08 1:25 /usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --loglevel=2
root 1539 0.0 0.0 115244 1432 ? Ss 12:08 0:00 /bin/sh -c /usr/bin/docker-current daemon $OPTIONS $DOCKER_STORAGE_OPTIONS $DOCKER_NETWOR
root 1541 0.5 1.0 822044 38596 ? Sl 12:08 0:56 \_ /usr/bin/docker-current daemon --selinux-enabled --insecure-registry=172.30.0.0/16 --log-opt max-size=10M -
root 1799 0.0 0.0 2972 780 ? Ssl 12:08 0:00 | \_ /pod
root 1828 0.0 0.0 2972 744 ? Ssl 12:08 0:00 | \_ /pod
root 1921 0.0 0.7 360752 27712 ? Ssl 12:08 0:01 | \_ /usr/bin/openshift-router
root 13544 0.0 0.1 52864 6700 ? Ss 14:39 0:00 | | \_ /usr/sbin/haproxy -f /var/lib/haproxy/conf/haproxy.config -p /var/lib/haproxy/run/haproxy.pid -sf 1
1000000+ 1934 0.0 0.3 35268 13032 ? Ssl 12:08 0:00 | \_ /dockerregistry /config.yml
root 1542 0.0 0.0 101728 2016 ? Sl 12:08 0:02 \_ /usr/bin/forward-journald -tag docker
Is that it?
Failed to get pwuid struct: user: unknown userid 4294967295 is the bz listed above
@ncdc
level=error msg="Error reading loginuid: open /proc/11976/loginuid: no such file or directory"
root 11976 1.7 0.1 833564 46292 ? Ssl 19:51 0:01 /usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --loglevel=2
@lvthillo does /proc/11976 exist?
@ncdc
yes but no loginuid:
$ ls
auxv cmdline coredump_filter cwd exe fdinfo io limits mem mounts net numa_maps oom_score personality root stat status task
cgroup comm cpuset environ fd gid_map ipaddr maps mountinfo mountstats ns oom_adj oom_score_adj projid_map setgroups statm syscall uid_map
@gabrielscheffer How did you perform your new install? (I performed a new git clone but same issue)
@lvthillo So, like described up
@lvthillo can you grep loginuid /etc/pam.d/*?
This is not an OpenShift issue. The kernel needs to be compiled with audit support enabled, and you may need to have pam_loginuid configured. Once you do have /proc/pid/loginuid files in place for the OpenShift processes, they will be MAXINT, and you'll end up with the Failed to get pwuid struct: user: unknown userid 4294967295 "error" until a newer version of Docker is available that doesn't have this log spam.
@ncdc this has nothing to do with pam.
This is NOT a CentOS kernel. This is NOT a RHEL kernel. The kernel was built without CONFIG_AUDIT and/or CONFIG_AUDIT_SYSCALL.
This is a bug in docker. It should support such kernels.
Please file a BZ against 'docker' at bugzilla.redhat.com. Tell them that they need to support kernel's without CONFIG_AUDIT.
it's not against upstream docker though, the issue should be filed under projectatomic/docker since it's our patch causing this
@lvthillo ^^
@runcom okay, was an older issue which I wanted to close .
Thank you all for the support =)
@runcom Any idea when this will be fixed?
Is it possible to start an OpenShift environment with the 'broken' docker. And after the fix remove docker and reinstall it (so will it remain 1.9.1) + are there also functional problems with the docker version or only the 'error-bug'. Sorry for this big amount of questions but we're a bit stuck in here.
@lvthillo Hello mate, based on this patch
I released my own docker-1.9.1-40.
For me it's fixed, you can dig up my changes at my repo https://packagecloud.io/gabrielscheffer/centos7
Steps to install fixed docker:
1. curl -s https://packagecloud.io/install/repositories/gabrielscheffer/centos7/script.rpm.sh | sudo bash
2. sudo yum install docker docker-common docker-forward-journald docker-logrotate docker-selinux
3. sudo systemctl restart docker
Cheers