The projects that are running CronJobs seem to eventually lose internal network connectivity.
ie.
PHP + MySQL setup
Run cron job every minute, php script to connect to MySQL
By the end of the day, the PHP web server is unable to connect to the MySQL server with a connection timeout
Similar case happening to Redis and NodeJS, so it doesn't appear to be technology specific.
These are the same ScheduledJob that were being run on 1.4 (exported and re-imported as CronJob).
Even after cleaning up the completed jobs (jobs older then 2 hours are deleted) the issue still occurs.
Projects without CronJobs don't appear to be effected.
openshift v1.5.0+031cbe4
kubernetes v1.5.2+43a9be4
Can you get the node logs please from the node where the failing pod is running?
The issue only happens after a day or two of running, I had suspended all cron jobs when it started bringing down their services.
I'm re-enabling some and will post back the logs when it happens again.
I was able to replicate it again, the issue doesn't seem to particularity relate to CronJobs but it happens mostly on projects that are running them. Most of the time right after a build succeeds and a new deployment is created.
Below are the errors:
May 13 09:23:05 ip-10-0-200-127.ap-southeast-2.compute.internal dockerd-current[6361]: I0513 09:23:05.875583 14242 operation_executor.go:917] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/f1124b4b-3769-11e7-ba45-06579dce22e7-default-token-zdk92" (spec.Name: "default-token-zdk92") pod "f1124b4b-3769-11e7-ba45-06579dce22e7" (UID: "f1124b4b-3769-11e7-ba45-06579dce22e7").
May 13 09:23:05 ip-10-0-200-127.ap-southeast-2.compute.internal origin-node[14164]: I0513 09:23:05.875583 14242 operation_executor.go:917] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/f1124b4b-3769-11e7-ba45-06579dce22e7-default-token-zdk92" (spec.Name: "default-token-zdk92") pod "f1124b4b-3769-11e7-ba45-06579dce22e7" (UID: "f1124b4b-3769-11e7-ba45-06579dce22e7").
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal ovs-vsctl[127682]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists del-port veth00735fc5
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: device veth00735fc5 left promiscuous mode
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal dhclient[127114]: DHCPDISCOVER on veth58a37e85 to 255.255.255.255 port 67 interval 11 (xid=0x268e1510)
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-127701-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-127701-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal ovs-vsctl[127712]: ovs|00001|db_ctl_base|ERR|no row "veth00735fc5" in table Port
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal ovs-vsctl[127717]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists del-port veth00735fc5
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-127713-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal dockerd-current[6361]: W0513 09:23:06.195544 14242 multitenant.go:156] refcounting error on vnid 14177148
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal origin-node[14164]: W0513 09:23:06.195544 14242 multitenant.go:156] refcounting error on vnid 14177148
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-127713-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal oci-systemd-hook[127730]: systemdhook <debug>: Skipping as container command is /pod, not init or systemd
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-127732-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-127732-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-10): Unmounting Filesystem
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal dhclient[127397]: receive_packet failed on veth00735fc5: Network is down
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494631386.3021] device (veth00735fc5): driver 'veth' does not support carrier detection.
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494631386.3027] device (veth00735fc5): state change: ip-config -> unmanaged (reason 'unmanaged') [70 10 3]
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494631386.3097] dhcp4 (veth00735fc5): canceled DHCP transaction, DHCP client pid 127397
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494631386.3098] dhcp4 (veth00735fc5): state changed unknown -> done
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal dockerd-current[6361]: time="2017-05-13T09:23:06.373001459+10:00" level=error msg="Handler for POST /v1.24/containers/53417d8e1b62ed7c3e9856c320ee0113d6eddda3ce3ea05fb28444676064bb48/stop returned error: Container 53417d8e1b62ed7c3e9856c320ee0113d6eddda3ce3ea05fb28444676064bb48 is already stopped"
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal origin-node[14164]: W0513 09:23:06.373203 14242 docker_manager.go:1582] No ref for pod '"53417d8e1b62ed7c3e9856c320ee0113d6eddda3ce3ea05fb28444676064bb48 gg/pi-1494631380-5286k"'
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal dockerd-current[6361]: W0513 09:23:06.373203 14242 docker_manager.go:1582] No ref for pod '"53417d8e1b62ed7c3e9856c320ee0113d6eddda3ce3ea05fb28444676064bb48 gg/pi-1494631380-5286k"'
To clarify, running CronJobs only doesn't seem to break the project networking. It seems to happen mostly when a build finishes and a new deployment starts. But the issue happens only on projects that are running CronJobs, so perhaps some sort of race condition if they happen at the same time?
I have the full log dump of one of the nodes where this happened.
More relevant logs
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494630086.0812] device (veth77b313c0): link connected
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494630086.0827] manager: (veth77b313c0): new Veth device (/org/freedesktop/NetworkManager/Devices/4663)
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494630086.1029] device (veth77b313c0): state change: unmanaged -> unavailable (reason 'managed') [10 20 2]
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494630086.1073] ifcfg-rh: add connection in-memory (87e3c85c-92b0-3cd4-bed2-a5d37ba0b0da,"Wired connection 3")
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494630086.1082] settings: (veth77b313c0): created default wired connection 'Wired connection 3'
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494630086.1099] device (veth77b313c0): state change: unavailable -> disconnected (reason 'none') [20 30 0]
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494630086.1114] policy: auto-activating connection 'Wired connection 3'
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494630086.1133] device (veth77b313c0): Activation: starting connection 'Wired connection 3' (87e3c85c-92b0-3cd4-bed2-a5d37ba0b0da)
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494630086.1135] device (veth77b313c0): state change: disconnected -> prepare (reason 'none') [30 40 0]
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494630086.1144] device (veth77b313c0): state change: prepare -> config (reason 'none') [40 50 0]
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494630086.1157] device (veth77b313c0): state change: config -> ip-config (reason 'none') [50 70 0]
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494630086.1162] dhcp4 (veth77b313c0): activation: beginning transaction (timeout in 45 seconds)
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal ovs-vsctl[101569]: ovs|00001|vsctl|INFO|Called as ovs-vsctl add-port br0 veth77b313c0
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: device veth77b313c0 entered promiscuous mode
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494630086.1790] dhcp4 (veth77b313c0): dhclient started with pid 101571
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494630086.1809] device (veth77b313c0): enslaved to non-master-type device ovs-system; ignoring
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal dhclient[101571]: DHCPDISCOVER on veth77b313c0 to 255.255.255.255 port 67 interval 7 (xid=0x74555c58)
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-27): Mounting V5 Filesystem
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-27): Ending clean mount
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal dockerd-current[6361]: I0513 09:01:26.441869 14242 docker_manager.go:2206] Determined pod ip after infra change: "auth-worker-110-c8x7z_(ec154037-3766-11e7-ba45-06579dce22e7)": "10.131.10.130"
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal origin-node[14164]: I0513 09:01:26.441869 14242 docker_manager.go:2206] Determined pod ip after infra change: "auth-worker-110-c8x7z_(ec154037-3766-11e7-ba45-06579dce22e7)": "10.131.10.130"
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-27): Unmounting Filesystem
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-27): Mounting V5 Filesystem
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-27): Ending clean mount
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-27): Unmounting Filesystem
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-27): Mounting V5 Filesystem
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-27): Ending clean mount
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-101641-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-101641-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: nf_conntrack: falling back to vmalloc.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: nf_conntrack: falling back to vmalloc.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Started docker container 30e7d4718535645141a0bff09e42f80710fd88ede1a2f93adaedcda61cc4dd28.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Starting docker container 30e7d4718535645141a0bff09e42f80710fd88ede1a2f93adaedcda61cc4dd28.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue)
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-28): Mounting V5 Filesystem
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal oci-systemd-hook[101678]: systemdhook <debug>: Skipping as container command is /pod, not init or systemd
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-101679-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-101679-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-28): Ending clean mount
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal dockerd-current[6361]: E0513 09:01:27.284821 14242 docker_manager.go:1741] Failed to create symbolic link to the log file of pod "auth-139-ljc8r_(ecc3537b-3766-11e7-ba45-06579dce22e7)" container "POD": symlink /var/log/containers/auth-139-ljc8r__POD-30e7d4718535645141a0bff09e42f80710fd88ede1a2f93adaedcda61cc4dd28.log: no such file or directory
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal origin-node[14164]: E0513 09:01:27.284821 14242 docker_manager.go:1741] Failed to create symbolic link to the log file of pod "auth-139-ljc8r_(ecc3537b-3766-11e7-ba45-06579dce22e7)" container "POD": symlink /var/log/containers/auth-139-ljc8r__POD-30e7d4718535645141a0bff09e42f80710fd88ede1a2f93adaedcda61cc4dd28.log: no such file or directory
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-28): Unmounting Filesystem
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal dockerd-current[6361]: E0513 09:01:27.370479 14242 docker_manager.go:378] NetworkPlugin cni failed on the status hook for pod 'auth-139-ljc8r' - Unexpected command output Device "eth0" does not exist.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal dockerd-current[6361]: with error: exit status 1
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal origin-node[14164]: E0513 09:01:27.370479 14242 docker_manager.go:378] NetworkPlugin cni failed on the status hook for pod 'auth-139-ljc8r' - Unexpected command output Device "eth0" does not exist.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal origin-node[14164]: with error: exit status 1
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494630087.4540] manager: (vethf0daeab9): new Veth device (/org/freedesktop/NetworkManager/Devices/4664)
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494630087.4562] device (vethf0daeab9): link connected
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494630087.4563] device (vethf0daeab9): state change: unmanaged -> unavailable (reason 'managed') [10 20 2]
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494630087.4600] ifcfg-rh: add connection in-memory (c9cf445a-4141-3e70-a65c-87a3588f6c49,"Wired connection 11")
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494630087.4608] settings: (vethf0daeab9): created default wired connection 'Wired connection 11'
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494630087.4637] device (vethf0daeab9): state change: unavailable -> disconnected (reason 'none') [20 30 0]
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494630087.4677] policy: auto-activating connection 'Wired connection 11'
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494630087.4704] device (vethf0daeab9): Activation: starting connection 'Wired connection 11' (c9cf445a-4141-3e70-a65c-87a3588f6c49)
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494630087.4706] device (vethf0daeab9): state change: disconnected -> prepare (reason 'none') [30 40 0]
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494630087.4714] device (vethf0daeab9): state change: prepare -> config (reason 'none') [40 50 0]
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal ovs-vsctl[101727]: ovs|00001|vsctl|INFO|Called as ovs-vsctl add-port br0 vethf0daeab9
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494630087.4724] device (vethf0daeab9): state change: config -> ip-config (reason 'none') [50 70 0]
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494630087.4727] dhcp4 (vethf0daeab9): activation: beginning transaction (timeout in 45 seconds)
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: device vethf0daeab9 entered promiscuous mode
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494630087.5142] dhcp4 (vethf0daeab9): dhclient started with pid 101729
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info> [1494630087.5176] device (vethf0daeab9): enslaved to non-master-type device ovs-system; ignoring
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal dhclient[101729]: DHCPDISCOVER on vethf0daeab9 to 255.255.255.255 port 67 interval 5 (xid=0xe7661df)
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-28): Mounting V5 Filesystem
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-28): Ending clean mount
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-28): Unmounting Filesystem
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal origin-node[14164]: I0513 09:01:27.921616 14242 docker_manager.go:2206] Determined pod ip after infra change: "auth-139-ljc8r_(ecc3537b-3766-11e7-ba45-06579dce22e7)": "10.131.10.131"
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal dockerd-current[6361]: I0513 09:01:27.921616 14242 docker_manager.go:2206] Determined pod ip after infra change: "auth-139-ljc8r_(ecc3537b-3766-11e7-ba45-06579dce22e7)": "10.131.10.131"
auth-worker-110-c8x7z and auth-139-ljc8r are two of the pods that recently came up and error'd. Other pods came up without a problem.
I have been able to repeatedly replicate this issue on two different clusters. Steps:
php health.php every minute (this just pings the MySQL database)Seems to happen when a new deployment starts at the same time the cronjob executes. Every 2-3 builds it happens.
Delete the CronJob and all completed Jobs, keep triggering builds. Issue no longer occurs.
After further tests, it doesn't appear to be just related to CronJobs at all. It was just occurring more frequently with projects that had CronJobs.
I have experienced very similar issues after upgrading to 1.5 - I haven't had any CronJobs though, so I can confirm that it is probably unrelated to CronJobs.
Occasionally, when a Pod starts up, it does not have network connectivity. It happens about once in 5 newly created pods. Restarting (deleting) the Pod, or restarting the origin-node or openvswitch service on the node that the Pod is running on fixes the problem.
I have tried to investigate a bit further, and I noticed that in this setup:
(project A / pod1) <-> Node1 <-> Node2 <-> (project A / pod2)
If pod1 is experiencing this issue, and I was pinging pod2 from pod1, then the packets were actually arriving at pod2, and it was sending ICMP replies, which got back encapsulated in vxlan to Node1, but they haven't shown up on pod1's virtual interface, so it seemed the issue was with the vxlan termination of incoming traffic to the Pod. At this point I started looking at OpenFlow rules but I couldn't determine anything, I haven't found anything trivially missing so I restarted origin-node on Node1 which resolved the issue. But it's still an annoyance as it happens fairly frequently.
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal origin-node[14164]: W0513 09:23:06.195544 14242 multitenant.go:156] refcounting error on vnid 14177148
That's the problem; something is going wrong in the networking code and causing it to drop OVS rules for a namespace when, apparently, the namespace is still in use.
Could you restart the node with --loglevel=5, run until the error shows up, and then post the logs (from just the atomic-openshift-node service) somewhere? (It will probably be way too large to post here but you could use a pastebin and post just a link, or you could email it to me (danw at redhat))
from just the atomic-openshift-node service
where by "atomic-openshift-node" I probably mean "origin-node" I guess...
I'll try to replicate this again on a test cluster so I don't need to update all the nodes to loglevel=5
@mbalazs curious if you are also running on AWS?
@andrewklau No, it's on a private OpenNebula installation. Some additional info that might help in reproducing:
@danwinship I am emailing the logs to you now
I was able to replicate this fairly quickly:
Continued to oc rollout on cakephp-ex. By deployment 6 it crashed.
Please note that we're having the exact same issue, and it's not related to CronJobs.
I hope we'll have a patch soon, otherwise we'll have to downgrade to 1.4 :(
thanks
Can you tell me if the broken pod has an IP address? Can it ping the node it is on? Can the node ping it? (I'm wondering if this is an ARP cache problem)
@knobunc In my case, the broken pod (php) has an IP address, and it can ping other nodes in the cluster. It can also ping the docker registry pod. And I can also ping the broken pod from the cluster nodes and from the registry pod.
However, it can not ping the other pod (mysql) in the same project, and vice-versa. The other pod can also ping all the nodes and the registry too. So the only thing that seems to be broken is the communication between these two pods.
@knobunc I have a 2 node test cluster running that I created to replicate the problem. It has hit the problem if you would like to have access. It's also running at --log-level=5
@knobunc @danwinship are there any additional logs or access I can provide to further identify what might be the cause?
Can you ping pods from that pod that are on the same node?
What plugin are you using for networking?
It seems to only happen with multitenant plugin.
Yes: multitenant
Can you ping from the target pod back to the source pod? (i.e. reverse the broken ping)
Trying to check now, I was using a healthcheck to pick up the failure but that would put it into a crashloopbackoff.
From a pod that was working I was able to ping a failed pod. I then tried to start up a debug container from the pod that was failed, either that or the ping managed to bring all the pods back online.
Trying to reliability confirm now.
Ok, to answer your question:
For the MySQL pod which has always been running:
It is unable to ping the new pod which deployed
The new pod which is deployed is able to ping the MySQL pod
So to clarify the scenario:
PHP deployment 1
MySQL deployment 1
(Everything OK)
PHP deployment 2
MySQL deployment 1
(PHP can ping MySQL, PHP can't query against MySQL server, MySQL can't ping PHP)
Since posting this, PHP is now no longer able to ping MySQL
And again if I start a debug container, all the pod inter-connectivity is magically starts working again.
Powering off the debug container and inter-connectivity dies.
Please run the debug.sh referenced at https://docs.openshift.com/enterprise/3.2/admin_guide/sdn_troubleshooting.html#further-help
And put that somewhere I can get at, and tell me the pods (or ips) of a pair of things that are not working.
github doesn't support .tgz filetype to it's renamed to .gz
openshift-sdn-debug-2017-05-24.gz
The two pods which have failed are 192.0.0.142 (cakephp-mysql-example-34-jzvq8) and 192.0.0.143 (cakephp-mysql-example-34-w5mcn)
192.1.0.10 (mysql-3-dhsl0) is the long running pod which has always been working.
All the pods on the host were in the "notready" state due to healthcheck on database connectivity, hence the message:
No pods on x , so no connectivity tests
I updated the debug script to include notready pods and attached the output again.
openshift-sdn-debug-2017-05-25.gz
Any news on this? It's pretty critical to us.
Thanks
Unfortunately the logs captured there don't have the startup. I'm trying to reproduce it locally using the steps @andrewklau posted earlier.
@knobunc were you able to reproduce it, or do you need more accurate steps?
We have not managed it yet... I think it is the same as: https://bugzilla.redhat.com/show_bug.cgi?id=1454948 and in comment #8 you can see the steps @weliang1 tried (but the "Can reproduce" should read "Can't reproduce").
If you can suggest something else to try, or get us openshift logs at at least log level 5 from the source and destination nodes, that might help get to the root of the problem.
Thanks!
@knobunc how many nodes does your test cluster have? I believe the issue has something to do with pods beings started / stopped concurrently in the same time on the same host. Maybe try to add a node-selector to the project to limit to one or two nodes if the test cluster is big. (The difference between a project-level node selector and a deployment-level node selector is that in this case the deployer pod would also run on one of the two nodes)
@knobunc I had sent logs to Dan a while back, here is the attached logs (log level 5)
restart.txt
journactl -f > restart.log
systemctl restart openvswitch origin-node
oc rollout latest dc/cakephp-mysql-persistent
oc start-build cakephp-mysql-persistent
oc rollout latest dc/cakephp-mysql-persistent
Error occured at around 22:04
I think @mbalazs is correct in assuming the error occurs when the pods start on the same host.
I was able to reliably replicate this in a 2 node cluster:
mulitenant plugin, containerized
My testing env: AWS, mulitenant plugin, containerized, one master, two nodes.
This time I reproduce pods connectivity issue in my env after run checking script instead of manual testing.
I had sent logs to Dan a while back, here is the attached logs
I've been working on a bunch of other things and didn't get a chance to look at these until today, but unfortunately the logs aren't useful. It looks like bumping up loglevel ended up triggering a lot of:
systemd-journal[340]: Suppressed 2252 messages from /system.slice/docker.service
So the information we wanted isn't there. That could be fixed by setting RateLimitIntervalSec=0 in /etc/systemd/journald.conf and then restarting systemd-journald.
Are we able to get the necessary logs from @weliang1 environment?
This is starting to happen more frequently. Were the you able to get logs from @weliang1 's environment or should I reproduce a cluster logs again?
@weliang1 has the logs we need... we're trying to track down the race. Thanks!
Is the problem in openvswitch or the node? I'm wondering if it'd be possible to run a downgraded openvswitch v1.4.1 just so this stops causing app outages in the interim.
I'm wondering if it'd be possible to run a downgraded openvswitch v1.4.1 just so this stops causing app outages in the interim.
Someone can do the test? :)
Are you close to have a fix, or is it worth downgrading now?
Thanks
it's not in openvswitch
@andrewklau We have downgraded our nodes to 1.4.1 and the issue seems to be gone. All pods are starting directly now. We're waiting a few days before concluding to victory, but so far so good.
ps: We downgraded _only_ the nodes (origin-node), not ovs nor the master(s).
@gravis did you roll out new nodes or just switch the node container image?
@knobunc when/if(?) we get a fix will origin get a minor version bump? or would we need to wait for 3.6
I just just switched the node container image. We don't have network
loss so far.
I'm not sure if this is related, but I noticed this pop up in one of the events for a 1.5 node:
"SetupNetwork" for "wordpress-5-deploy_retirement" with SetupNetworkError: "Failed to setup network for pod \"wordpress-5-deploy_retirement(ecaeb6ee-4ead-11e7-8646-06579dce22e7)\" using network plugins \"cni\": CNI request failed with status 400: 'error running network setup script:\nhostVethName vethab45e589, contVethMac 6a:35:8d:9a:31:80, podIP 10.131.10.17, podConfig &plugin.PodConfig{vnid:0xe8a77f, ingressBandwidth:\"\", egressBandwidth:\"\", wantMacvlan:false}\n + ovs-ofctl -O OpenFlow13 add-flow br0 'table=2, priority=100, in_port=-1, arp, nw_src=10.131.10.17, arp_sha=6a:35:8d:9a:31:80, actions=load:15247231->NXM_NX_REG0[], goto_table:5'\n2017-06-11T13:57:38Z|00001|ofp_util|WARN|Negative value -1 is not a valid port number.\novs-ofctl: -1: port value out of range for in_port\n\n'; Skipping pod"
Is this going to be backported to 1.5?
Backporting this to 1.5 would be great!
Any news regarding a backport? This issue is still blocking our nodes from upgrading to 1.5 :(
Same here.
Btw, is anybody building non-official packages/images?
Em 27 de jun de 2017 12:08, "Philippe Lafoucrière" notifications@github.com
escreveu:
Any news regarding a backport? This issue is still blocking our nodes from
upgrading to 1.5 :(—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/14092#issuecomment-311388418,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAhBCNzLoxn-VgHud4HyvrcfDV_BMka8ks5sIRrigaJpZM4NS9b1
.
The fix was merged in https://github.com/openshift/origin/issues/14801
yeah, the fix is on the release-1.5 branch but AFAIK we don't do regular point releases of origin
There was already a point release for origin, so why not do another? With this issue fixed, I guess it would be great to have another point release. not?
Hi guys, it would be really nice to have a backport of this in the 1.5 branch.
This bug justifies a point release all by itself.
Thanks
We've been waiting for a patch version for almost a month now, anyone?
I'm pretty sure there won't be any point release for 1.5. But it's very easy to build a release by your own: https://github.com/openshift/origin/blob/release-1.5/HACKING.md -> just use the release-1.5 branch.
Yes, we can build our own images, but if we can't use official images anymore, that pretty much mean the end of openshift for us :(
but if we can't use official images anymore, that pretty much mean the end of openshift for us :(
Why? Origin is not formally supported by Red Hat anyway. Literally the only difference between "official" images and images you built yourself is who typed "make release".
Yes, but some do trust CentOS (which provides the origin packages), but not some random build packages.
Most helpful comment
Is this going to be backported to 1.5?