origin 🚀 - Pods are sometimes deployed without internal network connectivity

Can you get the node logs please from the node where the failing pod is running?

knobunc on 9 May 2017

The issue only happens after a day or two of running, I had suspended all cron jobs when it started bringing down their services.

I'm re-enabling some and will post back the logs when it happens again.

andrewklau on 10 May 2017

I was able to replicate it again, the issue doesn't seem to particularity relate to CronJobs but it happens mostly on projects that are running them. Most of the time right after a build succeeds and a new deployment is created.

Below are the errors:


May 13 09:23:05 ip-10-0-200-127.ap-southeast-2.compute.internal dockerd-current[6361]: I0513 09:23:05.875583   14242 operation_executor.go:917] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/f1124b4b-3769-11e7-ba45-06579dce22e7-default-token-zdk92" (spec.Name: "default-token-zdk92") pod "f1124b4b-3769-11e7-ba45-06579dce22e7" (UID: "f1124b4b-3769-11e7-ba45-06579dce22e7").
May 13 09:23:05 ip-10-0-200-127.ap-southeast-2.compute.internal origin-node[14164]: I0513 09:23:05.875583   14242 operation_executor.go:917] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/f1124b4b-3769-11e7-ba45-06579dce22e7-default-token-zdk92" (spec.Name: "default-token-zdk92") pod "f1124b4b-3769-11e7-ba45-06579dce22e7" (UID: "f1124b4b-3769-11e7-ba45-06579dce22e7").
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal ovs-vsctl[127682]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists del-port veth00735fc5
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: device veth00735fc5 left promiscuous mode
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal dhclient[127114]: DHCPDISCOVER on veth58a37e85 to 255.255.255.255 port 67 interval 11 (xid=0x268e1510)
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-127701-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-127701-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal ovs-vsctl[127712]: ovs|00001|db_ctl_base|ERR|no row "veth00735fc5" in table Port
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal ovs-vsctl[127717]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists del-port veth00735fc5
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-127713-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal dockerd-current[6361]: W0513 09:23:06.195544   14242 multitenant.go:156] refcounting error on vnid 14177148
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal origin-node[14164]: W0513 09:23:06.195544   14242 multitenant.go:156] refcounting error on vnid 14177148
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-127713-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal oci-systemd-hook[127730]: systemdhook <debug>: Skipping as container command is /pod, not init or systemd
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-127732-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-127732-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-10): Unmounting Filesystem
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal dhclient[127397]: receive_packet failed on veth00735fc5: Network is down
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494631386.3021] device (veth00735fc5): driver 'veth' does not support carrier detection.
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494631386.3027] device (veth00735fc5): state change: ip-config -> unmanaged (reason 'unmanaged') [70 10 3]
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494631386.3097] dhcp4 (veth00735fc5): canceled DHCP transaction, DHCP client pid 127397
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494631386.3098] dhcp4 (veth00735fc5): state changed unknown -> done
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal dockerd-current[6361]: time="2017-05-13T09:23:06.373001459+10:00" level=error msg="Handler for POST /v1.24/containers/53417d8e1b62ed7c3e9856c320ee0113d6eddda3ce3ea05fb28444676064bb48/stop returned error: Container 53417d8e1b62ed7c3e9856c320ee0113d6eddda3ce3ea05fb28444676064bb48 is already stopped"
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal origin-node[14164]: W0513 09:23:06.373203   14242 docker_manager.go:1582] No ref for pod '"53417d8e1b62ed7c3e9856c320ee0113d6eddda3ce3ea05fb28444676064bb48 gg/pi-1494631380-5286k"'
May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal dockerd-current[6361]: W0513 09:23:06.373203   14242 docker_manager.go:1582] No ref for pod '"53417d8e1b62ed7c3e9856c320ee0113d6eddda3ce3ea05fb28444676064bb48 gg/pi-1494631380-5286k"'

To clarify, running CronJobs only doesn't seem to break the project networking. It seems to happen mostly when a build finishes and a new deployment starts. But the issue happens only on projects that are running CronJobs, so perhaps some sort of race condition if they happen at the same time?

I have the full log dump of one of the nodes where this happened.

andrewklau on 13 May 2017

More relevant logs


May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630086.0812] device (veth77b313c0): link connected
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630086.0827] manager: (veth77b313c0): new Veth device (/org/freedesktop/NetworkManager/Devices/4663)
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630086.1029] device (veth77b313c0): state change: unmanaged -> unavailable (reason 'managed') [10 20 2]
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630086.1073] ifcfg-rh: add connection in-memory (87e3c85c-92b0-3cd4-bed2-a5d37ba0b0da,"Wired connection 3")
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630086.1082] settings: (veth77b313c0): created default wired connection 'Wired connection 3'
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630086.1099] device (veth77b313c0): state change: unavailable -> disconnected (reason 'none') [20 30 0]
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630086.1114] policy: auto-activating connection 'Wired connection 3'
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630086.1133] device (veth77b313c0): Activation: starting connection 'Wired connection 3' (87e3c85c-92b0-3cd4-bed2-a5d37ba0b0da)
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630086.1135] device (veth77b313c0): state change: disconnected -> prepare (reason 'none') [30 40 0]
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630086.1144] device (veth77b313c0): state change: prepare -> config (reason 'none') [40 50 0]
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630086.1157] device (veth77b313c0): state change: config -> ip-config (reason 'none') [50 70 0]
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630086.1162] dhcp4 (veth77b313c0): activation: beginning transaction (timeout in 45 seconds)
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal ovs-vsctl[101569]: ovs|00001|vsctl|INFO|Called as ovs-vsctl add-port br0 veth77b313c0
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: device veth77b313c0 entered promiscuous mode
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630086.1790] dhcp4 (veth77b313c0): dhclient started with pid 101571
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630086.1809] device (veth77b313c0): enslaved to non-master-type device ovs-system; ignoring
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal dhclient[101571]: DHCPDISCOVER on veth77b313c0 to 255.255.255.255 port 67 interval 7 (xid=0x74555c58)
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-27): Mounting V5 Filesystem
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-27): Ending clean mount
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal dockerd-current[6361]: I0513 09:01:26.441869   14242 docker_manager.go:2206] Determined pod ip after infra change: "auth-worker-110-c8x7z_(ec154037-3766-11e7-ba45-06579dce22e7)": "10.131.10.130"
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal origin-node[14164]: I0513 09:01:26.441869   14242 docker_manager.go:2206] Determined pod ip after infra change: "auth-worker-110-c8x7z_(ec154037-3766-11e7-ba45-06579dce22e7)": "10.131.10.130"
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-27): Unmounting Filesystem
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-27): Mounting V5 Filesystem
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-27): Ending clean mount
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-27): Unmounting Filesystem
May 13 09:01:26 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-27): Mounting V5 Filesystem
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-27): Ending clean mount
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-101641-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-101641-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: nf_conntrack: falling back to vmalloc.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: nf_conntrack: falling back to vmalloc.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Started docker container 30e7d4718535645141a0bff09e42f80710fd88ede1a2f93adaedcda61cc4dd28.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Starting docker container 30e7d4718535645141a0bff09e42f80710fd88ede1a2f93adaedcda61cc4dd28.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-28): Mounting V5 Filesystem
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal oci-systemd-hook[101678]: systemdhook <debug>: Skipping as container command is /pod, not init or systemd
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-101679-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal systemd[1]: Scope libcontainer-101679-systemd-test-default-dependencies.scope has no PIDs. Refusing.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-28): Ending clean mount
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal dockerd-current[6361]: E0513 09:01:27.284821   14242 docker_manager.go:1741] Failed to create symbolic link to the log file of pod "auth-139-ljc8r_(ecc3537b-3766-11e7-ba45-06579dce22e7)" container "POD": symlink  /var/log/containers/auth-139-ljc8r__POD-30e7d4718535645141a0bff09e42f80710fd88ede1a2f93adaedcda61cc4dd28.log: no such file or directory
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal origin-node[14164]: E0513 09:01:27.284821   14242 docker_manager.go:1741] Failed to create symbolic link to the log file of pod "auth-139-ljc8r_(ecc3537b-3766-11e7-ba45-06579dce22e7)" container "POD": symlink  /var/log/containers/auth-139-ljc8r__POD-30e7d4718535645141a0bff09e42f80710fd88ede1a2f93adaedcda61cc4dd28.log: no such file or directory
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-28): Unmounting Filesystem
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal dockerd-current[6361]: E0513 09:01:27.370479   14242 docker_manager.go:378] NetworkPlugin cni failed on the status hook for pod 'auth-139-ljc8r' - Unexpected command output Device "eth0" does not exist.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal dockerd-current[6361]:  with error: exit status 1
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal origin-node[14164]: E0513 09:01:27.370479   14242 docker_manager.go:378] NetworkPlugin cni failed on the status hook for pod 'auth-139-ljc8r' - Unexpected command output Device "eth0" does not exist.
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal origin-node[14164]: with error: exit status 1
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630087.4540] manager: (vethf0daeab9): new Veth device (/org/freedesktop/NetworkManager/Devices/4664)
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630087.4562] device (vethf0daeab9): link connected
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630087.4563] device (vethf0daeab9): state change: unmanaged -> unavailable (reason 'managed') [10 20 2]
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630087.4600] ifcfg-rh: add connection in-memory (c9cf445a-4141-3e70-a65c-87a3588f6c49,"Wired connection 11")
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630087.4608] settings: (vethf0daeab9): created default wired connection 'Wired connection 11'
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630087.4637] device (vethf0daeab9): state change: unavailable -> disconnected (reason 'none') [20 30 0]
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630087.4677] policy: auto-activating connection 'Wired connection 11'
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630087.4704] device (vethf0daeab9): Activation: starting connection 'Wired connection 11' (c9cf445a-4141-3e70-a65c-87a3588f6c49)
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630087.4706] device (vethf0daeab9): state change: disconnected -> prepare (reason 'none') [30 40 0]
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630087.4714] device (vethf0daeab9): state change: prepare -> config (reason 'none') [40 50 0]
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal ovs-vsctl[101727]: ovs|00001|vsctl|INFO|Called as ovs-vsctl add-port br0 vethf0daeab9
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630087.4724] device (vethf0daeab9): state change: config -> ip-config (reason 'none') [50 70 0]
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630087.4727] dhcp4 (vethf0daeab9): activation: beginning transaction (timeout in 45 seconds)
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: device vethf0daeab9 entered promiscuous mode
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630087.5142] dhcp4 (vethf0daeab9): dhclient started with pid 101729
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal NetworkManager[753]: <info>  [1494630087.5176] device (vethf0daeab9): enslaved to non-master-type device ovs-system; ignoring
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal dhclient[101729]: DHCPDISCOVER on vethf0daeab9 to 255.255.255.255 port 67 interval 5 (xid=0xe7661df)
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-28): Mounting V5 Filesystem
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-28): Ending clean mount
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal kernel: XFS (dm-28): Unmounting Filesystem
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal origin-node[14164]: I0513 09:01:27.921616   14242 docker_manager.go:2206] Determined pod ip after infra change: "auth-139-ljc8r_(ecc3537b-3766-11e7-ba45-06579dce22e7)": "10.131.10.131"
May 13 09:01:27 ip-10-0-200-127.ap-southeast-2.compute.internal dockerd-current[6361]: I0513 09:01:27.921616   14242 docker_manager.go:2206] Determined pod ip after infra change: "auth-139-ljc8r_(ecc3537b-3766-11e7-ba45-06579dce22e7)": "10.131.10.131"

auth-worker-110-c8x7z and auth-139-ljc8r are two of the pods that recently came up and error'd. Other pods came up without a problem.

andrewklau on 13 May 2017

I have been able to repeatedly replicate this issue on two different clusters. Steps:

Create project
Create from template (eg. cakephp-mysql)
Scale up to at least 2
Create a cronjob that runs php health.php every minute (this just pings the MySQL database)
Ensure healthchecks are pinging health.php (this will help us catch the fail to dns lookup as the pod will CrashLoop)
Trigger new build
(repeat new build until eventually a pod will fail being unable to connect to the database.)

Seems to happen when a new deployment starts at the same time the cronjob executes. Every 2-3 builds it happens.

Delete the CronJob and all completed Jobs, keep triggering builds. Issue no longer occurs.

andrewklau on 15 May 2017

After further tests, it doesn't appear to be just related to CronJobs at all. It was just occurring more frequently with projects that had CronJobs.

andrewklau on 15 May 2017

Same issue here? Pods has connectivity to other pod and service only when I run an additional pod

harobed on 17 May 2017

I have experienced very similar issues after upgrading to 1.5 - I haven't had any CronJobs though, so I can confirm that it is probably unrelated to CronJobs.

Occasionally, when a Pod starts up, it does not have network connectivity. It happens about once in 5 newly created pods. Restarting (deleting) the Pod, or restarting the origin-node or openvswitch service on the node that the Pod is running on fixes the problem.

I have tried to investigate a bit further, and I noticed that in this setup:

(project A / pod1) <-> Node1 <-> Node2 <-> (project A / pod2)

If pod1 is experiencing this issue, and I was pinging pod2 from pod1, then the packets were actually arriving at pod2, and it was sending ICMP replies, which got back encapsulated in vxlan to Node1, but they haven't shown up on pod1's virtual interface, so it seemed the issue was with the vxlan termination of incoming traffic to the Pod. At this point I started looking at OpenFlow rules but I couldn't determine anything, I haven't found anything trivially missing so I restarted origin-node on Node1 which resolved the issue. But it's still an annoyance as it happens fairly frequently.

mbalazs on 17 May 2017

May 13 09:23:06 ip-10-0-200-127.ap-southeast-2.compute.internal origin-node[14164]: W0513 09:23:06.195544   14242 multitenant.go:156] refcounting error on vnid 14177148

That's the problem; something is going wrong in the networking code and causing it to drop OVS rules for a namespace when, apparently, the namespace is still in use.

Could you restart the node with --loglevel=5, run until the error shows up, and then post the logs (from just the atomic-openshift-node service) somewhere? (It will probably be way too large to post here but you could use a pastebin and post just a link, or you could email it to me (danw at redhat))

danwinship on 17 May 2017

from just the atomic-openshift-node service

where by "atomic-openshift-node" I probably mean "origin-node" I guess...

danwinship on 17 May 2017

I'll try to replicate this again on a test cluster so I don't need to update all the nodes to loglevel=5

@mbalazs curious if you are also running on AWS?

andrewklau on 17 May 2017

@andrewklau No, it's on a private OpenNebula installation. Some additional info that might help in reproducing:

I have two deployments in the project, a webapp called "vanilla" and a mysql server: https://pastebin.com/RL90QCsd
The problem manifests itself by the webapp not being able to connect to mysql through the mysql service. (It cannot access the pod directly, either). Both deployments are scaled to 1 replica.
It seems to me, that by only deleting the "vanilla-..." pods over and over again, the issue does not manifest at all, the new pods will always work. However, if I do an "oc rollout latest vanilla", a broken pod will present itself after 3-4 iterations. I guess it's probably related to multiple pods being started / stopped at the same time...

mbalazs on 17 May 2017

@danwinship I am emailing the logs to you now

I was able to replicate this fairly quickly:

Create a 2 node cluster (centos, openshift-ansible, multienant sdn, containerised install)
Deploy cakephp-ex
nodeselected php to master
nodeselected mysql to node

Continued to oc rollout on cakephp-ex. By deployment 6 it crashed.

andrewklau on 18 May 2017

Please note that we're having the exact same issue, and it's not related to CronJobs.
I hope we'll have a patch soon, otherwise we'll have to downgrade to 1.4 :(
thanks

gravis on 18 May 2017

Can you tell me if the broken pod has an IP address? Can it ping the node it is on? Can the node ping it? (I'm wondering if this is an ARP cache problem)

knobunc on 18 May 2017

@knobunc In my case, the broken pod (php) has an IP address, and it can ping other nodes in the cluster. It can also ping the docker registry pod. And I can also ping the broken pod from the cluster nodes and from the registry pod.

However, it can not ping the other pod (mysql) in the same project, and vice-versa. The other pod can also ping all the nodes and the registry too. So the only thing that seems to be broken is the communication between these two pods.

mbalazs on 18 May 2017

@knobunc I have a 2 node test cluster running that I created to replicate the problem. It has hit the problem if you would like to have access. It's also running at --log-level=5

andrewklau on 19 May 2017

@knobunc @danwinship are there any additional logs or access I can provide to further identify what might be the cause?

andrewklau on 24 May 2017

Can you ping pods from that pod that are on the same node?

What plugin are you using for networking?

knobunc on 24 May 2017

It seems to only happen with multitenant plugin.

andrewklau on 24 May 2017

Yes: multitenant

harobed on 24 May 2017

Can you ping from the target pod back to the source pod? (i.e. reverse the broken ping)

knobunc on 24 May 2017

Trying to check now, I was using a healthcheck to pick up the failure but that would put it into a crashloopbackoff.

From a pod that was working I was able to ping a failed pod. I then tried to start up a debug container from the pod that was failed, either that or the ping managed to bring all the pods back online.

Trying to reliability confirm now.

andrewklau on 24 May 2017

Ok, to answer your question:

For the MySQL pod which has always been running:
It is unable to ping the new pod which deployed

The new pod which is deployed is able to ping the MySQL pod

So to clarify the scenario:
PHP deployment 1
MySQL deployment 1
(Everything OK)

PHP deployment 2
MySQL deployment 1
(PHP can ping MySQL, PHP can't query against MySQL server, MySQL can't ping PHP)

andrewklau on 24 May 2017

Since posting this, PHP is now no longer able to ping MySQL

andrewklau on 24 May 2017

And again if I start a debug container, all the pod inter-connectivity is magically starts working again.

andrewklau on 24 May 2017

Powering off the debug container and inter-connectivity dies.

andrewklau on 24 May 2017

Please run the debug.sh referenced at https://docs.openshift.com/enterprise/3.2/admin_guide/sdn_troubleshooting.html#further-help

And put that somewhere I can get at, and tell me the pods (or ips) of a pair of things that are not working.

knobunc on 24 May 2017

github doesn't support .tgz filetype to it's renamed to .gz

openshift-sdn-debug-2017-05-24.gz

The two pods which have failed are 192.0.0.142 (cakephp-mysql-example-34-jzvq8) and 192.0.0.143 (cakephp-mysql-example-34-w5mcn)

192.1.0.10 (mysql-3-dhsl0) is the long running pod which has always been working.

andrewklau on 24 May 2017

All the pods on the host were in the "notready" state due to healthcheck on database connectivity, hence the message:
No pods on x , so no connectivity tests

I updated the debug script to include notready pods and attached the output again.
openshift-sdn-debug-2017-05-25.gz

andrewklau on 25 May 2017

Any news on this? It's pretty critical to us.

Thanks

gravis on 30 May 2017

Unfortunately the logs captured there don't have the startup. I'm trying to reproduce it locally using the steps @andrewklau posted earlier.

knobunc on 30 May 2017

👍1

@knobunc were you able to reproduce it, or do you need more accurate steps?

andrewklau on 31 May 2017

We have not managed it yet... I think it is the same as: https://bugzilla.redhat.com/show_bug.cgi?id=1454948 and in comment #8 you can see the steps @weliang1 tried (but the "Can reproduce" should read "Can't reproduce").

If you can suggest something else to try, or get us openshift logs at at least log level 5 from the source and destination nodes, that might help get to the root of the problem.

Thanks!

knobunc on 31 May 2017

@knobunc how many nodes does your test cluster have? I believe the issue has something to do with pods beings started / stopped concurrently in the same time on the same host. Maybe try to add a node-selector to the project to limit to one or two nodes if the test cluster is big. (The difference between a project-level node selector and a deployment-level node selector is that in this case the deployer pod would also run on one of the two nodes)

mbalazs on 31 May 2017

@knobunc I had sent logs to Dan a while back, here is the attached logs (log level 5)
restart.txt

journactl -f > restart.log
systemctl restart openvswitch origin-node

oc rollout latest dc/cakephp-mysql-persistent
oc start-build cakephp-mysql-persistent
oc rollout latest dc/cakephp-mysql-persistent

Error occured at around 22:04

I think @mbalazs is correct in assuming the error occurs when the pods start on the same host.

I was able to reliably replicate this in a 2 node cluster:
mulitenant plugin, containerized

Deploy cakephp-ex (with php and mysql)
Nodeselect cakephp-ex to node1 and mysql to node2
Ensure the healthcheck runs to easily pick up the faliure
Continuous rollout of cakephp-ex, it seems to fail approximately every 6 times.

andrewklau on 31 May 2017

My testing env: AWS, mulitenant plugin, containerized, one master, two nodes.
This time I reproduce pods connectivity issue in my env after run checking script instead of manual testing.

weliang1 on 31 May 2017

I had sent logs to Dan a while back, here is the attached logs

I've been working on a bunch of other things and didn't get a chance to look at these until today, but unfortunately the logs aren't useful. It looks like bumping up loglevel ended up triggering a lot of:

systemd-journal[340]: Suppressed 2252 messages from /system.slice/docker.service

So the information we wanted isn't there. That could be fixed by setting RateLimitIntervalSec=0 in /etc/systemd/journald.conf and then restarting systemd-journald.

danwinship on 31 May 2017

Are we able to get the necessary logs from @weliang1 environment?

andrewklau on 1 Jun 2017

This is starting to happen more frequently. Were the you able to get logs from @weliang1 's environment or should I reproduce a cluster logs again?

andrewklau on 6 Jun 2017

@weliang1 has the logs we need... we're trying to track down the race. Thanks!

knobunc on 6 Jun 2017

🎉2

Is the problem in openvswitch or the node? I'm wondering if it'd be possible to run a downgraded openvswitch v1.4.1 just so this stops causing app outages in the interim.

andrewklau on 8 Jun 2017

I'm wondering if it'd be possible to run a downgraded openvswitch v1.4.1 just so this stops causing app outages in the interim.

Someone can do the test? :)

harobed on 8 Jun 2017

Are you close to have a fix, or is it worth downgrading now?
Thanks

gravis on 8 Jun 2017

it's not in openvswitch

danwinship on 8 Jun 2017

@andrewklau We have downgraded our nodes to 1.4.1 and the issue seems to be gone. All pods are starting directly now. We're waiting a few days before concluding to victory, but so far so good.
ps: We downgraded _only_ the nodes (origin-node), not ovs nor the master(s).

gravis on 8 Jun 2017

@gravis did you roll out new nodes or just switch the node container image?

@knobunc when/if(?) we get a fix will origin get a minor version bump? or would we need to wait for 3.6

andrewklau on 9 Jun 2017

I just just switched the node container image. We don't have network
loss so far.

gravis on 9 Jun 2017

I'm not sure if this is related, but I noticed this pop up in one of the events for a 1.5 node:

 "SetupNetwork" for "wordpress-5-deploy_retirement" with SetupNetworkError: "Failed to setup network for pod \"wordpress-5-deploy_retirement(ecaeb6ee-4ead-11e7-8646-06579dce22e7)\" using network plugins \"cni\": CNI request failed with status 400: 'error running network setup script:\nhostVethName vethab45e589, contVethMac 6a:35:8d:9a:31:80, podIP 10.131.10.17, podConfig &plugin.PodConfig{vnid:0xe8a77f, ingressBandwidth:\"\", egressBandwidth:\"\", wantMacvlan:false}\n + ovs-ofctl -O OpenFlow13 add-flow br0 'table=2, priority=100, in_port=-1, arp, nw_src=10.131.10.17, arp_sha=6a:35:8d:9a:31:80, actions=load:15247231->NXM_NX_REG0[], goto_table:5'\n2017-06-11T13:57:38Z|00001|ofp_util|WARN|Negative value -1 is not a valid port number.\novs-ofctl: -1: port value out of range for in_port\n\n'; Skipping pod"

andrewklau on 11 Jun 2017

Is this going to be backported to 1.5?

caruccio on 12 Jun 2017

👍5

Backporting this to 1.5 would be great!

AlbertoPeon on 23 Jun 2017

👍1

Any news regarding a backport? This issue is still blocking our nodes from upgrading to 1.5 :(

gravis on 27 Jun 2017

Same here.

Btw, is anybody building non-official packages/images?

Em 27 de jun de 2017 12:08, "Philippe Lafoucrière" notifications@github.com
escreveu:

Any news regarding a backport? This issue is still blocking our nodes from
upgrading to 1.5 :(

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/14092#issuecomment-311388418,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAhBCNzLoxn-VgHud4HyvrcfDV_BMka8ks5sIRrigaJpZM4NS9b1
.

caruccio on 27 Jun 2017

The fix was merged in https://github.com/openshift/origin/issues/14801

eparis on 27 Jun 2017

yeah, the fix is on the release-1.5 branch but AFAIK we don't do regular point releases of origin

danwinship on 27 Jun 2017

There was already a point release for origin, so why not do another? With this issue fixed, I guess it would be great to have another point release. not?

tobru on 27 Jun 2017

👍4

Hi guys, it would be really nice to have a backport of this in the 1.5 branch.
This bug justifies a point release all by itself.

Thanks

gravis on 11 Jul 2017

We've been waiting for a patch version for almost a month now, anyone?

gravis on 17 Jul 2017

I'm pretty sure there won't be any point release for 1.5. But it's very easy to build a release by your own: https://github.com/openshift/origin/blob/release-1.5/HACKING.md -> just use the release-1.5 branch.

tobru on 17 Jul 2017

Yes, we can build our own images, but if we can't use official images anymore, that pretty much mean the end of openshift for us :(

gravis on 17 Jul 2017

but if we can't use official images anymore, that pretty much mean the end of openshift for us :(

Why? Origin is not formally supported by Red Hat anyway. Literally the only difference between "official" images and images you built yourself is who typed "make release".

danwinship on 17 Jul 2017

Yes, but some do trust CentOS (which provides the origin packages), but not some random build packages.

ibotty on 18 Jul 2017

👍1

Origin: Pods are sometimes deployed without internal network connectivity

Version

Most helpful comment

All 62 comments

Related issues