On openshift cluster if I use AWS as cloud provider. My installation fails while trying to start node service on each node. If I don't use any cloud provider it appears to be successful
I am using RPM Installation
- ansible 2.3.2.0
- oc v3.6.0+c4dd4cf
- kubernetes v1.6.1+5115d708d7
- features: Basic-Auth GSSAPI Kerberos SPNEGO
Expected result should be Node service start successfully and I see the output of oc get nodes as successful not in the state of NotReady.
fatal: [osnode04.bdteam.local]: FAILED! => {
"attempts": 3,
"changed": false,
"failed": true,
"invocation": {
"module_args": {
"daemon_reload": false,
"enabled": null,
"masked": null,
"name": "origin-node",
"no_block": false,
"state": "restarted",
"user": false
}
},
"msg": "Unable to restart service origin-node: Job for origin-node.service failed because the control process exited with error code. See \"systemctl status origin-node.service\" and \"journalctl -xe\" for details.\n"
}
RUNNING HANDLER [openshift_node : reload systemd units] ************************************************************************************************************************************************************************************************************************
META: ran handlers
to retry, use: --limit @/root/openshift-ansible/playbooks/byo/config.retry
PLAY RECAP *********************************************************************************************************************************************************************************************************************************************************************
localhost : ok=21 changed=0 unreachable=0 failed=0
openshift-etcd.bdteam.local : ok=97 changed=34 unreachable=0 failed=0
osmaster01.bdteam.local : ok=365 changed=110 unreachable=0 failed=1
osmaster02.bdteam.local : ok=314 changed=95 unreachable=0 failed=1
osnode01.bdteam.local : ok=146 changed=38 unreachable=0 failed=1
osnode02.bdteam.local : ok=146 changed=38 unreachable=0 failed=1
osnode03.bdteam.local : ok=146 changed=38 unreachable=0 failed=1
osnode04.bdteam.local : ok=146 changed=38 unreachable=0 failed=1
INSTALLER STATUS ***************************************************************************************************************************************************************************************************************************************************************
Initialization : Complete
etcd Install : Complete
NFS Install : Not Started
Load balancer Install : Not Started
Master Install : Complete
Master Additional Install : Complete
Node Install : In Progress
This phase can be restarted by running: playbooks/byo/openshift-node/config.yml
GlusterFS Install : Not Started
Hosted Install : Not Started
Metrics Install : Not Started
Logging Install : Not Started
Service Catalog Install : Not Started
Failure summary:
1. Hosts: osmaster01.bdteam.local, osmaster02.bdteam.local, osnode01.bdteam.local, osnode02.bdteam.local, osnode03.bdteam.local, osnode04.bdteam.local
Play: Configure nodes
Task: restart node
Message: Unable to restart service origin-node: Job for origin-node.service failed because the control process exited with error code. See "systemctl status origin-node.service" and "journalctl -xe" for details.
[root@osmaster01 ~]# packet_write_wait: Connection to 10.X.X.X port 22: Broken pipe
Node service is unable to restart on each nodes or masters.
[root@osmaster01 centos]# oc get nodes
NAME STATUS AGE VERSION
ip-10-30-1-200.us-west-1.compute.internal NotReady 2h v1.6.1+5115d708d7
ip-10-30-1-27.us-west-1.compute.internal NotReady 2h v1.6.1+5115d708d7
ip-10-30-1-43.us-west-1.compute.internal NotReady 2h v1.6.1+5115d708d7
ip-10-30-2-109.us-west-1.compute.internal NotReady 2h v1.6.1+5115d708d7
ip-10-30-2-182.us-west-1.compute.internal NotReady 2h v1.6.1+5115d708d7
ip-10-30-2-251.us-west-1.compute.internal NotReady 2h v1.6.1+5115d708d7
Kubectl describe node output
[root@osmaster01 centos]# kubectl describe node ip-10-30-2-251.us-west-1.compute.internal
Name: ip-10-30-2-251.us-west-1.compute.internal
Role:
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=m4.xlarge
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=us-west-1
failure-domain.beta.kubernetes.io/zone=us-west-1a
kubernetes.io/hostname=osnode04.bdteam.local
region=primary
zone=west
Annotations: volumes.kubernetes.io/controller-managed-attach-detach=true
Taints: <none>
CreationTimestamp: Fri, 06 Oct 2017 18:10:56 +0000
Phase:
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Fri, 06 Oct 2017 20:37:50 +0000 Fri, 06 Oct 2017 18:10:56 +0000 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Fri, 06 Oct 2017 20:37:50 +0000 Fri, 06 Oct 2017 18:10:56 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 06 Oct 2017 20:37:50 +0000 Fri, 06 Oct 2017 18:10:56 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
Ready False Fri, 06 Oct 2017 20:37:50 +0000 Fri, 06 Oct 2017 18:10:56 +0000 KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Addresses: 10.30.2.251,10.30.2.251,ip-10-30-2-251.bdteam.local,osnode04.bdteam.local
Capacity:
cpu: 4
memory: 16266720Ki
pods: 40
Allocatable:
cpu: 4
memory: 16164320Ki
pods: 40
System Info:
Machine ID: 8bd05758fdfc1903174c9fcaf82b71ca
System UUID: EC2798A7-3C88-0538-2A95-D28F2BCCDF96
Boot ID: 5d7f71a8-95f8-4ed6-a7ba-07977e2dc926
Kernel Version: 3.10.0-693.2.2.el7.x86_64
OS Image: CentOS Linux 7 (Core)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://1.12.6
Kubelet Version: v1.6.1+5115d708d7
Kube-Proxy Version: v1.6.1+5115d708d7
ExternalID: i-08ae279780695c5f7
Non-terminated Pods: (0 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
0 (0%) 0 (0%) 0 (0%) 0 (0%)
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
1h 1h 1 kubelet, ip-10-30-2-251.us-west-1.compute.internal Warning ImageGCFailed unable to find data for container /
1h 1h 1 kubelet, ip-10-30-2-251.us-west-1.compute.internal Normal NodeHasSufficientDisk Node ip-10-30-2-251.us-west-1.compute.internal status is now: NodeHasSufficientDisk
1h 1h 1 kubelet, ip-10-30-2-251.us-west-1.compute.internal Normal NodeHasSufficientMemory Node ip-10-30-2-251.us-west-1.compute.internal status is now: NodeHasSufficientMemory
1h 1h 1 kubelet, ip-10-30-2-251.us-west-1.compute.internal Normal NodeHasNoDiskPressure Node ip-10-30-2-251.us-west-1.compute.internal status is now: NodeHasNoDiskPressure
1h 1h 1 kubelet, ip-10-30-2-251.us-west-1.compute.internal Normal Starting Starting kubelet.
If I dont use any cloud provider in my ansible config.yml file my installation works fine but I need to resolve this for AWS or any cloud provider
Systemctl output of node service on a particular node
[root@osnode01 centos]# systemctl status origin-node.service
● origin-node.service - OpenShift Node
Loaded: loaded (/etc/systemd/system/origin-node.service; enabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/origin-node.service.d
└─openshift-sdn-ovs.conf
Active: activating (start) since Fri 2017-10-06 20:39:54 UTC; 8s ago
Docs: https://github.com/openshift/origin
Process: 56362 ExecStopPost=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string: (code=exited, status=0/SUCCESS)
Process: 56360 ExecStopPost=/usr/bin/rm /etc/dnsmasq.d/node-dnsmasq.conf (code=exited, status=0/SUCCESS)
Process: 56368 ExecStartPre=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string:/in-addr.arpa/127.0.0.1,/cluster.local/127.0.0.1 (code=exited, status=0/SUCCESS)
Process: 56365 ExecStartPre=/usr/bin/cp /etc/origin/node/node-dnsmasq.conf /etc/dnsmasq.d/ (code=exited, status=0/SUCCESS)
Main PID: 56370 (openshift)
Memory: 42.2M
CGroup: /system.slice/origin-node.service
├─56370 /usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --loglevel=2
└─56415 journalctl -k -f
Oct 06 20:39:59 osnode01.bdteam.local origin-node[56370]: W1006 20:39:59.546363 56370 pod_container_deletor.go:77] Container "2f4c53551f7b6e654cc1de1159d44856f81b6d16f4ed5d1eb580c9cb3a9bc575" not found in pod's containers
Oct 06 20:39:59 osnode01.bdteam.local origin-node[56370]: W1006 20:39:59.546425 56370 pod_container_deletor.go:77] Container "851f6503d78acd135e3a4b87009d4163a808856f14757f6123c1cf625123504d" not found in pod's containers
Oct 06 20:39:59 osnode01.bdteam.local origin-node[56370]: W1006 20:39:59.546448 56370 pod_container_deletor.go:77] Container "88a45a9147f05a0bd9e05ed712069f10b4cea6c2af3ccd0eb1601166f3ccf679" not found in pod's containers
Oct 06 20:39:59 osnode01.bdteam.local origin-node[56370]: W1006 20:39:59.546460 56370 pod_container_deletor.go:77] Container "a3ef9c2922877e2f25bd4814fd1f4e371fd98a19ad36b54371fd0b1bc51e255b" not found in pod's containers
Oct 06 20:39:59 osnode01.bdteam.local origin-node[56370]: W1006 20:39:59.546472 56370 pod_container_deletor.go:77] Container "c5102f50c2e01a2100e1dcb025096967e31134c43ffdb1655827b908e5b29f77" not found in pod's containers
Oct 06 20:39:59 osnode01.bdteam.local origin-node[56370]: W1006 20:39:59.546483 56370 pod_container_deletor.go:77] Container "d68f9392b34c6410e6154c95febcfb55dac109725750ae5c20671c39279c9730" not found in pod's containers
Oct 06 20:39:59 osnode01.bdteam.local origin-node[56370]: W1006 20:39:59.546494 56370 pod_container_deletor.go:77] Container "eb04adc0b544c64e20ac3c847e03de048f7c7a26ce4d4a6b46282817d0df8e10" not found in pod's containers
Oct 06 20:39:59 osnode01.bdteam.local origin-node[56370]: W1006 20:39:59.710842 56370 cni.go:157] Unable to update cni config: No networks found in /etc/cni/net.d
Oct 06 20:39:59 osnode01.bdteam.local origin-node[56370]: E1006 20:39:59.710981 56370 kubelet.go:2072] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Oct 06 20:40:00 osnode01.bdteam.local origin-node[56370]: W1006 20:40:00.816290 56370 sdn_controller.go:38] Could not find an allocated subnet for node: osnode01.bdteam.local, Waiting...
[root@osnode01 centos]#
Logs output from one of the node (/var/log/messages)
Oct 6 20:41:15 osnode01 NetworkManager[18586]: <info> [1507322475.5434] dhcp4 (eth0): address 10.30.1.43
Oct 6 20:41:15 osnode01 NetworkManager[18586]: <info> [1507322475.5434] dhcp4 (eth0): plen 24 (255.255.255.0)
Oct 6 20:41:15 osnode01 NetworkManager[18586]: <info> [1507322475.5434] dhcp4 (eth0): gateway 10.30.1.1
Oct 6 20:41:15 osnode01 NetworkManager[18586]: <info> [1507322475.5434] dhcp4 (eth0): lease time 3600
Oct 6 20:41:15 osnode01 NetworkManager[18586]: <info> [1507322475.5434] dhcp4 (eth0): hostname 'ip-10-30-1-43'
Oct 6 20:41:15 osnode01 NetworkManager[18586]: <info> [1507322475.5435] dhcp4 (eth0): nameserver '10.21.0.251'
Oct 6 20:41:15 osnode01 NetworkManager[18586]: <info> [1507322475.5435] dhcp4 (eth0): domain name 'bdteam.local'
Oct 6 20:41:15 osnode01 NetworkManager[18586]: <info> [1507322475.5435] dhcp4 (eth0): state changed bound -> bound
Oct 6 20:41:15 osnode01 dbus-daemon: dbus[632]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service'
Oct 6 20:41:15 osnode01 dbus[632]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service'
Oct 6 20:41:15 osnode01 systemd: Starting Network Manager Script Dispatcher Service...
Oct 6 20:41:15 osnode01 dhclient[18622]: bound to 10.30.1.43 -- renewal in 1686 seconds.
Oct 6 20:41:15 osnode01 dbus[632]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Oct 6 20:41:15 osnode01 dbus-daemon: dbus[632]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Oct 6 20:41:15 osnode01 systemd: Started Network Manager Script Dispatcher Service.
Oct 6 20:41:15 osnode01 nm-dispatcher: req:1 'dhcp4-change' [eth0]: new request (6 scripts)
Oct 6 20:41:15 osnode01 nm-dispatcher: req:1 'dhcp4-change' [eth0]: start running ordered scripts...
Oct 6 20:41:15 osnode01 nm-dispatcher: + cd /etc/sysconfig/network-scripts
Oct 6 20:41:15 osnode01 nm-dispatcher: + . ./network-functions
Oct 6 20:41:15 osnode01 nm-dispatcher: ++ PATH=/sbin:/usr/sbin:/bin:/usr/bin
Oct 6 20:41:15 osnode01 nm-dispatcher: ++ export PATH
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ hostname
Oct 6 20:41:15 osnode01 nm-dispatcher: ++ HOSTNAME=osnode01.bdteam.local
Oct 6 20:41:15 osnode01 nm-dispatcher: ++ '[' -z '' ']'
Oct 6 20:41:15 osnode01 nm-dispatcher: ++ . /etc/init.d/functions
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ TEXTDOMAIN=initscripts
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ umask 022
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ PATH=/sbin:/usr/sbin:/bin:/usr/bin
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ export PATH
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ '[' 56720 -ne 1 -a -z '' ']'
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ '[' -d /run/systemd/system ']'
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ case "$0" in
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ '[' -z '' ']'
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ COLUMNS=80
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ '[' -z '' ']'
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ '[' -c /dev/stderr -a -r /dev/stderr ']'
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ CONSOLETYPE=serial
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ '[' -z '' ']'
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ '[' -z '' ']'
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ '[' -f /etc/sysconfig/i18n -o -f /etc/locale.conf ']'
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ . /etc/profile.d/lang.sh
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ unset LANGSH_SOURCED
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ '[' -z '' ']'
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ '[' -f /etc/sysconfig/init ']'
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ . /etc/sysconfig/init
Oct 6 20:41:15 osnode01 nm-dispatcher: ++++ BOOTUP=color
Oct 6 20:41:15 osnode01 nm-dispatcher: ++++ RES_COL=60
Oct 6 20:41:15 osnode01 nm-dispatcher: ++++ MOVE_TO_COL='echo -en \033[60G'
Oct 6 20:41:15 osnode01 nm-dispatcher: ++++ SETCOLOR_SUCCESS='echo -en \033[0;32m'
Oct 6 20:41:15 osnode01 nm-dispatcher: ++++ SETCOLOR_FAILURE='echo -en \033[0;31m'
Oct 6 20:41:15 osnode01 nm-dispatcher: ++++ SETCOLOR_WARNING='echo -en \033[0;33m'
Oct 6 20:41:15 osnode01 nm-dispatcher: ++++ SETCOLOR_NORMAL='echo -en \033[0;39m'
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ '[' serial = serial ']'
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ BOOTUP=serial
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ MOVE_TO_COL=
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ SETCOLOR_SUCCESS=
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ SETCOLOR_FAILURE=
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ SETCOLOR_WARNING=
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ SETCOLOR_NORMAL=
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ __sed_discard_ignored_files='/\(~\|\.bak\|\.orig\|\.rpmnew\|\.rpmorig\|\.rpmsave\)$/d'
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ '[' '' = 1 ']'
Oct 6 20:41:15 osnode01 nm-dispatcher: ++++ cat /proc/cmdline
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ strstr 'BOOT_IMAGE=/boot/vmlinuz-3.10.0-693.2.2.el7.x86_64 root=UUID=29342a0b-e20f-4676-9ecf-dfdf02ef6683 ro console=tty0 console=ttyS0,115200n8 crashkernel=auto console=ttyS0,115200 LANG=en_US.UTF-8' rc.debug
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ '[' 'BOOT_IMAGE=/boot/vmlinuz-3.10.0-693.2.2.el7.x86_64 root=UUID=29342a0b-e20f-4676-9ecf-dfdf02ef6683 ro console=tty0 console=ttyS0,115200n8 crashkernel=auto console=ttyS0,115200 LANG=en_US.UTF-8' = 'BOOT_IMAGE=/boot/vmlinuz-3.10.0-693.2.2.el7.x86_64 root=UUID=29342a0b-e20f-4676-9ecf-dfdf02ef6683 ro console=tty0 console=ttyS0,115200n8 crashkernel=auto console=ttyS0,115200 LANG=en_US.UTF-8' ']'
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ return 1
Oct 6 20:41:15 osnode01 nm-dispatcher: +++ return 0
Oct 6 20:41:15 osnode01 nm-dispatcher: + '[' -f ../network ']'
Oct 6 20:41:15 osnode01 nm-dispatcher: + . ../network
Oct 6 20:41:15 osnode01 nm-dispatcher: ++ NETWORKING=yes
Oct 6 20:41:15 osnode01 nm-dispatcher: ++ NOZEROCONF=yes
Oct 6 20:41:15 osnode01 nm-dispatcher: + [[ dhcp4-change =~ ^(up|dhcp4-change|dhcp6-change)$ ]]
Oct 6 20:41:15 osnode01 nm-dispatcher: + NEEDS_RESTART=0
Oct 6 20:41:15 osnode01 nm-dispatcher: + UPSTREAM_DNS=/etc/dnsmasq.d/origin-upstream-dns.conf
Oct 6 20:41:15 osnode01 nm-dispatcher: ++ mktemp
Oct 6 20:41:15 osnode01 nm-dispatcher: + UPSTREAM_DNS_TMP=/tmp/tmp.5DzdaQo1tn
Oct 6 20:41:15 osnode01 nm-dispatcher: ++ mktemp
Oct 6 20:41:15 osnode01 nm-dispatcher: + UPSTREAM_DNS_TMP_SORTED=/tmp/tmp.Ie4FFsjAgL
Oct 6 20:41:15 osnode01 nm-dispatcher: ++ mktemp
Oct 6 20:41:15 osnode01 nm-dispatcher: + CURRENT_UPSTREAM_DNS_SORTED=/tmp/tmp.0ZlG7MgcgO
Oct 6 20:41:15 osnode01 nm-dispatcher: ++ mktemp
Oct 6 20:41:15 osnode01 nm-dispatcher: + NEW_RESOLV_CONF=/tmp/tmp.293w7YIsqD
Oct 6 20:41:15 osnode01 nm-dispatcher: ++ mktemp
Oct 6 20:41:15 osnode01 nm-dispatcher: + NEW_NODE_RESOLV_CONF=/tmp/tmp.D9exxlKVYt
Oct 6 20:41:15 osnode01 nm-dispatcher: ++ /sbin/ip route list match 0.0.0.0/0
Oct 6 20:41:15 osnode01 nm-dispatcher: ++ awk '{print $3 }'
Oct 6 20:41:15 osnode01 nm-dispatcher: + def_route=10.30.1.1
Oct 6 20:41:15 osnode01 nm-dispatcher: ++ /sbin/ip route get to 10.30.1.1
Oct 6 20:41:15 osnode01 nm-dispatcher: ++ awk '{print $3}'
Oct 6 20:41:15 osnode01 nm-dispatcher: + def_route_int=eth0
Oct 6 20:41:15 osnode01 nm-dispatcher: ++ /sbin/ip route get to 10.30.1.1
Oct 6 20:41:15 osnode01 nm-dispatcher: ++ awk '{print $5}'
Oct 6 20:41:15 osnode01 nm-dispatcher: + def_route_ip=10.30.1.43
Oct 6 20:41:15 osnode01 nm-dispatcher: + [[ eth0 == eth0 ]]
Oct 6 20:41:15 osnode01 nm-dispatcher: + '[' '!' -f /etc/dnsmasq.d/origin-dns.conf ']'
Oct 6 20:41:15 osnode01 nm-dispatcher: + grep -q 99-origin-dns.sh /etc/resolv.conf
Oct 6 20:41:15 osnode01 nm-dispatcher: ++ systemctl -q is-active dnsmasq.service
Oct 6 20:41:15 osnode01 nm-dispatcher: + '[' 0 -eq 1 ']'
Oct 6 20:41:15 osnode01 nm-dispatcher: ++ systemctl -q is-active dnsmasq.service
Oct 6 20:41:15 osnode01 nm-dispatcher: + grep -q 99-origin-dns.sh /etc/resolv.conf
Oct 6 20:41:15 osnode01 nm-dispatcher: + sed -e '/^nameserver.*$/d' /etc/resolv.conf
Oct 6 20:41:15 osnode01 nm-dispatcher: + echo 'nameserver 10.30.1.43'
Oct 6 20:41:15 osnode01 nm-dispatcher: + grep -q 'search.*cluster.local' /tmp/tmp.293w7YIsqD
Oct 6 20:41:15 osnode01 nm-dispatcher: + grep -qw search /tmp/tmp.293w7YIsqD
Oct 6 20:41:15 osnode01 nm-dispatcher: + cp -Z /tmp/tmp.293w7YIsqD /etc/resolv.conf
Oct 6 20:41:15 osnode01 nm-dispatcher: + rm -f /tmp/tmp.5DzdaQo1tn /tmp/tmp.Ie4FFsjAgL /tmp/tmp.0ZlG7MgcgO /tmp/tmp.293w7YIsqD
Oct 6 20:41:18 osnode01 origin-node: I1006 20:41:18.210035 56657 aws.go:936] Could not determine public DNS from AWS metadata.
Oct 6 20:41:18 osnode01 origin-node: W1006 20:41:18.246426 56657 cni.go:157] Unable to update cni config: No networks found in /etc/cni/net.d
Oct 6 20:41:18 osnode01 origin-node: E1006 20:41:18.246581 56657 kubelet.go:2072] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Oct 6 20:41:20 osnode01 origin-node: W1006 20:41:20.737092 56657 sdn_controller.go:38] Could not find an allocated subnet for node: osnode01.bdteam.local, Waiting...
Oct 6 20:41:20 osnode01 origin-node: F1006 20:41:20.737146 56657 node.go:309] error: SDN node startup failed: failed to get subnet for this host: osnode01.bdteam.local, error: timed out waiting for the condition
Oct 6 20:41:20 osnode01 systemd: origin-node.service: main process exited, code=exited, status=255/n/a
Oct 6 20:41:20 osnode01 dnsmasq[18837]: setting upstream servers from DBus
Oct 6 20:41:20 osnode01 dnsmasq[18837]: using nameserver 10.21.0.251#53
Oct 6 20:41:20 osnode01 dbus-daemon: dbus[632]: [system] Rejected send message, 0 matched rules; type="method_return", sender=":1.7943" (uid=0 pid=18837 comm="/usr/sbin/dnsmasq -k ") interface="(unset)" member="(unset)" error name="(unset)" requested_reply="0" destination=":1.9458" (uid=0 pid=56795 comm="/usr/bin/dbus-send --system --dest=uk.org.thekelle")
Oct 6 20:41:20 osnode01 dbus[632]: [system] Rejected send message, 0 matched rules; type="method_return", sender=":1.7943" (uid=0 pid=18837 comm="/usr/sbin/dnsmasq -k ") interface="(unset)" member="(unset)" error name="(unset)" requested_reply="0" destination=":1.9458" (uid=0 pid=56795 comm="/usr/bin/dbus-send --system --dest=uk.org.thekelle")
Oct 6 20:41:20 osnode01 systemd: Failed to start OpenShift Node.
Oct 6 20:41:20 osnode01 systemd: Unit origin-node.service entered failed state.
Oct 6 20:41:20 osnode01 systemd: origin-node.service failed.
[OSEv3:children]
masters
nodes
etcd
[OSEv3:vars]
ansible_ssh_user=root
openshift_deployment_type=origin
openshift_master_cluster_method=native
openshift_master_cluster_hostname=osmasterelb.bdteam.local
openshift_master_cluster_public_hostname=osmasterelb.bdteam.local
openshift_clock_enabled=true
openshift_master_default_subdomain= apps.bdteam.local
openshift_cloudprovider_kind=aws
openshift_cloudprovider_aws_access_key=XXXXXXX
openshift_cloudprovider_aws_secret_key=XXXXXXXXX
# host group for masters
[masters]
osmaster01.bdteam.local openshift_hostname=osmaster01.bdteam.local
osmaster02.bdteam.local openshift_hostname=osmaster02.bdteam.local
[etcd]
openshift-etcd.bdteam.local openshift_hostname=openshift-etcd.bdteam.local
[nodes]
osmaster01.bdteam.local openshift_hostname=osmaster01.bdteam.local
osmaster02.bdteam.local openshift_hostname=osmaster02.bdteam.local
osnode01.bdteam.local openshift_node_labels="{'region': 'infra', 'zone': 'west'}" openshift_hostname=osnode01.bdteam.local
osnode03.bdteam.local openshift_node_labels="{'region': 'infra', 'zone': 'west'}" openshift_hostname=osnode03.bdteam.local
osnode02.bdteam.local openshift_node_labels="{'region': 'primary', 'zone': 'west'}" openshift_hostname=osnode02.bdteam.local
osnode04.bdteam.local openshift_node_labels="{'region': 'primary', 'zone': 'west'}" openshift_hostname=osnode04.bdteam.local
Same error here since end of August with RHEL7.4 + Openshift Enterprise on AWS. (See also #5691)
Nodes are registered in the cluster with AWS DNS domain suffix (
Support request opened with Red Hat (case 01937377), still waiting for resolution.
$ oc get nodes
NAME STATUS AGE VERSION
ip-10-0-132-148.eu-west-1.compute.internal NotReady 15d v1.6.1+5115d708d7
ip-10-0-132-201.eu-west-1.compute.internal NotReady 15d v1.6.1+5115d708d7
ip-10-0-132-38.eu-west-1.compute.internal NotReady 15d v1.6.1+5115d708d7
ip-10-0-133-100.eu-west-1.compute.internal NotReady 15d v1.6.1+5115d708d7
ip-10-0-133-173.eu-west-1.compute.internal NotReady 15d v1.6.1+5115d708d7
ip-10-0-134-180.eu-west-1.compute.internal NotReady 15d v1.6.1+5115d708d7
ip-10-0-134-31.eu-west-1.compute.internal NotReady 15d v1.6.1+5115d708d7
Same for networks.
$ oc get hostsubnets
NAME HOST HOST IP SUBNET
ip-10-0-132-148.eu-west-1.compute.internal ip-10-0-132-148.eu-west-1.compute.internal 10.0.132.148 172.16.14.0/23
ip-10-0-132-201.eu-west-1.compute.internal ip-10-0-132-201.eu-west-1.compute.internal 10.0.132.201 172.16.10.0/23
ip-10-0-132-38.eu-west-1.compute.internal ip-10-0-132-38.eu-west-1.compute.internal 10.0.132.38 172.16.0.0/23
ip-10-0-133-100.eu-west-1.compute.internal ip-10-0-133-100.eu-west-1.compute.internal 10.0.133.100 172.16.12.0/23
ip-10-0-133-173.eu-west-1.compute.internal ip-10-0-133-173.eu-west-1.compute.internal 10.0.133.173 172.16.16.0/23
ip-10-0-134-180.eu-west-1.compute.internal ip-10-0-134-180.eu-west-1.compute.internal 10.0.134.180 172.16.6.0/23
ip-10-0-134-31.eu-west-1.compute.internal ip-10-0-134-31.eu-west-1.compute.internal 10.0.134.31 172.16.8.0/23
It seems to be a timing issue, I have similar error messages installing on aws. I was able to start the origin-node service on the machine after waiting several minutes when the installation failed. After that, running installation a second time seems to work.
@j00p34 ...I don't think in my setup I have timing issue ... I had tried manually restarting the node service on each machine after a while and it was throwing the same error ...
@j00p34 / @poonia0arun same for me. Restarting installation doesn't help.
@poonia0arun Sorry to hear that. It would have been easy to workaround then. I must say that I am using the 3.7 alpha version of openshift I didn't try it with 3.6 yet. Another big difference I see from your config is that your specifying aws keys. I am using IAM roles for my instances so they have rights to the AWS API without specifying keys:
```# Create an OSEv3 group that contains the masters and nodes groups
[OSEv3:children]
masters
nodes
[OSEv3:vars]
ansible_ssh_user=centos
ansible_become=true
debug_level=5
openshift_deployment_type=origin
use_manageiq=true
openshift_cfme_install_app=True
openshift_repos_enable_testing=True
openshift_disable_check=memory_availability,docker_storage,disk_availability,docker_image_availability
enable_excluders=false
openshift_hosted_logging_deploy=true
openshift_hosted_logging_storage_kind=dynamic
openshift_cloudprovider_kind=aws
openshift_master_default_subdomain=pub.lic.ip.here.xip.io
openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider', 'filename': '/etc/origin/master/htpasswd'}]
openshift_master_htpasswd_users={'username': '$apr1$incrediblysecrethash/'}
[masters]
master.openshift.local
[etcd]
master.openshift.local
[nodes]
master.openshift.local openshift_node_labels="{'region': 'infra', 'zone': 'default'}" openshift_schedulable=true
node1.openshift.local openshift_node_labels="{'region': 'primary', 'zone': 'east'}"
node2.openshift.local openshift_node_labels="{'region': 'primary', 'zone': 'west'}"
~
```
@patlachance Your setup is a lot different I guess as you are on Enterprise version.
I have used terraform to set up the machines and configure everything in AWS. I've got openshift running except for the registry. The registry can't start because it's trying to use a base image from docker hub that doesn't exist. I did find this:
aws-ansible
That seems to configure your complete environment so it could be a better option to know everything is configured ok. I think I'll look into that this week.
This is also an interesting read : refererence architecture 3.6
@j00p34 Your wright, I'm trying to install Enterprise version, following instructions from the link you provided. Only difference is that I'm trying to deploy Openshift in a private VPC behind a custom proxy/reverse proxy instances.
@poonia0arun There's one thing I remember from a previous installation: When I provided openshift hostname my cluster couldn't start either. I can't remember exactly what the problem was but it had something to do with kubernetes resolving the hostname while the node names are different. Maybe you should try it without the openshift_hostname=osmaster01.bdteam.local stuff. You get aws names then but it worked for me.
@j00p34 if I run my ansible playbook without openshift_hostname value... API on master doesn't restart because it tries to resolve to ip-10-30.1.248.bdteam.local hostname which is not a dns record on my dns server so API service on master fails ..
[root@osmaster01 centos]# systemctl status origin-master-api.service
● origin-master-api.service - Atomic OpenShift Master API
Loaded: loaded (/usr/lib/systemd/system/origin-master-api.service; enabled; vendor preset: disabled)
Active: activating (start) since Tue 2017-10-10 16:53:28 UTC; 23s ago
Docs: https://github.com/openshift/origin
Main PID: 31254 (openshift)
Memory: 25.4M
CGroup: /system.slice/origin-master-api.service
└─31254 /usr/bin/openshift start master api --config=/etc/origin/master/master-config.yaml --loglevel=2 --listen=https://0.0.0.0:8443 --master=https://ip-10-30-1-27.bdteam.local:8443
Oct 10 16:53:38 osmaster01.bdteam.local openshift[31254]: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp: lookup ip-10-30-1-248.bdteam.local: no such host"; Reconnecting to {ip-10-30-1-2...m.local:2379 <nil>}
Oct 10 16:53:42 osmaster01.bdteam.local openshift[31254]: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp: lookup ip-10-30-1-248.bdteam.local: no such host"; Reconnecting to {ip-10-30-1-2...m.local:2379 <nil>}
Oct 10 16:53:43 osmaster01.bdteam.local openshift[31254]: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp: lookup ip-10-30-1-248.bdteam.local: no such host"; Reconnecting to {ip-10-30-1-2...m.local:2379 <nil>}
Oct 10 16:53:44 osmaster01.bdteam.local openshift[31254]: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp: lookup ip-10-30-1-248.bdteam.local: no such host"; Reconnecting to {ip-10-30-1-2...m.local:2379 <nil>}
Oct 10 16:53:44 osmaster01.bdteam.local openshift[31254]: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp: lookup ip-10-30-1-248.bdteam.local: no such host"; Reconnecting to {ip-10-30-1-2...m.local:2379 <nil>}
Oct 10 16:53:44 osmaster01.bdteam.local openshift[31254]: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp: lookup ip-10-30-1-248.bdteam.local: no such host"; Reconnecting to {ip-10-30-1-2...m.local:2379 <nil>}
Oct 10 16:53:44 osmaster01.bdteam.local openshift[31254]: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp: lookup ip-10-30-1-248.bdteam.local: no such host"; Reconnecting to {ip-10-30-1-2...m.local:2379 <nil>}
Oct 10 16:53:44 osmaster01.bdteam.local openshift[31254]: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp: lookup ip-10-30-1-248.bdteam.local: no such host"; Reconnecting to {ip-10-30-1-2...m.local:2379 <nil>}
Oct 10 16:53:45 osmaster01.bdteam.local openshift[31254]: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp: lookup ip-10-30-1-248.bdteam.local: no such host"; Reconnecting to {ip-10-30-1-2...m.local:2379 <nil>}
Oct 10 16:53:51 osmaster01.bdteam.local openshift[31254]: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp: lookup ip-10-30-1-248.bdteam.local: no such host"; Reconnecting to {ip-10-30-1-2...m.local:2379 <nil>}
Hint: Some lines were ellipsized, use -l to show in full.
[root@osmaster01 centos]#
I am using ELB in front my HA pair of masters.
I fixed my problem, seems unrelated after all. I am running 3.7 and I noticed origin-master-controllers.service was crash looping because in this version you need to set ClusterID when in aws. While running the playbook I added
[Global]
KubernetesClusterTag=mytestcluster
KubernetesClusterID=mytestcluster
to /etc/origin/cloudprovider/aws.conf
After that the install proceeded without a problem. The reason it worked after a while was probably because I was starting it at the right moment
@j00p34 .. oh okay.. I can't find the right solution for this ... I am still waiting for a solution
@abutcher @sdodson any pointers for this issue?
@sdodson do you happen to have any pointer on this issue ?
Hello,
I can confirm that the problem exists using OpenShift Origin v3.6 and openshift-ansible with git tag
openshift-ansible-3.6.173.0.9-1 using Amazon Web Services (AWS).
The problem exists when you have your custom host names or custom domain configured e.g. mymaster1.example.internal and so on.
The aws cloud provider works fine only when you use the hostname/domain in your ansible inventory *.hosts file, the same as displayed in the AWS instance Private DNS field (in ec2 instance description) e.g.:
To do so you must have VPC DHCP options configured with empty domain-name eg.:
{
"DhcpOptions": [
{
"DhcpConfigurations": [
{
"Values": [
{
"Value": "AmazonProvidedDNS"
}
],
"Key": "domain-name-servers"
}
],
"DhcpOptionsId": "dopt-<lkjlkfdj>"
}
]
}
The hostname in CentOS Linux must be the same as above: ip-10-212-31-117.eu-west-1.compute.internal.
The following commands also must return ip-10-212-31-117.eu-west-1.compute.internal:
The similar problem is also mentioned in the issue: https://github.com/kubernetes/kubernetes/issues/11543
I'm looking forward for a fix or workaround to use custom domain and hostnames when using aws cloud provider.
Regards,
Pawel
One of my Colleague spent some time into this issue ...he suggested to create A record on Route53 as ip-X-X-X-X.local.domain and assign masters and nodes IP accordingly to each A record...In my setup, I am using ELB in-front of each masters so create a classic loadbalancer listening on port 8443 of each masters.
I made three changes to make it work on my current setup even though I can't use proper custom hostname:
[root@osmaster01 master]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
10.30.1.121 ip-10-30-1-121.us-west-1.compute.internal ip-10-30-1-121.bdteam.local ip-10-30-1-121
10.30.2.212 ip-10-30-2-212.us-west-1.compute.internal ip-10-30-2-212.bdteam.local ip-10-30-2-212
10.30.1.64 ip-10-30-1-64.us-west-1.compute.internal ip-10-30-1-64.bdteam.local ip-10-30-1-64
10.30.2.235 ip-10-30-2-235.us-west-1.compute.internal ip-10-30-2-235.bdteam.local ip-10-30-2-235
10.30.1.221 ip-10-30-1-221.us-west-1.compute.internal ip-10-30-1-221.bdteam.local ip-10-30-1-221
10.30.2.209 ip-10-30-2-209.us-west-1.compute.internal ip-10-30-2-209.bdteam.local ip-10-30-2-209
[OSEv3:children]
masters
nodes
etcd
[OSEv3:vars]
ansible_ssh_user=root
openshift_master_cluster_method=native
openshift_master_cluster_hostname=openshift-master.bdteam.local
openshift_master_cluster_public_hostname=openshift-master.bdteam.local
openshift_master_default_subdomain=apps.bdteam.local
openshift_clock_enabled=true
openshift_hosted_manage_registry=false
openshift_hosted_manage_router=false
openshift_override_hostname_check=true
deployment_type=openshift-enterprise
openshift_disable_check=memory_availability,disk_availability,docker_storage
openshift_cloudprovider_kind=aws
openshift_cloudprovider_aws_access_key=XXXXXXXXX
openshift_cloudprovider_aws_secret_key=XXXXXXXX
[nodes]
10.30.1.121 openshift_hostname=ip-10-30-1-121.us-west-1.compute.internal
10.30.2.212 openshift_hostname=ip-10-30-2-212.us-west-1.compute.internal
10.30.1.64 openshift_hostname=ip-10-30-1-64.us-west-1.compute.internal openshift_node_labels="{'region': 'infra', 'zone': 'west'}"
10.30.2.235 openshift_hostname=ip-10-30-2-235.us-west-1.compute.internal openshift_node_labels="{'region': 'infra', 'zone': 'west'}"
10.30.1.221 openshift_hostname=ip-10-30-1-221.us-west-1.compute.internal openshift_node_labels="{'region': 'primary', 'zone': 'west'}"
10.30.2.209 openshift_hostname=ip-10-30-2-209.us-west-1.compute.internal openshift_node_labels="{'region': 'primary', 'zone': 'west'}"
[masters]
10.30.1.121 openshift_hostname=ip-10-30-1-121.us-west-1.compute.internal
10.30.2.212 openshift_hostname=ip-10-30-2-212.us-west-1.compute.internal
[etcd]
10.30.1.121 openshift_hostname=ip-10-30-1-121.us-west-1.compute.internal
No error occurred:
[root@osmaster01 master]# oc get nodes
NAME STATUS AGE VERSION
ip-10-30-1-121.us-west-1.compute.internal Ready,SchedulingDisabled 59m v1.6.1+5115d708d7
ip-10-30-1-221.us-west-1.compute.internal Ready 59m v1.6.1+5115d708d7
ip-10-30-1-64.us-west-1.compute.internal Ready 59m v1.6.1+5115d708d7
ip-10-30-2-209.us-west-1.compute.internal Ready 59m v1.6.1+5115d708d7
ip-10-30-2-212.us-west-1.compute.internal Ready,SchedulingDisabled 59m v1.6.1+5115d708d7
ip-10-30-2-235.us-west-1.compute.internal Ready 59m v1.6.1+5115d708d7
hopefully this will help to someone who is still trying to make it work.
@DanyC97 the kubeletPreferredAddressTypes arg goes in master config under apiserver arguments
thanks a bunch @liggitt , i'll give it a try and report back.
Initially i've done https://github.com/kubernetes/kubernetes/issues/11543#issuecomment-373978371 but not much luck.
@liggitt something is not right. I've applied the change as suggested and i got
applied "kubeletPreferredAddressTypes" fix and saw the following error.
Mar 19 19:46:05 ip-10-0-0-197 origin-master-controllers: Invalid MasterConfig /etc/origin/master/master-config.yaml
Mar 19 19:46:05 ip-10-0-0-197 origin-master-controllers: flag: Invalid value: "kubeletPreferredAddressTypes": is not a valid flag
Mar 19 19:46:05 ip-10-0-0-197 systemd: origin-master-controllers.service: main process exited, code=exited, status=255/n/a
Mar 19 19:46:05 ip-10-0-0-197 systemd: Failed to start Atomic OpenShift Master Controllers.
any ideas ?
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle rotten
/remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.
/close
@openshift-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting
/reopen.
Mark the issue as fresh by commenting/remove-lifecycle rotten.
Exclude this issue from closing again by commenting/lifecycle frozen./close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
Hello,
I can confirm that the problem exists using OpenShift Origin v3.6 and openshift-ansible with git tag
openshift-ansible-3.6.173.0.9-1 using Amazon Web Services (AWS).
The problem exists when you have your custom host names or custom domain configured e.g. mymaster1.example.internal and so on.
The aws cloud provider works fine only when you use the hostname/domain in your ansible inventory *.hosts file, the same as displayed in the AWS instance Private DNS field (in ec2 instance description) e.g.:
To do so you must have VPC DHCP options configured with empty domain-name eg.:
{ "DhcpOptions": [ { "DhcpConfigurations": [ { "Values": [ { "Value": "AmazonProvidedDNS" } ], "Key": "domain-name-servers" } ], "DhcpOptionsId": "dopt-<lkjlkfdj>" } ] }The hostname in CentOS Linux must be the same as above: ip-10-212-31-117.eu-west-1.compute.internal.
The following commands also must return ip-10-212-31-117.eu-west-1.compute.internal:
The similar problem is also mentioned in the issue: https://github.com/kubernetes/kubernetes/issues/11543
I'm looking forward for a fix or workaround to use custom domain and hostnames when using aws cloud provider.
Regards,
Pawel