openshift-ansible-3.11.0-0.16.0
ansible-2.6.2-1.el7.noarch (epel)
successful installation
Describe what is actually happening.
2018-08-17 12:37:20,201 p=19271 u=root | failed: [master1.ceijaug.internal] (item=etcd) => {"attempts": 60, "changed": false, "item": "etcd", "results": {"cmd": "/bin/oc get pod master-etcd-ip-192-199-0-7.ec2.i
nternal -o json -n kube-system", "results": [{}], "returncode": 0, "stderr": "Error from server (NotFound): pods \"master-etcd-ip-192-199-0-7.ec2.internal\" not found\n", "stdout": ""}, "state": "list"}
Provide any additional information which may help us diagnose the
issue.
$ cat /etc/redhat-release)[root@master1 ~]# oc get node
NAME STATUS ROLES AGE VERSION
master1.ceijaug.internal NotReady <none> 6m v1.11.0+d4cacc0
@sdodson another nodename problem.
@thoraxe please provide inventory.
Ansible playbooks are reaching out to AWS metadata to fetch the internal DNS name and replace whatever is set in ansible inventory if the host can be reached.
How was node1.ceijaug.internal etc hostnames set?
Note that in order to override master-etcd-ip-192-199-0-7.ec2.internal you'd need Route53 config so that cloudprovider would know about this - a simple hostnamectl set-hostname won't work
[root@master1 ~]# dig master1.mahtaix.internal
; <<>> DiG 9.9.4-RedHat-9.9.4-61.el7 <<>> master1.mahtaix.internal
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 59599
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;master1.mahtaix.internal. IN A
;; ANSWER SECTION:
master1.mahtaix.internal. 10 IN A 192.199.0.164
;; Query time: 2 msec
;; SERVER: 192.199.0.2#53(192.199.0.2)
;; WHEN: Fri Aug 17 18:14:30 UTC 2018
;; MSG SIZE rcvd: 69
We create a DNS zone.
That's weird, maybe @mazzystr could help with Route53 settings and find out why would the playbook still expect hostnames to be .ec2.internal?
I ran into this as well.
You will want to specify some combination of hostname and IP in the inventory.
Overriding Detected IP Addresses and Host Names
In AWS, situations that require overriding the variables include:
https://docs.openshift.com/container-platform/3.10/install_config/configuring_aws.html#overriding-detected-ip-addresses-host-names-aws
demo-mas01-usw1.demo.local openshift_hostname=demo-mas01-usw1.demo.local hostname=demo-mas01-usw1.demo.local ip=10.10.2.190 openshift_node_group_name=node-config-master
demo-wrk01-usw1.demo.local openshift_hostname=demo-wrk01-usw1.demo.local hostname=demo-wrk01-usw1.demo.local ip=10.10.2.92 openshift_node_group_name=node-config-compute
demo-wrk04-usw1.demo.local openshift_hostname=demo-wrk04-usw1.demo.local hostname=demo-wrk04-usw1.demo.local ip=10.10.4.111 openshift_node_group_name=node-config-infra
@MatthewJSalerno did you get a nasty deprecation warning about the use of openshift_hostname / hostname ?
@thoraxe I did a quick grep through my output log and didn't see anything. Without those settings, all of my certs used the wrong hostname.
It would be great if we could disable the cloud provider check. I'm no ansible expert, but I went digging through the code and it looks like there is no bypass.
2018-08-17 20:56:29,947 p=19329 u=root | TASK [openshift_control_plane : Wait for control plane pods to appear] *********
2018-08-17 20:56:29,947 p=19329 u=root | task path: /root/openshift-ansible/roles/openshift_control_plane/tasks/main.yml:204
2018-08-17 20:56:30,319 p=19329 u=root | Using module file /root/openshift-ansible/roles/lib_openshift/library/oc_obj.py
2018-08-17 20:56:30,430 p=19329 u=root | Escalation succeeded
2018-08-17 20:56:30,761 p=19329 u=root | FAILED - RETRYING: Wait for control plane pods to appear (60 retries left).Result was: {
"attempts": 1,
"changed": false,
"invocation": {
"module_args": {
"all_namespaces": null,
"content": null,
"debug": false,
"delete_after": false,
"field_selector": null,
"files": null,
"force": false,
"kind": "pod",
"kubeconfig": "/etc/origin/master/admin.kubeconfig",
"name": "master-etcd-ip-192-199-0-228.ec2.internal",
"namespace": "kube-system",
"selector": null,
"state": "list"
}
},
"msg": {
"cmd": "/bin/oc get pod master-etcd-ip-192-199-0-228.ec2.internal -o json -n kube-system",
"results": [
{}
],
"returncode": 1,
"stderr": "The connection to the server master.hiequid.openshiftdemos.com:443 was refused - did you specify the right host or port?\n",
"stdout": ""
},
"retries": 61
}
OK, sure, the connection was refused. But that's not the issue.
"cmd": "/bin/oc get pod master-etcd-ip-192-199-0-228.ec2.internal -o json -n kube-system"
vs
[root@master1 ~]# oc get node
NAME STATUS ROLES AGE VERSION
master1.hiequid.internal NotReady <none> 3m v1.11.0+d4cacc0
[root@master1 ~]# dig master1.hiequid.internal
; <<>> DiG 9.9.4-RedHat-9.9.4-61.el7 <<>> master1.hiequid.internal
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 16108
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;master1.hiequid.internal. IN A
;; ANSWER SECTION:
master1.hiequid.internal. 1 IN A 192.199.0.228
;; Query time: 0 msec
;; SERVER: 192.199.0.228#53(192.199.0.228)
;; WHEN: Fri Aug 17 21:00:51 UTC 2018
;; MSG SIZE rcvd: 69
The inventory file is the same as in the OP with the exception of the hostnames being slightly different.
This task is in roles/openshift_control_plane/tasks/main.yml
The oc command is trying to fetch a pod name that doesn't correspond to the node name.
Specifically, in this case, openshift.node.nodename appears to have some metadata name for the server and not the actual specified inventory node name. Which is extra odd because while we are actually on AWS we are not configuring the AWS cloud provider. So it's doing this all on its own.
Have you looked through the AWS metadata to see if that hostname is there? Do any of the below commands return that name?
curl http://169.254.169.254/latest/meta-data/local-hostname
curl http://169.254.169.254/latest/meta-data/public-hostname
curl http://169.254.169.254/latest/meta-data/hostname
It's possibly getting populated around here
and here. 273 looks like where the values are being set
When in doubt, just execute the following command before the prereq and deploy:
sudo ip route add blackhole 169.254.169.254/32
It won't last past a reboot and it will keep the install from seeing the AWS metadata.
@MatthewJSalerno if we are detecting hostnames using pre-determined methods on AWS based on system bios we are likely to wreck quite a few things.
I have fully functioning forward-resolving DNS, and yet the hostnames I'm using in my inventory are being ignored, but only partially. The node gets the name I specified in inventory. But for whatever reason the test for the control plane pods is using the ec2-detected information, which I don't want to and did not ask to use.
[root@master1 ~]# hostname
master1.fuohooy.internal
[root@master1 ~]# dig master1.fuohooy.internal
; <<>> DiG 9.9.4-RedHat-9.9.4-61.el7 <<>> master1.fuohooy.internal
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 13731
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;master1.fuohooy.internal. IN A
;; ANSWER SECTION:
master1.fuohooy.internal. 10 IN A 192.199.0.182
;; Query time: 2 msec
;; SERVER: 192.199.0.2#53(192.199.0.2)
;; WHEN: Mon Aug 20 13:37:34 UTC 2018
;; MSG SIZE rcvd: 69
[root@master1 ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
link/ether 0a:c0:87:54:77:a0 brd ff:ff:ff:ff:ff:ff
inet 192.199.0.182/24 brd 192.199.0.255 scope global noprefixroute dynamic eth0
valid_lft 2119sec preferred_lft 2119sec
inet6 fe80::8c0:87ff:fe54:77a0/64 scope link
valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
link/ether 02:42:52:10:c1:fe brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 scope global docker0
valid_lft forever preferred_lft forever
[root@master1 ~]# curl http://169.254.169.254/latest/meta-data/local-hostname
ip-192-199-0-182.ec2.internal
[root@master1 ~]# curl http://169.254.169.254/latest/meta-data/public-hostname
ec2-34-195-201-149.compute-1.amazonaws.com
[root@master1 ~]# curl http://169.254.169.254/latest/meta-data/hostname
ip-192-199-0-182.ec2.internal[root@master1 ~]#
while blackholing the metadata server is a workaround for this particular issue, it's not a fix for the fact that we are not using the provided inventory name.
I'm relatively sure it's related to this change.
Clayton had previously suggested that we should figure out a way to use raw_hostname only on scale group nodes but hostname otherwise decoupling this from bootstrapping.
What play is being executed? provision_install.yml or deploy_cluster.yml?
In AWS operators have to go through great effort to get hostname, A, and PTR records to match. If they don't match then the following key is mandatory...
openshift_hostname_check=false
Also openshift_set_hostname is not a valid key. Please remove it.
@mazzystr I am running prereq and then deploy_cluster
Let me get an environment back up and validate that hostname, A, and PTR records match (or not).
@mazzystr so I have verified that the PTR record is does not point at the A record/hostname:
[root@master1 ~]# dig -t ptr 7.0.199.192.in-addr.arpa
; <<>> DiG 9.9.4-RedHat-9.9.4-61.el7 <<>> -t ptr 7.0.199.192.in-addr.arpa
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50572
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;7.0.199.192.in-addr.arpa. IN PTR
;; ANSWER SECTION:
7.0.199.192.in-addr.arpa. 60 IN PTR ip-192-199-0-7.ec2.internal.
;; Query time: 1 msec
;; SERVER: 192.199.0.7#53(192.199.0.7)
;; WHEN: Mon Aug 20 21:34:48 UTC 2018
;; MSG SIZE rcvd: 94
[root@master1 ~]# hostname -f
master1.weiphar.internal
[root@master1 ~]# dig master1.weiphar.internal
; <<>> DiG 9.9.4-RedHat-9.9.4-61.el7 <<>> master1.weiphar.internal
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 3030
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;master1.weiphar.internal. IN A
;; ANSWER SECTION:
master1.weiphar.internal. 8 IN A 192.199.0.7
;; Query time: 0 msec
;; SERVER: 192.199.0.7#53(192.199.0.7)
;; WHEN: Mon Aug 20 21:34:59 UTC 2018
;; MSG SIZE rcvd: 69
That being said, setting openshift_hostname_check=false and removing openshift_set_hostname did not fix the issue:
2018-08-20 21:15:48,564 p=19278 u=root | failed: [master1.weiphar.internal] (item=controllers) => {
"attempts": 60,
"changed": false,
"invocation": {
"module_args": {
"all_namespaces": null,
"content": null,
"debug": false,
"delete_after": false,
"field_selector": null,
"files": null,
"force": false,
"kind": "pod",
"kubeconfig": "/etc/origin/master/admin.kubeconfig",
"name": "master-controllers-ip-192-199-0-7.ec2.internal",
"namespace": "kube-system",
"selector": null,
"state": "list"
}
},
"item": "controllers",
"results": {
"cmd": "/bin/oc get pod master-controllers-ip-192-199-0-7.ec2.internal -o json -n kube-system",
"results": [
{}
],
"returncode": 0,
"stderr": "Error from server (NotFound): pods \"master-controllers-ip-192-199-0-7.ec2.internal\" not found\n",
"stdout": ""
},
"state": "list"
}
It is still failing by checking for pod names constructed from AWS' hostnames and not the inventory or forward DNS names.
I can probably fix the PTR problem but that still seems like a workaround given the installer is doing something I'm not asking it to do.
prereq + deploy_cluster are the correct plays for byo in AWS with no cloud_provider.
The instances are fouled due to the bad keys. Please terminate and recreate them then rerun installer.
@mazzystr every time I submit a new comment I am doing a fresh deploy of the entire environment.
https://github.com/openshift/openshift-ansible/issues/9647#issuecomment-414472091
This is a completely fresh environment with the suggestion you requested.
It is still failing by checking for pod names constructed from AWS' hostnames and not the inventory or forward DNS names. The original inventory file is still essentially accurate. The systems have different hostnames, I've added openshift_hostname_check=false and removed openshift_set_hostname.
It is still failing by checking for pod names constructed from AWS' hostnames and not the inventory or forward DNS names. I am not sure what you mean by _due to the bad keys_. What keys?
re redeployment ... very good. just making sure.
re still failing ... i see you have the following in your inventory...
openshift_master_cluster_hostname=master.ceijaug.openshiftdemos.com
openshift_master_cluster_public_hostname=master.ceijaug.openshiftdemos.com
openshift_master_default_subdomain=apps.ceijaug.openshiftdemos.com
Do openshift_master_cluster_hostname and openshift_master_cluster_public_hostname resolve? Usually these are CNAMEs that point to the internal and internet-facing master ELBs.
Do you have the wildcard record set for openshift_master_default_subdomain? Usually this is a CNAME pointing at the infra ELB. You can have internal and internet-facing infra ELBs also. If so then you need two wildcard records.
@mazzystr This all works with 3.10. It doesn't work with master/3.11/latest.
_Do openshift_master_cluster_hostname and openshift_master_cluster_public_hostname resolve?_
Yes. All the DNS works and is in Route53.
_Do you have the wildcard record set for openshift_master_default_subdomain?_
Yes. All the DNS works and is in Route53.
Neither of those problems appear to me to be related to this GH issue, though, which is that the installer is assembling the wrong pod names when it checks status.
This whole config, installation process, and everything, works just fine with 3.10. It breaks in 3.11 specifically with the control plane check.
Can you take an inventory of SecurityGroups? SecurityGroups, rules, and assignments please.
Ensure that openshift_portal_net matches vpc cidr range.
osm_cluster_network_cidr is not necessary. Let the installer use the default setting.
openshift_master_overwrite_named_certificates is not necessary. Let the installer use the default setting.
For now until we get this working disable openshift_enable_service_catalog, template_service_broker_install, ansible_service_broker_install, and remove nfs.
I'm feeling frustrated because I want to achieve a quick resolution here. I am not sure how the things you are asking for are related to the issue at hand.
_Can you take an inventory of SecurityGroups? SecurityGroups, rules, and assignments please._
Would you be willing to explain the failure path by which a problem in any of those would cause Ansible to ignore the inventory name when assembling the control plane pod names?
Would you be willing to explain the failure path by which service_catalog, template_broker, or service_broker would cause Ansible to ignore the inventory name when assembling the control plane pod names?
Removing NFS will break my environment.
For reference, again, here is the specific task that is failing:
https://github.com/openshift/openshift-ansible/blob/openshift-ansible-3.11.0-0.17.0/roles/openshift_control_plane/tasks/main.yml#L204-L221
Even more specifically:
name: "master-{{ item }}-{{ openshift.node.nodename | lower }}"
openshift.node.nodename is the AWS-metadata-based name, and is not the inventory-provided name.
@sdodson alluded to https://github.com/openshift/openshift-ansible/blame/644601121664cc8930efd71843a8b7e0ef5e1109/roles/openshift_facts/library/openshift_facts.py#L488-L491 as being where the noename fact is set.
Would you be willing to focus on where the nodename is being set?
You have to understand I'm flying blind here. Ref arch hasn't started work on 3.11 yet. Your infrastructure is a completely unknown to me.
The 3 common things that cause the installer to fail is dns, network connectivity and not having gquota option set on emptydir storage on instances/ami. The installer will chug along happily even though various components are damaged.
Your node is not in Ready status. Do a systemctl status atomic-openshift-node.service. It's probably in an activating state. A common reason is missing gquota option but it could be something more complicated like network connectivity.
The bootstrap file likely won't affect you since you're not doing cloud_provider=aws and autoscale groups.
@mazzystr don't worry. Scott is on it.
@sdodson I'll give that a whirl now.
I'm just now catching up on this issue. I've dug through the environment with @thoraxe
What's happening, and I'm sorry if this is already clear to everyone else, is that the hostname of the hosts has been changed after provisioning and before the installer runs. So hostname != metadata/hostname. That would not be valid if we were configuring the cloud provider, however we're not configuring cloud provider integration so we're fine. However, even when the cloud provider is not configured we override facts['common']['hostname'] with the name from the metadata API. I think we should disable this metadata inspection whenever we're not configuring the provider.
I'm trying to work through the implications of this during an upgrade however.
In short - install on AWS would use AWS metadata service to override hostnames. If hostnames don't match it would fail to install, so once this fix lands in release-3.10 only hostname would be used
Most helpful comment
I'm just now catching up on this issue. I've dug through the environment with @thoraxe
What's happening, and I'm sorry if this is already clear to everyone else, is that the hostname of the hosts has been changed after provisioning and before the installer runs. So hostname != metadata/hostname. That would not be valid if we were configuring the cloud provider, however we're not configuring cloud provider integration so we're fine. However, even when the cloud provider is not configured we override facts['common']['hostname'] with the name from the metadata API. I think we should disable this metadata inspection whenever we're not configuring the provider.
I'm trying to work through the implications of this during an upgrade however.