Openshift-ansible: when installing on AWS, node hostname and checked pod names don't match

Created on 17 Aug 2018 · 30Comments · Source: openshift/openshift-ansible

Description

openshift-ansible-3.11.0-0.16.0
ansible-2.6.2-1.el7.noarch (epel)

Steps To Reproduce

run installer on aws-based hosts

Expected Results

successful installation

Observed Results

Describe what is actually happening.

2018-08-17 12:37:20,201 p=19271 u=root |  failed: [master1.ceijaug.internal] (item=etcd) => {"attempts": 60, "changed": false, "item": "etcd", "results": {"cmd": "/bin/oc get pod master-etcd-ip-192-199-0-7.ec2.i
nternal -o json -n kube-system", "results": [{}], "returncode": 0, "stderr": "Error from server (NotFound): pods \"master-etcd-ip-192-199-0-7.ec2.internal\" not found\n", "stdout": ""}, "state": "list"}

Additional Information

Provide any additional information which may help us diagnose the
issue.

Your operating system and version, ie: RHEL 7.2, Fedora 23 ($ cat /etc/redhat-release)
Your inventory file (especially any non-standard configuration parameters)
Sample code, etc

[root@master1 ~]# oc get node
NAME                       STATUS     ROLES     AGE       VERSION
master1.ceijaug.internal   NotReady   <none>    6m        v1.11.0+d4cacc0

Source

thoraxe

Most helpful comment

I'm just now catching up on this issue. I've dug through the environment with @thoraxe

What's happening, and I'm sorry if this is already clear to everyone else, is that the hostname of the hosts has been changed after provisioning and before the installer runs. So hostname != metadata/hostname. That would not be valid if we were configuring the cloud provider, however we're not configuring cloud provider integration so we're fine. However, even when the cloud provider is not configured we override facts['common']['hostname'] with the name from the metadata API. I think we should disable this metadata inspection whenever we're not configuring the provider.

I'm trying to work through the implications of this during an upgrade however.

sdodson on 21 Aug 2018

👍3

All 30 comments

@sdodson another nodename problem.

michaelgugino on 17 Aug 2018

@thoraxe please provide inventory.

michaelgugino on 17 Aug 2018

https://gist.github.com/thoraxe/d058f5bff3ca963f90c1b739a56d11f9

thoraxe on 17 Aug 2018

Ansible playbooks are reaching out to AWS metadata to fetch the internal DNS name and replace whatever is set in ansible inventory if the host can be reached.

How was node1.ceijaug.internal etc hostnames set?

Note that in order to override master-etcd-ip-192-199-0-7.ec2.internal you'd need Route53 config so that cloudprovider would know about this - a simple hostnamectl set-hostname won't work

vrutkovs on 17 Aug 2018

👍1

[root@master1 ~]# dig master1.mahtaix.internal

; <<>> DiG 9.9.4-RedHat-9.9.4-61.el7 <<>> master1.mahtaix.internal
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 59599
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;master1.mahtaix.internal.      IN      A

;; ANSWER SECTION:
master1.mahtaix.internal. 10    IN      A       192.199.0.164

;; Query time: 2 msec
;; SERVER: 192.199.0.2#53(192.199.0.2)
;; WHEN: Fri Aug 17 18:14:30 UTC 2018
;; MSG SIZE  rcvd: 69

We create a DNS zone.

I am not using the cloudprovider. Look in my inventory
I am already using a Route53 config. This hostname resolves via dig.

thoraxe on 17 Aug 2018

That's weird, maybe @mazzystr could help with Route53 settings and find out why would the playbook still expect hostnames to be .ec2.internal?

vrutkovs on 17 Aug 2018

I ran into this as well.

You will want to specify some combination of hostname and IP in the inventory.

Overriding Detected IP Addresses and Host Names
In AWS, situations that require overriding the variables include:
https://docs.openshift.com/container-platform/3.10/install_config/configuring_aws.html#overriding-detected-ip-addresses-host-names-aws

demo-mas01-usw1.demo.local openshift_hostname=demo-mas01-usw1.demo.local hostname=demo-mas01-usw1.demo.local ip=10.10.2.190 openshift_node_group_name=node-config-master
demo-wrk01-usw1.demo.local openshift_hostname=demo-wrk01-usw1.demo.local hostname=demo-wrk01-usw1.demo.local ip=10.10.2.92 openshift_node_group_name=node-config-compute
demo-wrk04-usw1.demo.local openshift_hostname=demo-wrk04-usw1.demo.local hostname=demo-wrk04-usw1.demo.local ip=10.10.4.111 openshift_node_group_name=node-config-infra

MatthewJSalerno on 17 Aug 2018

@MatthewJSalerno did you get a nasty deprecation warning about the use of openshift_hostname / hostname ?

thoraxe on 17 Aug 2018

@thoraxe I did a quick grep through my output log and didn't see anything. Without those settings, all of my certs used the wrong hostname.

MatthewJSalerno on 17 Aug 2018

It would be great if we could disable the cloud provider check. I'm no ansible expert, but I went digging through the code and it looks like there is no bypass.

MatthewJSalerno on 17 Aug 2018

2018-08-17 20:56:29,947 p=19329 u=root |  TASK [openshift_control_plane : Wait for control plane pods to appear] *********
2018-08-17 20:56:29,947 p=19329 u=root |  task path: /root/openshift-ansible/roles/openshift_control_plane/tasks/main.yml:204               
2018-08-17 20:56:30,319 p=19329 u=root |  Using module file /root/openshift-ansible/roles/lib_openshift/library/oc_obj.py
2018-08-17 20:56:30,430 p=19329 u=root |  Escalation succeeded                                                            
2018-08-17 20:56:30,761 p=19329 u=root |  FAILED - RETRYING: Wait for control plane pods to appear (60 retries left).Result was: {
    "attempts": 1,                                                                          
    "changed": false,                                                                             
    "invocation": {                                            
        "module_args": {                                                                                                  
            "all_namespaces": null,                                                                                           
            "content": null,                                                                                                  
            "debug": false,                                                                                                                    
            "delete_after": false,                                                                
            "field_selector": null,                                                                                                                                                                               
            "files": null,                                                                                                   
            "force": false,                                                                                      
            "kind": "pod",                                                                                               
            "kubeconfig": "/etc/origin/master/admin.kubeconfig",                                                                               
            "name": "master-etcd-ip-192-199-0-228.ec2.internal",                                                                  
            "namespace": "kube-system",                                                                                   
            "selector": null,                                                                                                
            "state": "list"
        }
    },
    "msg": {
        "cmd": "/bin/oc get pod master-etcd-ip-192-199-0-228.ec2.internal -o json -n kube-system",
        "results": [
            {}
        ],
        "returncode": 1,
        "stderr": "The connection to the server master.hiequid.openshiftdemos.com:443 was refused - did you specify the right host or port?\n",
        "stdout": ""
    },
    "retries": 61
}

OK, sure, the connection was refused. But that's not the issue.

"cmd": "/bin/oc get pod master-etcd-ip-192-199-0-228.ec2.internal -o json -n kube-system"

[root@master1 ~]# oc get node
NAME                       STATUS     ROLES     AGE       VERSION
master1.hiequid.internal   NotReady   <none>    3m        v1.11.0+d4cacc0
[root@master1 ~]# dig master1.hiequid.internal

; <<>> DiG 9.9.4-RedHat-9.9.4-61.el7 <<>> master1.hiequid.internal
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 16108
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;master1.hiequid.internal.      IN      A

;; ANSWER SECTION:
master1.hiequid.internal. 1     IN      A       192.199.0.228

;; Query time: 0 msec
;; SERVER: 192.199.0.228#53(192.199.0.228)
;; WHEN: Fri Aug 17 21:00:51 UTC 2018
;; MSG SIZE  rcvd: 69

The inventory file is the same as in the OP with the exception of the hostnames being slightly different.

This task is in roles/openshift_control_plane/tasks/main.yml

The oc command is trying to fetch a pod name that doesn't correspond to the node name.

https://github.com/openshift/openshift-ansible/blob/openshift-ansible-3.11.0-0.17.0/roles/openshift_control_plane/tasks/main.yml#L204-L221

Specifically, in this case, openshift.node.nodename appears to have some metadata name for the server and not the actual specified inventory node name. Which is extra odd because while we are actually on AWS we are not configuring the AWS cloud provider. So it's doing this all on its own.

thoraxe on 17 Aug 2018

Have you looked through the AWS metadata to see if that hostname is there? Do any of the below commands return that name?

curl http://169.254.169.254/latest/meta-data/local-hostname
curl http://169.254.169.254/latest/meta-data/public-hostname
curl http://169.254.169.254/latest/meta-data/hostname

It's possibly getting populated around here

and here. 273 looks like where the values are being set

MatthewJSalerno on 18 Aug 2018

👍1

When in doubt, just execute the following command before the prereq and deploy:

sudo ip route add blackhole 169.254.169.254/32

It won't last past a reboot and it will keep the install from seeing the AWS metadata.

MatthewJSalerno on 19 Aug 2018

@MatthewJSalerno if we are detecting hostnames using pre-determined methods on AWS based on system bios we are likely to wreck quite a few things.

I have fully functioning forward-resolving DNS, and yet the hostnames I'm using in my inventory are being ignored, but only partially. The node gets the name I specified in inventory. But for whatever reason the test for the control plane pods is using the ec2-detected information, which I don't want to and did not ask to use.

[root@master1 ~]# hostname
master1.fuohooy.internal
[root@master1 ~]# dig master1.fuohooy.internal

; <<>> DiG 9.9.4-RedHat-9.9.4-61.el7 <<>> master1.fuohooy.internal
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 13731
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;master1.fuohooy.internal.      IN      A

;; ANSWER SECTION:
master1.fuohooy.internal. 10    IN      A       192.199.0.182

;; Query time: 2 msec
;; SERVER: 192.199.0.2#53(192.199.0.2)
;; WHEN: Mon Aug 20 13:37:34 UTC 2018
;; MSG SIZE  rcvd: 69

[root@master1 ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
    link/ether 0a:c0:87:54:77:a0 brd ff:ff:ff:ff:ff:ff
    inet 192.199.0.182/24 brd 192.199.0.255 scope global noprefixroute dynamic eth0
       valid_lft 2119sec preferred_lft 2119sec
    inet6 fe80::8c0:87ff:fe54:77a0/64 scope link 
       valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:52:10:c1:fe brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 scope global docker0
       valid_lft forever preferred_lft forever

[root@master1 ~]# curl http://169.254.169.254/latest/meta-data/local-hostname
ip-192-199-0-182.ec2.internal

[root@master1 ~]# curl http://169.254.169.254/latest/meta-data/public-hostname
ec2-34-195-201-149.compute-1.amazonaws.com

[root@master1 ~]# curl http://169.254.169.254/latest/meta-data/hostname
ip-192-199-0-182.ec2.internal[root@master1 ~]#

while blackholing the metadata server is a workaround for this particular issue, it's not a fix for the fact that we are not using the provided inventory name.

thoraxe on 20 Aug 2018

I'm relatively sure it's related to this change.

https://github.com/openshift/openshift-ansible/blame/644601121664cc8930efd71843a8b7e0ef5e1109/roles/openshift_facts/library/openshift_facts.py#L488-L491

Clayton had previously suggested that we should figure out a way to use raw_hostname only on scale group nodes but hostname otherwise decoupling this from bootstrapping.

sdodson on 20 Aug 2018

What play is being executed? provision_install.yml or deploy_cluster.yml?

mazzystr on 20 Aug 2018

In AWS operators have to go through great effort to get hostname, A, and PTR records to match. If they don't match then the following key is mandatory...
openshift_hostname_check=false

Also openshift_set_hostname is not a valid key. Please remove it.

mazzystr on 20 Aug 2018

@mazzystr I am running prereq and then deploy_cluster

Let me get an environment back up and validate that hostname, A, and PTR records match (or not).

thoraxe on 20 Aug 2018

@mazzystr so I have verified that the PTR record is does not point at the A record/hostname:

[root@master1 ~]# dig -t ptr 7.0.199.192.in-addr.arpa

; <<>> DiG 9.9.4-RedHat-9.9.4-61.el7 <<>> -t ptr 7.0.199.192.in-addr.arpa
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50572
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;7.0.199.192.in-addr.arpa.      IN      PTR

;; ANSWER SECTION:
7.0.199.192.in-addr.arpa. 60    IN      PTR     ip-192-199-0-7.ec2.internal.

;; Query time: 1 msec
;; SERVER: 192.199.0.7#53(192.199.0.7)
;; WHEN: Mon Aug 20 21:34:48 UTC 2018
;; MSG SIZE  rcvd: 94

[root@master1 ~]# hostname -f
master1.weiphar.internal
[root@master1 ~]# dig master1.weiphar.internal

; <<>> DiG 9.9.4-RedHat-9.9.4-61.el7 <<>> master1.weiphar.internal
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 3030
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;master1.weiphar.internal.      IN      A

;; ANSWER SECTION:
master1.weiphar.internal. 8     IN      A       192.199.0.7

;; Query time: 0 msec
;; SERVER: 192.199.0.7#53(192.199.0.7)
;; WHEN: Mon Aug 20 21:34:59 UTC 2018
;; MSG SIZE  rcvd: 69

That being said, setting openshift_hostname_check=false and removing openshift_set_hostname did not fix the issue:

2018-08-20 21:15:48,564 p=19278 u=root |  failed: [master1.weiphar.internal] (item=controllers) => {
    "attempts": 60,
    "changed": false,
    "invocation": {
        "module_args": {
            "all_namespaces": null,
            "content": null,
            "debug": false,
            "delete_after": false,
            "field_selector": null,
            "files": null,
            "force": false,
            "kind": "pod",
            "kubeconfig": "/etc/origin/master/admin.kubeconfig",
            "name": "master-controllers-ip-192-199-0-7.ec2.internal",
            "namespace": "kube-system",
            "selector": null,
            "state": "list"
        }
    },
    "item": "controllers",
    "results": {
        "cmd": "/bin/oc get pod master-controllers-ip-192-199-0-7.ec2.internal -o json -n kube-system",
        "results": [
            {}
        ],
        "returncode": 0,
        "stderr": "Error from server (NotFound): pods \"master-controllers-ip-192-199-0-7.ec2.internal\" not found\n",
        "stdout": ""
    },
    "state": "list"
}

It is still failing by checking for pod names constructed from AWS' hostnames and not the inventory or forward DNS names.

I can probably fix the PTR problem but that still seems like a workaround given the installer is doing something I'm not asking it to do.

thoraxe on 20 Aug 2018

prereq + deploy_cluster are the correct plays for byo in AWS with no cloud_provider.

The instances are fouled due to the bad keys. Please terminate and recreate them then rerun installer.

mazzystr on 21 Aug 2018

@mazzystr every time I submit a new comment I am doing a fresh deploy of the entire environment.

https://github.com/openshift/openshift-ansible/issues/9647#issuecomment-414472091

This is a completely fresh environment with the suggestion you requested.

It is still failing by checking for pod names constructed from AWS' hostnames and not the inventory or forward DNS names. The original inventory file is still essentially accurate. The systems have different hostnames, I've added openshift_hostname_check=false and removed openshift_set_hostname.

It is still failing by checking for pod names constructed from AWS' hostnames and not the inventory or forward DNS names. I am not sure what you mean by _due to the bad keys_. What keys?

thoraxe on 21 Aug 2018

re redeployment ... very good. just making sure.

re still failing ... i see you have the following in your inventory...

openshift_master_cluster_hostname=master.ceijaug.openshiftdemos.com
openshift_master_cluster_public_hostname=master.ceijaug.openshiftdemos.com
openshift_master_default_subdomain=apps.ceijaug.openshiftdemos.com

Do openshift_master_cluster_hostname and openshift_master_cluster_public_hostname resolve? Usually these are CNAMEs that point to the internal and internet-facing master ELBs.
Do you have the wildcard record set for openshift_master_default_subdomain? Usually this is a CNAME pointing at the infra ELB. You can have internal and internet-facing infra ELBs also. If so then you need two wildcard records.

mazzystr on 21 Aug 2018

@mazzystr This all works with 3.10. It doesn't work with master/3.11/latest.

_Do openshift_master_cluster_hostname and openshift_master_cluster_public_hostname resolve?_
Yes. All the DNS works and is in Route53.

_Do you have the wildcard record set for openshift_master_default_subdomain?_
Yes. All the DNS works and is in Route53.

Neither of those problems appear to me to be related to this GH issue, though, which is that the installer is assembling the wrong pod names when it checks status.

This whole config, installation process, and everything, works just fine with 3.10. It breaks in 3.11 specifically with the control plane check.

thoraxe on 21 Aug 2018

Can you take an inventory of SecurityGroups? SecurityGroups, rules, and assignments please.

mazzystr on 21 Aug 2018

Ensure that openshift_portal_net matches vpc cidr range.
osm_cluster_network_cidr is not necessary. Let the installer use the default setting.
openshift_master_overwrite_named_certificates is not necessary. Let the installer use the default setting.

For now until we get this working disable openshift_enable_service_catalog, template_service_broker_install, ansible_service_broker_install, and remove nfs.

mazzystr on 21 Aug 2018

I'm feeling frustrated because I want to achieve a quick resolution here. I am not sure how the things you are asking for are related to the issue at hand.

_Can you take an inventory of SecurityGroups? SecurityGroups, rules, and assignments please._

Would you be willing to explain the failure path by which a problem in any of those would cause Ansible to ignore the inventory name when assembling the control plane pod names?

Would you be willing to explain the failure path by which service_catalog, template_broker, or service_broker would cause Ansible to ignore the inventory name when assembling the control plane pod names?

Removing NFS will break my environment.

For reference, again, here is the specific task that is failing:
https://github.com/openshift/openshift-ansible/blob/openshift-ansible-3.11.0-0.17.0/roles/openshift_control_plane/tasks/main.yml#L204-L221

Even more specifically:

name: "master-{{ item }}-{{ openshift.node.nodename | lower }}"

openshift.node.nodename is the AWS-metadata-based name, and is not the inventory-provided name.

@sdodson alluded to https://github.com/openshift/openshift-ansible/blame/644601121664cc8930efd71843a8b7e0ef5e1109/roles/openshift_facts/library/openshift_facts.py#L488-L491 as being where the noename fact is set.

Would you be willing to focus on where the nodename is being set?

thoraxe on 21 Aug 2018

You have to understand I'm flying blind here. Ref arch hasn't started work on 3.11 yet. Your infrastructure is a completely unknown to me.

The 3 common things that cause the installer to fail is dns, network connectivity and not having gquota option set on emptydir storage on instances/ami. The installer will chug along happily even though various components are damaged.

Your node is not in Ready status. Do a systemctl status atomic-openshift-node.service. It's probably in an activating state. A common reason is missing gquota option but it could be something more complicated like network connectivity.

The bootstrap file likely won't affect you since you're not doing cloud_provider=aws and autoscale groups.

mazzystr on 21 Aug 2018

@mazzystr don't worry. Scott is on it.

@sdodson I'll give that a whirl now.

thoraxe on 21 Aug 2018

I'm just now catching up on this issue. I've dug through the environment with @thoraxe

I'm trying to work through the implications of this during an upgrade however.

sdodson on 21 Aug 2018

👍3

9956 should fix that

In short - install on AWS would use AWS metadata service to override hostnames. If hostnames don't match it would fail to install, so once this fix lands in release-3.10 only hostname would be used

vrutkovs on 7 Sep 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings