Openshift-ansible: Installation fails on origin-master-api restarting attempt

Created on 1 Feb 2018 · 20Comments · Source: openshift/openshift-ansible

Description

Installation fails on origin-master-api restarting attempt.

Version

Ansible

ansible 2.4.2.0
  config file = None
  configured module search path = [u'/home/aizi/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /home/aizi/.local/lib/python2.7/site-packages/ansible
  executable location = /home/aizi/.local/bin/ansible
  python version = 2.7.13 (default, Nov 24 2017, 17:33:09) [GCC 6.3.0 20170516]

openshift-ansible-3.9.0-0.35.0-8-g1a58f7fc7

Steps To Reproduce

ansible-playbook -i os-hosts openshift-ansible/playbooks/prerequisites.yml
ansible-playbook -i os-hosts openshift-ansible/playbooks/deploy_cluster.yml

Failure summary:

  1. Hosts:    master.dom
     Play:     Configure masters
     Task:     restart master api
     Message:  Unable to restart service origin-master-api: Job for origin-master-api.service failed because the control process exited with error code. See "systemctl status origin-master-api.service" and "journalctl -xe" for details.

Inventory file

[OSEv3:children]
masters
nodes
etcd

[masters]
master.dom

[nodes]
master.dom
node1.dom openshift_node_labels="{'region': 'infra','zone': 'default'}"
node2.dom
#="{'region': 'primary', 'zone': 'default'}"

[etcd]
master.dom

#[masters:vars]
#ansible_become=true

#[nodes:vars]
#ansible_become=true


[OSEv3:vars]
ansible_user=vagrant
ansible_become=true


openshift_deployment_type=origin

openshift_enable_service_catalog=false
openshift_service_catalog_image_prefix=openshift/origin-
openshift_service_catalog_image_version=latest

# You must enable Network Time Protocol (NTP) to prevent masters and nodes in the cluster from going out of sync.
openshift_clock_enabled=true

# Let's change checks values for now
openshift_disable_check=memory_availability,disk_availability
#docker_storage

prerequisites.log
gist

deploy_cluster.log
gist

Additional Information

As host I'm using Debian Stretch, but from a fresh CentOS I'm receiving the same error.
As a vm provider I'm using virtualbox and there I have three boxes ( CentOS official box ) with 2GB RAM and 2 VCPUs each.

I've tried to use release-3.7 branch and openshift_release=v3.7 variable on a master branch, but got the same error.

lifecyclrotten needinfo

Source

mate201

Most helpful comment

I had exactly the same issue as @vrutkovs during the installation of OpenShift Origin 3.9.

The problem was that I used the wrong ip in the /etc/hosts file.

I wrote this after the first 2 default config lines:
127.0.0.1 hostname hostname.domain

The correct way would be to simply let the dns give you the right ip or use the LAN ip:
192.168.x.x hostname hostname.domain

If you used 127.0.0.1 in the /etc/hosts the origin-master-api container tries access itself on port 2379 and not the container host / master.

Suniastar on 5 Jul 2018

👍4

All 20 comments

Could you also attach the output of journalctl -b -el --unit=origin-master-api.service from the master?

vrutkovs on 1 Feb 2018

👍1

Here you go !

api log

mate201 on 1 Feb 2018

Hmm, interesting.

So master fails to start as it can't connect to etcd:
F0201 21:39:38.430245 1030 start_api.go:67] [could not reach etcd(v2): client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 127.0.0.1:2379: getsockopt: connection refused

etcd service seems to be running, ~~but I've noticed firewalld has opened 2380 instead of 2379.~~ and iptables seems to allow 2379 and 2380 there

Could you try rerunning this with os_firewall_enabled: false? I'm not really familiar with vagrant setup, but it might something else blocking the connection

vrutkovs on 1 Feb 2018

👍1

@vrutkovs thank you for the fast response. Where should I inject this line ? In deploy-cluster.yml ?

mate201 on 2 Feb 2018

Where should I inject this line ? In deploy-cluster.yml ?

In the inventory file, in [OSEv3] group

vrutkovs on 2 Feb 2018

👍2

I've included it in [OSEv3:vars] group and have done that like this: os_firewall_enabled=false, because I use ini format in the hosts file. I've also tried to create a separate group [OSEv3] and include this setting there, but it haven't work out as well.

I'll try to create fresh base image and test installation on it. I think there are some problems with base vagrant image.

mate201 on 2 Feb 2018

@vrutkovs which branch is stable ? Could I use for example origin/release-3.7 ?

mate201 on 2 Feb 2018

I think that one is related. Will try to test that too.

https://github.com/openshift/openshift-ansible/issues/6087

mate201 on 2 Feb 2018

Could I use for example origin/release-3.7 ?

All release-* branches are considered stable, master would install 3.9, which is not yet released though.

vrutkovs on 2 Feb 2018

👍1

I found the issue yesterday. Official vagrant CentOS box contains this line in /etc/hosts. It's the first line by the way.

127.0.0.1 node2.dom node2.dom # When you change hostname in /etc/hosts, you should normally rename hostname here as well.

It should be removed or commented. If CentOS is installed from scratch, this line doesn't exist and installation works good.

I think that additional check should be added to the playbook.

mate201 on 5 Feb 2018

👍1

Sounds like a vagrant-specific issue, not related to openshift-ansible.

This repo can't detect whether its an install in Vagrant - or any lines in /etc/hosts should be removed

vrutkovs on 13 Feb 2018

I believe that simple check could be easily added to the playbook. It will save a lot of time and headache for the people who use vagrant to test various stuff. By the way, this line exists in Debian and if I'm not mistaken in SLES too. Of course you don't use this distros for know, but who knows.

mate201 on 15 Feb 2018

Hi,
same problem here on centos 7.
it seems like etcd is configured to listen only on a specific interface.
wouldn't it be easiest to just listen on all interfaces, as it is done for the other services.
this could be done by setting the url to listen to 0.0.0.0?

grb19 on 20 Feb 2018

👍1

I had exactly the same issue as @vrutkovs during the installation of OpenShift Origin 3.9.

The problem was that I used the wrong ip in the /etc/hosts file.

I wrote this after the first 2 default config lines:
127.0.0.1 hostname hostname.domain

The correct way would be to simply let the dns give you the right ip or use the LAN ip:
192.168.x.x hostname hostname.domain

If you used 127.0.0.1 in the /etc/hosts the origin-master-api container tries access itself on port 2379 and not the container host / master.

Suniastar on 5 Jul 2018

👍4

I'm facing similar issue with release-3.11 and CentOS 7.6.1810:
kube api tries to connect to local etcd instance using hostname ose-master1 but this is defined in /etc/hosts as 127.0.1.1. This would be ok if etcd was listening on 0.0.0.0 but it is not. I don't like to delete the record from hosts because it is there probably to show hostname -f correctly. What would be the cleanest solution? Can we make etcd listen on 0.0.0.0 ?

danielkucera on 8 Aug 2019

My temporary solution is:

[etcd:vars]
etcd_listen_client_urls="https://0.0.0.0:2379"

Is there any reason for this not being default?

danielkucera on 8 Aug 2019

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot on 2 Jun 2020

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot on 2 Jul 2020

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-bot on 1 Aug 2020

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.