Installation fails on origin-master-api restarting attempt.
Ansible
ansible 2.4.2.0
config file = None
configured module search path = [u'/home/aizi/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
ansible python module location = /home/aizi/.local/lib/python2.7/site-packages/ansible
executable location = /home/aizi/.local/bin/ansible
python version = 2.7.13 (default, Nov 24 2017, 17:33:09) [GCC 6.3.0 20170516]
openshift-ansible-3.9.0-0.35.0-8-g1a58f7fc7
Failure summary:
1. Hosts: master.dom
Play: Configure masters
Task: restart master api
Message: Unable to restart service origin-master-api: Job for origin-master-api.service failed because the control process exited with error code. See "systemctl status origin-master-api.service" and "journalctl -xe" for details.
Inventory file
[OSEv3:children]
masters
nodes
etcd
[masters]
master.dom
[nodes]
master.dom
node1.dom openshift_node_labels="{'region': 'infra','zone': 'default'}"
node2.dom
#="{'region': 'primary', 'zone': 'default'}"
[etcd]
master.dom
#[masters:vars]
#ansible_become=true
#[nodes:vars]
#ansible_become=true
[OSEv3:vars]
ansible_user=vagrant
ansible_become=true
openshift_deployment_type=origin
openshift_enable_service_catalog=false
openshift_service_catalog_image_prefix=openshift/origin-
openshift_service_catalog_image_version=latest
# You must enable Network Time Protocol (NTP) to prevent masters and nodes in the cluster from going out of sync.
openshift_clock_enabled=true
# Let's change checks values for now
openshift_disable_check=memory_availability,disk_availability
#docker_storage
prerequisites.log
gist
deploy_cluster.log
gist
As host I'm using Debian Stretch, but from a fresh CentOS I'm receiving the same error.
As a vm provider I'm using virtualbox and there I have three boxes ( CentOS official box ) with 2GB RAM and 2 VCPUs each.
I've tried to use release-3.7 branch and openshift_release=v3.7 variable on a master branch, but got the same error.
Could you also attach the output of journalctl -b -el --unit=origin-master-api.service from the master?
Here you go !
Hmm, interesting.
So master fails to start as it can't connect to etcd:
F0201 21:39:38.430245 1030 start_api.go:67] [could not reach etcd(v2): client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 127.0.0.1:2379: getsockopt: connection refused
etcd service seems to be running, but I've noticed firewalld has opened 2380 instead of 2379. and iptables seems to allow 2379 and 2380 there
Could you try rerunning this with os_firewall_enabled: false? I'm not really familiar with vagrant setup, but it might something else blocking the connection
@vrutkovs thank you for the fast response. Where should I inject this line ? In deploy-cluster.yml ?
Where should I inject this line ? In deploy-cluster.yml ?
In the inventory file, in [OSEv3] group
I've included it in [OSEv3:vars] group and have done that like this: os_firewall_enabled=false, because I use ini format in the hosts file. I've also tried to create a separate group [OSEv3] and include this setting there, but it haven't work out as well.
I'll try to create fresh base image and test installation on it. I think there are some problems with base vagrant image.
@vrutkovs which branch is stable ? Could I use for example origin/release-3.7 ?
I think that one is related. Will try to test that too.
Could I use for example origin/release-3.7 ?
All release-* branches are considered stable, master would install 3.9, which is not yet released though.
I found the issue yesterday. Official vagrant CentOS box contains this line in /etc/hosts. It's the first line by the way.
127.0.0.1 node2.dom node2.dom # When you change hostname in /etc/hosts, you should normally rename hostname here as well.
It should be removed or commented. If CentOS is installed from scratch, this line doesn't exist and installation works good.
I think that additional check should be added to the playbook.
Sounds like a vagrant-specific issue, not related to openshift-ansible.
This repo can't detect whether its an install in Vagrant - or any lines in /etc/hosts should be removed
I believe that simple check could be easily added to the playbook. It will save a lot of time and headache for the people who use vagrant to test various stuff. By the way, this line exists in Debian and if I'm not mistaken in SLES too. Of course you don't use this distros for know, but who knows.
Hi,
same problem here on centos 7.
it seems like etcd is configured to listen only on a specific interface.
wouldn't it be easiest to just listen on all interfaces, as it is done for the other services.
this could be done by setting the url to listen to 0.0.0.0?
I had exactly the same issue as @vrutkovs during the installation of OpenShift Origin 3.9.
The problem was that I used the wrong ip in the /etc/hosts file.
I wrote this after the first 2 default config lines:
127.0.0.1 hostname hostname.domain
The correct way would be to simply let the dns give you the right ip or use the LAN ip:
192.168.x.x hostname hostname.domain
If you used 127.0.0.1 in the /etc/hosts the origin-master-api container tries access itself on port 2379 and not the container host / master.
I'm facing similar issue with release-3.11 and CentOS 7.6.1810:
kube api tries to connect to local etcd instance using hostname ose-master1 but this is defined in /etc/hosts as 127.0.1.1. This would be ok if etcd was listening on 0.0.0.0 but it is not. I don't like to delete the record from hosts because it is there probably to show hostname -f correctly. What would be the cleanest solution? Can we make etcd listen on 0.0.0.0 ?
My temporary solution is:
[etcd:vars]
etcd_listen_client_urls="https://0.0.0.0:2379"
Is there any reason for this not being default?
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle rotten
/remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.
/close
@openshift-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting
/reopen.
Mark the issue as fresh by commenting/remove-lifecycle rotten.
Exclude this issue from closing again by commenting/lifecycle frozen./close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
I had exactly the same issue as @vrutkovs during the installation of OpenShift Origin 3.9.
The problem was that I used the wrong ip in the
/etc/hostsfile.I wrote this after the first 2 default config lines:
127.0.0.1 hostname hostname.domainThe correct way would be to simply let the dns give you the right ip or use the LAN ip:
192.168.x.x hostname hostname.domainIf you used 127.0.0.1 in the
/etc/hoststheorigin-master-apicontainer tries access itself on port 2379 and not the container host / master.