Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG REPORT
Environment:
printf "$(uname -srm)\n$(cat /etc/os-release)\n"):CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
[root@node2 ~]# exit
logout
Connection to 192.168.122.128 closed.
[root@localhost kubespray]# ssh 192.168.122.5
Last login: Thu Jun 14 18:54:43 2018 from 192.168.122.245
[root@node3 ~]#
[root@node3 ~]# printf "$(uname -srm)\n$(cat /etc/os-release)\n"
Linux 3.10.0-862.el7.x86_64 x86_64
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
[root@node3 ~]#
ansible --version):Kubespray version (commit) (git rev-parse --short HEAD):
[root@localhost kubespray]# git rev-parse --short HEAD
0686b84
[root@localhost kubespray]#
Network plugin used:
No.
Copy of your inventory file:
[root@localhost kubespray]# cat inventory/mycluster/hosts.ini
[all]
node1 ansible_host=192.168.122.64 ip=192.168.122.64
node2 ansible_host=192.168.122.128 ip=192.168.122.128
node3 ansible_host=192.168.122.5 ip=192.168.122.5
[kube-master]
node1
node2
[kube-node]
node1
node2
node3
[etcd]
node1
node2
node3
[k8s-cluster:children]
kube-node
kube-master
[calico-rr]
[vault]
node1
node2
node3
[root@localhost kubespray]#
Command used to invoke ansible:
ansible-playbook -i inventory/mycluster/hosts.ini cluster.yml
Output of ansible run:
TASK [etcd : include_tasks] *****************************************************
Thursday 14 June 2018 18:55:38 +0530 (0:00:01.421) 0:06:02.708 **
included: /opt/kubespray/roles/etcd/tasks/upd_ca_trust.yml for node2, node3
TASK [etcd : Gen_certs | target ca-certificate store file] ******************************************
Thursday 14 June 2018 18:55:39 +0530 (0:00:00.505) 0:06:03.214 **
ok: [node2]
ok: [node3]
TASK [etcd : Gen_certs | add CA to trusted CA dir] *********************************************
Thursday 14 June 2018 18:55:39 +0530 (0:00:00.833) 0:06:04.048 ***
fatal: [node2]: FAILED! => {"changed": false, "msg": "Source /etc/ssl/etcd/ssl/ca.pem not found"}
fatal: [node3]: FAILED! => {"changed": false, "msg": "Source /etc/ssl/etcd/ssl/ca.pem not found"}
NO MORE HOSTS LEFT *********************************************************
to retry, use: --limit @/opt/kubespray/cluster.retry
PLAY RECAP ***********************************************************
localhost : ok=2 changed=0 unreachable=0 failed=0
node1 : ok=15 changed=1 unreachable=0 failed=1
node2 : ok=162 changed=10 unreachable=0 failed=1
node3 : ok=158 changed=13 unreachable=0 failed=1
...
Shubham Pardeshi
I'm encountering the same problem, my setup is basically the same, but Ubuntu, looks like the certs are not been generated.
I encountered the same issue. In my case the master node was low on resources (memory) but the play didn't stop on that task (check if enough memory available). So basically I got no tasks ran on the master node, including the cert generation. Could this be your case?
I'm also using Ubuntu (16.04) on Azure.
the same case =(
@naumvd95 have you checked that all of the previous tasks on the masters node have been run before failing on the certificate stuff?
I kind of solve the problem changing the "bootstrap_os" to "none" in the all.yml. I don't know why it solved, but if i put Ubuntu there, does not work...
If this is a memory problem (that was my case too) you can see this logs before the end :
TASK [kubernetes/preinstall : Stop if memory is too small for masters] ******************************
Wednesday 08 August 2018 09:25:57 +0200 (0:00:00.592) 0:06:30.520 ******
fatal: [infra-k8s-master-01]: FAILED! => {
"assertion": "ansible_memtotal_mb >= 1500",
"changed": false,
"evaluated_to": false
}
fatal: [infra-k8s-master-02]: FAILED! => {
"assertion": "ansible_memtotal_mb >= 1500",
"changed": false,
"evaluated_to": false
}
TASK [kubernetes/preinstall : Stop if memory is too small for nodes] ********************************
Wednesday 08 August 2018 09:25:57 +0200 (0:00:00.419) 0:06:30.939 ******
fatal: [infra-k8s-worker-01]: FAILED! => {
"assertion": "ansible_memtotal_mb >= 1024",
"changed": false,
"evaluated_to": false
}
fatal: [infra-k8s-worker-02]: FAILED! => {
"assertion": "ansible_memtotal_mb >= 1024",
"changed": false,
"evaluated_to": false
}
fatal: [infra-k8s-worker-03]: FAILED! => {
"assertion": "ansible_memtotal_mb >= 1024",
"changed": false,
"evaluated_to": false
}
@ltupin interesting, so your play did fail on the check memory task...
Not exactly, it continue after this fatal and at the end I get the certificate error:
TASK [kubernetes/secrets : Gen_certs | add CA to trusted CA dir] ************************************
Wednesday 08 August 2018 09:33:56 +0200 (0:00:00.458) 0:14:30.345 ******
fatal: [infra-k8s-etcd-01]: FAILED! => {"changed": false, "msg": "Source /etc/kubernetes/ssl/ca.pem not found"}
fatal: [infra-k8s-etcd-03]: FAILED! => {"changed": false, "msg": "Source /etc/kubernetes/ssl/ca.pem not found"}
fatal: [infra-k8s-etcd-02]: FAILED! => {"changed": false, "msg": "Source /etc/kubernetes/ssl/ca.pem not found"}
Exactly same as you.
Hi,
any results on this? Have you had the chance to fix it?
My logs are as follows:
TASK [etcd : Gen_certs | add CA to trusted CA dir] *************
fatal: [aadigital2]: FAILED! => {"changed": false, "msg": "Source /etc/ssl/etcd/ssl/ca.pem not found"}
fatal: [aadigital3]: FAILED! => {"changed": false, "msg": "Source /etc/ssl/etcd/ssl/ca.pem not found"}
Any ideas to look at? The nodes should have enough memory and disk space. It seems that the certs are not generated at all.
Thanks!
@mbecker what was the result of your task "Stop if memory is too small for masters" ??
HI @lgg42 ,
the logs for that tasks are as follows
TASK [kubernetes/preinstall : Stop if memory is too small for masters] *******
task path: /home/mbecker/***/kubespray/roles/kubernetes/preinstall/tasks/verify-settings.yml:52
skipping: [aadigital2] => {
"changed": false,
"skip_reason": "Conditional result was False"
}
ok: [aadigital1] => {
"changed": false,
"msg": "All assertions passed"
}
skipping: [aadigital3] => {
"changed": false,
"skip_reason": "Conditional result was False"
}
The nodes have 32GB RAM, so I do not think that this should be the problem ;-) Thanks for your help! Is there any chance to skip the generation / copying of certificates and maybe generate own certs and update the configuration of each component manually?
@mbecker sorry that I can't be of more help here, I'm pretty far from a PC :relieved:
What about going through all the conditions (within all tasks) that must be meet to generate the certs? Chances are maybe one/several of them failed.
I'm guessing you already tried to run the playbook several times right? Is not the right way but just to rule out some other issues.
I have the same problem here:
On debian Stretch and tag/branch v2.8.0 or master.
TASK [etcd : Gen_certs | add CA to trusted CA dir] ********************************************************************************************************************************************
Wednesday 05 December 2018 13:55:16 +0100 (0:00:00.054) 0:02:54.371 ****
fatal: [node4]: FAILED! => {"changed": false, "msg": "Source /etc/ssl/etcd/ssl/ca.pem not found"}
fatal: [node5]: FAILED! => {"changed": false, "msg": "Source /etc/ssl/etcd/ssl/ca.pem not found"}
node:
2 X86 64bit Cores
2GB memory
master:
4 X86 64bit Cores
4GB memory
Same problem on 2.8.0 on Ubuntu 18.04.
@henres 2Gb for your nodes could be the reason (that was mine) can you try with more memory and check logs if you find trace I mentioned earlier in this thread?
I'm seeing the same:
TASK [etcd : Gen_certs | add CA to trusted CA dir] *****************************************************************************************************************************************************************************************************************************
Thursday 13 December 2018 16:06:06 -0500 (0:00:00.157) 0:02:29.444 *****
fatal: [w2]: FAILED! => {"changed": false, "msg": "Source /etc/ssl/etcd/ssl/ca.pem not found"}
fatal: [w1]: FAILED! => {"changed": false, "msg": "Source /etc/ssl/etcd/ssl/ca.pem not found"}
fatal: [w3]: FAILED! => {"changed": false, "msg": "Source /etc/ssl/etcd/ssl/ca.pem not found"}
Worth noting, the nodes that are failing are not in the etcd group. Should this task be run on these nodes?
[kube-master]
m1
m2
m3
[etcd]
m1
m2
m3
[kube-node]
w1
w2
w3
[k8s-cluster:children]
kube-master
kube-node
I was able to resolve this. In my case the deployment for the nodes which are in the etcd group was failing at the assertion stage (in particular they were failing on the assertion Stop if RBAC and anonymous-auth are not enabled when insecure port is disabled, resolved by setting kube_api_anonymous_auth=true). Once the failing assertion was fixed the original error went away.
Having the same issue, though my etcd nodes are seperate from masters and workers (dedicated for each type). The etcd nodes fail the Stop if RBAC and anonymous-auth are not enabled when insecure port is disabled - I just fail to see how that should/could possibly have any effect since the etcd nodes are out-of-cluster entirely (without kubelet or anything I would presume).
I can confirm the 'solution' for version 2.8.1: kube_api_anonymous_auth=true
Thanks @macarpen :)
I also discovered that the bug isn't there with the current commit version of the master branch (39d7503). It's working fine without kube_api_anonymous_auth=true
Same problem on 2.8.1 on Ubuntu 16.04.
Also on 2.8.2 on Ubuntu 16.04
I solved it for my case! My problem was that an unrelated error had made the master node stop running the playbook. So since it is the owner of the CA file, it could not generate it, and thus, the others could not obtain a copy of it.
Looking at the output people have posted here, it looks like this could actually be the case for more than just myself. And if so, @lgg42 was right about the intuition that caused the question "have you checked that all of the previous tasks on the masters node have been run before failing on the certificate stuff?"
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Same problem here.
For all my hosts in group kube-node, /etc/ssl/etcd/ssl/ca.pem is missing at this step
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
I encountered the same issue. In my case the master node was low on resources (memory) but the play didn't stop on that task (check if enough memory available). So basically I got no tasks ran on the master node, including the cert generation. Could this be your case?
I'm also using Ubuntu (16.04) on Azure.