1. What kops version are you running? The command kops version, will display
this information.
$ kops version
Version 1.8.0
2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.1", GitCommit:"3a1c9449a956b6026f075fa3134ff92f7d55f812", GitTreeState:"clean", BuildDate:"2018-01-04T19:59:01Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"darwin/amd64"}
3. What cloud provider are you using?
AWS in the us-east-1 region.
4. What commands did you run? What is the simplest way to reproduce this issue?
This issue occurs with both Gossip-based and Route53-based DNS. The examples are from a Route 53 installation. I see the same issue with the previous three stable versions of CoreOS and the default Jessie Debian images.
export AWS_ACCESS_KEY_ID="redacted"
export AWS_SECRET_ACCESS_KEY="redacted"
export AWS_SESSION_TOKEN="redacted"
export AWS_SECURITY_TOKEN="redacted"
export KOPS_STATE_STORE=s3://kops-state-redacted
export KOPS_STATE_S3_ACL=bucket-owner-full-control
export KOPS_CLUSTER_NAME='k8s.prod.redacted.com'
export KOPS_ZONES='us-east-1a,us-east-1b,us-east-1c'
kops create cluster ${KOPS_CLUSTER_NAME} \
--cloud aws \
--topology private \
--networking calico \
--dns public \
--dns-zone prod.redacted.com \
--master-zones ${KOPS_ZONES} \
--zones ${KOPS_ZONES} \
--node-size m4.large \
--image coreos.com/CoreOS-stable-1576.5.0-hvm \
--authorization RBAC \
--ssh-public-key ~/.ssh/id_rsa.pub
kops edit cluster ${KOPS_CLUSTER_NAME}
# Edit this:
#
# networking:
# calico: {}
#
# To this:
#
# networking:
# calico:
# crossSubnet: true
kops edit ig --name ${KOPS_CLUSTER_NAME} nodes
# Set these:
#
# maxSize: 10
# minSize: 3
kops update cluster --yes ${KOPS_CLUSTER_NAME}
kops validate cluster # fails
kubectl get nodes --show-labels # fails
5. What happened after the commands executed?
Technically kops does everything that is expected of it. However, when attempting to validate or run any kubectl commands, we see something like:
$ kops validate cluster
Using cluster from kubectl context: k8s.prod.redacted.com
Validating cluster k8s.prod.redacted.com
cannot get nodes for "k8s.prod.redacted.com": Get https://api.k8s.prod.redacted.com/api/v1/nodes: EOF
The AWS console shows all three of the master nodes as OutOfService in the ELB health checks tab. Viewing the system console log for all six of the EC2 nodes created doesn't give any indication cloud-init user data has actually been applied, despite what looks to be the expected userdata being attached to each node/launch config.
6. What did you expect to happen?
I expected all nodes to bootstrap successfully as they have in the pass, and the masters to appear as healthy in the API ELB load balancer.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -oyaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
$ kops get --name k8s.prod.redacted.com -oyaml
apiVersion: kops/v1alpha2
kind: Cluster
metadata:
creationTimestamp: 2018-01-19T08:56:21Z
name: k8s.prod.redacted.com
spec:
api:
loadBalancer:
type: Public
authorization:
rbac: {}
channel: stable
cloudProvider: aws
configBase: s3://kops-state-redacted/k8s.prod.redacted.com
dnsZone: prod.redacted.com
etcdClusters:
- etcdMembers:
- instanceGroup: master-us-east-1a
name: a
- instanceGroup: master-us-east-1b
name: b
- instanceGroup: master-us-east-1c
name: c
name: main
- etcdMembers:
- instanceGroup: master-us-east-1a
name: a
- instanceGroup: master-us-east-1b
name: b
- instanceGroup: master-us-east-1c
name: c
name: events
kubernetesApiAccess:
- 0.0.0.0/0
kubernetesVersion: 1.7.11
masterInternalName: api.internal.k8s.prod.redacted.com
masterPublicName: api.k8s.prod.redacted.com
networkCIDR: 172.20.0.0/16
networking:
calico:
crossSubnet: true
nonMasqueradeCIDR: 100.64.0.0/10
sshAccess:
- 0.0.0.0/0
subnets:
- cidr: 172.20.32.0/19
name: us-east-1a
type: Private
zone: us-east-1a
- cidr: 172.20.64.0/19
name: us-east-1b
type: Private
zone: us-east-1b
- cidr: 172.20.96.0/19
name: us-east-1c
type: Private
zone: us-east-1c
- cidr: 172.20.0.0/22
name: utility-us-east-1a
type: Utility
zone: us-east-1a
- cidr: 172.20.4.0/22
name: utility-us-east-1b
type: Utility
zone: us-east-1b
- cidr: 172.20.8.0/22
name: utility-us-east-1c
type: Utility
zone: us-east-1c
topology:
dns:
type: Public
masters: private
nodes: private
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2018-01-19T08:56:21Z
labels:
kops.k8s.io/cluster: k8s.prod.redacted.com
name: master-us-east-1a
spec:
image: coreos.com/CoreOS-stable-1576.5.0-hvm
machineType: m3.medium
maxSize: 1
minSize: 1
role: Master
subnets:
- us-east-1a
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2018-01-19T08:56:21Z
labels:
kops.k8s.io/cluster: k8s.prod.redacted.com
name: master-us-east-1b
spec:
image: coreos.com/CoreOS-stable-1576.5.0-hvm
machineType: m3.medium
maxSize: 1
minSize: 1
role: Master
subnets:
- us-east-1b
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2018-01-19T08:56:21Z
labels:
kops.k8s.io/cluster: k8s.prod.redacted.com
name: master-us-east-1c
spec:
image: coreos.com/CoreOS-stable-1576.5.0-hvm
machineType: m3.medium
maxSize: 1
minSize: 1
role: Master
subnets:
- us-east-1c
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2018-01-19T08:56:22Z
labels:
kops.k8s.io/cluster: k8s.prod.redacted.com
name: nodes
spec:
image: coreos.com/CoreOS-stable-1576.5.0-hvm
machineType: m4.large
maxSize: 10
minSize: 3
role: Node
subnets:
- us-east-1a
- us-east-1b
- us-east-1c
8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
I'll follow up with more verbose logs should they be needed. More to redact and the issue may well be clear to someone with the information already posted.
9. Anything else do we need to know?
Not as far as I'm aware, but I'll be happy to answer any questions that folk might have.
Thanks!
For what it is worth this is happening to me to. I was seeing this behavior using kops installed from home brew using the HEAD of master.
I think the issue might be partly that you are trying to use k82 1.9.x with kops 1.8.x. Not sure that this is supported? I think the HEAD might be in the same situation.
I'm similar to you in that I've installed kops using Homebrew, but version 1.8.0 in my case.
As for K8s, I've tried all 1.8.0-1.8.4 versions and also 1.7.11. All showing the same symptoms. Very peculiar.
Interesting. Well at least you know you are not alone. This is currently a blocker for me. I took a look at the master node. I sshed into it. It seemed like non of the expected API ports were open. Checked using netstat.
So I am not sure if the services are not starting up or what.
I'm fairly confident you're seeing no listening services because none of the nodes or masters are being bootstrapped at all. You may also find that none of the required Docker images have been pulled either.
Just to rule out the Homebrew builds of kops being "bad", I'll try tomorrow with a self-built binary or one from an official source.
I think you are right. I had also done a docker ps -a which yielded nothing. Similarly docker images came up empty. The nodes definitely are not being bootstraped.
I'm not going to be near a computer until tomorrow, so unfortunately can't test this suggestion myself, but I'm wondering if running the following could yeild some information as to why the userdata provisioning shell script is failing? You'd need to sudo to root to run it.
bash <(curl -s http://169.254.169.254/latest/user-data)
If all went well (I'd imagine it would fail and hopefully yeild a useful error message) then it ought to bootstrap the node.
So I ran that command, here is what I got:
bash <(curl -s http://169.254.169.254/latest/user-data)
== nodeup node config starting ==
Downloading nodeup (https://kubeupv2.s3.amazonaws.com/kops/1.8.0/linux/amd64/nodeup)
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 35.0M 100 35.0M 0 0 16.0M 0 0:00:02 0:00:02 --:--:-- 16.1M
== Downloaded https://kubeupv2.s3.amazonaws.com/kops/1.8.0/linux/amd64/nodeup (SHA1 = 02185512f78dc9d15a8c10774c4cb11f67e4bc20) ==
Running nodeup
nodeup version 1.8.0 (git-5099bc553)
I0120 01:31:31.036607 928 install.go:149] Built service manifest "kops-configuration.service"
[Unit]
Description=Run kops bootstrap (nodeup)
Documentation=https://github.com/kubernetes/kops
[Service]
EnvironmentFile=/etc/environment
ExecStart=/var/cache/kubernetes-install/nodeup --conf=/var/cache/kubernetes-install/kube_env.yaml --v=8
Type=oneshot
[Install]
WantedBy=multi-user.target
I0120 01:31:31.037195 928 install.go:69] No package task found; won't update packages
I0120 01:31:31.037349 928 topological_sort.go:63] Dependencies:
I0120 01:31:31.037418 928 topological_sort.go:65] Service/kops-configuration.service: []
I0120 01:31:31.037595 928 executor.go:91] Tasks: 0 done / 1 total; 1 can run
I0120 01:31:31.037661 928 executor.go:157] Executing task "Service/kops-configuration.service": Service: kops-configuration.service
I0120 01:31:31.037804 928 service.go:124] querying state of service "kops-configuration.service"
W0120 01:31:31.040772 928 service.go:203] Unknown ActiveState="activating"; will treat as not running
I0120 01:31:31.040844 928 changes.go:81] Field changed "Running" actual="false" expected="true"
I0120 01:31:31.040889 928 changes.go:81] Field changed "Enabled" actual="false" expected="true"
I0120 01:31:31.041008 928 service.go:345] Restarting service "kops-configuration.service"
It seemed to just sit there forever seeming doing nothing. Looking in the syslog, I see the following log messages:
Jan 20 01:37:58 ip-172-20-36-36 nodeup[982]: I0120 01:37:58.071143 982 s3fs.go:213] Listing objects in S3 bucket "<**redacted**>-k8s-state-store" with prefix "<**redacted**>.k8s.local/pki/issued/ca/"
Jan 20 01:37:58 ip-172-20-36-36 nodeup[982]: I0120 01:37:58.172547 982 s3fs.go:239] Listed files in s3://<**redacted**>-k8s-state-store/<**redacted**>.k8s.local/pki/issued/ca: [s3://<**redacted**>-k8s-state-store/<**redacted**>.k8s.local/pki/issued/ca/6512878379439447890072744430.crt s3://<**redacted**>-k8s-state-store/<**redacted**>.k8s.local/pki/issued/ca/keyset.yaml]
Jan 20 01:37:58 ip-172-20-36-36 nodeup[982]: I0120 01:37:58.172569 982 s3fs.go:176] Reading file "s3://<**redacted**>-k8s-state-store/<**redacted**>.k8s.local/pki/issued/ca/6512878379439447890072744430.crt"
Jan 20 01:37:58 ip-172-20-36-36 nodeup[982]: I0120 01:37:58.184895 982 certificate.go:103] Parsing pem block: "CERTIFICATE"
Jan 20 01:37:58 ip-172-20-36-36 nodeup[982]: I0120 01:37:58.185000 982 s3fs.go:176] Reading file "s3://<**redacted**>-k8s-state-store/<**redacted**>.k8s.local/pki/issued/ca/keyset.yaml"
Jan 20 01:37:58 ip-172-20-36-36 nodeup[982]: W0120 01:37:58.194794 982 main.go:141] got error running nodeup (will retry in 30s): error building loader: error fetching CA certificate from keystore: error in 'FindCert' attempting to load cert "ca": error loading certificate "s3://<**redacted**>-k8s-state-store/<**redacted**>.k8s.local/pki/issued/ca/keyset.yaml": could not parse certificate
This then repeats every 30s. Not sure if this is the problem you are seeing, but it seems like it can't parse the certificate and therefore the bootstraping process does not complete.
I think this is related to #4156.
So, I found the root cause of my issue, and it wasn't related to #4156.
Basically, for the staging cluster (account 2, us-east-1) we were using a bucket via cross-account access (in account 1, us-east-2). This worked fine for a Kops 1.7.0/K8s 1.7.6 deployment.
Unfortunately, this didn't work in our production account (account 3, us-east-1) using the same bucket in account 1, us-east-2. During the node bootstrap process nodeup was failing to pull the cluster.spec despite having full permission to do so.
It would first check the us-east-1 region for the bucket (which will clearly fail) and then attempt to pull the region via GetBucketLocation, which also fails. The calling account has full access to the S3 bucket in account 1, so realistically this should not cause a problem.
I0122 11:28:24.922283 5459 files.go:100] Hash matched for "/var/cache/nodeup/sha1:f62360d3351bed837ae3ffcdee65e9d57511695a_https___kubeupv2_s3_amazonaws_com_kops_1_8_0_linux_amd64_utils_tar_gz": sha1:f62360d3351bed837ae3ffcdee65e9d57511695a
I0122 11:28:24.923928 5459 assetstore.go:202] added asset "utils.tar.gz" for &{"/var/cache/nodeup/sha1:f62360d3351bed837ae3ffcdee65e9d57511695a_https___kubeupv2_s3_amazonaws_com_kops_1_8_0_linux_amd64_utils_tar_gz"}
I0122 11:28:24.924144 5459 assetstore.go:303] added asset "socat" for &{"/var/cache/nodeup/extracted/sha1:f62360d3351bed837ae3ffcdee65e9d57511695a_https___kubeupv2_s3_amazonaws_com_kops_1_8_0_linux_amd64_utils_tar_gz/utils/socat"}
I0122 11:28:24.970641 5459 s3context.go:145] unable to get bucket location from region "us-east-1"; scanning all regions: AccessDenied: Access Denied
status code: 403, request id: 99082A90EB31B46F
W0122 11:28:25.051940 5459 main.go:141] got error running nodeup (will retry in 30s): error loading Cluster "s3://kops-state-redacted/k8s.prod.redacted.com/cluster.spec": Unable to list AWS regions: UnauthorizedOperation: You are not authorized to perform this operation.
status code: 403, request id: d2480090-347f-4ca2-87be-ba6c2bd978ca
I believe that this is down to GetBucketLocation failing when the requestor is not the owner of the bucket. I believe in those circumstances Kops will brute-force the bucket location, but it seems that nodeup has never or no longer does that. I'd imagine that it used to on the basis that this has worked before, but it could well be that it used to default to checking us-east-2 for some reason or another. I believe there was a reason that I originally selected to have the state bucket in that region, so that could have been it.
@chrisfu this seems to be an IAM issue, are we missing a permission? We reduced the IAM roles significantly in kops 1.8, did we miss something?
Hi @chrisfu. Good find, so it appears that an additional IAM permission is required for the cases where an S3 bucket is located in a region different to where the cluster has been built / instances are running.
Please could you test by adding the relevant IAM permission to the master & node policies, I believe something like:
{
"Sid": "kopsK8sDescribeRegions",
"Effect": "Allow",
"Action": [
"ec2:DescribeRegions"
],
"Resource": [
"*"
]
}
You can add additional IAM Policies straight into your kops ClusterSpec, like this: https://github.com/kubernetes/kops/blob/master/docs/iam_roles.md#adding-additional-policies
If this is the case, we would need to extend the master & node permissions defined here: https://github.com/kubernetes/kops/blob/master/pkg/model/iam/iam_builder.go
Related code performing the brute-force reqs: https://github.com/kubernetes/kops/blob/cc67497/util/pkg/vfs/s3context.go#L181-L228
I'm facing the same issue though, anyone have a solution to this?
@KashifSaadat Good find, that sorted it. Apologies for the long time to respond; I completely missed this!
@chrisfu no worries :)
@lxcid the latest version of kops should now include this fix, can you upgrade to the latest release? https://github.com/kubernetes/kops/releases/tag/1.9.0
If you can't for whatever reason, have a look at the following comment (attaching an additional IAM policy to the nodes): https://github.com/kubernetes/kops/issues/4301#issuecomment-361528463
I'll close this issue, feel free to re-open if you find it's still a problem on the latest release.
/close
Most helpful comment
Hi @chrisfu. Good find, so it appears that an additional IAM permission is required for the cases where an S3 bucket is located in a region different to where the cluster has been built / instances are running.
Please could you test by adding the relevant IAM permission to the master & node policies, I believe something like:
You can add additional IAM Policies straight into your kops ClusterSpec, like this: https://github.com/kubernetes/kops/blob/master/docs/iam_roles.md#adding-additional-policies
If this is the case, we would need to extend the master & node permissions defined here: https://github.com/kubernetes/kops/blob/master/pkg/model/iam/iam_builder.go