Kops: API instances appear as OutOfService on ELB for a brand new cluster

Created on 19 Jan 2018 · 16Comments · Source: kubernetes/kops

1. What kops version are you running? The command kops version, will display
this information.

$ kops version
Version 1.8.0

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.1", GitCommit:"3a1c9449a956b6026f075fa3134ff92f7d55f812", GitTreeState:"clean", BuildDate:"2018-01-04T19:59:01Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"darwin/amd64"}

3. What cloud provider are you using?

AWS in the us-east-1 region.

4. What commands did you run? What is the simplest way to reproduce this issue?

This issue occurs with both Gossip-based and Route53-based DNS. The examples are from a Route 53 installation. I see the same issue with the previous three stable versions of CoreOS and the default Jessie Debian images.

export AWS_ACCESS_KEY_ID="redacted"
export AWS_SECRET_ACCESS_KEY="redacted"
export AWS_SESSION_TOKEN="redacted"
export AWS_SECURITY_TOKEN="redacted"

export KOPS_STATE_STORE=s3://kops-state-redacted
export KOPS_STATE_S3_ACL=bucket-owner-full-control
export KOPS_CLUSTER_NAME='k8s.prod.redacted.com'
export KOPS_ZONES='us-east-1a,us-east-1b,us-east-1c'

kops create cluster ${KOPS_CLUSTER_NAME} \
--cloud aws \
--topology private \
--networking calico \
--dns public \
--dns-zone prod.redacted.com \
--master-zones ${KOPS_ZONES} \
--zones ${KOPS_ZONES} \
--node-size m4.large \
--image coreos.com/CoreOS-stable-1576.5.0-hvm \
--authorization RBAC \
--ssh-public-key ~/.ssh/id_rsa.pub

kops edit cluster ${KOPS_CLUSTER_NAME}

# Edit this:
#
# networking:
#   calico: {}
#
# To this:
#
# networking:
#   calico:
#     crossSubnet: true

kops edit ig --name ${KOPS_CLUSTER_NAME} nodes

# Set these:
#
# maxSize: 10
# minSize: 3

kops update cluster --yes ${KOPS_CLUSTER_NAME}
kops validate cluster # fails
kubectl get nodes --show-labels # fails

5. What happened after the commands executed?

Technically kops does everything that is expected of it. However, when attempting to validate or run any kubectl commands, we see something like:

$ kops validate cluster
Using cluster from kubectl context: k8s.prod.redacted.com

Validating cluster k8s.prod.redacted.com


cannot get nodes for "k8s.prod.redacted.com": Get https://api.k8s.prod.redacted.com/api/v1/nodes: EOF

The AWS console shows all three of the master nodes as OutOfService in the ELB health checks tab. Viewing the system console log for all six of the EC2 nodes created doesn't give any indication cloud-init user data has actually been applied, despite what looks to be the expected userdata being attached to each node/launch config.

6. What did you expect to happen?

I expected all nodes to bootstrap successfully as they have in the pass, and the masters to appear as healthy in the API ELB load balancer.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -oyaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

$ kops get --name k8s.prod.redacted.com -oyaml
apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: 2018-01-19T08:56:21Z
  name: k8s.prod.redacted.com
spec:
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://kops-state-redacted/k8s.prod.redacted.com
  dnsZone: prod.redacted.com
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    - instanceGroup: master-us-east-1b
      name: b
    - instanceGroup: master-us-east-1c
      name: c
    name: main
  - etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    - instanceGroup: master-us-east-1b
      name: b
    - instanceGroup: master-us-east-1c
      name: c
    name: events
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.7.11
  masterInternalName: api.internal.k8s.prod.redacted.com
  masterPublicName: api.k8s.prod.redacted.com
  networkCIDR: 172.20.0.0/16
  networking:
    calico:
      crossSubnet: true
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 172.20.32.0/19
    name: us-east-1a
    type: Private
    zone: us-east-1a
  - cidr: 172.20.64.0/19
    name: us-east-1b
    type: Private
    zone: us-east-1b
  - cidr: 172.20.96.0/19
    name: us-east-1c
    type: Private
    zone: us-east-1c
  - cidr: 172.20.0.0/22
    name: utility-us-east-1a
    type: Utility
    zone: us-east-1a
  - cidr: 172.20.4.0/22
    name: utility-us-east-1b
    type: Utility
    zone: us-east-1b
  - cidr: 172.20.8.0/22
    name: utility-us-east-1c
    type: Utility
    zone: us-east-1c
  topology:
    dns:
      type: Public
    masters: private
    nodes: private

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-01-19T08:56:21Z
  labels:
    kops.k8s.io/cluster: k8s.prod.redacted.com
  name: master-us-east-1a
spec:
  image: coreos.com/CoreOS-stable-1576.5.0-hvm
  machineType: m3.medium
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - us-east-1a

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-01-19T08:56:21Z
  labels:
    kops.k8s.io/cluster: k8s.prod.redacted.com
  name: master-us-east-1b
spec:
  image: coreos.com/CoreOS-stable-1576.5.0-hvm
  machineType: m3.medium
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - us-east-1b

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-01-19T08:56:21Z
  labels:
    kops.k8s.io/cluster: k8s.prod.redacted.com
  name: master-us-east-1c
spec:
  image: coreos.com/CoreOS-stable-1576.5.0-hvm
  machineType: m3.medium
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - us-east-1c

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-01-19T08:56:22Z
  labels:
    kops.k8s.io/cluster: k8s.prod.redacted.com
  name: nodes
spec:
  image: coreos.com/CoreOS-stable-1576.5.0-hvm
  machineType: m4.large
  maxSize: 10
  minSize: 3
  role: Node
  subnets:
  - us-east-1a
  - us-east-1b
  - us-east-1c

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

I'll follow up with more verbose logs should they be needed. More to redact and the issue may well be clear to someone with the information already posted.

9. Anything else do we need to know?

Not as far as I'm aware, but I'll be happy to answer any questions that folk might have.

Thanks!

Source

chrisfu

Most helpful comment

Hi @chrisfu. Good find, so it appears that an additional IAM permission is required for the cases where an S3 bucket is located in a region different to where the cluster has been built / instances are running.

Please could you test by adding the relevant IAM permission to the master & node policies, I believe something like:

        {
            "Sid": "kopsK8sDescribeRegions",
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeRegions"
            ],
            "Resource": [
                "*"
            ]
        }

You can add additional IAM Policies straight into your kops ClusterSpec, like this: https://github.com/kubernetes/kops/blob/master/docs/iam_roles.md#adding-additional-policies

If this is the case, we would need to extend the master & node permissions defined here: https://github.com/kubernetes/kops/blob/master/pkg/model/iam/iam_builder.go

KashifSaadat on 30 Jan 2018

🎉2

All 16 comments

For what it is worth this is happening to me to. I was seeing this behavior using kops installed from home brew using the HEAD of master.

mmacfadden on 19 Jan 2018

I think the issue might be partly that you are trying to use k82 1.9.x with kops 1.8.x. Not sure that this is supported? I think the HEAD might be in the same situation.

mmacfadden on 19 Jan 2018

I'm similar to you in that I've installed kops using Homebrew, but version 1.8.0 in my case.

As for K8s, I've tried all 1.8.0-1.8.4 versions and also 1.7.11. All showing the same symptoms. Very peculiar.

chrisfu on 19 Jan 2018

Interesting. Well at least you know you are not alone. This is currently a blocker for me. I took a look at the master node. I sshed into it. It seemed like non of the expected API ports were open. Checked using netstat.

So I am not sure if the services are not starting up or what.

mmacfadden on 19 Jan 2018

I'm fairly confident you're seeing no listening services because none of the nodes or masters are being bootstrapped at all. You may also find that none of the required Docker images have been pulled either.

Just to rule out the Homebrew builds of kops being "bad", I'll try tomorrow with a self-built binary or one from an official source.

chrisfu on 19 Jan 2018

I think you are right. I had also done a docker ps -a which yielded nothing. Similarly docker images came up empty. The nodes definitely are not being bootstraped.

mmacfadden on 19 Jan 2018

I'm not going to be near a computer until tomorrow, so unfortunately can't test this suggestion myself, but I'm wondering if running the following could yeild some information as to why the userdata provisioning shell script is failing? You'd need to sudo to root to run it.

bash <(curl -s http://169.254.169.254/latest/user-data)

If all went well (I'd imagine it would fail and hopefully yeild a useful error message) then it ought to bootstrap the node.

chrisfu on 19 Jan 2018

So I ran that command, here is what I got:

 bash <(curl -s http://169.254.169.254/latest/user-data)
== nodeup node config starting ==
Downloading nodeup (https://kubeupv2.s3.amazonaws.com/kops/1.8.0/linux/amd64/nodeup)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 35.0M  100 35.0M    0     0  16.0M      0  0:00:02  0:00:02 --:--:-- 16.1M
== Downloaded https://kubeupv2.s3.amazonaws.com/kops/1.8.0/linux/amd64/nodeup (SHA1 = 02185512f78dc9d15a8c10774c4cb11f67e4bc20) ==
Running nodeup
nodeup version 1.8.0 (git-5099bc553)
I0120 01:31:31.036607     928 install.go:149] Built service manifest "kops-configuration.service"
[Unit]
Description=Run kops bootstrap (nodeup)
Documentation=https://github.com/kubernetes/kops

[Service]
EnvironmentFile=/etc/environment
ExecStart=/var/cache/kubernetes-install/nodeup --conf=/var/cache/kubernetes-install/kube_env.yaml --v=8
Type=oneshot

[Install]
WantedBy=multi-user.target
I0120 01:31:31.037195     928 install.go:69] No package task found; won't update packages
I0120 01:31:31.037349     928 topological_sort.go:63] Dependencies:
I0120 01:31:31.037418     928 topological_sort.go:65]   Service/kops-configuration.service: []
I0120 01:31:31.037595     928 executor.go:91] Tasks: 0 done / 1 total; 1 can run
I0120 01:31:31.037661     928 executor.go:157] Executing task "Service/kops-configuration.service": Service: kops-configuration.service
I0120 01:31:31.037804     928 service.go:124] querying state of service "kops-configuration.service"
W0120 01:31:31.040772     928 service.go:203] Unknown ActiveState="activating"; will treat as not running
I0120 01:31:31.040844     928 changes.go:81] Field changed "Running" actual="false" expected="true"
I0120 01:31:31.040889     928 changes.go:81] Field changed "Enabled" actual="false" expected="true"
I0120 01:31:31.041008     928 service.go:345] Restarting service "kops-configuration.service"

It seemed to just sit there forever seeming doing nothing. Looking in the syslog, I see the following log messages:

Jan 20 01:37:58 ip-172-20-36-36 nodeup[982]: I0120 01:37:58.071143     982 s3fs.go:213] Listing objects in S3 bucket "<**redacted**>-k8s-state-store" with prefix "<**redacted**>.k8s.local/pki/issued/ca/"
Jan 20 01:37:58 ip-172-20-36-36 nodeup[982]: I0120 01:37:58.172547     982 s3fs.go:239] Listed files in s3://<**redacted**>-k8s-state-store/<**redacted**>.k8s.local/pki/issued/ca: [s3://<**redacted**>-k8s-state-store/<**redacted**>.k8s.local/pki/issued/ca/6512878379439447890072744430.crt s3://<**redacted**>-k8s-state-store/<**redacted**>.k8s.local/pki/issued/ca/keyset.yaml]
Jan 20 01:37:58 ip-172-20-36-36 nodeup[982]: I0120 01:37:58.172569     982 s3fs.go:176] Reading file "s3://<**redacted**>-k8s-state-store/<**redacted**>.k8s.local/pki/issued/ca/6512878379439447890072744430.crt"
Jan 20 01:37:58 ip-172-20-36-36 nodeup[982]: I0120 01:37:58.184895     982 certificate.go:103] Parsing pem block: "CERTIFICATE"
Jan 20 01:37:58 ip-172-20-36-36 nodeup[982]: I0120 01:37:58.185000     982 s3fs.go:176] Reading file "s3://<**redacted**>-k8s-state-store/<**redacted**>.k8s.local/pki/issued/ca/keyset.yaml"
Jan 20 01:37:58 ip-172-20-36-36 nodeup[982]: W0120 01:37:58.194794     982 main.go:141] got error running nodeup (will retry in 30s): error building loader: error fetching CA certificate from keystore: error in 'FindCert' attempting to load cert "ca": error loading certificate "s3://<**redacted**>-k8s-state-store/<**redacted**>.k8s.local/pki/issued/ca/keyset.yaml": could not parse certificate

This then repeats every 30s. Not sure if this is the problem you are seeing, but it seems like it can't parse the certificate and therefore the bootstraping process does not complete.

mmacfadden on 20 Jan 2018

👍1

I think this is related to #4156.

mmacfadden on 20 Jan 2018

So, I found the root cause of my issue, and it wasn't related to #4156.

Basically, for the staging cluster (account 2, us-east-1) we were using a bucket via cross-account access (in account 1, us-east-2). This worked fine for a Kops 1.7.0/K8s 1.7.6 deployment.

Unfortunately, this didn't work in our production account (account 3, us-east-1) using the same bucket in account 1, us-east-2. During the node bootstrap process nodeup was failing to pull the cluster.spec despite having full permission to do so.

It would first check the us-east-1 region for the bucket (which will clearly fail) and then attempt to pull the region via GetBucketLocation, which also fails. The calling account has full access to the S3 bucket in account 1, so realistically this should not cause a problem.

I0122 11:28:24.922283    5459 files.go:100] Hash matched for "/var/cache/nodeup/sha1:f62360d3351bed837ae3ffcdee65e9d57511695a_https___kubeupv2_s3_amazonaws_com_kops_1_8_0_linux_amd64_utils_tar_gz": sha1:f62360d3351bed837ae3ffcdee65e9d57511695a
I0122 11:28:24.923928    5459 assetstore.go:202] added asset "utils.tar.gz" for &{"/var/cache/nodeup/sha1:f62360d3351bed837ae3ffcdee65e9d57511695a_https___kubeupv2_s3_amazonaws_com_kops_1_8_0_linux_amd64_utils_tar_gz"}
I0122 11:28:24.924144    5459 assetstore.go:303] added asset "socat" for &{"/var/cache/nodeup/extracted/sha1:f62360d3351bed837ae3ffcdee65e9d57511695a_https___kubeupv2_s3_amazonaws_com_kops_1_8_0_linux_amd64_utils_tar_gz/utils/socat"}
I0122 11:28:24.970641    5459 s3context.go:145] unable to get bucket location from region "us-east-1"; scanning all regions: AccessDenied: Access Denied
    status code: 403, request id: 99082A90EB31B46F
W0122 11:28:25.051940    5459 main.go:141] got error running nodeup (will retry in 30s): error loading Cluster "s3://kops-state-redacted/k8s.prod.redacted.com/cluster.spec": Unable to list AWS regions: UnauthorizedOperation: You are not authorized to perform this operation.
    status code: 403, request id: d2480090-347f-4ca2-87be-ba6c2bd978ca

I believe that this is down to GetBucketLocation failing when the requestor is not the owner of the bucket. I believe in those circumstances Kops will brute-force the bucket location, but it seems that nodeup has never or no longer does that. I'd imagine that it used to on the basis that this has worked before, but it could well be that it used to default to checking us-east-2 for some reason or another. I believe there was a reason that I originally selected to have the state bucket in that region, so that could have been it.

chrisfu on 24 Jan 2018

@chrisfu this seems to be an IAM issue, are we missing a permission? We reduced the IAM roles significantly in kops 1.8, did we miss something?

chrislovecnm on 26 Jan 2018

Please could you test by adding the relevant IAM permission to the master & node policies, I believe something like:

        {
            "Sid": "kopsK8sDescribeRegions",
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeRegions"
            ],
            "Resource": [
                "*"
            ]
        }

You can add additional IAM Policies straight into your kops ClusterSpec, like this: https://github.com/kubernetes/kops/blob/master/docs/iam_roles.md#adding-additional-policies

If this is the case, we would need to extend the master & node permissions defined here: https://github.com/kubernetes/kops/blob/master/pkg/model/iam/iam_builder.go

KashifSaadat on 30 Jan 2018

🎉2

Related code performing the brute-force reqs: https://github.com/kubernetes/kops/blob/cc67497/util/pkg/vfs/s3context.go#L181-L228

KashifSaadat on 30 Jan 2018

I'm facing the same issue though, anyone have a solution to this?

lxcid on 24 Apr 2018

@KashifSaadat Good find, that sorted it. Apologies for the long time to respond; I completely missed this!

chrisfu on 24 Apr 2018

@chrisfu no worries :)

@lxcid the latest version of kops should now include this fix, can you upgrade to the latest release? https://github.com/kubernetes/kops/releases/tag/1.9.0

If you can't for whatever reason, have a look at the following comment (attaching an additional IAM policy to the nodes): https://github.com/kubernetes/kops/issues/4301#issuecomment-361528463

I'll close this issue, feel free to re-open if you find it's still a problem on the latest release.
/close

KashifSaadat on 24 Apr 2018

Was this page helpful?

0 / 5 - 0 ratings