Kops: ELB created via a Kubernetes Service shows the nodes as OutOfService

Created on 21 Feb 2018 · 49Comments · Source: kubernetes/kops

I have created a service using the following:

apiVersion: v1
kind: Service
metadata:
  name: nginx-example
  labels:
    name: nginx-example
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-internal: "0.0.0.0/0"
spec:
  selector:
    app: nginx-example
  type: LoadBalancer
  ports:
  - name: http
    port: 80
    targetPort: http-web

The ELB launches but the nodes are seen as unavailable. It also looks like security groups are not set properly on the ELB to allow it access on the port set by K8 (in this case 30892). I had not modified any SG's as part of creating the cluster.

I used Kops 1.8.1 for K8 1.9.3 - anyone got any ideas what's going on?

lifecyclrotten

Source

darrenhaken

Most helpful comment

There were two things I changed.

First was the apiVersion as I got this error:

Error from server (BadRequest): error when creating "nginx.yaml": Deployment in version "v1" cannot be handled as a Deployment: no kind "Deployment" is registered for version "apps/v1"

The second was your selector in your service. Here's how I debugged it

Created from your config (with the apiVersion changed)

$ kubectl create -f nginx.yaml
deployment "nginx-example" created
service "nginx-example" created

Check the pods are running

$ kubectl get pods | grep nginx
nginx-example-569477d6d8-gfksb   1/1       Running   0          39s
nginx-example-569477d6d8-p92k9   1/1       Running   0          39s

Check your selector

$ kubectl describe service nginx-example | grep -i selector | awk '{print $2}'
app=nginx-example

Find pods matching your label

$ get pods -l $(kubectl describe service nginx-example | grep -i selector | awk '{print $2}')
No resources found.

No pods found, so the service won't pass anything to the pods, even though they are running....

So let's delete and recreate

$ kubectl delete -f nginx.yaml
deployment "nginx-example" deleted
service "nginx-example" deleted

$ kubectl create -f nginx-patched.yaml
deployment "nginx-example" created
service "nginx-example" created



md5-b114dbb005b457d2ca2a7f4e79f9f571



$ kubectl get pods | grep nginx
nginx-example-569477d6d8-vxc5s   1/1       Running   0          5s
nginx-example-569477d6d8-zqhbv   1/1       Running   0          5s



md5-77a5512332c13b216ffdf7af077d1dbb



$ kubectl describe service nginx-example | grep -i selector | awk '{print $2}'
app=nginx



md5-5d07d02203b326b4273f8b5bee7ca339



$ kubectl get pods -l $(kubectl describe service nginx-example | grep -i selector | awk '{print $2}')
NAME                             READY     STATUS    RESTARTS   AGE
nginx-example-569477d6d8-vxc5s   1/1       Running   0          37s
nginx-example-569477d6d8-zqhbv   1/1       Running   0          37s

So now the service should be passing the requests onto the pods. You can verify this using the kubernetes dashboard (if you have installed on your cluster)

https://[cluster api host]/api/v1/namespaces/kube-system/services/http:kubernetes-dashboard:/proxy/#!/service/default/nginx-example?namespace=default

huang-jy on 22 Feb 2018

👍2

All 49 comments

Do you have a pod running that matches the selector?

huang-jy on 22 Feb 2018

Here is the deployment manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-example
  labels:
    app: nginx-example
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.7.9
        ports:
        - containerPort: 80

darrenhaken on 22 Feb 2018

Couple of errors:

Error from server (BadRequest): error when creating "nginx.yaml": Deployment in version "v1" cannot be handled as a Deployment: no kind "Deployment" is registered for version "apps/v1"

you are also using a selector in your service definition of "app: nginx-example", but your pods have "app: nginx"

I've tweaked your definition:

kind: Deployment
apiVersion: extensions/v1beta1
metadata:
  name: nginx-example
  labels:
    app: nginx-example
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.7.9
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: nginx-example
  labels:
    name: nginx-example
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-internal: "0.0.0.0/0"
spec:
  selector:
    app: nginx
  type: LoadBalancer
  ports:
  - name: http
    port: 80
    targetPort: http-web

huang-jy on 22 Feb 2018

Was the error on the apiVersion then?

darrenhaken on 22 Feb 2018

There were two things I changed.

First was the apiVersion as I got this error:

Error from server (BadRequest): error when creating "nginx.yaml": Deployment in version "v1" cannot be handled as a Deployment: no kind "Deployment" is registered for version "apps/v1"

The second was your selector in your service. Here's how I debugged it

Created from your config (with the apiVersion changed)

$ kubectl create -f nginx.yaml
deployment "nginx-example" created
service "nginx-example" created

Check the pods are running

$ kubectl get pods | grep nginx
nginx-example-569477d6d8-gfksb   1/1       Running   0          39s
nginx-example-569477d6d8-p92k9   1/1       Running   0          39s

Check your selector

$ kubectl describe service nginx-example | grep -i selector | awk '{print $2}'
app=nginx-example

Find pods matching your label

$ get pods -l $(kubectl describe service nginx-example | grep -i selector | awk '{print $2}')
No resources found.

No pods found, so the service won't pass anything to the pods, even though they are running....

So let's delete and recreate

$ kubectl delete -f nginx.yaml
deployment "nginx-example" deleted
service "nginx-example" deleted

$ kubectl create -f nginx-patched.yaml
deployment "nginx-example" created
service "nginx-example" created



md5-b114dbb005b457d2ca2a7f4e79f9f571



$ kubectl get pods | grep nginx
nginx-example-569477d6d8-vxc5s   1/1       Running   0          5s
nginx-example-569477d6d8-zqhbv   1/1       Running   0          5s



md5-77a5512332c13b216ffdf7af077d1dbb



$ kubectl describe service nginx-example | grep -i selector | awk '{print $2}'
app=nginx



md5-5d07d02203b326b4273f8b5bee7ca339



$ kubectl get pods -l $(kubectl describe service nginx-example | grep -i selector | awk '{print $2}')
NAME                             READY     STATUS    RESTARTS   AGE
nginx-example-569477d6d8-vxc5s   1/1       Running   0          37s
nginx-example-569477d6d8-zqhbv   1/1       Running   0          37s

So now the service should be passing the requests onto the pods. You can verify this using the kubernetes dashboard (if you have installed on your cluster)

https://[cluster api host]/api/v1/namespaces/kube-system/services/http:kubernetes-dashboard:/proxy/#!/service/default/nginx-example?namespace=default

huang-jy on 22 Feb 2018

👍2

Your internal ELB has subnets configured in the same availability zone as your node?

Are you tagging the privates nodes with the annotation kubernetes.io/role/internal-elb as described in aws.go?

// TagNameSubnetInternalELB is the tag name used on a subnet to designate that
// it should be used for internal ELBs
const TagNameSubnetInternalELB = "kubernetes.io/role/internal-elb"

amalucelli on 22 Feb 2018

@amalucelli what should be annotated with kubernetes.io/role/internal-elb?
The K8 nodes?

darrenhaken on 22 Feb 2018

For internal ELBs, you tag with the annotation

service.beta.kubernetes.io/aws-load-balancer-internal: "0.0.0.0/0"

When I tested your yaml earlier, it did correctly create an internal-only ELB, so that is fine.

huang-jy on 22 Feb 2018

@huang-jy did it connect to the nodes as well?

darrenhaken on 22 Feb 2018

@darrenhaken yes, it picked up the pods as per your config. I use a similar method for my corporate test cluster.

The only catch you have to remember is that the ELB will resolve to a private IP and not a public IP if you use that annotation.

huang-jy on 22 Feb 2018

@huang-jy did you have any issues with the Security Groups it creates?

At the moment the instance nodes seem to have a default node-xx security group attached which restricts incoming traffic to the master. I'm wondering if the issue I have is about SG's blocking connectivity.

darrenhaken on 22 Feb 2018

@darrenhaken your yaml does not create any new security groups -- or do you mean the security groups kops creates at the cluster creation?

huang-jy on 22 Feb 2018

I mean the security groups Kops creates.

darrenhaken on 22 Feb 2018

No, I had no issues. I added additional groups to it as part of the creation though, to allow the cluster to talk to other parts of my AWS estate though.

huang-jy on 22 Feb 2018

How did you add additional groups?

darrenhaken on 22 Feb 2018

Assuming the groups are already created:

kind: InstanceGroup
metadata:
  creationTimestamp: 2018-01-04T10:25:26Z
  labels:
    kops.k8s.io/cluster: [clustername]
  name: master-eu-west-1a
spec:
  additionalSecurityGroups:
  - sg-xxxxxxxx
  - sg-yyyyyyyy
  - sg-zzzzzzzz
....
....

huang-jy on 22 Feb 2018

And can you add any additional security groups to the ELB that K8 creates for a Service?

darrenhaken on 22 Feb 2018

I haven't tried myself, but additionalSecurityGroups doesn't work in a service definition since it's AWS specific.

error: error validating "test.yaml": error validating data: ValidationError(Service.spec): unknown field "additionalSecurityGroups" in io.k8s.api.core.v1.ServiceSpec; if you choose to ignore these errors, turn validation off with --validate=false

huang-jy on 22 Feb 2018

Do you have anything set on cloudConfig for your Kops manifest?

I noticed i had this set:

cloudConfig:
  disableSecurityGroupIngress: true
  elbSecurityGroup: sg-xxxxxxx

I wondered if that's causing the problem, I'm redeploying the cluster to see.

darrenhaken on 22 Feb 2018

@huang-jy So I did curl {IP_OF_NODE}:{NODEPORT} where nodeport is the one the ELB is trying to connect to. I get a connection refused.

I have added an SG to the node to allow all inbound/outbound traffic but I still get the same problem. Is there any kubectl commands to verify that the ports are open?

darrenhaken on 22 Feb 2018

@darrenhaken you don't normally have to hit the node directly. You have created a loadbalancer as part of the service, you should be using that.

Since you have created an internal loadbalancer, which therefore resolves to internal IPs, you can only access that if you are DirectConnected into the VPC, or you are going through a bastion. Please confirm you are doing one of those. If not, take off that internal annotation to make the ELB public.

Assuming you are able to access it using the internal IP, I found another error.

Your service definition describes how to port map the requests. You described this:

  ports:
  - name: http
    port: 80
    targetPort: http-web

But have not declared http-web anywhere.

This leads to this message showing when trying to curl the elb:

curl: (52) Empty reply from server

Change that to:

  ports:
  - name: http
    port: 80
    targetPort: 80

And it should work

$ curl -IL [elb-host-name]
HTTP/1.1 200 OK
Server: nginx/1.7.9
Date: Thu, 22 Feb 2018 17:43:06 GMT
Content-Type: text/html
Content-Length: 612
Last-Modified: Tue, 23 Dec 2014 16:25:09 GMT
Connection: keep-alive
ETag: "54999765-264"
Accept-Ranges: bytes

For summary, here's the full config I used

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: nginx-example
  labels:
    app: nginx-example
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.7.9
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: nginx-example
  labels:
    name: nginx-example
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-internal: "0.0.0.0/0"
spec:
  selector:
    app: nginx
  type: LoadBalancer
  ports:
  - name: http
    port: 80
    targetPort: 80

huang-jy on 22 Feb 2018

I tried to connect directly to the node rather than through the ELB to verify if I could at least connect directly to the node.

I do have a VPN to the VPC and can access other resources in the account.

Below are all the files I have used to create the cluster

kops manifest:

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: 2018-02-21T11:51:01Z
  name: k8.dev.k8s.local
spec:
  api:
    loadBalancer:
      type: Internal
  authorization:
    alwaysAllow: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://bucket/k8.dev.k8s.local
  dnsZone: dns.fake.zone.com
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-eu-west-1a
      name: a
    name: main
  - etcdMembers:
    - instanceGroup: master-eu-west-1a
      name: a
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.9.3
  masterPublicName: api.k8.dev.k8s.local
  networkCIDR: 10.90.0.0/16
  networkID: vpc-4c4dca2b
  networking:
    weave:
      mtu: 8912
  nonMasqueradeCIDR: 100.xx.x.x/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 10.xxx.0.0/22
    id: subnet-id1
    name: subnet-x1
    type: Private
    zone: eu-west-1a
  - cidr: 10.xxx.16.0/22
    id: subnet-id2
    name: private-b
    type: Private
    zone: eu-west-1b
  - cidr: 10.xxx.32.0/22
    id: subnet-id3
    name: private-c
    type: Private
    zone: eu-west-1c
  topology:
    dns:
      type: Public
    masters: private
    nodes: private

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-02-21T11:51:01Z
  labels:
    kops.k8s.io/cluster: k8.dev.k8s.local
  name: master-eu-west-1a
spec:
  associatePublicIp: false
  image: kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-01-14
  machineType: m3.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-eu-west-1a
  role: Master
  subnets:
  - private-a

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-02-21T11:51:01Z
  labels:
    kops.k8s.io/cluster: k8.dev.k8s.local
  name: nodes
spec:
  associatePublicIp: false
  additionalSecurityGroups:
  - sg-xxxxx
  image: kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-01-14
  machineType: t2.medium
  maxSize: 6
  minSize: 3
  nodeLabels:
    kops.k8s.io/instancegroup: nodes
  role: Node
  subnets:
  - sg-xx1
  - sg-xx2
  - sg-xx3

Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-example
spec:
  selector:
    matchLabels:
      app: nginx-example
  replicas: 2
  template:
    metadata:
      labels:
        app: nginx-example
    spec:
      containers:
      - name: nginx-example
        image: nginx:1.7.9
        ports:
        - name: http-port
          containerPort: 80

Service:

apiVersion: v1
kind: Service
metadata:
  name: nginx-example
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-internal: "0.0.0.0/0"
spec:
  selector:
    app: nginx-example
  type: LoadBalancer
  ports:
  - name: nginx-example
    port: 80
    targetPort: 80

darrenhaken on 22 Feb 2018

Can you take off the internal elb annotation then? Confirm if it works?

huang-jy on 22 Feb 2018

@huang-jy slight update here but if I change the service to use a NodePort for the type I now get a response from the IP of a node using the allocated port. Will try taking internal off now...

@huang-jy without the annotation no Load Balancer is created

darrenhaken on 22 Feb 2018

@darrenhaken internal annotation only affects the ELB and when you use LoadBalancer on the service type. It makes no difference when you use NodePort type.

huang-jy on 22 Feb 2018

@huang-jy I put it back to LoadBalancer and I do seem to still be able to curl the Node directly but the ELB still reports the nodes as out of service. Any info you'd need about the ELB?

darrenhaken on 22 Feb 2018

If you curl the ELB, do you get a reply like I did?

huang-jy on 22 Feb 2018

@huang-jy I just made some progress. If I change the health check manually from TCP to HTTP it works. Eh!?

darrenhaken on 22 Feb 2018

Bear in mind I didn't change the healthcheck on the ELB.

huang-jy on 22 Feb 2018

@huang-jy if i curl the ELB it doesn't work. Change the health check to be HTTP and it immediately detects the nodes as healthy. Then curl works.

darrenhaken on 22 Feb 2018

That is strange.

What security groups do you have attached on the nodes?

huang-jy on 22 Feb 2018

@huang-jy I was wrong it was related to the security groups. I added a security which that I applied to the instance group which basically is an allow all SG. Then healthy nodes appeared.. This has also been added to the instance group via:

additionalSecurityGroups:
  - sg-xxxxx

Does that make sense?

I don't know how to automatically add the same SG to the ELB when its created, do you?
Alternatively can I simply remove the additionalSecurityGroups and it will 'just work'?

Here is what the SG applies:

Inbound:

All traffic | All | All | 0.0.0.0/0
-- | -- | -- | --

Outbound:

All traffic | All | All | 0.0.0.0/0
-- | -- | -- | --

darrenhaken on 22 Feb 2018

Okay can I propose this:

Remove all "additionalSecurityGroups" from your worker ig definition so that the ONLY security group you have on the nodes is the nodes.kubernetes one.

kubectl delete your nginx example deployment and service then kubectl create it again with the internal annotation.

See if that works. If it does, kubectl delete again, re-add the security groups you took off to additionalSecurityGroups and then kubectl create again.

If this time it doesn't work, then one or more of the security groups you just re-added is conflicting.

huang-jy on 22 Feb 2018

@huang-jy Isn't additionalSecurityGroups part of the cluster and not the service manifest?

darrenhaken on 22 Feb 2018

Still remove it. We essentially want to get as close to a vanilla sg configuration as possible.

huang-jy on 22 Feb 2018

And, no, they're ig specific.

https://github.com/kubernetes/kops/issues/4486#issuecomment-367722850

huang-jy on 22 Feb 2018

Alright, @huang-jy I think I know what's fixed it but I will also rebuild the cluster with all SG's removed and get back to vanilla. I'll reply to this issue tomorrow with the result and keep you posted (it's getting late here and a rebuild takes a long time)

The last thing I removed before testing this again was:

cloudConfig:
  disableSecurityGroupIngress: true
  elbSecurityGroup: sg-33cb4449

This meant that when an ELB was now created it used a randomly generated one by Kops. I think that has allowed it access to the nodes, it was just a little slow to connect at first.

My goal of adding SG's to both the ELB and Nodes btw was so to restrict access on the Node port range to the ELB. Are you aware of achieving something like this? It's quite a common pattern.

darrenhaken on 22 Feb 2018

Yes, I have a corporate base sg attached to all the nodes, and masters via the instance group definitions, so it's not uncommon at all.

huang-jy on 22 Feb 2018

How do you also apply that to the ELB as well so that it is permitted to communicate to the nodes?

Are you comfortable sharing as TF or CloudFormation so I can see it as code showing the ports but scrubbing IPs etc?

darrenhaken on 22 Feb 2018

We don't use TF or CF for the Kubernetes cluster, we use kops. And it's not production ready yet :p

It's essentials the comment I put earlier. On all the master igs and node ig, I have a spec.additionalSecurityGroups section with the additional corporate sgs that need to go onto the boxes.

I haven't added anything additional to the ELBs yet.

huang-jy on 22 Feb 2018

Ah ok so you add it to the master IG too.

If you haven’t added anything to the ELB how do you ensure the port restrictions are allowed by the ELB?

I’ve found out you can add an annotation to the service to add an SG to a service btw.

darrenhaken on 22 Feb 2018

The service defines what ports you allow in.

The ELB is created as part of the Service declaration and in there you define what ports to allow in and where to map it to on the pod (in your example, you mapped 80 -> 80)

huang-jy on 22 Feb 2018

Thanks for all the help on this you’ve been awesome!

darrenhaken on 22 Feb 2018

🎉1

Can we close?

chrislovecnm on 27 Feb 2018

I think we can. @darrenhaken ?

huang-jy on 27 Feb 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 28 May 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot on 28 Jun 2018

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 28 Jul 2018

I had a similar issue with a service:

Service in version "v1" cannot be handled as a Service: v1.Service: Spec: v1.ServiceSpec: Selector: ReadString: expects " or n, but found 2, error found in #10 byte of

We use yaml files for manifests, and from what I've read, kubernetes translates yaml to json before execution. In my case, I have a build process that populates a label version with a value of the commit sha's last 8 chars. This was the first time I'd run into an all-numeric last 8 sha, and we don't use quotes in yaml because it doesn't require quotes.

The kubernetes json schema that it validates against requires labels to be a string, but somewhere along the way the json parser thought my label was a number instead of a string, and threw that error.

WRAP YOUR YAML STRINGS IN QUOTES! Especially if they are dynamically populated!