Kops: calico-kube-controllers No etcd endpoints specified in etcdv3 API config

Created on 5 Jun 2019 · 12Comments · Source: kubernetes/kops

1. What kops version are you running? The command kops version, will display
this information.
Version 1.12.1 (git-e1c317f9c)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.2", GitCommit:"66049e3b21efe110454d67df4fa62b08ea79a19b", GitTreeState:"clean", BuildDate:"2019-05-16T16:23:09Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.8", GitCommit:"a89f8c11a5f4f132503edbc4918c98518fd504e3", GitTreeState:"clean", BuildDate:"2019-04-23T04:41:47Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

Honestly, I have a bit lost track. We created the cluster using kops 1.12.0-beta.2 to install Kubernetes 1.12.7 with calico networking. We upgraded the cluster a few times to get to kops 1.12.1 and Kubernetes 1.12.8.

5. What happened after the commands executed?

Somewhere along the line, calico-kube-controllers stopped working, with the error "No etcd endpoints specified in etcdv3 API config".

Note that we started the cluster at Kubernetes 1.12.7 with etcd3, so there was no major upgrade like etcd v2 -> v3.

6. What did you expect to happen?
Expected calico-kube-controllers to be properly configured to use etcd3, or, if calico-kube-controllers is no longer needed, as is suggested by this comment, I expected that to be better documented so I can show whoever is concerned that this missing configuration is not an issue that needs to be fixed.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: null
  name: example.io
spec:
  additionalPolicies:
    bastion: |
      [
        {
           "Effect": "Allow",
            "Action": [
               "ec2:DescribeTags"
             ],
              "Resource": "*"
        }
      ]
    master: |
      [
        {
           "Effect": "Allow",
            "Action": [
               "ec2:DescribeTags"
             ],
              "Resource": "*"
        }
      ]
    node: |
      [
        {
           "Effect": "Allow",
           "Action": [
               "ec2:DescribeTags"
           ],
           "Resource": "*"
        }
      ]
  api:
    loadBalancer:
      idleTimeoutSeconds: 600
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudLabels:
    Cluster: <redacted>
  cloudProvider: aws
  configBase: s3://<redacted>
  dnsZone: <redacted>
  etcdClusters:
  - etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-us-west-2a
      name: a
    - encryptedVolume: true
      instanceGroup: master-us-west-2b
      name: b
    - encryptedVolume: true
      instanceGroup: master-us-west-2c
      name: c
    name: main
  - etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-us-west-2a
      name: a
    - encryptedVolume: true
      instanceGroup: master-us-west-2b
      name: b
    - encryptedVolume: true
      instanceGroup: master-us-west-2c
      name: c
    name: events
  hooks:
  - manifest: |
      Type=oneshot
      ExecStart=/bin/sh -c '/sbin/iptables -t nat -A PREROUTING -d 169.254.169.254/32 \
          -i cali+ -p tcp -m tcp --dport 80 -j DNAT \
          --to-destination $(curl -s http://169.254.169.254/latest/meta-data/local-ipv4):8181'
    name: kiam-iptables.service
    roles:
    - Node
  iam:
    legacy: true
  kubeAPIServer:
    admissionControl:
    - NamespaceLifecycle
    - LimitRanger
    - ServiceAccount
    - DefaultStorageClass
    - DefaultTolerationSeconds
    - MutatingAdmissionWebhook
    - ValidatingAdmissionWebhook
    - ResourceQuota
    - NodeRestriction
    - Priority
    - Initializers
    - DenyEscalatingExec
    anonymousAuth: false
    authorizationMode: RBAC
    oidcClientID: <redacted>
    oidcGroupsClaim: <redacted>
    oidcGroupsPrefix: 'oidc:'
    oidcIssuerURL: https://<redacted>
    oidcUsernameClaim: <redacted>
  kubeDNS:
    provider: CoreDNS
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.12.8
  masterPublicName: <redacted>.io
  networkCIDR: 10.105.0.0/17
  networkID: vpc-<redcated>
  networking:
    calico: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 10.105.0.0/20
    egress: nat-<redacted>
    id: subnet-<redacted>
    name: us-west-2a
    type: Private
    zone: us-west-2a
  - cidr: 10.105.16.0/20
    egress: nat-<redacted>
    id: subnet-<redacted>
    name: us-west-2b
    type: Private
    zone: us-west-2b
  - cidr: 10.105.32.0/20
    egress: nat-<redacted>
    id: subnet-<redacted>
    name: us-west-2c
    type: Private
    zone: us-west-2c
  - cidr: 10.105.48.0/20
    id: subnet-<redacted>
    name: utility-us-west-2a
    type: Utility
    zone: us-west-2a
  - cidr: 10.105.64.0/20
    id: subnet-<redacted>
    name: utility-us-west-2b
    type: Utility
    zone: us-west-2b
  - cidr: 10.105.80.0/20
    id: subnet-<redacted>
    name: utility-us-west-2c
    type: Utility
    zone: us-west-2c
  topology:
    bastion:
      bastionPublicName: <redacted>
    dns:
      type: Public
    masters: private
    nodes: private

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: <redacted>
  name: bastions
spec:
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: t3.small
  maxSize: 1
  minSize: 1
  role: Bastion
  subnets:
  - utility-us-west-2a
  - utility-us-west-2b
  - utility-us-west-2c

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: <redacted>
  name: master-us-west-2a
spec:
  associatePublicIp: false
  detailedInstanceMonitoring: false
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - us-west-2a

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: <redacted>
  name: master-us-west-2b
spec:
  associatePublicIp: false
  detailedInstanceMonitoring: false
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - us-west-2b

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: <redacted>
  name: master-us-west-2c
spec:
  associatePublicIp: false
  detailedInstanceMonitoring: false
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - us-west-2c

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: <redacted>
  name: nodes
spec:
  associatePublicIp: false
  detailedInstanceMonitoring: false
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: t3.medium
  maxSize: 3
  minSize: 3
  role: Node
  subnets:
  - us-west-2a
  - us-west-2b
  - us-west-2c

9. Anything else do we need to know?
What does it mean to run calico "in CRD mode"? I cannot find that anywhere in the Calico documentation.

lifecyclrotten

Source

Nuru

👍4

Most helpful comment

UPDATES:

"CRD mode" means that Calico stores its state information via the Kubernetes API in Kubernetes Custom Resources, rather than, as previously, storing the information directly in etcd via its API.

According to this PR, calico-kube-controllers does still need to be run even in CRD mode. Apparently its jobs is to remove resources when they are no longer needed.

I suppose this bug was fixed along the way somewhere. Any version that has that PR (See list here) should be OK.

Nuru on 23 Oct 2019

👍2

All 12 comments

I haven't used calico but I believe "CRD mode" means that calico stores its state through the Kubernetes API Server using custom resource definitions rather than storing state directly on etcd. I think that might be the DATASTORE_TYPE variable here.

Can you confirm that setting DATASTORE_TYPE=kubernetes fixes the issue? I'll see if we can add some clarity to the Kops documentation.

rifelpet on 6 Jun 2019

@rifelpet Thanks for your contribution to the discussion.

I don't know how or where to set DATASTORE_TYPE=kubernetes using kops, and since kops is managing everything else to do with Calico, I do not want to configure it via some other mechanism.

Also, even if DATASTORE_TYPE=kubernetes makes the error message go away, it does not answer the question of whether or not calico-kube-controllers needs to run at all (and why or why not).

Nuru on 7 Jun 2019

👍1

After upgrade to Kops 1.12.3 I had the following constellation:

_calico-node_ version 3.7.4 running with env DATASTORE_TYPE: kubernetes
_calico-kube-controllers_ version 1.0.3 producing millions of errors -- Unhandled error: client: etcd cluster is unavailable or misconfigured; error #0: malformed HTTP response "x15x03x01x00x02x02"

It seems, that downscaling of the Deployment _calico-kube-controllers_ in https://github.com/kubernetes/kops/blob/6ea097da1fbc4db14667e4b1384ef956ad3620b7/upup/models/cloudup/resources/addons/networking.projectcalico.org/k8s-1.12.yaml.template#L442 doesn't work properly?

I have downscaled this deployment manually and error messages stopped without loosing any Calico functionality. If I understood correctly -- this functionality, previously implemented as a separate set of controllers, is now built into _calico-node_.

opusmagnum on 26 Aug 2019

👍2

@opusmagnum have you encountered any new errors? I have started to encounter new ones after downscaling the deployment to zero in protokube.service. Just wanted to see if you have encountered anything like that as well?

grv231 on 23 Oct 2019

UPDATES:

"CRD mode" means that Calico stores its state information via the Kubernetes API in Kubernetes Custom Resources, rather than, as previously, storing the information directly in etcd via its API.

According to this PR, calico-kube-controllers does still need to be run even in CRD mode. Apparently its jobs is to remove resources when they are no longer needed.

I suppose this bug was fixed along the way somewhere. Any version that has that PR (See list here) should be OK.

Nuru on 23 Oct 2019

👍2

Thank you for the explanation -- @Nuru ! If the cluster node is removed, Calico-daemon on this node is removed as well, no additional clean up is necessary, isn't it? @Nuru Which kind of resources should be removed and under which circumstances?

@grv231 since downscaling of calico-kube-controller I didn't notice any negative effects of masses of errors, but I have to check it again, considering explanation (last comment) from @Nuru . Just to be sure, that the "remove resources", after Cluster/Node-downscaling or similar, is working as well.

opusmagnum on 28 Oct 2019

@opusmagnum I'm not sure exactly what resources should be removed and when, but it seems that at least ipamblocks and blockaffinities need to be removed when a node is removed.

Nuru on 28 Oct 2019

@opusmagnum The issue for us started randomly happening after 3-4 weeks after migrating the cluster to 1.12.8 version. I can concur that the issue didn't come up soon after migrating to this version (I guess because I was making version up from 1.12.0 --> 1.12.8). Somewhere along the lines, the changes were not picked up and I had to scale the calico-kube-controller deployment down as the cluster was not getting validated (which raised other errors after 3-4 weeks in protkube service).

This weekend I migrated the cluster to 1.13 and this has resolved the issues. However @Nuru I see a significant change in the amount of logging from the etcd-managers pod. The previous etcd-server-events pod logging was lower (as I can see it in Kibana). Is this an expected behavior? The logs all seem to be non-error messages.

grv231 on 28 Oct 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 26 Jan 2020

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 25 Feb 2020

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 26 Mar 2020

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.