1. What kops version are you running? The command kops version, will display
this information.
Version 1.12.1 (git-e1c317f9c)
2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.2", GitCommit:"66049e3b21efe110454d67df4fa62b08ea79a19b", GitTreeState:"clean", BuildDate:"2019-05-16T16:23:09Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.8", GitCommit:"a89f8c11a5f4f132503edbc4918c98518fd504e3", GitTreeState:"clean", BuildDate:"2019-04-23T04:41:47Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
Honestly, I have a bit lost track. We created the cluster using kops 1.12.0-beta.2 to install Kubernetes 1.12.7 with calico networking. We upgraded the cluster a few times to get to kops 1.12.1 and Kubernetes 1.12.8.
5. What happened after the commands executed?
Somewhere along the line, calico-kube-controllers stopped working, with the error "No etcd endpoints specified in etcdv3 API config".
Note that we started the cluster at Kubernetes 1.12.7 with etcd3, so there was no major upgrade like etcd v2 -> v3.
6. What did you expect to happen?
Expected calico-kube-controllers to be properly configured to use etcd3, or, if calico-kube-controllers is no longer needed, as is suggested by this comment, I expected that to be better documented so I can show whoever is concerned that this missing configuration is not an issue that needs to be fixed.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
apiVersion: kops/v1alpha2
kind: Cluster
metadata:
creationTimestamp: null
name: example.io
spec:
additionalPolicies:
bastion: |
[
{
"Effect": "Allow",
"Action": [
"ec2:DescribeTags"
],
"Resource": "*"
}
]
master: |
[
{
"Effect": "Allow",
"Action": [
"ec2:DescribeTags"
],
"Resource": "*"
}
]
node: |
[
{
"Effect": "Allow",
"Action": [
"ec2:DescribeTags"
],
"Resource": "*"
}
]
api:
loadBalancer:
idleTimeoutSeconds: 600
type: Public
authorization:
rbac: {}
channel: stable
cloudLabels:
Cluster: <redacted>
cloudProvider: aws
configBase: s3://<redacted>
dnsZone: <redacted>
etcdClusters:
- etcdMembers:
- encryptedVolume: true
instanceGroup: master-us-west-2a
name: a
- encryptedVolume: true
instanceGroup: master-us-west-2b
name: b
- encryptedVolume: true
instanceGroup: master-us-west-2c
name: c
name: main
- etcdMembers:
- encryptedVolume: true
instanceGroup: master-us-west-2a
name: a
- encryptedVolume: true
instanceGroup: master-us-west-2b
name: b
- encryptedVolume: true
instanceGroup: master-us-west-2c
name: c
name: events
hooks:
- manifest: |
Type=oneshot
ExecStart=/bin/sh -c '/sbin/iptables -t nat -A PREROUTING -d 169.254.169.254/32 \
-i cali+ -p tcp -m tcp --dport 80 -j DNAT \
--to-destination $(curl -s http://169.254.169.254/latest/meta-data/local-ipv4):8181'
name: kiam-iptables.service
roles:
- Node
iam:
legacy: true
kubeAPIServer:
admissionControl:
- NamespaceLifecycle
- LimitRanger
- ServiceAccount
- DefaultStorageClass
- DefaultTolerationSeconds
- MutatingAdmissionWebhook
- ValidatingAdmissionWebhook
- ResourceQuota
- NodeRestriction
- Priority
- Initializers
- DenyEscalatingExec
anonymousAuth: false
authorizationMode: RBAC
oidcClientID: <redacted>
oidcGroupsClaim: <redacted>
oidcGroupsPrefix: 'oidc:'
oidcIssuerURL: https://<redacted>
oidcUsernameClaim: <redacted>
kubeDNS:
provider: CoreDNS
kubelet:
anonymousAuth: false
kubernetesApiAccess:
- 0.0.0.0/0
kubernetesVersion: 1.12.8
masterPublicName: <redacted>.io
networkCIDR: 10.105.0.0/17
networkID: vpc-<redcated>
networking:
calico: {}
nonMasqueradeCIDR: 100.64.0.0/10
sshAccess:
- 0.0.0.0/0
subnets:
- cidr: 10.105.0.0/20
egress: nat-<redacted>
id: subnet-<redacted>
name: us-west-2a
type: Private
zone: us-west-2a
- cidr: 10.105.16.0/20
egress: nat-<redacted>
id: subnet-<redacted>
name: us-west-2b
type: Private
zone: us-west-2b
- cidr: 10.105.32.0/20
egress: nat-<redacted>
id: subnet-<redacted>
name: us-west-2c
type: Private
zone: us-west-2c
- cidr: 10.105.48.0/20
id: subnet-<redacted>
name: utility-us-west-2a
type: Utility
zone: us-west-2a
- cidr: 10.105.64.0/20
id: subnet-<redacted>
name: utility-us-west-2b
type: Utility
zone: us-west-2b
- cidr: 10.105.80.0/20
id: subnet-<redacted>
name: utility-us-west-2c
type: Utility
zone: us-west-2c
topology:
bastion:
bastionPublicName: <redacted>
dns:
type: Public
masters: private
nodes: private
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
labels:
kops.k8s.io/cluster: <redacted>
name: bastions
spec:
image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
machineType: t3.small
maxSize: 1
minSize: 1
role: Bastion
subnets:
- utility-us-west-2a
- utility-us-west-2b
- utility-us-west-2c
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
labels:
kops.k8s.io/cluster: <redacted>
name: master-us-west-2a
spec:
associatePublicIp: false
detailedInstanceMonitoring: false
image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
machineType: t3.medium
maxSize: 1
minSize: 1
role: Master
subnets:
- us-west-2a
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
labels:
kops.k8s.io/cluster: <redacted>
name: master-us-west-2b
spec:
associatePublicIp: false
detailedInstanceMonitoring: false
image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
machineType: t3.medium
maxSize: 1
minSize: 1
role: Master
subnets:
- us-west-2b
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
labels:
kops.k8s.io/cluster: <redacted>
name: master-us-west-2c
spec:
associatePublicIp: false
detailedInstanceMonitoring: false
image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
machineType: t3.medium
maxSize: 1
minSize: 1
role: Master
subnets:
- us-west-2c
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
labels:
kops.k8s.io/cluster: <redacted>
name: nodes
spec:
associatePublicIp: false
detailedInstanceMonitoring: false
image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
machineType: t3.medium
maxSize: 3
minSize: 3
role: Node
subnets:
- us-west-2a
- us-west-2b
- us-west-2c
9. Anything else do we need to know?
What does it mean to run calico "in CRD mode"? I cannot find that anywhere in the Calico documentation.
I haven't used calico but I believe "CRD mode" means that calico stores its state through the Kubernetes API Server using custom resource definitions rather than storing state directly on etcd. I think that might be the DATASTORE_TYPE variable here.
Can you confirm that setting DATASTORE_TYPE=kubernetes fixes the issue? I'll see if we can add some clarity to the Kops documentation.
@rifelpet Thanks for your contribution to the discussion.
I don't know how or where to set DATASTORE_TYPE=kubernetes using kops, and since kops is managing everything else to do with Calico, I do not want to configure it via some other mechanism.
Also, even if DATASTORE_TYPE=kubernetes makes the error message go away, it does not answer the question of whether or not calico-kube-controllers needs to run at all (and why or why not).
After upgrade to Kops 1.12.3 I had the following constellation:
DATASTORE_TYPE: kubernetesUnhandled error: client: etcd cluster is unavailable or misconfigured; error #0: malformed HTTP response "x15x03x01x00x02x02"It seems, that downscaling of the Deployment _calico-kube-controllers_ in https://github.com/kubernetes/kops/blob/6ea097da1fbc4db14667e4b1384ef956ad3620b7/upup/models/cloudup/resources/addons/networking.projectcalico.org/k8s-1.12.yaml.template#L442 doesn't work properly?
I have downscaled this deployment manually and error messages stopped without loosing any Calico functionality. If I understood correctly -- this functionality, previously implemented as a separate set of controllers, is now built into _calico-node_.
@opusmagnum have you encountered any new errors? I have started to encounter new ones after downscaling the deployment to zero in protokube.service. Just wanted to see if you have encountered anything like that as well?
UPDATES:
"CRD mode" means that Calico stores its state information via the Kubernetes API in Kubernetes Custom Resources, rather than, as previously, storing the information directly in etcd via its API.
According to this PR, calico-kube-controllers does still need to be run even in CRD mode. Apparently its jobs is to remove resources when they are no longer needed.
I suppose this bug was fixed along the way somewhere. Any version that has that PR (See list here) should be OK.
Thank you for the explanation -- @Nuru ! If the cluster node is removed, Calico-daemon on this node is removed as well, no additional clean up is necessary, isn't it? @Nuru Which kind of resources should be removed and under which circumstances?
@grv231 since downscaling of calico-kube-controller I didn't notice any negative effects of masses of errors, but I have to check it again, considering explanation (last comment) from @Nuru . Just to be sure, that the "remove resources", after Cluster/Node-downscaling or similar, is working as well.
@opusmagnum I'm not sure exactly what resources should be removed and when, but it seems that at least ipamblocks and blockaffinities need to be removed when a node is removed.
@opusmagnum The issue for us started randomly happening after 3-4 weeks after migrating the cluster to 1.12.8 version. I can concur that the issue didn't come up soon after migrating to this version (I guess because I was making version up from 1.12.0 --> 1.12.8). Somewhere along the lines, the changes were not picked up and I had to scale the calico-kube-controller deployment down as the cluster was not getting validated (which raised other errors after 3-4 weeks in protkube service).
This weekend I migrated the cluster to 1.13 and this has resolved the issues. However @Nuru I see a significant change in the amount of logging from the etcd-managers pod. The previous etcd-server-events pod logging was lower (as I can see it in Kibana). Is this an expected behavior? The logs all seem to be non-error messages.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
UPDATES:
"CRD mode" means that Calico stores its state information via the Kubernetes API in Kubernetes Custom Resources, rather than, as previously, storing the information directly in
etcdvia its API.According to this PR,
calico-kube-controllersdoes still need to be run even in CRD mode. Apparently its jobs is to remove resources when they are no longer needed.I suppose this bug was fixed along the way somewhere. Any version that has that PR (See list here) should be OK.