1. What kops version are you running? The command kops version, will display
this information.
Version 1.11.0 (git-2c2042465)
2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
1.11.6
3. What cloud provider are you using?
4. What commands did you run? What is the simplest way to reproduce this issue?
kops rolling-update cluster --yes --instance-group=master-eu-central-1b
5. What happened after the commands executed?
The rolling update of the master node completed successfully. However, during the 2/3 minutes provisioning a new node, there was a complete cluster outage.
6. What did you expect to happen?
Everything continued to operate as normal.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
apiVersion: kops/v1alpha2
kind: Cluster
metadata:
creationTimestamp: 2017-12-19T15:09:49Z
name: [REDACTED]
spec:
additionalPolicies:
node: |
[{"Effect": "Allow","Action": ["route53:ChangeResourceRecordSets"],"Resource": ["arn:aws:route53:::hostedzone/*"]},{"Effect": "Allow","Action": ["route53:ListHostedZones","route53:ListResourceRecordSets"],"Resource": ["*"]}]
api:
loadBalancer:
type: Public
authorization:
rbac: {}
channel: stable
cloudProvider: aws
configBase: [REDACTED]
docker:
version: 18.06.1
etcdClusters:
- etcdMembers:
- instanceGroup: master-eu-central-1b
name: b
name: main
- etcdMembers:
- instanceGroup: master-eu-central-1b
name: b
name: events
iam:
allowContainerRegistry: true
legacy: false
kubeAPIServer:
oidcClientID: [REDACTED]
oidcIssuerURL: https://accounts.google.com
oidcUsernameClaim: email
kubeDNS:
provider: CoreDNS
kubelet:
anonymousAuth: false
kubernetesApiAccess:
- [REDACTED]
kubernetesVersion: 1.11.6
masterInternalName: [REDACTED]
masterPublicName: [REDACTED]
networkCIDR: [REDACTED]
networking:
weave:
mtu: 8912
nonMasqueradeCIDR: [REDACTED]
sshAccess:
- [REDACTED]
subnets:
- cidr: [REDACTED]
name: eu-central-1b
type: Private
zone: eu-central-1b
- cidr: [REDACTED]
name: utility-eu-central-1b
type: Utility
zone: eu-central-1b
- cidr: [REDACTED]
name: public-eu-central-1b
type: Public
zone: eu-central-1b
topology:
bastion:
bastionPublicName: [REDACTED]
dns:
type: Public
masters: private
nodes: private
8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
9. Anything else do we need to know?
It is my understanding that restarting the master node should have zero downtime.
Here is an example deployment of one of our apps that was impacted during the upgrade:
apiVersion: v1
kind: Service
metadata:
name: eventd
labels:
app: eventd
annotations:
external-dns.alpha.kubernetes.io/hostname: [REDACTED]
service.beta.kubernetes.io/aws-load-balancer-ssl-cert: [REDACTED]
service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "443"
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "http"
spec:
ports:
- name: http
protocol: TCP
port: 80
targetPort: 6000
- name: https
protocol: TCP
port: 443
targetPort: 6000
type: LoadBalancer
selector:
app: eventd
---
apiVersion: apps/v1beta2
kind: Deployment
metadata:
name: eventd
labels:
app: eventd
build: stable
spec:
replicas: 3
revisionHistoryLimit: 2
minReadySeconds: 30
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
selector:
matchLabels:
app: eventd
template:
metadata:
labels:
app: eventd
build: stable
spec:
terminationGracePeriodSeconds: 300
imagePullSecrets:
- name: registrykey
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- eventd
topologyKey: kubernetes.io/hostname
nodeSelector:
subnetType: private
containers:
- name: eventd
image: avocet/services
resources:
requests:
cpu: 1
memory: 4G
imagePullPolicy: Always
command:
- eventd
ports:
- containerPort: 6000
livenessProbe:
httpGet:
path: /healthz
port: 6000
initialDelaySeconds: 90
periodSeconds: 3
volumeMounts:
- name: ssl-certs
mountPath: /etc/ssl/certs/ca-certificates.crt
readOnly: true
volumes:
- name: ssl-certs
hostPath:
path: /etc/ssl/certs/ca-certificates.crt
After doing some further investigation, it looks like all of the instances attached to an ELB in the cluster are temporarily removed when restarting the master node.
I can see several metrics and health checks in AWS stop reporting during this brief outage that usually lasts around 2-3 minutes.
Could this be something to do with the dns-controller?
I have seen the same thing when using CoreDNS.
Looks like CoreDNS fails to talk connect to the K8s API, which in turn makes DNS lookups fail.
@olemarkus Unfortunately I've also experienced this with clusters just running kube-dns.
I've seen this with kube-dns too.
I didn't mean to suggest it was CoreDNS that was the culprit. We observed this with kube-dns too (which took even longer to recover from).
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
We are also experiencing similar issue, we use kops 1.12.2 and k8s 1.12.9.
From my investigation, what happens is that when new master comes up, all nodes become NotReady for short time, about a minute (until they reconnect to new master), during this time controller-manager thinks that there are no Ready nodes in the cluster and removes all current nodes from the Cloud Provider's LoadBalancer.
When nodes start reconnect to new master, they get re-added back to Provider's LoadBalancer by controller-manager.
Here are events from one of the service, as you can see there was some duration when there is no available nodes and after 1 minute all back to normal..
Type Reason Age From Message
---- ------ ---- ---- -------
Normal EnsuringLoadBalancer 17m service-controller Ensuring load balancer
Normal EnsuredLoadBalancer 17m service-controller Ensured load balancer
Warning UnAvailableLoadBalancer 15m service-controller There are no available nodes for LoadBalancer service default/hello-world
Normal UpdatedLoadBalancer 14m service-controller Updated load balancer with new hosts
Basically this means restarting master node is not safe anymore. Not sure when it started for me, but I never experienced this problem prior to upgrading kops/k8s to 1.12.x
I couldn't find related issues in k8s github and not here, but it's definitely not a desired behavior.
@justinsb @chrislovecnm need you attention and advise here, thanks
I was able to solve this, by using loadBalancer for Master API (internal domain)
In kops cluster spec, replace from:
api:
dns: {}
to:
api:
loadBalancer:
type: Internal
useForInternalApi: true
then remove both internal and external old DNS A records from route53 and terraform apply (If you use terraform)
I guess using dns approach is not safe...
@shamil RE: https://github.com/kubernetes/kops/issues/6349#issuecomment-504414022
This is exact same behaviour we are experiencing. The only way I've been able to mitigate this so far is to pass the --cloudonly flag when doing a kops rolling-update on the master node.
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Didn't get anywhere with this, and the only solution I found was to do this: https://github.com/kubernetes/kops/issues/6349#issuecomment-504657476
Most helpful comment
We are also experiencing similar issue, we use
kops 1.12.2andk8s 1.12.9.From my investigation, what happens is that when new master comes up, all nodes become
NotReadyfor short time, about a minute (until they reconnect to new master), during this timecontroller-managerthinks that there are noReadynodes in the cluster and removes all current nodes from the Cloud Provider's LoadBalancer.When nodes start reconnect to new master, they get re-added back to Provider's LoadBalancer by
controller-manager.Here are events from one of the
service, as you can see there was some duration when there is no available nodes and after 1 minute all back to normal..Basically this means restarting
masternode is not safe anymore. Not sure when it started for me, but I never experienced this problem prior to upgradingkops/k8sto1.12.xI couldn't find related issues in
k8sgithub and not here, but it's definitely not a desired behavior.@justinsb @chrislovecnm need you attention and advise here, thanks