1. What kops version are you running? The command kops version, will display
this information.
Version 1.12.0-alpha.1 (git-511a44c67)
2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.4", GitCommit:"f49fa022dbe63faafd0da106ef7e05a29721d3f1", GitTreeState:"clean", BuildDate:"2018-12-14T06:59:37Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
kops update cluster --yes && kops rolling-update --yes
5. What happened after the commands executed?
Calico fails
6. What did you expect to happen?
Calico to be migrated to version 3
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
apiVersion: kops/v1alpha2
kind: Cluster
metadata:
creationTimestamp: 2019-03-18T17:45:55Z
name: shane.dev.example.com
spec:
additionalPolicies:
master: |
[{
"Effect": "Allow",
"Action": [
"autoscaling:DescribeAutoScalingInstances"
],
"Resource": [
"*"
],
"Condition": {
"StringEquals": {
"autoscaling:ResourceTag/KubernetesCluster": "shane.dev.example.com"
}
}
}]
node: |
[{
"Effect": "Allow",
"Action": ["sts:AssumeRole"],
"Resource": ["*"]
},{
"Effect": "Deny",
"Action": ["sts:AssumeRole"],
"Resource": ["arn:aws:iam:::role/Admin"]
}]
api:
loadBalancer:
idleTimeoutSeconds: 4000
type: Public
authorization:
rbac: {}
channel: stable
cloudConfig:
disableSecurityGroupIngress: true
cloudProvider: aws
configBase: s3://example-terraform/dev/kops/shane.dev.example.com
etcdClusters:
- enableEtcdTLS: true
etcdMembers:
- encryptedVolume: true
instanceGroup: master-us-west-2a
name: a
- encryptedVolume: true
instanceGroup: master-us-west-2b
name: b
- encryptedVolume: true
instanceGroup: master-us-west-2c
name: c
name: main
provider: Legacy
version: 3.2.24
- enableEtcdTLS: true
etcdMembers:
- encryptedVolume: true
instanceGroup: master-us-west-2a
name: a
- encryptedVolume: true
instanceGroup: master-us-west-2b
name: b
- encryptedVolume: true
instanceGroup: master-us-west-2c
name: c
name: events
provider: Legacy
version: 3.2.24
fileAssets:
- content: |
# https://raw.githubusercontent.com/kubernetes/website/master/content/en/examples/audit/audit-policy.yaml
# https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/gci/configure-helper.sh#L735
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
# The following requests were manually identified as high-volume and low-risk,
# so drop them.
- level: None
users: ["system:kube-proxy"]
verbs: ["watch"]
resources:
- group: "" # core
resources: ["endpoints", "services", "services/status"]
- level: None
# Ingress controller reads 'configmaps/ingress-uid' through the unsecured port.
# TODO(#46983): Change this to the ingress controller service account.
users: ["system:unsecured"]
namespaces: ["kube-system"]
verbs: ["get"]
resources:
- group: "" # core
resources: ["configmaps"]
- level: None
users: ["kubelet"] # legacy kubelet identity
verbs: ["get"]
resources:
- group: "" # core
resources: ["nodes", "nodes/status"]
- level: None
userGroups: ["system:nodes"]
verbs: ["get"]
resources:
- group: "" # core
resources: ["nodes", "nodes/status"]
- level: None
users:
- system:kube-controller-manager
- system:kube-scheduler
- system:serviceaccount:kube-system:endpoint-controller
verbs: ["get", "update"]
namespaces: ["kube-system"]
resources:
- group: "" # core
resources: ["endpoints"]
- level: None
users: ["system:apiserver"]
verbs: ["get"]
resources:
- group: "" # core
resources: ["namespaces", "namespaces/status", "namespaces/finalize"]
- level: None
users: ["cluster-autoscaler"]
verbs: ["get", "update"]
namespaces: ["kube-system"]
resources:
- group: "" # core
resources: ["configmaps", "endpoints"]
# Don't log HPA fetching metrics.
- level: None
users:
- system:kube-controller-manager
verbs: ["get", "list"]
resources:
- group: "metrics.k8s.io"
# Don't log these read-only URLs.
- level: None
nonResourceURLs:
- /healthz*
- /version
- /swagger*
# Don't log events requests.
- level: None
resources:
- group: "" # core
resources: ["events"]
# node and pod status calls from nodes are high-volume and can be large, don't log responses for expected updates from nodes
- level: Request
users: ["kubelet", "system:node-problem-detector", "system:serviceaccount:kube-system:node-problem-detector"]
verbs: ["update","patch"]
resources:
- group: "" # core
resources: ["nodes/status", "pods/status"]
omitStages:
- "RequestReceived"
- level: Request
userGroups: ["system:nodes"]
verbs: ["update","patch"]
resources:
- group: "" # core
resources: ["nodes/status", "pods/status"]
omitStages:
- "RequestReceived"
# deletecollection calls can be large, don't log responses for expected namespace deletions
- level: Request
users: ["system:serviceaccount:kube-system:namespace-controller"]
verbs: ["deletecollection"]
omitStages:
- "RequestReceived"
# Secrets, ConfigMaps, and TokenReviews can contain sensitive & binary data,
# so only log at the Metadata level.
- level: Metadata
resources:
- group: "" # core
resources: ["secrets", "configmaps"]
- group: authentication.k8s.io
resources: ["tokenreviews"]
omitStages:
- "RequestReceived"
# A catch-all rule to log all other requests at the Metadata level.
- level: Metadata
# Long-running requests like watches that fall under this rule will not
# generate an audit event in RequestReceived.
omitStages:
- "RequestReceived"
name: audit.yaml
roles:
- Master
iam:
allowContainerRegistry: true
legacy: false
kubeAPIServer:
auditLogPath: '-'
auditPolicyFile: /srv/kubernetes/assets/audit.yaml
oidcClientID: kubernetes
oidcGroupsClaim: groups
oidcIssuerURL: https://dex.dev.example.com
oidcUsernameClaim: email
kubeControllerManager:
horizontalPodAutoscalerUseRestClients: true
kubelet:
anonymousAuth: false
kubernetesApiAccess:
- 0.0.0.0/0
kubernetesVersion: 1.12.4
masterInternalName: api.internal.shane.dev.example.com
masterPublicName: api.shane.dev.example.com
networkCIDR: 10.40.0.0/16
networkID: vpc-4dd43e34
networking:
calico:
majorVersion: v3
nodePortAccess:
- 10.40.0.0/16
nonMasqueradeCIDR: 100.64.0.0/10
sshAccess:
- 10.0.0.0/8
subnets:
- cidr: 10.40.0.0/22
id: subnet-0490c74c
name: us-west-2a
type: Private
zone: us-west-2a
- cidr: 10.40.4.0/22
id: subnet-f910279f
name: us-west-2b
type: Private
zone: us-west-2b
- cidr: 10.40.8.0/22
id: subnet-f608e4ac
name: us-west-2c
type: Private
zone: us-west-2c
- cidr: 10.40.128.0/22
id: subnet-c29ec98a
name: utility-us-west-2a
type: Utility
zone: us-west-2a
- cidr: 10.40.132.0/22
id: subnet-911027f7
name: utility-us-west-2b
type: Utility
zone: us-west-2b
- cidr: 10.40.136.0/22
id: subnet-f708e4ad
name: utility-us-west-2c
type: Utility
zone: us-west-2c
topology:
dns:
type: Public
masters: private
nodes: private
updatePolicy: external
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2019-03-18T17:45:55Z
labels:
kops.k8s.io/cluster: shane.dev.example.com
name: master-us-west-2a
spec:
additionalSecurityGroups:
- sg-661c951a
cloudLabels:
customer: internal
environment: dev
service: kubernetes
team: is-prod-down
image: ami-010b9028f89a81d66
machineType: m4.large
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: master-us-west-2a
role: Master
subnets:
- us-west-2a
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2019-03-18T17:45:55Z
labels:
kops.k8s.io/cluster: shane.dev.example.com
name: master-us-west-2b
spec:
additionalSecurityGroups:
- sg-661c951a
cloudLabels:
customer: internal
environment: dev
service: kubernetes
team: is-prod-down
image: ami-010b9028f89a81d66
machineType: m4.large
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: master-us-west-2b
role: Master
subnets:
- us-west-2b
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2019-03-18T17:45:55Z
labels:
kops.k8s.io/cluster: shane.dev.example.com
name: master-us-west-2c
spec:
additionalSecurityGroups:
- sg-661c951a
cloudLabels:
customer: internal
environment: dev
service: kubernetes
team: is-prod-down
image: ami-010b9028f89a81d66
machineType: m4.large
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: master-us-west-2c
role: Master
subnets:
- us-west-2c
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2019-03-18T17:45:55Z
labels:
kops.k8s.io/cluster: shane.dev.example.com
name: nodes-us-west-2a
spec:
additionalSecurityGroups:
- sg-661c951a
cloudLabels:
customer: internal
environment: dev
k8s.io/cluster-autoscaler/enabled: "true"
kubernetes.io/cluster/shane.dev.example.com: ""
service: kubernetes
team: is-prod-down
image: ami-010b9028f89a81d66
machineType: m4.xlarge
maxSize: 6
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: nodes-us-west-2a
role: Node
subnets:
- us-west-2a
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2019-03-18T17:45:55Z
labels:
kops.k8s.io/cluster: shane.dev.example.com
name: nodes-us-west-2b
spec:
additionalSecurityGroups:
- sg-661c951a
cloudLabels:
customer: internal
environment: dev
k8s.io/cluster-autoscaler/enabled: "true"
kubernetes.io/cluster/shane.dev.example.com: ""
service: kubernetes
team: is-prod-down
image: ami-010b9028f89a81d66
machineType: m4.xlarge
maxSize: 6
minSize: 0
nodeLabels:
kops.k8s.io/instancegroup: nodes-us-west-2b
role: Node
subnets:
- us-west-2b
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2019-03-18T17:45:56Z
labels:
kops.k8s.io/cluster: shane.dev.example.com
name: nodes-us-west-2c
spec:
additionalSecurityGroups:
- sg-661c951a
cloudLabels:
customer: internal
environment: dev
k8s.io/cluster-autoscaler/enabled: "true"
kubernetes.io/cluster/shane.dev.example.com: ""
service: kubernetes
team: is-prod-down
image: ami-010b9028f89a81d66
machineType: m4.xlarge
maxSize: 6
minSize: 0
nodeLabels:
kops.k8s.io/instancegroup: nodes-us-west-2c
role: Node
subnets:
- us-west-2c
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2019-03-18T17:45:56Z
labels:
kops.k8s.io/cluster: shane.dev.example.com
name: nodes-west-cpu-2a
spec:
additionalSecurityGroups:
- sg-661c951a
cloudLabels:
customer: internal
environment: dev
k8s.io/cluster-autoscaler/enabled: "true"
k8s.io/cluster-autoscaler/node-template/label/cpu: ""
k8s.io/cluster-autoscaler/node-template/taint/cpu: ""
kubernetes.io/cluster/shane.dev.example.com: ""
service: kubernetes
team: is-prod-down
image: ami-010b9028f89a81d66
machineType: c5.4xlarge
maxSize: 0
minSize: 0
nodeLabels:
kops.k8s.io/instancegroup: nodes-west-cpu-2a
role: Node
subnets:
- us-west-2a
taints:
- role=cpu:NoSchedule
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2019-03-18T17:45:56Z
labels:
kops.k8s.io/cluster: shane.dev.example.com
name: nodes-west-mem-2a
spec:
additionalSecurityGroups:
- sg-661c951a
cloudLabels:
customer: internal
environment: dev
k8s.io/cluster-autoscaler/enabled: "true"
k8s.io/cluster-autoscaler/node-template/label/mem: ""
k8s.io/cluster-autoscaler/node-template/taint/mem: ""
kubernetes.io/cluster/shane.dev.example.com: ""
service: kubernetes
team: is-prod-down
image: ami-010b9028f89a81d66
machineType: r4.4xlarge
maxSize: 0
minSize: 0
nodeLabels:
kops.k8s.io/instancegroup: nodes-west-mem-2a
role: Node
subnets:
- us-west-2a
taints:
- role=mem:NoSchedule
9. Anything else do we need to know?
@justinsb can we block the next release on this. Unless I did something wrong.
This would block the release, but I don't think we should block the alpha - not everyone will build from source.
The calico upgrade with etcd3 is disruptive though: https://github.com/kubernetes/kops/blob/63943277bc48f1faa7cc773c1c7b2d8127c4f9b3/docs/etcd3-migration.md
Did a kops rolling-update cluster work? It's (sadly) not expected that a standard rolling-update will work
Agreed it should not block alpha let me circle back around and test this again.
So we chatted on slack, and the problematic configuration seems to be k8s 1.12 + kops 1.11 + etcd3 + legacy + calico. After an upgrade to kops 1.12 it looks like there are two problems:
1) calico update is applied "on top", but this means we keep some references which don't exist in the new version, for example CALICO_ETCD_ENDPOINTS pointing to a config map (it looks like)
2) etcd-manager import of the existing cluster doesn't like the https scheme error initializing etcd server: scheme not yet implemented: "https://etcd-events-1.internal.calico.awsdata.com:2381"
We are running into number 2 as well but are not using calico we are using weave and just upgrading from etcd 3 with tls (legacy) to etcd v3 (Manager)
We are as well, I'm going to do some additional testing today and tomorrow.
Fixed in 1.12 alpha3 by https://github.com/kubernetes/kops/pull/6682
It does require a rolling update to the masters all at once.
Sorry for not updating this - #6682 should have fixed part 1 (calico manifest was broken), #6695 should have fixed part 2 (moving tls-etcd to etcd-manager)
Most helpful comment
So we chatted on slack, and the problematic configuration seems to be k8s 1.12 + kops 1.11 + etcd3 + legacy + calico. After an upgrade to kops 1.12 it looks like there are two problems:
1) calico update is applied "on top", but this means we keep some references which don't exist in the new version, for example CALICO_ETCD_ENDPOINTS pointing to a config map (it looks like)
2) etcd-manager import of the existing cluster doesn't like the https scheme
error initializing etcd server: scheme not yet implemented: "https://etcd-events-1.internal.calico.awsdata.com:2381"