https://kubernetes.slack.com/archives/C3QUFP0QM/p1521739991000074
Summary: Updating the ami to kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-02-08 on the kubernetes 1.8 masters and node seems to cause them to fail the AWS EC2 Instance reachability check and not become healthy. Aws restarts them repeatedly.
kops version are you running? The command kops version, will displayVersion 1.8.1 (git-94ef202)kubectl version will print thekops flag.--- kubernetes/kops ‹master› » kubectl version
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.10", GitCommit:"044cd262c40234014f01b40ed7b9d09adbafe9b1", GitTreeState:"clean", BuildDate:"2018-03-19T17:51:28Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.10", GitCommit:"044cd262c40234014f01b40ed7b9d09adbafe9b1", GitTreeState:"clean", BuildDate:"2018-03-19T17:44:09Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
awskops rolling update cluster --yes6. What did you expect to happen?
`master to come back up`
7. Please provide your cluster manifest. Execute
`kops get --name my.example.com -oyaml` to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
apiVersion: kops/v1alpha2
kind: Cluster
metadata:
creationTimestamp: 2017-02-09T23:47:31Z
name:
spec:
additionalPolicies:
node: |
[
{
"Effect": "Allow",
"Action": ["ec2:AttachVolume"],
"Resource": [""]
},
{
"Effect": "Allow",
"Action": ["ec2:DetachVolume"],
"Resource": [""]
}
]
api:
dns: {}
authorization:
rbac: {}
channel: stable
cloudProvider: aws
configBase:
etcdClusters:
-v 10 flag.------------- FEATURE REQUEST TEMPLATE --------------------
Describe IN DETAIL the feature/behavior/change you would like to see.
Feel free to provide a design supporting your feature request.
I ran into this problem as well. I have the problem when using m3.large, but not when using m3.medium.
I see the following crash when I look at the instance system log in AWS: https://gist.github.com/wendorf/91f5a2c77c3cdc277e48c2c22fc0b46b
Thanks for moving us a little further @wendorf
@dmcnaught Which instance type were you using? Do you get similar output in the AWS system log?
I am seeing the problem with the m3.large. Same error in the instance system logs.
I hadn't narrowed down to understanding that it was a problem with a particular instance type, but I remember now I didn't see the problem with c4.xlarge.
I see this with an r3.large in us-west-2 -- same kernel panics at start and end of the logs.
I am seeing this same kernel panic in us-east-1 with the c3.large on master nodes only.
I'm seeing this on us-east-1 with r3.larges as well. Noticed it on non-master nodes. r4.large and i3.large are fine.
I'm seeing this same kernel panic in us-west-2 with c3.large on non-master nodes as well.
We were facing the same issue in r3.large types. The issue gets fixed on upgrading the kernel in the image.
Image: kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-02-08
Error in ec2 system log
[ 1.132048] Kernel panic - not syncing: Fatal exception
[ 1.133919] Kernel Offset: disabled
Below steps fixes the issue and instance status check passes after that.
1. Launched an r3.large instance in ap-south-1 region with ami-640d5f0b.
The instance did not boot up and failed instance status checks as expected.
2. Stopped the instance from console and after it was stopped, I changed the instance type to r3.xlarge
3. STARTed the instance and it came up and passed health checks.
4. Connected to the instance via SSH and switched to root user (sudo su)
5. Checked the kernel version (uname -r) and confirmed that it is 4.4.115
6. Searched and installed newer kernel (4.4.121) with the steps below:
a. apt-cache search linux-image
b. apt-get install linux-image-4.4.121
c. apt-get update
7. Rebooted the Operation System to allow it to process and reflect the kernel upgrade.
8. Confirmed that the kernel is upgraded (uname -r) which showed 4.4.121.
9. Stopped the instance and switched it back to r3.large instance type
10. STARTed the instance and it came up and passed instance status checks this time.
@dmcnaught
@chrislovecnm @justinsb
The issue seems critical and should be fixed soon, as it is the latest version of public recommended AMI which gets by default in every k8s installation using kops.
Would love to fix this.
I've tried to manually change the image to kope.io/k8s-1.9-debian-jessie-amd64-hvm-ebs-2018-03-11 running on m3.large. This seems to have worked.
EDIT: I also tried kope.io/k8s-1.9-debian-stretch-amd64-hvm-ebs-2018-03-11 but the node won't connect, I have to login and manually run service kubelet restart for it to become visible on the dashboard.
@chrislovecnm Hey chris, please suggest how can I help to fix this.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
I ran into this problem as well. I have the problem when using m3.large, but not when using m3.medium.
I see the following crash when I look at the instance system log in AWS: https://gist.github.com/wendorf/91f5a2c77c3cdc277e48c2c22fc0b46b