kops ami might have a problem with k8s 1.8 (kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-02-08)

Created on 22 Mar 2018 · 18Comments · Source: kubernetes/kops

https://kubernetes.slack.com/archives/C3QUFP0QM/p1521739991000074

Summary: Updating the ami to kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-02-08 on the kubernetes 1.8 masters and node seems to cause them to fail the AWS EC2 Instance reachability check and not become healthy. Aws restarts them repeatedly.

What kops version are you running? The command kops version, will display
this information.
Version 1.8.1 (git-94ef202)
What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

--- kubernetes/kops ‹master› » kubectl version
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.10", GitCommit:"044cd262c40234014f01b40ed7b9d09adbafe9b1", GitTreeState:"clean", BuildDate:"2018-03-19T17:51:28Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.10", GitCommit:"044cd262c40234014f01b40ed7b9d09adbafe9b1", GitTreeState:"clean", BuildDate:"2018-03-19T17:44:09Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

What cloud provider are you using?
aws
What commands did you run? What is the simplest way to reproduce this issue?
kops rolling update cluster --yes
What happened after the commands executed?
```first master doesn't come up when moving to kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-02-08
Instance reachability check fails and instances is restarted many times. This happened when just changing the ami (kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-01-14 -> kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-02-08) , and also when changing ami and k8s version from 1.8.8 to 1.8.10

6. What did you expect to happen?
`master to come back up`
7. Please provide your cluster manifest. Execute
  `kops get --name my.example.com -oyaml` to display your cluster manifest.
  You may want to remove your cluster name and other sensitive information.

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
creationTimestamp: 2017-02-09T23:47:31Z
name:
spec:
additionalPolicies:
node: |
[
{
"Effect": "Allow",
"Action": ["ec2:AttachVolume"],
"Resource": [""]
},
{
"Effect": "Allow",
"Action": ["ec2:DetachVolume"],
"Resource": [""]
}
]
api:
dns: {}
authorization:
rbac: {}
channel: stable
cloudProvider: aws
configBase:
etcdClusters:

etcdMembers:
- instanceGroup: master-us-east-1a
  
  name: a
- instanceGroup: master-us-east-1c
  
  name: c
- instanceGroup: master-us-east-1d
  
  name: d
  
  name: main
etcdMembers:
- instanceGroup: master-us-east-1a
  
  name: a
- instanceGroup: master-us-east-1c
  
  name: c
- instanceGroup: master-us-east-1d
  
  name: d
  
  name: events
  
  iam:
  
  legacy: false
  
  kubernetesApiAccess:
kubernetesVersion: 1.8.10
masterInternalName:
masterPublicName:
networkCIDR: 10.101.0.0/16
networking:
kubenet: {}
nonMasqueradeCIDR: 100.64.0.0/10
sshAccess:
subnets:
cidr: 10.101.32.0/19
name: us-east-1a
type: Public
zone: us-east-1a
cidr: 10.101.64.0/19
name: us-east-1c
type: Public
zone: us-east-1c
cidr: 10.101.96.0/19
name: us-east-1d
type: Public
zone: us-east-1d
cidr: 10.101.128.0/19
name: us-east-1e
type: Public
zone: us-east-1e
topology:
dns:
type: Public
masters: public
nodes: public
```
1. Please run the commands with most verbose logging by adding the -v 10 flag.
  
  Paste the logs into this report, or in a gist and provide the gist link here.

Anything else do we need to know?

------------- FEATURE REQUEST TEMPLATE --------------------

Describe IN DETAIL the feature/behavior/change you would like to see.
Feel free to provide a design supporting your feature request.

lifecyclrotten

Source

dmcnaught

👍1

Most helpful comment

I ran into this problem as well. I have the problem when using m3.large, but not when using m3.medium.

I see the following crash when I look at the instance system log in AWS: https://gist.github.com/wendorf/91f5a2c77c3cdc277e48c2c22fc0b46b

wendorf on 17 Apr 2018

👍3

All 18 comments

I ran into this problem as well. I have the problem when using m3.large, but not when using m3.medium.

I see the following crash when I look at the instance system log in AWS: https://gist.github.com/wendorf/91f5a2c77c3cdc277e48c2c22fc0b46b

wendorf on 17 Apr 2018

👍3

Thanks for moving us a little further @wendorf

dmcnaught on 17 Apr 2018

@dmcnaught Which instance type were you using? Do you get similar output in the AWS system log?

wendorf on 17 Apr 2018

I am seeing the problem with the m3.large. Same error in the instance system logs.

dmcnaught on 17 Apr 2018

I hadn't narrowed down to understanding that it was a problem with a particular instance type, but I remember now I didn't see the problem with c4.xlarge.

dmcnaught on 17 Apr 2018

I see this with an r3.large in us-west-2 -- same kernel panics at start and end of the logs.

scopej on 24 Apr 2018

I am seeing this same kernel panic in us-east-1 with the c3.large on master nodes only.

mbranyon on 2 May 2018

I'm seeing this on us-east-1 with r3.larges as well. Noticed it on non-master nodes. r4.large and i3.large are fine.

tampajohn on 3 May 2018

I'm seeing this same kernel panic in us-west-2 with c3.large on non-master nodes as well.

liranp on 10 May 2018

We were facing the same issue in r3.large types. The issue gets fixed on upgrading the kernel in the image.

Image: kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-02-08

Error in ec2 system log

  [    1.132048] Kernel panic - not syncing: Fatal exception
    [    1.133919] Kernel Offset: disabled

Below steps fixes the issue and instance status check passes after that.

1. Launched an r3.large instance in ap-south-1 region with ami-640d5f0b.
The instance did not boot up and failed instance status checks as expected.
2. Stopped the instance from console and after it was stopped, I changed the instance type to r3.xlarge
3. STARTed the instance and it came up and passed health checks.
4. Connected to the instance via SSH and switched to root user (sudo su)
5. Checked the kernel version (uname -r) and confirmed that it is 4.4.115
6. Searched and installed newer kernel (4.4.121) with the steps below:
        a. apt-cache search linux-image
        b. apt-get install linux-image-4.4.121
        c. apt-get update
7. Rebooted the Operation System to allow it to process and reflect the kernel upgrade.
8. Confirmed that the kernel is upgraded (uname -r) which showed 4.4.121.
9. Stopped the instance and switched it back to r3.large instance type
10. STARTed the instance and it came up and passed instance status checks this time.

@dmcnaught
@chrislovecnm @justinsb
The issue seems critical and should be fixed soon, as it is the latest version of public recommended AMI which gets by default in every k8s installation using kops.
Would love to fix this.

alok87 on 12 May 2018

👍2

I've tried to manually change the image to kope.io/k8s-1.9-debian-jessie-amd64-hvm-ebs-2018-03-11 running on m3.large. This seems to have worked.

EDIT: I also tried kope.io/k8s-1.9-debian-stretch-amd64-hvm-ebs-2018-03-11 but the node won't connect, I have to login and manually run service kubelet restart for it to become visible on the dashboard.

AmazingDreams on 17 May 2018

@chrislovecnm Hey chris, please suggest how can I help to fix this.

alok87 on 21 May 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 19 Aug 2018

/remove-lifecycle stale

wendorf on 19 Aug 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 17 Nov 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 17 Dec 2018

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 16 Jan 2019

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.