kops ami might have a problem with k8s 1.8 (kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-02-08)

Created on 22 Mar 2018  Â·  18Comments  Â·  Source: kubernetes/kops

https://kubernetes.slack.com/archives/C3QUFP0QM/p1521739991000074

Summary: Updating the ami to kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-02-08 on the kubernetes 1.8 masters and node seems to cause them to fail the AWS EC2 Instance reachability check and not become healthy. Aws restarts them repeatedly.

  1. What kops version are you running? The command kops version, will display
    this information.
    Version 1.8.1 (git-94ef202)
  2. What Kubernetes version are you running? kubectl version will print the
    version if a cluster is running or provide the Kubernetes version specified as
    a kops flag.
--- kubernetes/kops ‹master› » kubectl version
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.10", GitCommit:"044cd262c40234014f01b40ed7b9d09adbafe9b1", GitTreeState:"clean", BuildDate:"2018-03-19T17:51:28Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.10", GitCommit:"044cd262c40234014f01b40ed7b9d09adbafe9b1", GitTreeState:"clean", BuildDate:"2018-03-19T17:44:09Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
  1. What cloud provider are you using?
    aws
  2. What commands did you run? What is the simplest way to reproduce this issue?
    kops rolling update cluster --yes
  3. What happened after the commands executed?
    ```first master doesn't come up when moving to kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-02-08
    Instance reachability check fails and instances is restarted many times. This happened when just changing the ami (kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-01-14 -> kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-02-08) , and also when changing ami and k8s version from 1.8.8 to 1.8.10
6. What did you expect to happen?
`master to come back up`
7. Please provide your cluster manifest. Execute
  `kops get --name my.example.com -oyaml` to display your cluster manifest.
  You may want to remove your cluster name and other sensitive information.

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
creationTimestamp: 2017-02-09T23:47:31Z
name:
spec:
additionalPolicies:
node: |
[
{
"Effect": "Allow",
"Action": ["ec2:AttachVolume"],
"Resource": [""]
},
{
"Effect": "Allow",
"Action": ["ec2:DetachVolume"],
"Resource": ["
"]
}
]
api:
dns: {}
authorization:
rbac: {}
channel: stable
cloudProvider: aws
configBase:
etcdClusters:

  • etcdMembers:

    • instanceGroup: master-us-east-1a

      name: a

    • instanceGroup: master-us-east-1c

      name: c

    • instanceGroup: master-us-east-1d

      name: d

      name: main

  • etcdMembers:

    • instanceGroup: master-us-east-1a

      name: a

    • instanceGroup: master-us-east-1c

      name: c

    • instanceGroup: master-us-east-1d

      name: d

      name: events

      iam:

      legacy: false

      kubernetesApiAccess:


  • kubernetesVersion: 1.8.10
    masterInternalName:
    masterPublicName:
    networkCIDR: 10.101.0.0/16
    networking:
    kubenet: {}
    nonMasqueradeCIDR: 100.64.0.0/10
    sshAccess:

  • subnets:
  • cidr: 10.101.32.0/19
    name: us-east-1a
    type: Public
    zone: us-east-1a
  • cidr: 10.101.64.0/19
    name: us-east-1c
    type: Public
    zone: us-east-1c
  • cidr: 10.101.96.0/19
    name: us-east-1d
    type: Public
    zone: us-east-1d
  • cidr: 10.101.128.0/19
    name: us-east-1e
    type: Public
    zone: us-east-1e
    topology:
    dns:
    type: Public
    masters: public
    nodes: public
    ```

    1. Please run the commands with most verbose logging by adding the -v 10 flag.

      Paste the logs into this report, or in a gist and provide the gist link here.

  1. Anything else do we need to know?

------------- FEATURE REQUEST TEMPLATE --------------------

  1. Describe IN DETAIL the feature/behavior/change you would like to see.

  2. Feel free to provide a design supporting your feature request.

lifecyclrotten

Most helpful comment

I ran into this problem as well. I have the problem when using m3.large, but not when using m3.medium.

I see the following crash when I look at the instance system log in AWS: https://gist.github.com/wendorf/91f5a2c77c3cdc277e48c2c22fc0b46b

All 18 comments

I ran into this problem as well. I have the problem when using m3.large, but not when using m3.medium.

I see the following crash when I look at the instance system log in AWS: https://gist.github.com/wendorf/91f5a2c77c3cdc277e48c2c22fc0b46b

Thanks for moving us a little further @wendorf

@dmcnaught Which instance type were you using? Do you get similar output in the AWS system log?

I am seeing the problem with the m3.large. Same error in the instance system logs.

I hadn't narrowed down to understanding that it was a problem with a particular instance type, but I remember now I didn't see the problem with c4.xlarge.

I see this with an r3.large in us-west-2 -- same kernel panics at start and end of the logs.

I am seeing this same kernel panic in us-east-1 with the c3.large on master nodes only.

I'm seeing this on us-east-1 with r3.larges as well. Noticed it on non-master nodes. r4.large and i3.large are fine.

I'm seeing this same kernel panic in us-west-2 with c3.large on non-master nodes as well.

We were facing the same issue in r3.large types. The issue gets fixed on upgrading the kernel in the image.

Image: kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-02-08

Error in ec2 system log

  [    1.132048] Kernel panic - not syncing: Fatal exception
    [    1.133919] Kernel Offset: disabled

Below steps fixes the issue and instance status check passes after that.

1. Launched an r3.large instance in ap-south-1 region with ami-640d5f0b.
The instance did not boot up and failed instance status checks as expected.
2. Stopped the instance from console and after it was stopped, I changed the instance type to r3.xlarge
3. STARTed the instance and it came up and passed health checks.
4. Connected to the instance via SSH and switched to root user (sudo su)
5. Checked the kernel version (uname -r) and confirmed that it is 4.4.115
6. Searched and installed newer kernel (4.4.121) with the steps below:
        a. apt-cache search linux-image
        b. apt-get install linux-image-4.4.121
        c. apt-get update
7. Rebooted the Operation System to allow it to process and reflect the kernel upgrade.
8. Confirmed that the kernel is upgraded (uname -r) which showed 4.4.121.
9. Stopped the instance and switched it back to r3.large instance type
10. STARTed the instance and it came up and passed instance status checks this time.

@dmcnaught
@chrislovecnm @justinsb
The issue seems critical and should be fixed soon, as it is the latest version of public recommended AMI which gets by default in every k8s installation using kops.
Would love to fix this.

I've tried to manually change the image to kope.io/k8s-1.9-debian-jessie-amd64-hvm-ebs-2018-03-11 running on m3.large. This seems to have worked.

EDIT: I also tried kope.io/k8s-1.9-debian-stretch-amd64-hvm-ebs-2018-03-11 but the node won't connect, I have to login and manually run service kubelet restart for it to become visible on the dashboard.

@chrislovecnm Hey chris, please suggest how can I help to fix this.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings