Kops: Cluster hung up starting, waiting for etcd volumes to attach

Created on 31 Mar 2018 · 5Comments · Source: kubernetes/kops

------------- BUG REPORT TEMPLATE --------------------

What kops version are you running? The command kops version, will display
this information.

1.8.1

What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.8.6

What cloud provider are you using?

AWS

What commands did you run? What is the simplest way to reproduce this issue?

I provisioned a cluster using kops-generated Terraform, that I modified somewhat to work with the rest of my infrastructure. The cluster was been running for weeks without issue.

What happened after the commands executed?

Yesterday, I tore down the cluster and rebuilt it (terraform destroy/apply). The cluster will not come back up. Instead, protokube hangs, waiting for the etcd volumes to attach. Here is the log output. The "waiting for volume to be attached" message repeats endlessly. I have confirmed via AWS CLI and console that the EBS volume is attached to the EC2 instance.

Mar 30 22:30:37 ip-172-31-1-13 systemd[1]: Starting Kubernetes Protokube Service...
Mar 30 22:30:37 ip-172-31-1-13 systemd[1]: Started Kubernetes Protokube Service.
Mar 30 22:30:37 ip-172-31-1-13 docker[1546]: protokube version 0.1
Mar 30 22:30:37 ip-172-31-1-13 docker[1546]: I0330 22:30:37.674785    1576 aws_volume.go:72] AWS API Request: ec2metadata/GetMetadata
Mar 30 22:30:37 ip-172-31-1-13 docker[1546]: I0330 22:30:37.676202    1576 aws_volume.go:72] AWS API Request: ec2metadata/GetMetadata
Mar 30 22:30:37 ip-172-31-1-13 docker[1546]: I0330 22:30:37.676962    1576 aws_volume.go:72] AWS API Request: ec2metadata/GetMetadata
Mar 30 22:30:37 ip-172-31-1-13 docker[1546]: I0330 22:30:37.678841    1576 aws_volume.go:72] AWS API Request: ec2/DescribeInstances
Mar 30 22:30:37 ip-172-31-1-13 docker[1546]: I0330 22:30:37.781307    1576 aws_volume.go:72] AWS API Request: ec2/DescribeVolumes
Mar 30 22:30:37 ip-172-31-1-13 docker[1546]: I0330 22:30:37.781410    1576 dnscontroller.go:101] starting DNS controller
Mar 30 22:30:37 ip-172-31-1-13 docker[1546]: I0330 22:30:37.781437    1576 dnscache.go:75] querying all DNS zones (no cached results)
Mar 30 22:30:37 ip-172-31-1-13 docker[1546]: I0330 22:30:37.782471    1576 route53.go:50] AWS request: route53 ListHostedZones
Mar 30 22:30:37 ip-172-31-1-13 docker[1546]: I0330 22:30:37.911045    1576 volume_mounter.go:254] Trying to mount master volume: "vol-0f1a04a36c6baaaae"
Mar 30 22:30:37 ip-172-31-1-13 docker[1546]: I0330 22:30:37.911510    1576 aws_volume.go:72] AWS API Request: ec2/AttachVolume
Mar 30 22:30:38 ip-172-31-1-13 docker[1546]: I0330 22:30:38.184418    1576 aws_volume.go:396] AttachVolume request returned {
Mar 30 22:30:38 ip-172-31-1-13 docker[1546]: AttachTime: 2018-03-30 22:30:38.164 +0000 UTC,
Mar 30 22:30:38 ip-172-31-1-13 docker[1546]: Device: "/dev/xvdu",
Mar 30 22:30:38 ip-172-31-1-13 docker[1546]: InstanceId: "i-0d46fbb1317501ac0",
Mar 30 22:30:38 ip-172-31-1-13 docker[1546]: State: "attaching",
Mar 30 22:30:38 ip-172-31-1-13 docker[1546]: VolumeId: "vol-0f1a04a36c6baaaae"
Mar 30 22:30:38 ip-172-31-1-13 docker[1546]: }
Mar 30 22:30:38 ip-172-31-1-13 docker[1546]: I0330 22:30:38.184693    1576 aws_volume.go:72] AWS API Request: ec2/DescribeVolumes
Mar 30 22:30:38 ip-172-31-1-13 docker[1546]: I0330 22:30:38.268938    1576 volume_mounter.go:254] Trying to mount master volume: "vol-020d90a464b55678f"
Mar 30 22:30:38 ip-172-31-1-13 docker[1546]: I0330 22:30:38.269163    1576 aws_volume.go:72] AWS API Request: ec2/AttachVolume
Mar 30 22:30:38 ip-172-31-1-13 docker[1546]: I0330 22:30:38.560245    1576 aws_volume.go:396] AttachVolume request returned {
Mar 30 22:30:38 ip-172-31-1-13 docker[1546]: AttachTime: 2018-03-30 22:30:38.543 +0000 UTC,
Mar 30 22:30:38 ip-172-31-1-13 docker[1546]: Device: "/dev/xvdv",
Mar 30 22:30:38 ip-172-31-1-13 docker[1546]: InstanceId: "i-0d46fbb1317501ac0",
Mar 30 22:30:38 ip-172-31-1-13 docker[1546]: State: "attaching",
Mar 30 22:30:38 ip-172-31-1-13 docker[1546]: VolumeId: "vol-020d90a464b55678f"
Mar 30 22:30:38 ip-172-31-1-13 docker[1546]: }
Mar 30 22:30:38 ip-172-31-1-13 docker[1546]: I0330 22:30:38.560654    1576 aws_volume.go:72] AWS API Request: ec2/DescribeVolumes
Mar 30 22:30:38 ip-172-31-1-13 docker[1546]: I0330 22:30:38.658480    1576 volume_mounter.go:273] Currently attached volumes: [{"ID":"vol-0f1a04a36c6baaaae","LocalDevice":"/dev/xvdu","AttachedTo":"","Mountpoint":"","Status":"available","Info":{"Description":"vol-0f1a04a36c6baaaae","EtcdClusters":[{"clusterKey":"main","nodeName":"b","nodeNames":["a","b","c"]}]}} {"ID":"vol-020d90a464b55678f","LocalDevice":"/dev/xvdv","AttachedTo":"","Mountpoint":"","Status":"available","Info":{"Description":"vol-020d90a464b55678f","EtcdClusters":[{"clusterKey":"events","nodeName":"b","nodeNames":["a","b","c"]}]}}]
Mar 30 22:30:38 ip-172-31-1-13 docker[1546]: I0330 22:30:38.658815    1576 volume_mounter.go:58] Master volume "vol-0f1a04a36c6baaaae" is attached at "/dev/xvdu"
Mar 30 22:30:38 ip-172-31-1-13 docker[1546]: I0330 22:30:38.659340    1576 volume_mounter.go:72] Doing safe-format-and-mount of /dev/xvdu to /mnt/master-vol-0f1a04a36c6baaaae
Mar 30 22:30:38 ip-172-31-1-13 docker[1546]: I0330 22:30:38.659365    1576 aws_volume.go:318] nvme path not found "/rootfs/dev/disk/by-id/nvme-Amazon_Elastic_Block_Store_vol0f1a04a36c6baaaae"
Mar 30 22:30:38 ip-172-31-1-13 docker[1546]: I0330 22:30:38.659373    1576 volume_mounter.go:107] Waiting for volume "vol-0f1a04a36c6baaaae" to be attached
Mar 30 22:30:39 ip-172-31-1-13 docker[1546]: I0330 22:30:39.659499    1576 aws_volume.go:318] nvme path not found "/rootfs/dev/disk/by-id/nvme-Amazon_Elastic_Block_Store_vol0f1a04a36c6baaaae"
Mar 30 22:30:39 ip-172-31-1-13 docker[1546]: I0330 22:30:39.659519    1576 volume_mounter.go:107] Waiting for volume "vol-0f1a04a36c6baaaae" to be attached
Mar 30 22:30:40 ip-172-31-1-13 docker[1546]: I0330 22:30:40.659641    1576 aws_volume.go:318] nvme path not found "/rootfs/dev/disk/by-id/nvme-Amazon_Elastic_Block_Store_vol0f1a04a36c6baaaae"
Mar 30 22:30:40 ip-172-31-1-13 docker[1546]: I0330 22:30:40.659660    1576 volume_mounter.go:107] Waiting for volume "vol-0f1a04a36c6baaaae" to be attached

What did you expect to happen?

The cluster should start normally.

Please provide your cluster manifest. Execute
kops get --name my.example.com -oyaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

metadata:
  creationTimestamp: 2018-02-09T00:29:28Z
  name: redacted
spec:
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  clusterDNSDomain: cluster.local
  configBase: s3://redacted
  configStore: s3://redacted
  dnsZone: redacted
  docker:
    bridge: ""
    ipMasq: false
    ipTables: false
    logDriver: json-file
    logLevel: warn
    logOpt:
    - max-size=10m
    - max-file=5
    storage: overlay,aufs
    version: 1.13.1
  etcdClusters:
  - etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-us-west-2a
      name: a
    - encryptedVolume: true
      instanceGroup: master-us-west-2b
      name: b
    - encryptedVolume: true
      instanceGroup: master-us-west-2c
      name: c
    name: main
    version: 2.2.1
  - etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-us-west-2a
      name: a
    - encryptedVolume: true
      instanceGroup: master-us-west-2b
      name: b
    - encryptedVolume: true
      instanceGroup: master-us-west-2c
      name: c
    name: events
    version: 2.2.1
  iam:
    allowContainerRegistry: true
    legacy: false
  keyStore: s3://redacted/pki
  kubeAPIServer:
    address: 127.0.0.1
    admissionControl:
    - Initializers
    - NamespaceLifecycle
    - LimitRanger
    - ServiceAccount
    - PersistentVolumeLabel
    - DefaultStorageClass
    - DefaultTolerationSeconds
    - NodeRestriction
    - Priority
    - ResourceQuota
    allowPrivileged: true
    anonymousAuth: false
    apiServerCount: 3
    authorizationMode: RBAC
    cloudProvider: aws
    etcdServers:
    - http://127.0.0.1:4001
    etcdServersOverrides:
    - /events#http://127.0.0.1:4002
    image: gcr.io/google_containers/kube-apiserver:v1.8.6
    insecurePort: 8080
    kubeletPreferredAddressTypes:
    - InternalIP
    - Hostname
    - ExternalIP
    logLevel: 2
    requestheaderAllowedNames:
    - aggregator
    requestheaderExtraHeaderPrefixes:
    - X-Remote-Extra-
    requestheaderGroupHeaders:
    - X-Remote-Group
    requestheaderUsernameHeaders:
    - X-Remote-User
    securePort: 443
    serviceClusterIPRange: 100.64.0.0/13
    storageBackend: etcd2
  kubeControllerManager:
    allocateNodeCIDRs: true
    attachDetachReconcileSyncPeriod: 1m0s
    cloudProvider: aws
    clusterCIDR: 100.96.0.0/11
    clusterName: redacted
    configureCloudRoutes: false
    image: gcr.io/google_containers/kube-controller-manager:v1.8.6
    leaderElection:
      leaderElect: true
    logLevel: 2
    useServiceAccountCredentials: true
  kubeDNS:
    domain: cluster.local
    replicas: 2
    serverIP: 100.64.0.10
  kubeProxy:
    clusterCIDR: 100.96.0.0/11
    cpuRequest: 100m
    featureGates: null
    hostnameOverride: '@aws'
    image: gcr.io/google_containers/kube-proxy:v1.8.6
    logLevel: 2
  kubeScheduler:
    image: gcr.io/google_containers/kube-scheduler:v1.8.6
    leaderElection:
      leaderElect: true
    logLevel: 2
  kubelet:
    allowPrivileged: true
    cgroupRoot: /
    cloudProvider: aws
    clusterDNS: 100.64.0.10
    clusterDomain: cluster.local
    enableDebuggingHandlers: true
    evictionHard: memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%,imagefs.available<10%,imagefs.inodesFree<5%
    featureGates:
      ExperimentalCriticalPodAnnotation: "true"
    hostnameOverride: '@aws'
    kubeconfigPath: /var/lib/kubelet/kubeconfig
    logLevel: 2
    networkPluginMTU: 9001
    networkPluginName: kubenet
    nonMasqueradeCIDR: 100.64.0.0/10
    podInfraContainerImage: gcr.io/google_containers/pause-amd64:3.0
    podManifestPath: /etc/kubernetes/manifests
    requireKubeconfig: true
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.8.6
  masterInternalName: api.internal.redacted
  masterKubelet:
    allowPrivileged: true
    cgroupRoot: /
    cloudProvider: aws
    clusterDNS: 100.64.0.10
    clusterDomain: cluster.local
    enableDebuggingHandlers: true
    evictionHard: memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%,imagefs.available<10%,imagefs.inodesFree<5%
    featureGates:
      ExperimentalCriticalPodAnnotation: "true"
    hostnameOverride: '@aws'
    kubeconfigPath: /var/lib/kubelet/kubeconfig
    logLevel: 2
    networkPluginMTU: 9001
    networkPluginName: kubenet
    nonMasqueradeCIDR: 100.64.0.0/10
    podInfraContainerImage: gcr.io/google_containers/pause-amd64:3.0
    podManifestPath: /etc/kubernetes/manifests
    registerSchedulable: false
    requireKubeconfig: true
  masterPublicName: api.redacted
  networkCIDR: 172.31.0.0/22
  networking:
    kopeio: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  secretStore: s3://redacted/secrets
  serviceClusterIPRange: 100.64.0.0/13
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - id: subnet-e89740a3
    name: us-west-2a
    type: Private
    zone: us-west-2a
  - id: subnet-5967d220
    name: us-west-2b
    type: Private
    zone: us-west-2b
  - id: subnet-4c23b616
    name: us-west-2c
    type: Private
    zone: us-west-2c
  - id: subnet-e99740a2
    name: utility-us-west-2a
    type: Utility
    zone: us-west-2a
  - id: subnet-4460d53d
    name: utility-us-west-2b
    type: Utility
    zone: us-west-2b
  - id: subnet-c822b792
    name: utility-us-west-2c
    type: Utility
    zone: us-west-2c
  topology:
    bastion:
      bastionPublicName: bastion.redacted
    dns:
      type: Public
    masters: private
    nodes: private

Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
Anything else do we need to know?

Source

thereverendtom

Most helpful comment

I solved the issue. Apparently, the new M5 AWS instance types are not supported in kops 1.8.1, as it does not support NVME for EBS volumes yet. Changing the instance type to M4 resolved the issue. There should be a warning when attempting to use unsupported instances types when provisioning a kops cluster.

thereverendtom on 1 Apr 2018

👍15

All 5 comments

thereverendtom on 1 Apr 2018

👍15

Still same issue with M5 on kops 1.9.0

cdenneen on 30 May 2018

Are you using Debian as your base image? If so, I assume Jessie? If you upgrade to Stretch that resolves the issue and allows you to use M5 with 1.9.0.

thereverendtom on 30 May 2018

❤1

@thereverendtom I have kops 1.9.0 installed and cluster upgrade has nothing to do as it is already 1.9.3 from stable. It's still Jessie and from what I can tell from channels (stable, alpha) there is no Stretch option. So as kops 1.9.0 stands with deploying 1.9.3 it doesn't support M5 without some manual changes as far as I can tell because upgrade or without some sort of manual update to the ig to doesn't upgrade image to stretch.

cdenneen on 30 May 2018

Yeah, I think you may have to make a manual edit to the cluster config to specify Stretch over Jessie. I have kops generating Terraform, and then using that, so I just made the manual changes before applying the Terraform.

thereverendtom on 30 May 2018

Was this page helpful?

0 / 5 - 0 ratings