Kops: Pod sandbox issues and stuck at ContainerCreating

Created on 23 Jan 2018 · 15Comments · Source: kubernetes/kops

What kops version are you running? The command kops version, will display
this information.

Version 1.8.0 (git-5099bc5)

What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.5", GitCommit:"cce11c6a185279d037023e02ac5249e14daa22bf", GitTreeState:"clean", BuildDate:"2017-12-07T16:16:03Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.5", GitCommit:"cce11c6a185279d037023e02ac5249e14daa22bf", GitTreeState:"clean", BuildDate:"2017-12-07T16:05:18Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

What cloud provider are you using?

AWS

What commands did you run? What is the simplest way to reproduce this issue?

Applying any configuration resulting in the creation of a new container ends up with a pod stuck in ContainerCreating status and afterwards into sandbox failures (see logs).

What happened after the commands executed?

First the pod stays in 'ContainerCreating' for a long time, afterwards there are several sandbox errors (see logs)

What did you expect to happen?

Normal deployment without the hanging.

Please provide your cluster manifest. Execute
kops get --name my.example.com -oyaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: 2017-09-07T12:12:21Z
  name: kubernetes.xxx.xxx
spec:
  api:
    loadBalancer:
      type: Public
  authorization:
    alwaysAllow: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://xxx-kops/kubernetes.xxx.xxx
  dnsZone: Z16C10VSQ4D9E
  docker:
    logDriver: ""
    storage: overlay2
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-eu-central-1a
      name: a
    - instanceGroup: master-eu-central-1b
      name: b
    - instanceGroup: master-eu-central-1c
      name: c
    name: main
  - etcdMembers:
    - instanceGroup: master-eu-central-1a
      name: a
    - instanceGroup: master-eu-central-1b
      name: b
    - instanceGroup: master-eu-central-1c
      name: c
    name: events
  iam:
    legacy: true
  kubeAPIServer:
    runtimeConfig:
      batch/v2alpha1: "true"
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.8.5
  masterInternalName: api.internal.kubernetes.xxx.xxx
  masterPublicName: api.kubernetes.xxx.xxx
  networkCIDR: 172.20.0.0/16
  networking:
    weave:
      mtu: 8912
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 172.20.32.0/19
    name: eu-central-1a
    type: Private
    zone: eu-central-1a
  - cidr: 172.20.64.0/19
    name: eu-central-1b
    type: Private
    zone: eu-central-1b
  - cidr: 172.20.96.0/19
    name: eu-central-1c
    type: Private
    zone: eu-central-1c
  - cidr: 172.20.0.0/22
    name: utility-eu-central-1a
    type: Utility
    zone: eu-central-1a
  - cidr: 172.20.4.0/22
    name: utility-eu-central-1b
    type: Utility
    zone: eu-central-1b
  - cidr: 172.20.8.0/22
    name: utility-eu-central-1c
    type: Utility
    zone: eu-central-1c
  topology:
    bastion:
      bastionPublicName: bastion.kubernetes.xxx.xxx
    dns:
      type: Public
    masters: private
    nodes: private

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2017-09-07T12:12:21Z
  labels:
    kops.k8s.io/cluster: kubernetes.xxx.xxx
  name: bastions
spec:
  image: kope.io/k8s-1.7-debian-jessie-amd64-hvm-ebs-2017-07-28
  machineType: t2.micro
  maxSize: 1
  minSize: 1
  role: Bastion
  subnets:
  - utility-eu-central-1a
  - utility-eu-central-1b
  - utility-eu-central-1c

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2017-09-07T12:12:21Z
  labels:
    kops.k8s.io/cluster: kubernetes.xxx.xxx
  name: master-eu-central-1a
spec:
  image: kope.io/k8s-1.7-debian-jessie-amd64-hvm-ebs-2017-07-28
  machineType: m4.large
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - eu-central-1a

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2017-09-07T12:12:21Z
  labels:
    kops.k8s.io/cluster: kubernetes.xxx.xxx
  name: master-eu-central-1b
spec:
  image: kope.io/k8s-1.7-debian-jessie-amd64-hvm-ebs-2017-07-28
  machineType: m4.large
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - eu-central-1b

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2017-09-07T12:12:21Z
  labels:
    kops.k8s.io/cluster: kubernetes.xxx.xxx
  name: master-eu-central-1c
spec:
  image: kope.io/k8s-1.7-debian-jessie-amd64-hvm-ebs-2017-07-28
  machineType: m4.large
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - eu-central-1c

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2017-12-13T20:46:55Z
  labels:
    beta.kubernetes.io/fluentd-ds-ready: "true"
    kops.k8s.io/cluster: kubernetes.xxx.xxx
  name: nodes-base
spec:
  image: kope.io/k8s-1.7-debian-jessie-amd64-hvm-ebs-2017-12-02
  machineType: m4.2xlarge
  maxSize: 4
  minSize: 4
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-base
  role: Node
  subnets:
  - eu-central-1a
  - eu-central-1b
  - eu-central-1c

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-01-22T10:26:14Z
  labels:
    beta.kubernetes.io/fluentd-ds-ready: "true"
    kops.k8s.io/cluster: kubernetes.xxx.xxx
  name: nodes-general-purpose
spec:
  image: kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-01-05
  machineType: m4.2xlarge
  maxSize: 0
  minSize: 0
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-general-purpose
  role: Node
  subnets:
  - eu-central-1a
  - eu-central-1b
  - eu-central-1c

Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

Jan 23 14:20:07 ip-172-20-96-21 kubelet[9704]: I0123 14:20:07.640211    9704 aws.go:1051] Could not determine public DNS from AWS metadata.
Jan 23 14:20:12 ip-172-20-96-21 kubelet[9704]: E0123 14:20:12.259174    9704 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Jan 23 14:20:12 ip-172-20-96-21 kubelet[9704]: E0123 14:20:12.259200    9704 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Jan 23 14:20:17 ip-172-20-96-21 kubelet[9704]: I0123 14:20:17.667982    9704 aws.go:1051] Could not determine public DNS from AWS metadata.
Jan 23 14:20:20 ip-172-20-96-21 kubelet[9704]: I0123 14:20:20.553308    9704 qos_container_manager_linux.go:320] [ContainerManager]: Updated QoS cgroup configuration
Jan 23 14:20:22 ip-172-20-96-21 kubelet[9704]: E0123 14:20:22.352318    9704 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Jan 23 14:20:22 ip-172-20-96-21 kubelet[9704]: E0123 14:20:22.352346    9704 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Jan 23 14:20:22 ip-172-20-96-21 kubelet[9704]: I0123 14:20:22.512209    9704 server.go:779] GET /metrics: (32.208317ms) 200 [[Prometheus/1.8.1] 172.20.91.21:34458]
Jan 23 14:20:27 ip-172-20-96-21 kubelet[9704]: I0123 14:20:27.687652    9704 aws.go:1051] Could not determine public DNS from AWS metadata.
Jan 23 14:20:30 ip-172-20-96-21 kubelet[9704]: E0123 14:20:30.431621    9704 remote_runtime.go:115] StopPodSandbox "4799a6a8c867bc324480b64df7221f13d6b83e8171a14c527b9c0559cf4b6426" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Jan 23 14:20:30 ip-172-20-96-21 kubelet[9704]: E0123 14:20:30.431669    9704 kuberuntime_manager.go:781] Failed to stop sandbox {"docker" "4799a6a8c867bc324480b64df7221f13d6b83e8171a14c527b9c0559cf4b6426"}
Jan 23 14:20:30 ip-172-20-96-21 kubelet[9704]: E0123 14:20:30.431708    9704 kubelet_pods.go:1063] Failed killing the pod "nginx-deployment-569477d6d8-jcbjz": failed to "KillPodSandbox" for "406a218c-0048-11e8-b572-026c39b367e0" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Jan 23 14:20:32 ip-172-20-96-21 kubelet[9704]: E0123 14:20:32.431681    9704 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Jan 23 14:20:32 ip-172-20-96-21 kubelet[9704]: E0123 14:20:32.431728    9704 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Jan 23 14:20:37 ip-172-20-96-21 kubelet[9704]: I0123 14:20:37.712114    9704 aws.go:1051] Could not determine public DNS from AWS metadata.
Jan 23 14:20:42 ip-172-20-96-21 kubelet[9704]: E0123 14:20:42.513956    9704 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Jan 23 14:20:42 ip-172-20-96-21 kubelet[9704]: E0123 14:20:42.513986    9704 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Jan 23 14:20:47 ip-172-20-96-21 kubelet[9704]: I0123 14:20:47.734079    9704 aws.go:1051] Could not determine public DNS from AWS metadata.
Jan 23 14:20:52 ip-172-20-96-21 kubelet[9704]: E0123 14:20:52.743703    9704 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Jan 23 14:20:52 ip-172-20-96-21 kubelet[9704]: E0123 14:20:52.743728    9704 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Jan 23 14:20:57 ip-172-20-96-21 kubelet[9704]: I0123 14:20:57.761509    9704 aws.go:1051] Could not determine public DNS from AWS metadata.
Jan 23 14:21:02 ip-172-20-96-21 kubelet[9704]: E0123 14:21:02.826386    9704 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Jan 23 14:21:02 ip-172-20-96-21 kubelet[9704]: E0123 14:21:02.826413    9704 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Jan 23 14:21:05 ip-172-20-96-21 kubelet[9704]: E0123 14:21:05.083079    9704 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Jan 23 14:21:05 ip-172-20-96-21 kubelet[9704]: E0123 14:21:05.083105    9704 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Jan 23 14:21:05 ip-172-20-96-21 kubelet[9704]: I0123 14:21:05.089787    9704 server.go:779] GET /stats/summary/: (61.282407ms) 200 [[Go-http-client/1.1] 172.20.66.110:33458]
Jan 23 14:21:07 ip-172-20-96-21 kubelet[9704]: I0123 14:21:07.780547    9704 aws.go:1051] Could not determine public DNS from AWS metadata.
Jan 23 14:21:12 ip-172-20-96-21 kubelet[9704]: E0123 14:21:12.904377    9704 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Jan 23 14:21:12 ip-172-20-96-21 kubelet[9704]: E0123 14:21:12.904407    9704 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Jan 23 14:21:17 ip-172-20-96-21 kubelet[9704]: I0123 14:21:17.800646    9704 aws.go:1051] Could not determine public DNS from AWS metadata.
Jan 23 14:21:20 ip-172-20-96-21 kubelet[9704]: I0123 14:21:20.554402    9704 qos_container_manager_linux.go:320] [ContainerManager]: Updated QoS cgroup configuration
Jan 23 14:21:22 ip-172-20-96-21 kubelet[9704]: I0123 14:21:22.502214    9704 server.go:779] GET /metrics: (9.704924ms) 200 [[Prometheus/1.8.1] 172.20.91.21:34458]
Jan 23 14:21:22 ip-172-20-96-21 kubelet[9704]: E0123 14:21:22.995951    9704 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Jan 23 14:21:22 ip-172-20-96-21 kubelet[9704]: E0123 14:21:22.995978    9704 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Jan 23 14:21:27 ip-172-20-96-21 kubelet[9704]: I0123 14:21:27.823773    9704 aws.go:1051] Could not determine public DNS from AWS metadata.
Jan 23 14:21:33 ip-172-20-96-21 kubelet[9704]: E0123 14:21:33.062525    9704 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Jan 23 14:21:33 ip-172-20-96-21 kubelet[9704]: E0123 14:21:33.062556    9704 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Jan 23 14:21:43 ip-172-20-96-21 kubelet[9704]: E0123 14:21:43.159664    9704 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Jan 23 14:21:43 ip-172-20-96-21 kubelet[9704]: E0123 14:21:43.159715    9704 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Jan 23 14:21:47 ip-172-20-96-21 kubelet[9704]: I0123 14:21:47.881168    9704 aws.go:1051] Could not determine public DNS from AWS metadata.
Jan 23 14:21:49 ip-172-20-96-21 kubelet[9704]: E0123 14:21:49.208647    9704 remote_runtime.go:115] StopPodSandbox "9dfd449d99efe66115045c5557efba54d57cab1b3617fb67fb412fc11487d266" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Jan 23 14:21:49 ip-172-20-96-21 kubelet[9704]: E0123 14:21:49.208684    9704 kuberuntime_gc.go:152] Failed to stop sandbox "9dfd449d99efe66115045c5557efba54d57cab1b3617fb67fb412fc11487d266" before removing: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Jan 23 14:21:53 ip-172-20-96-21 kubelet[9704]: E0123 14:21:53.238500    9704 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Jan 23 14:21:53 ip-172-20-96-21 kubelet[9704]: E0123 14:21:53.238527    9704 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"

Anything else do we need to know?

This started to happen suddenly a week ago. First it manifested itself as 'slow deployments' with intermittent sandbox failure. 3 days later, the deployments wouldn't finish anymore, always resulting in sandbox errors. Probably related to CNI from what I've searched, but all issues point to 'fixed' in 1.8.5 but somehow I get this problem.

I'm also using weavenet 2.0.1

The deployment I used for this test is:

apiVersion: apps/v1beta2 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 3 # tells deployment to run 2 pods matching the template
  template: # create pods using pod definition in this template
    metadata:
      # unlike pod-nginx.yaml, the name is not included in the meta data as a unique name is
      # generated from the deployment name
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.7.9
        ports:
        - containerPort: 80

lifecyclrotten

Source

glend

Most helpful comment

See this two links:
https://github.com/weaveworks/weave/issues/2797
https://github.com/weaveworks/weave/issues/2797

At the company I work for we just run into this yesterday :)

Basically weave does not reclaim unused IP addresses after nodes have been removed from the cluster. This was fixed in weave 2.1.1 but kops release 1.8.0 ships weave 2.0.5. The master branch in kops has weave 2.3.1, so we're waiting for a new release.
Meanwhile we're removing unused peers.

kubectl exec -n kube-system {MASTER_NODE_WEAVE_POD_ID} -c weave -- /home/weave/weave --local status ipam
kubectl exec -n kube-system {MASTER_NODE_WEAVE_POD_ID} -c weave -- /home/weave/weave --local rmpeer {MAC_OF_UNREACHABLE_NODE}

The node on which you run rmpeer will claim the unused addresses, so we're running the command across the master nodes

yoz2326 on 23 Jan 2018

👍9 🎉6

All 15 comments

See this two links:
https://github.com/weaveworks/weave/issues/2797
https://github.com/weaveworks/weave/issues/2797

At the company I work for we just run into this yesterday :)

The node on which you run rmpeer will claim the unused addresses, so we're running the command across the master nodes

yoz2326 on 23 Jan 2018

👍9 🎉6

You can also upgrade kops to 1.8.1: https://github.com/kubernetes/kops/releases/tag/1.8.1
or you can upgrade weave image: https://github.com/kubernetes/kops/issues/3575 (look for AlexRRR's comment)

astamaliyev on 27 Mar 2018

I have seen this same issue with 1.9.0-beta-2

sstarcher on 10 Apr 2018

@sstarcher any idea on root cause? Are you using weave?

chrislovecnm on 10 Apr 2018

yep, using weave. I see they have a few issues opened up.

https://github.com/weaveworks/weave/issues/3275
https://github.com/kubernetes/kops/issues/3575

already closed
https://github.com/weaveworks/weave/pull/3149

sstarcher on 10 Apr 2018

@yoz2326 Your steps fixed the issue for me. I had the issue after testing what would happen when I rebooted my nodes in a cluster. Note I'm actually using kubicorn with digitalocean, but I thought I'd post here to thank you and maybe help someone else who has the same issue :slightly_smiling_face:

You mentioned though that this is fixed in weave 2.1.1. But from what I can see this is still an issue when using :

weaveworks/weave-kube:2.3.0
weaveworks/weave-npc:2.3.0

For me, the output of the command was as follows:

kubectl exec -n kube-system weave-net-m6btt -c weave -- /home/weave/weave --local status ipam
e2:c2:ee:4f:09:4f(myfirstk8s-master-0)        2 IPs (06.2% of total) (2 active)
7e:12:35:ae:6a:2d(myfirstk8s-node-1)        12 IPs (37.5% of total) - unreachable!
72:91:37:2c:c8:e9(myfirstk8s-node-0)         8 IPs (25.0% of total) - unreachable!
22:00:53:3b:f3:4b(myfirstk8s-node-0)         4 IPs (12.5% of total) 
51:82:5b:e0:91:16(myfirstk8s-node-1)         6 IPs (18.8% of total)

I followed your second command and removed the two unreachable nodes ( they are the same node, but after the reboot appear to have got a different mac address).

As soon as this happened the cluster sprang back into life.

ghubcoder on 15 Apr 2018

❤1

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 14 Jul 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 13 Aug 2018

Same issue is also with version weave 2.4

anupam4881 on 24 Aug 2018

Same issue with version 2.4.1

arminioa on 11 Nov 2018

👍1

I see it on weave 2.4.0 as well, with the cluster at k8s 1.10. This should probably be reopened.

We have autoscaling turned on, so nodes get added & removed pretty frequently; and that aggravates this issue I think. Some kind of automation around the rmpeer workaround would also help meanwhile.

JaveriaK on 4 Jan 2019

@JaveriaK please upgrade to weave 2.5, this release has fixes to automate rmpeer and forget the nodes as they gets deleted from ASG.

murali-reddy on 7 Jan 2019

Still seeing unreachable IPs in 2.5.0, although less frequently than before. It's also possible that the underlying cause may have changed. Just wanted to comment here first, please let me know if you'd like me to open a new issue.

I also see some weave pods restarting continuously. A log snippet from one of these is below:

INFO: 2019/02/18 23:20:05.099273 overlay_switch ->[ae:2a:da:c3:28:36(ip-172-31-42-187.us-west-1.compute.internal)] fastdp send InitSARemote: write tcp4 172.31.43.46:6783->172.31.42.187:19492: write: connection reset by peer INFO: 2019/02/18 23:20:05.099305 overlay_switch ->[ae:2a:da:c3:28:36(ip-172-31-42-187.us-west-1.compute.internal)] using sleeve INFO: 2019/02/18 23:20:05.102149 ->[172.31.42.187:19492|ae:2a:da:c3:28:36(ip-172-31-42-187.us-west-1.compute.internal)]: connection shutting down due to error: Inconsistent entries for 10.100.64.0: owned by f2:c1:39:11:01:d4 but incoming message says 1e:76:31:2a:3a:d1 INFO: 2019/02/18 23:35:02.335169 ->[172.31.42.208:19561|02:d0:fa:07:f7:ca(ip-172-31-42-208.us-west-1.compute.internal)]: connection shutting down due to error: cannot connect to ourself INFO: 2019/02/18 23:35:02.335278 ->[172.31.42.208:6783|02:d0:fa:07:f7:ca(ip-172-31-42-208.us-west-1.compute.internal)]: connection shutting down due to error: cannot connect to ourself

JaveriaK on 19 Feb 2019

@JaveriaK Please open a new issue in weave repo with relevant logs

murali-reddy on 19 Feb 2019

It might be as silly as enabling CNI plugins ports. I am using weave-net so adequate ports are 6783/tcp, 6783/udp, 6784/udp on master node(s) in your firewall