Calico: Pod calico-node leaking connections to the apiserver when using KDD

Created on 27 Nov 2017  路  6Comments  路  Source: projectcalico/calico

Expected Behavior



When cluster is not changed, established connection count from each calico-node pod to the apiserver should remain (more or less) constant.

Current Behavior



Established connection count from calico-node to apiserver increases overtime. (added 60 connections over 2 days). No leaking when using ETCD datastore driver.

Possible Solution



Not sure

Steps to Reproduce (for bugs)


  1. Deploy a Kube cluster
  2. Deploy Calico with KDD following https://docs.projectcalico.org/v2.6/getting-started/kubernetes/installation/hosted/kubernetes-datastore/
  3. Check for current connection count from calico-node to apiserver (for me it was 24)
  4. Wait for 2 days
  5. Check current connection count from calico-node to apiserver again (for me it became 86)

Context



It accumulates in large amount of unclosed TCP connections on the apiservers, which we have to regularly kill all calico-node to reset them.

Your Environment

  • Calico version: 2.6.2
  • Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes 1.8.3
  • Operating System and version: Centos 7.4 1708
  • Link to your project (optional):
kinbug

All 6 comments

Posting the yaml file used to deploy our Calico (slightly modified from https://docs.projectcalico.org/v2.6/getting-started/kubernetes/installation/hosted/kubernetes-datastore/) Mainly:

  1. Pull calico images from DockerHub instead of quay.io, as our host runs behind the firewall and only have a mirror set up for DockerHub.
  2. Replaced /etc/service/enabled/confd/run in calico-node image to reduce log verbosity of confd
  3. Merged in the RBAC resources (https://docs.projectcalico.org/v2.6/getting-started/kubernetes/installation/hosted/rbac-kdd.yaml)
# Calico Version v2.6.2
# https://docs.projectcalico.org/v2.6/releases#v2.6.2
# This manifest includes the following component versions:
#   calico/node:v2.6.2
#   calico/cni:v1.11.0

# This ConfigMap is used to configure a self-hosted Calico installation.
kind: ConfigMap
apiVersion: v1
metadata:
  name: calico-config
  namespace: kube-system
data:
  # The CNI network configuration to install on each node.
  cni_network_config: |-
    {
        "name": "k8s-pod-network",
        "cniVersion": "0.1.0",
        "type": "calico",
        "log_level": "info",
        "datastore_type": "kubernetes",
        "nodename": "__KUBERNETES_NODE_NAME__",
        "mtu": 1500,
        "ipam": {
            "type": "host-local",
            "subnet": "usePodCidr"
        },
        "policy": {
            "type": "k8s",
            "k8s_auth_token": "__SERVICEACCOUNT_TOKEN__"
        },
        "kubernetes": {
            "k8s_api_root": "https://__KUBERNETES_SERVICE_HOST__:__KUBERNETES_SERVICE_PORT__",
            "kubeconfig": "__KUBECONFIG_FILEPATH__"
        }
    }

---

# HACK: This is the temporary patch for fixing the log spam from confd
# Seems fixed in a later beta version, we will keep it here until a new stable version is released.
# https://github.com/projectcalico/calico/issues/985
kind: ConfigMap
apiVersion: v1
metadata:
  name: calico-confd-patch
  namespace: kube-system
data:
  run: |-
    #!/bin/sh
    exec 2>&1

    if [ "$DATASTORE_TYPE" = "kubernetes" ]
    then
        exec confd -confdir=/etc/calico/confd -interval=5 -backend=k8s
    else
        ETCD_NODE=${ETCD_ENDPOINTS:=${ETCD_SCHEME:=http}://${ETCD_AUTHORITY}}
        ETCD_ENDPOINTS_CONFD=`echo "-node=$ETCD_NODE" | sed -e 's/,/ -node=/g'`

        exec confd -confdir=/etc/calico/confd -interval=5 -watch \
               $ETCD_ENDPOINTS_CONFD -client-key=${ETCD_KEY_FILE} \
               -client-cert=${ETCD_CERT_FILE} -client-ca-keys=${ETCD_CA_CERT_FILE}
    fi

---

# This manifest installs the calico/node container, as well
# as the Calico CNI plugins and network config on
# each master and worker node in a Kubernetes cluster.
kind: DaemonSet
apiVersion: extensions/v1beta1
metadata:
  name: calico-node
  namespace: kube-system
  labels:
    k8s-app: calico-node
spec:
  selector:
    matchLabels:
      k8s-app: calico-node
  template:
    metadata:
      labels:
        k8s-app: calico-node
      annotations:
        # This, along with the CriticalAddonsOnly toleration below,
        # marks the pod as a critical add-on, ensuring it gets
        # priority scheduling and that its resources are reserved
        # if it ever gets evicted.
        scheduler.alpha.kubernetes.io/critical-pod: ''
    spec:
      hostNetwork: true
      serviceAccountName: calico-node
      tolerations:
        # Allow the pod to run on the master.  This is required for
        # the master to communicate with pods.
        - key: dedicated
          operator: "Exists"
          effect: NoSchedule
        # Mark the pod as a critical add-on for rescheduling.
        - key: "CriticalAddonsOnly"
          operator: "Exists"
      # Minimize downtime during a rolling upgrade or deletion; tell Kubernetes to do a "force
      # deletion": https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods.
      terminationGracePeriodSeconds: 0
      containers:
        # Runs calico/node container on each Kubernetes node.  This
        # container programs network policy and routes on each
        # host.
        - name: calico-node
          image: calico/node:v2.6.2
          env:
            # Use Kubernetes API as the backing datastore.
            - name: DATASTORE_TYPE
              value: "kubernetes"
            # Enable felix info logging.
            - name: FELIX_LOGSEVERITYSCREEN
              value: "info"
            # Cluster type to identify the deployment type
            - name: CLUSTER_TYPE
              value: "k8s,bgp"
            # Disable file logging so `kubectl logs` works.
            - name: CALICO_DISABLE_FILE_LOGGING
              value: "true"
            # Set Felix endpoint to host default action to ACCEPT.
            - name: FELIX_DEFAULTENDPOINTTOHOSTACTION
              value: "ACCEPT"
            # Disable IPV6 on Kubernetes.
            - name: FELIX_IPV6SUPPORT
              value: "false"
            # Set MTU for tunnel device used if ipip is enabled
            - name: FELIX_IPINIPMTU
              value: "1440"
            # Wait for the datastore.
            - name: WAIT_FOR_DATASTORE
              value: "true"
            # The Calico IPv4 pool to use.  This should match `--cluster-cidr`
            - name: CALICO_IPV4POOL_CIDR
              value: "100.200.0.0/16"
            # Enable IPIP
            - name: CALICO_IPV4POOL_IPIP
              value: "off"
            # Enable IP-in-IP within Felix.
            - name: FELIX_IPINIPENABLED
              value: "true"
            # Set based on the k8s node name.
            - name: NODENAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            # Make sure IP is always autodetected even if already exists in the node resource configuration
            # https://docs.projectcalico.org/v2.4/reference/node/configuration#ip-autodetection-methods
            - name: IP
              value: "autodetect"
            - name: FELIX_HEALTHENABLED
              value: "true"
          securityContext:
            privileged: true
          resources:
            requests:
              cpu: 250m
          livenessProbe:
            httpGet:
              path: /liveness
              port: 9099
            periodSeconds: 10
            initialDelaySeconds: 10
            failureThreshold: 6
          readinessProbe:
            httpGet:
              path: /readiness
              port: 9099
            periodSeconds: 10
          volumeMounts:
            - mountPath: /lib/modules
              name: lib-modules
              readOnly: true
            - mountPath: /var/run/calico
              name: var-run-calico
              readOnly: false
            # HACK: temporary fix for confd log spam
            # https://github.com/projectcalico/calico/issues/985
            - mountPath: /etc/service/enabled/confd/run
              name: confd-patch
              subPath: run
              readOnly: true
        # This container installs the Calico CNI binaries
        # and CNI network config file on each node.
        - name: install-cni
          image: calico/cni:v1.11.0
          command: ["/install-cni.sh"]
          env:
            # The CNI network config to install on each node.
            - name: CNI_NETWORK_CONFIG
              valueFrom:
                configMapKeyRef:
                  name: calico-config
                  key: cni_network_config
            # Set the hostname based on the k8s node name.
            - name: KUBERNETES_NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          volumeMounts:
            - mountPath: /host/opt/cni/bin
              name: cni-bin-dir
            - mountPath: /host/etc/cni/net.d
              name: cni-net-dir
      volumes:
        # Used by calico/node.
        - name: lib-modules
          hostPath:
            path: /lib/modules
        - name: var-run-calico
          hostPath:
            path: /var/run/calico
        # Used to install CNI.
        - name: cni-bin-dir
          hostPath:
            path: /opt/cni/bin
        - name: cni-net-dir
          hostPath:
            path: /etc/cni/net.d
        # HACK: temporary fix for confd log spam
        # https://github.com/projectcalico/calico/issues/985
        - name: confd-patch
          configMap:
            name: calico-confd-patch
            defaultMode: 0755

# Create all the CustomResourceDefinitions needed for
# Calico policy and networking mode.
---

apiVersion: apiextensions.k8s.io/v1beta1
description: Calico Global Felix Configuration
kind: CustomResourceDefinition
metadata:
   name: globalfelixconfigs.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: GlobalFelixConfig
    plural: globalfelixconfigs
    singular: globalfelixconfig

---

apiVersion: apiextensions.k8s.io/v1beta1
description: Calico BGP Peers
kind: CustomResourceDefinition
metadata:
  name: bgppeers.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: BGPPeer
    plural: bgppeers
    singular: bgppeer

---

apiVersion: apiextensions.k8s.io/v1beta1
description: Calico Global BGP Configuration
kind: CustomResourceDefinition
metadata:
  name: globalbgpconfigs.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: GlobalBGPConfig
    plural: globalbgpconfigs
    singular: globalbgpconfig

---

apiVersion: apiextensions.k8s.io/v1beta1
description: Calico IP Pools
kind: CustomResourceDefinition
metadata:
  name: ippools.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: IPPool
    plural: ippools
    singular: ippool

---

apiVersion: apiextensions.k8s.io/v1beta1
description: Calico Global Network Policies
kind: CustomResourceDefinition
metadata:
  name: globalnetworkpolicies.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: GlobalNetworkPolicy
    plural: globalnetworkpolicies
    singular: globalnetworkpolicy

---

apiVersion: v1
kind: ServiceAccount
metadata:
  name: calico-node
  namespace: kube-system

---

# Calico Version v2.6.2
# https://docs.projectcalico.org/v2.6/releases#v2.6.2
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: calico-node
rules:
  - apiGroups: [""]
    resources:
      - namespaces
    verbs:
      - get
      - list
      - watch
  - apiGroups: [""]
    resources:
      - pods/status
    verbs:
      - update
  - apiGroups: [""]
    resources:
      - pods
    verbs:
      - get
      - list
      - watch
  - apiGroups: [""]
    resources:
      - nodes
    verbs:
      - get
      - list
      - update
      - watch
  - apiGroups: ["extensions"]
    resources:
      - networkpolicies
    verbs:
      - get
      - list
      - watch
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - globalfelixconfigs
      - bgppeers
      - globalbgpconfigs
      - ippools
      - globalnetworkpolicies
    verbs:
      - create
      - get
      - list
      - update
      - watch

---

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: calico-node
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: calico-node
subjects:
- kind: ServiceAccount
  name: calico-node
  namespace: kube-system

@javefang Please can you try with Calico v2.6.3? We made some fixes in that area that may have resolved this.

Awesome! I will give it a go tomorrow and report back in a few days (to see if the conncetion still leaks).

OK running 2.6.3 (and CNI 1.11.1) now. 2 hours passed, looking good so far! Will report back in 24h

With this result I'm convinced that the connection leak issue has been fixed by 2.6.3. Thanks for the great work guys!

(2.6.3 deployed at around 9am on 7 Dec, and no connection count increase since then)

screen shot 2017-12-11 at 10 36 28

Awesome, thank you for reporting your results back. :+1:

Was this page helpful?
0 / 5 - 0 ratings

Related issues

lwr20 picture lwr20  路  5Comments

jpiper picture jpiper  路  4Comments

squat picture squat  路  5Comments

mohit5577 picture mohit5577  路  5Comments

mrsherlock88 picture mrsherlock88  路  3Comments