Calico: Calico Network Policy only works when both application are on same K8s node

Created on 2 Oct 2019 · 5Comments · Source: projectcalico/calico

I am having a very strange issue and I couldn't find out the issue that is causing this weird scenario. I am using Calico Network Policy to allow the DB to accept the connection from one specific namespace only.

Calico Network Policy

apiVersion: projectcalico.org/v3
kind: NetworkPolicy
metadata:
  name: network-policy-171-946
  namespace: ns-restriction-demo-2
spec:
  selector: app == 'db-demo-2'
  ingress:  
  - action: Allow
    protocol: TCP
    source:
      selector: app == 'node-demo-1'
      namespaceSelector: name == 'ns-restriction-demo-1'

  - action: Allow
    protocol: TCP
    source:
      namespaceSelector: name == 'ns-restriction-demo-2'

Expected Behavior

When I apply the network policy, regardless of the kubernetes worker node, the policy should work.

Current Behavior

When I apply the network policy, it work only if the DB and the application that is connecting to it, both are on same kubernetes worker node.

Steps to Reproduce (for bugs)

Deploy a DB app in one k8s node
Deploy another app that is going to connect to DB in another k8s node

Your Environment

Client Version: v3.5.8
Git commit: 107e128
Cluster Version: v3.9.1
Cluster Type: k8s,bgp,kdd,typha
Kubernetes: 1.13.6
Istio: 1.1.10

Please help me understand or debug the issue.
Thanks

kinsupport

Source

venomwaqar

👍2

Most helpful comment

Seem to have the same problem with a different network policy #2896
My pods can communicate only if they are on the same node

Woap on 2 Oct 2019

👍2

All 5 comments

Seem to have the same problem with a different network policy #2896
My pods can communicate only if they are on the same node

Woap on 2 Oct 2019

👍2

@Woap Yeah something is wrong.

venomwaqar on 3 Oct 2019

Are you running in the cloud?
Are you using IPIP? Is IPIP traffic allowed between your nodes?
Are the calico-node pods Running (not erroring/crashing)?

tmjd on 7 Oct 2019

Hi, I've also run into the same issue (although with canal).

Edit: Updated following more investigation, I believe my issue was down to flannel version change rather than calico.

In my env I had updated Calico from v3.3.0 to v3.6.1, flannel from v0.9.0 to v0.11.0.

Prior to the upgrade, the POSTROUTING table was as follows:

Chain POSTROUTING (policy ACCEPT)
target     prot opt in     out     source               destination
KUBE-POSTROUTING  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */
RETURN     all  --  *      *       10.244.0.0/16        10.244.0.0/16
MASQUERADE  all  --  *      *       10.244.0.0/16       !224.0.0.0/4
RETURN     all  --  *      *      !10.244.0.0/16        10.244.1.0/24
MASQUERADE  all  --  *      *      !10.244.0.0/16        10.244.0.0/16
cali-POSTROUTING  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:O3lYWMrLQYEMJtB5 */

Following the update of the DaemonSet and all the pods recycling, the POSTROUTING chain on all nodes had gotten into the following state:

Chain POSTROUTING (policy ACCEPT)
target     prot opt in     out     source               destination
KUBE-POSTROUTING  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */
MASQUERADE  all  --  *      *       10.244.0.0/16       !224.0.0.0/4
MASQUERADE  all  --  *      *      !10.244.0.0/16        10.244.0.0/16
RETURN     all  --  *      *       10.244.0.0/16        10.244.0.0/16
MASQUERADE  all  --  *      *       10.244.0.0/16       !224.0.0.0/4          random-fully
RETURN     all  --  *      *      !10.244.0.0/16        10.244.1.0/24
MASQUERADE  all  --  *      *      !10.244.0.0/16        10.244.0.0/16        random-fully
cali-POSTROUTING  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:O3lYWMrLQYEMJtB5 */

Snippet of the kube-flannel logs:

iptables.go:167] Deleting iptables rule: -s 0.0.0.0/0 -d 0.0.0.0/0 -j RETURN
iptables.go:167] Deleting iptables rule: -s 0.0.0.0/0 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully
iptables.go:167] Deleting iptables rule: ! -s 0.0.0.0/0 -d 0.0.0.0/0 -j RETURN
iptables.go:167] Deleting iptables rule: ! -s 0.0.0.0/0 -d 0.0.0.0/0 -j MASQUERADE --random-fully
main.go:317] Wrote subnet file to /run/flannel/subnet.env
main.go:321] Running backend.
main.go:339] Waiting for all goroutines to exit
vxlan_network.go:60] watching for new subnet leases
iptables.go:145] Some iptables rules are missing; deleting and recreating rules
iptables.go:167] Deleting iptables rule: -s 10.244.0.0/16 -d 10.244.0.0/16 -j RETURN
iptables.go:167] Deleting iptables rule: -s 10.244.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully
iptables.go:167] Deleting iptables rule: ! -s 10.244.0.0/16 -d 10.244.1.0/24 -j RETURN
iptables.go:167] Deleting iptables rule: ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE --random-fully
iptables.go:155] Adding iptables rule: -s 10.244.0.0/16 -d 10.244.0.0/16 -j RETURN
iptables.go:155] Adding iptables rule: -s 10.244.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully
iptables.go:155] Adding iptables rule: ! -s 10.244.0.0/16 -d 10.244.1.0/24 -j RETURN
iptables.go:155] Adding iptables rule: ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE --random-fully

When flannel starts up it attempts to detect and remove 4 rules (RETURN, MASQUERADE, RETURN, MASQUERADE) before re-adding, but in this scenario it only detected and removed the two RETURN rules. That caused the 2 MASQUERADE rules to be left (slight difference in the new version with them referencing --random-fully), which all traffic ends up hitting and causing this issue.

To solve it without cycling the nodes I flushed the POSTROUTING chain (or alternatively could just drop those 2 MASQUERADE rules individually) and it was reconfigured correctly shortly afterwards (e.g. iptables -t nat -F POSTROUTING).

This issue is related: #2169

KashifSaadat on 24 Oct 2019

👍1

@venomwaqar please provide additional information then we can re-open this issue.
Does the traffic you are attempting work when you do not have policies in place?

tmjd on 3 Dec 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Typha is missing tolerations

lwr20 · 5Comments

support for NF_NAT_RANGE_PROTO_RANDOM_FULLY in SNAT rules

redbaron · 3Comments

Liveness probe failed: calico/node is not ready: Felix is not live: Get http://localhost:9099/liveness: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

cann0nf0dder · 5Comments

NodeNetworkUnavailable condition fix not included in v3.4.0 as release notes suggest

ffilippopoulos · 4Comments

Add non-cluster host to Kubernets Calico network

AlejoAsd · 5Comments