Calico: Calico Network Policy only works when both application are on same K8s node

Created on 2 Oct 2019  路  5Comments  路  Source: projectcalico/calico


I am having a very strange issue and I couldn't find out the issue that is causing this weird scenario. I am using Calico Network Policy to allow the DB to accept the connection from one specific namespace only.

Calico Network Policy

apiVersion: projectcalico.org/v3
kind: NetworkPolicy
metadata:
  name: network-policy-171-946
  namespace: ns-restriction-demo-2
spec:
  selector: app == 'db-demo-2'
  ingress:  
  - action: Allow
    protocol: TCP
    source:
      selector: app == 'node-demo-1'
      namespaceSelector: name == 'ns-restriction-demo-1'

  - action: Allow
    protocol: TCP
    source:
      namespaceSelector: name == 'ns-restriction-demo-2'

Expected Behavior

When I apply the network policy, regardless of the kubernetes worker node, the policy should work.

Current Behavior

When I apply the network policy, it work only if the DB and the application that is connecting to it, both are on same kubernetes worker node.

Steps to Reproduce (for bugs)

  1. Deploy a DB app in one k8s node
  2. Deploy another app that is going to connect to DB in another k8s node

Your Environment

Client Version: v3.5.8
Git commit: 107e128
Cluster Version: v3.9.1
Cluster Type: k8s,bgp,kdd,typha
Kubernetes: 1.13.6
Istio: 1.1.10

Please help me understand or debug the issue.
Thanks

kinsupport

Most helpful comment

Seem to have the same problem with a different network policy #2896
My pods can communicate only if they are on the same node

All 5 comments

Seem to have the same problem with a different network policy #2896
My pods can communicate only if they are on the same node

@Woap Yeah something is wrong.

Are you running in the cloud?
Are you using IPIP? Is IPIP traffic allowed between your nodes?
Are the calico-node pods Running (not erroring/crashing)?

Hi, I've also run into the same issue (although with canal).

Edit: Updated following more investigation, I believe my issue was down to flannel version change rather than calico.

In my env I had updated Calico from v3.3.0 to v3.6.1, flannel from v0.9.0 to v0.11.0.

Prior to the upgrade, the POSTROUTING table was as follows:

Chain POSTROUTING (policy ACCEPT)
target     prot opt in     out     source               destination
KUBE-POSTROUTING  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */
RETURN     all  --  *      *       10.244.0.0/16        10.244.0.0/16
MASQUERADE  all  --  *      *       10.244.0.0/16       !224.0.0.0/4
RETURN     all  --  *      *      !10.244.0.0/16        10.244.1.0/24
MASQUERADE  all  --  *      *      !10.244.0.0/16        10.244.0.0/16
cali-POSTROUTING  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:O3lYWMrLQYEMJtB5 */

Following the update of the DaemonSet and all the pods recycling, the POSTROUTING chain on all nodes had gotten into the following state:

Chain POSTROUTING (policy ACCEPT)
target     prot opt in     out     source               destination
KUBE-POSTROUTING  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */
MASQUERADE  all  --  *      *       10.244.0.0/16       !224.0.0.0/4
MASQUERADE  all  --  *      *      !10.244.0.0/16        10.244.0.0/16
RETURN     all  --  *      *       10.244.0.0/16        10.244.0.0/16
MASQUERADE  all  --  *      *       10.244.0.0/16       !224.0.0.0/4          random-fully
RETURN     all  --  *      *      !10.244.0.0/16        10.244.1.0/24
MASQUERADE  all  --  *      *      !10.244.0.0/16        10.244.0.0/16        random-fully
cali-POSTROUTING  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:O3lYWMrLQYEMJtB5 */

Snippet of the kube-flannel logs:

iptables.go:167] Deleting iptables rule: -s 0.0.0.0/0 -d 0.0.0.0/0 -j RETURN
iptables.go:167] Deleting iptables rule: -s 0.0.0.0/0 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully
iptables.go:167] Deleting iptables rule: ! -s 0.0.0.0/0 -d 0.0.0.0/0 -j RETURN
iptables.go:167] Deleting iptables rule: ! -s 0.0.0.0/0 -d 0.0.0.0/0 -j MASQUERADE --random-fully
main.go:317] Wrote subnet file to /run/flannel/subnet.env
main.go:321] Running backend.
main.go:339] Waiting for all goroutines to exit
vxlan_network.go:60] watching for new subnet leases
iptables.go:145] Some iptables rules are missing; deleting and recreating rules
iptables.go:167] Deleting iptables rule: -s 10.244.0.0/16 -d 10.244.0.0/16 -j RETURN
iptables.go:167] Deleting iptables rule: -s 10.244.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully
iptables.go:167] Deleting iptables rule: ! -s 10.244.0.0/16 -d 10.244.1.0/24 -j RETURN
iptables.go:167] Deleting iptables rule: ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE --random-fully
iptables.go:155] Adding iptables rule: -s 10.244.0.0/16 -d 10.244.0.0/16 -j RETURN
iptables.go:155] Adding iptables rule: -s 10.244.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully
iptables.go:155] Adding iptables rule: ! -s 10.244.0.0/16 -d 10.244.1.0/24 -j RETURN
iptables.go:155] Adding iptables rule: ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE --random-fully

When flannel starts up it attempts to detect and remove 4 rules (RETURN, MASQUERADE, RETURN, MASQUERADE) before re-adding, but in this scenario it only detected and removed the two RETURN rules. That caused the 2 MASQUERADE rules to be left (slight difference in the new version with them referencing --random-fully), which all traffic ends up hitting and causing this issue.

To solve it without cycling the nodes I flushed the POSTROUTING chain (or alternatively could just drop those 2 MASQUERADE rules individually) and it was reconfigured correctly shortly afterwards (e.g. iptables -t nat -F POSTROUTING).

This issue is related: #2169

@venomwaqar please provide additional information then we can re-open this issue.
Does the traffic you are attempting work when you do not have policies in place?

Was this page helpful?
0 / 5 - 0 ratings