When I upgrade a cluster from kube 1.5.6 (with Calico 2.1.5) to kube 1.7.4 (with Calico 2.6.3), and then try to apply a "default-deny" kube NetworkPolicy (so that all traffic to all pods are blocked), traffic is still allowed. It appears this may be due to the k8s-policy-no-match policy still being there after the upgrade, and allowing traffic through before it would be dropped.
default-deny NetworkPolicy (that applies to a namespace but doesn't allow any traffic) should block all traffic
On this specific upgrade scenario (described above), the traffic is allowed
Possibly deleting the k8s-policy-no-match policy on the upgrade might solve the problem?
kubectl create -n test-ns-1 -f - <<EOF
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
name: default-deny
namespace: test-ns-1
spec:
podSelector: {}
EOF
Here is the iptables Chain protecting the pod from both a cluster that hasn't been upgraded (just installed to Calico 2.6.1) which works as I would expect, and one that has been upgraded to 2.6.3 and has the behavior I think is a bug:
2.6.1 cluster (not upgraded, just clean install, working properly)
Chain cali-tw-caliab932bb1a38 (1 references)
pkts bytes target prot opt in out source destination
0 0 ACCEPT all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:wr_7R_2Ll4gXcpXS */ ctstate RELATED,ESTABLISHED
0 0 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:L-sEfYWoGlVgh-Lz */ ctstate INVALID
0 0 MARK all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:W-fLKYSnYMhZd5Ja */ MARK and 0xfeffffff
0 0 MARK all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:Uu1VC7XNaL2g8jPG */ /* Start of policies */ MARK and 0xfdffffff
0 0 cali-pi-_fZoFCYDhDNhKISEPthv all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:RCEQSmEz2bmx2wqi */ mark match 0x0/0x2000000
0 0 RETURN all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:i5LjpeGXCWw1Xxn5 */ /* Return if policy accepted */ mark match 0x1000000/0x1000000
0 0 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:HXgOyssdpP6KztWv */ /* Drop if no policies passed packet */ mark match 0x0/0x2000000
0 0 cali-pri-k8s_ns.brad all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:cYtJveSK-8-vntTn */
0 0 RETURN all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:Nr0JYr6_cj4zPWJK */ /* Return if profile accepted */ mark match 0x1000000/0x1000000
0 0 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:JT9V5j5nPunUBdLW */ /* Drop if no profiles matched */
2.6.3 cluster: upgraded from kube 1.5.6 (Calico 2.1.5) to kube 1.7.4 (Calico 2.6.3)
Chain cali-tw-calib1b1eec50b8 (1 references)
pkts bytes target prot opt in out source destination
0 0 ACCEPT all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:A6Z6ayobcObMw_kv */ ctstate RELATED,ESTABLISHED
0 0 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:sIXICXfOSGJQxypA */ ctstate INVALID
0 0 MARK all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:P0o_KjVvl5ruXji5 */ MARK and 0xfeffffff
0 0 MARK all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:uH4oV6L5fNbFkqLs */ /* Start of policies */ MARK and 0xfdffffff
0 0 cali-pi-_fZoFCYDhDNhKISEPthv all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:5ijObVeZuiCoI3wu */ mark match 0x0/0x2000000
0 0 RETURN all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:uElfmU_jr-vQ04ec */ /* Return if policy accepted */ mark match 0x1000000/0x1000000
0 0 cali-pi-k8s-policy-no-match all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:vkExxziUNt87hCpZ */ mark match 0x0/0x2000000
0 0 RETURN all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:3HZWWz28QthvFdvJ */ /* Return if policy accepted */ mark match 0x1000000/0x1000000
0 0 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:_XJqN6yxOkc4q2r6 */ /* Drop if no policies passed packet */ mark match 0x0/0x2000000
0 0 cali-pri-k8s_ns.brad all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:2r9Dk9gbua7Y_lln */
0 0 RETURN all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:TFpA50GV-66vFEr4 */ /* Return if profile accepted */ mark match 0x1000000/0x1000000
0 0 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:lneUbySYs_UDD9ij */ /* Drop if no profiles matched */
It does look like someone else hit this problem (see issue https://github.com/projectcalico/kube-controllers/issues/198), where the k8s-policy-no-match is not removed when upgrading to calico 2.6.3 from 2.1.5. Can someone let me know if that k8s-policy-no-match policy really should be getting deleted when upgrading a cluster?
And in our case when we upgrade from kube 1.5.6 (calico 2.1.5) to kube 1.7.4 (calico 2.6.3), we upgrade one node at a time (existing nodes are running calico-node 2.1.5 managed by systemd (NOT kube-hosted). So when we upgrade a node, we remove the systemd managed calico-node 2.1.5, and then start the calico-node daemonset on that node running 2.6.3. So we do have a period of time when we are running 2.6.3 on some nodes, and 2.1.5 on other nodes. At what point should the k8s-policy-no-match policy be removed? Once all the nodes have been upgraded? Once kube-controller is updated?
This is preventing us from moving to 2.6.3, so any help would be appreciated. Thanks.
@bradbehle yes, I think it's probably the k8s-policy-no-match chain that is preventing the default-deny form working correctly.
The relevant code seems to exist in v0.7.0 of the policy controller here.
However, corresponding code doesn't seem to exist in kube-controllers v1.0.0. Given the large version skew, it might be safest to do an incremental upgrade to Calico v2.5 first (which contains the above code), and then to v2.6.
However, I think that as soon as the new kube-controllers pod has been upgraded to v1.0 it is safe to remove the k8s-policy-no-match.
@caseydavenport I did some more digging on this issue, and the code to remove this policy does exist in v1.0.0 of calico-kube-controllers (the container log shows it being deleted), but that same code does not exist (or does not get run) in the v1.0.2 version. We can workaround this in our deploy by deleting the policy ourselves as soon as calico-kube-controllers is running, but I would like to know if this was taken out intentionally, and if so, why? (So that we don't delete it and then find out the hard way there is a reason it should be there). Also, this change in behavior could be seen as a serious issue, since it results in pods not being protected by deny policies like they should be once an upgrade happens.
@bradbehle thanks for the investigation.
Yes, I think we should fix this by re-implementing the missing code. I believe it was removed intentionally, the thinking being that users would not perform direct upgrades from an earlier version of the controllers code to v1.0, instead going through v0.7.0 which has the relevant code.
While that's still the recommended way, it should be low-cost to add this piece of code back into v1.0.x, hopefully saving some pain.
The same behaviour happen to us (okay, we were in node 0.22 and a very old policy controller upgrading to 2.6.5) and @mrrandrade discovered that removing k8s-policy-no-match works fine.
The thing is that it used to work in policy-controller 1.0.0 (the one that we used as migration test) but not in 1.0.2 (released after our tests, and as there's no documented breaking changes, we've just used this one).
There's a question here: In upgrade docs, as it's not desired to have this 'migration' step, isn't better to have something that points to the removal of this policy after the migration, so DefaultDeny starts working again?
Thank you very much!!
@rikatz I think it was a bug to remove this from v1.0.2, and I'd like to add it back into a v1.0.3 release and supersede v1.0.2. That way no manual step will be required.
We can clean up the code again once the code is sufficiently old.
Ok, I think I found the root cause of this and I've got a fix here: https://github.com/projectcalico/kube-controllers/pull/208
I've released https://github.com/projectcalico/kube-controllers/releases/tag/v1.0.3 with that fix in it.
@rikatz hopefully using v1.0.3 will remove the need for a migration step in your scripts. LMK!
Most helpful comment
@caseydavenport I did some more digging on this issue, and the code to remove this policy does exist in v1.0.0 of calico-kube-controllers (the container log shows it being deleted), but that same code does not exist (or does not get run) in the v1.0.2 version. We can workaround this in our deploy by deleting the policy ourselves as soon as calico-kube-controllers is running, but I would like to know if this was taken out intentionally, and if so, why? (So that we don't delete it and then find out the hard way there is a reason it should be there). Also, this change in behavior could be seen as a serious issue, since it results in pods not being protected by deny policies like they should be once an upgrade happens.