We would like the ability to detect when nodes have valid cni plugin configurations before launching pods on that node.
Definitions:
.conf is a file that is used when you only have 1 cni plugin to apply.conflist is a file that is used when you are chaining multiple cni plugins.conf/.conflist file and ignore the restConditions: Two CNI plugins (one of them is the linkerd-cni plugin) and a launching Pod that wants to be meshed.
Requirement: We would like Linkerd enabled Pods to have TLS communication
Situation: We have told our service teams to communicate in plain text since Linkerd will take care of in transit TLS to and from meshed pods
1) A Calico daemonset is deployed and has successfully written its .conflist to the host machine.
note On the target host, no pods that require Linkerd CNI plugin have been launched
2) A Linkerd-CNI daemonset is deployed and has successfully added it's configuration to the existing .conflist on the host machine.
3) A Pod that we want the Linkerd proxy on is launched.
4) The CNI plugin chain loads the combined .conflist on the host machine.
5) The CNI chain completes successfully
6) The Pod is launched with an ipAddress (from Calico) and iptables (from Linkerd-CNI)
1) A Linkerd-CNI daemonset is deployed and has successfully written it's .conf file to the host machine.
2) A Calico daemonset is deployed and has successfully written its .conflist to the host machine.
note On the target host, no pods that require Linkerd CNI plugin have been launched
3) There are now two cni files on the host machine:
.conf file (because it is first in lexicographical order).conf/.conflist files to the host1) A Calico daemonset is deployed and has successfully written its conflist to the host machine.
note On the target host, no pods that require Linkerd CNI plugin have been launched
2) A Linkerd-CNI daemonset is deployed BUT it has NOT written it's .conf file to the host machine yet.
3) There are is now only one cni file on the host machine:
- 10-calico.conflist
4) A Pod that we want the Linkerd proxy on is launched.
5) The CNI plugin chain loads the Calico .conflist file (because it is the only .conflist on the host)
6) The CNI chain completes successfully
7) The Pod is launched with an ipAddress (from Calico) BUT there are NO iptables (from Linkerd-CNI)
* Any pods that launch before the Linkerd-CNI daemonset updates the .conflist file are going to be missing their iptables
8) The Linkerd-CNI daemonset finally updates the existing .conflist and creates a valid cni chain
9) Launch another Pod
10) This pod will have the correct ipAddress and iptables applied since we have a valid .conflist now.
* Any pods now launched will have the ipAddress and iptables applied
11) Now we are in a scenario where we have some ACTIVE pods that do not have iptables written.
* The intent is for ALL of the pods is to have the Linkerd proxy on them to apply TLS
This problem is specific to using CNI plugins to apply iptables.
It seems that these cni pods go ready before they actually write their config files to the host machines. Different cni plugins will write to the host in non-deterministic order. This would normally not be a problem.
It becomes a problem when different cni plugins do not correctly incorporate previously existing conf and conflists files (from other CNI plugins).
For example if we launch the Linkerd CNI pod and it writes to the host first, we have noticed that the Calico CNI plugin seems to ignore the existing .conf file. Both files will exist on disk but the CNI system will choose the Linkerd CNI config only. This results in a stalled Pod since there are no ipAddresses applied (Since the Calico CNI plugin gets skipped)
There can be a race condition between when the Linkerd-CNI plugin writes it's .conf file and when pods get deployed on the node. The result is that the CNI plugin chain will complete (depending on the order) without error and the pod launches. The downside is that even though we have a valid Linkerd CNI plugin installed, we did not actually write the iptables to the newly launched pod. This is problematic because if you have two meshed services that are expecting to communicate over TLS, you could end up speaking plaintext over the channel (Since the iptables were never written all inbound request end up routing to the container's listening ports and not the Linkerd proxy). I don't think that currently you wouldn't be able to detect this scenario without explicitly checking what services are meshed.
It would be great to have the ability to detect situations like this and deny or kill existing deployments of those pods if they are not configured correctly.
It would be great if all CNI plugins actually took other CNI plugins into account. Even if this was true we believe that the race condition that occurs when new Nodes spin up will still be a thing.
It may make sense to have a daemon set that can monitor the pods on each node for correct iptables (assuming they are configured to be meshed). These pods would need CAP_NET_ADMIN in order to directly detect if the ip tables are written correctly.
To rule out multiple the CNI plugin race condition Cluster Administrators would manually insure that there is only 1 custom .conflist file deployed. This is not a general solution though. ~The issue is that there may still be a race condition between when pods spin up on a new Node and when the CNI plugin gets written to disk.~
It would be great in general if we could reject any pods that come up in this scary configuration.
.conflist on the host.I'm out of ideas for alternative solutions, the idea of using Admission controllers however this is something that manifests during the pod bootup at runtime.
Is there another way to solve this problem that isn't as good a solution?
We could also log when the Linkerd CNI plugin actually gets invoked and make sure they correlate with meshed pod launches. This seems problematic for many reasons.
It would be nice if pods failed to launch or are terminated after it has been determined to be in a bad configuration.
If you can, explain how users will be able to use this. Maybe some sample CLI
output?
It would be great if it was transparent.
Some thoughts, without any deep understanding of CNI ordering that I'll need to go research some more.
In this instance, the kubelet has been configured with --network-plugin=cni. The daemonset starts up, looks at the config and does an in-place update. That sounds right to me. We should have a check that verifies the config has been written successfully on all the instances. Feels like this should be a readiness check on the daemonset pod itself (exec not http).
In this instance, the kubelet is originally configured with --network-plugin=kubenet. The daemonset starts up, looks for a config, doesn't find one and writes its own. This feels like a weird use case as the daemonset won't actually do something in this situation. My proposal would be to do nothing and have the readiness check fail. check would error out and you'd get some warnings that you're trying to do something your cluster doesn't support.
This is the scenario that scares me the most. Ignoring any timing issues re. calico, it is entirely possible for someone to inject and skip the sidecar. They'll then be totally oblivious to the lack of things like metrics and TLS. We actually ran into this previously when the initContainer would fail silently (since fixed).
Ideally, in this situation, inject would error out and explain that your config is invalid. @klingerf you're tackling the config/inject updates. What do you think?
In the non-ideal situation, that should be planned for anyways, there's either a check for this situation or an audit that explains whether the iptables rules are setup correctly.
Scenario 2, as described by @khappucino , is also assuming --network-plugin=cni, just that the "real" cni plugin that assigns IP addresses to pods hasn't been installed yet.
@codeman9 pods start up even though there's no config?
@grampelberg the cluster in scenario 2 is started up: 1) without a cni plugin configuration or, 2) the configuration has been deleted somehow or, 3) the kubelet has been restarted to point to a different cni configuration directory that is empty.
One possible option may include cluster admins constructing a Node image that contains a single composite config file with all of the cni plugins in the list. This image would also have the relevant binaries to support the cni-plugins that are called out in the composite config file. This way there is no uncertainty with respect to what cni configuration is on any given node. Each node will boot up with the complete set of config file and bin files.
A more flexible alternative may be to have a single CNI daemonset that encapsulates all of the separate cni dependencies at once. ie. One daemonset pod that writes the combined config file and the associated binary files.
Another possibility is marking newly spawned nodes as tainted until we can determine if the node has the correct configuration deployed (also setting pod deployment priority even though it is best effort).
We got around this issue by baking the cni binaries (linkerd2-cni and flannel) into the node VM base images. This removed any uncertainty of component launch order and availability of conf/binary files.
Actually, this is a real problem out in the wild 馃槥, especially when using managed kubernets cluster nodes (e.g. AWS EKS managed node groups) resp. if you don't want to maintain your own VM base images. On EKS there is a daemonset called aws-node which is the AWS VPC CNI Plugin, and that plugin might override the changes that linkerd-cni plugin wrote to the .conflist file.
Note, IMHO this is clearly a serious bug and definitively not a feature request. It is not acceptable for production clusters that such "configuration race conditions" exist as it is really hard and time consuming to figure out why some pods in the mesh are not working just because some daemonset's pods have been started in different orders.
Most helpful comment
Actually, this is a real problem out in the wild 馃槥, especially when using managed kubernets cluster nodes (e.g. AWS EKS managed node groups) resp. if you don't want to maintain your own VM base images. On EKS there is a daemonset called
aws-nodewhich is the AWS VPC CNI Plugin, and that plugin might override the changes thatlinkerd-cniplugin wrote to the.conflistfile.Note, IMHO this is clearly a serious bug and definitively not a feature request. It is not acceptable for production clusters that such "configuration race conditions" exist as it is really hard and time consuming to figure out why some pods in the mesh are not working just because some daemonset's pods have been started in different orders.