We are facing dns lookup latencies, as described in many kubernetes issues, such as #45363
One of several solutions is to run a node-local dnsmasq instance, accessible via the nodeIP address. In practice, this works extremely well to solve dns problems.
The solution is relatively simple. Using our own base image, we can run dnsmasq. All we need then is to use this option in kubelet:
--cluster-dns ${NODE_IP}
But this simple thing is very hard, unless we're missing something. Since the NODE_IP is dynamic, we cannot provide a hard-coded value for it. How can we inject the node ip?
There are several strategies that do not work:
We can provide a place holder, but since we do not control the setup of the kubelet service, we cannot inject an environment variable.
It doesnt appear that a snippet of code, like $(/usr/bin/curl --silent --fail http://169.254.169.254/latest/meta-data/local-ipv4) is valid.
We have our own base image, and we can find out the node IP. But our script finishes before nodeup, so we cannot modify the resulting /etc/sysconfig/kubelet file, since its not present when our scripts run on instance start.
I've been reading about this issue this morning and that's very serious. Kops should support a solution by default IMO.
EDIT: maybe we can add this to the agenda of the next kops-office hours, happening next week on Friday.
Isn't the new flag implemented as discussed in https://github.com/kubernetes/kops/issues/5283#issue-329881911 gonna solve this issue as well?
@Raffo TBH I really don't know what the right thing to do is, because it's time variant.
In the long term, the right fix is to do nothing, because the fixes needed are in netfilter code. The issue can be avoided by using a networking layer that doesn't use NAT.. For example calico or aws router. In this case, I would have appreciated a warning on the networking page to this affect.
If the desire is for kops clusters to 'just work' on a shorter timescale , then my recommendation is for kops to deploy dnsmasq on each node, as we did in https://github.com/kubernetes/kubernetes/issues/56903#issuecomment-41110920. It seems that many people eventually go this route. The manner in which dnsmasq is implemented on the node is unimportant. We did it as a service on the node, to avoid as much nat as possible, but a daemonset with host network True would work.
The traffic shaping workaround and the fully Random workarounds would of course be specific to the networking plug-ins, so they should implement it. From a kops view, it might be nice to track which ones have the fix for the benefit of those choosing a network plug in.
This dns issue has been, by far, the most frustrating part of my kubernetes experience. It's the best kept dirty little k8s secret. You don't typically see it till you get ready for going live, and you start doing your load testing.
I hear your frustration and I understand that it is a pity that is not well documented, it was indeed a surprise for me as well and it doesn't look like there is enough material on the right way to approach this problem.
Are you sure that calico would do the trick here? I tried to run the test that was shared in the thread on kubernetes/kubernetes and I have problems with calico as well.
@Raffo I'm no expert, but I'm reasonably sure. What i'm sure of is that the dns issues are ultimately due to SNAT and DNAT race conditions in the kernel and netfilter. Given that, I can also be pretty sure that any networking layer doesnt use SNAT or DNAT, wont experience the problem. I think most of the other choices end up allocating a public ip on the host network adapter for the pod, and then doing BGP or other routing solutions to get the traffic around. In that case, there should be no NAT, thus no DNS issues.
The problem is that 'my dns times out' is kind of like 'my car wont run'. There are tons of different causes, because packet loss is inevitable no matter what you do.
That's why i think dnsmasq on the node is the 'right' solution. No matter what you do, you'll always have packet loss/latency with a UDP protocol going over the wire. In a k8s cluster, 99.9% (literally) of all lookups are repeat lookups. So simply using a node-local cache is the right answer.
Sounds reasonable and it should be possible to do in kops.
@Raffo sounds good! In case it is helpful, here's the most important source code for what we did ( we use our own centos-based images ):
cat <<EOF > /etc/sysconfig/network-scripts/ifcfg-lo:0
DEVICE=lo:0
BOOTPROTO=static
IPADDR=198.18.0.1
NETMASK=255.255.255.255
ONBOOT=yes
EOF
ifup lo:0
#then setup dnsmasq to listen on that new subinterface
yum install -y dnsmasq
systemctl enable dnsmasq
cat <<EOF > /etc/dnsmasq.d/colinx
cache-size=1000
log-queries
dns-forward-max=1500
all-servers
neg-ttl=30
listen-address=198.18.0.1
server=/cluster.local/100.64.0.10#53
server=/in-addr.arpa/100.64.0.10#53
server=/ip6.arpa/100.64.0.10#53
EOF
systemctl start dnsmasq
Coupled with kubelet:: --cluster-dns = 198.18.0.1,100.64.0.10 , we get the behavior we want
Thanks for adding this. The idea way for kops would be not having to touch the AMI as users should be IMO able to run their own AMI (or one of the standard one) without experiencing this issue.
@Raffo yes, that makes perfect sense. It should be no problem for kops to do add a daemonset with hostNetwork=true, and then somehow inject the hostIp or something into --cluster-dns.
As a side note I feel like I need to again plug #387. We find it very difficult and annoying that there is no way to hook scripts into nodeup-- either to run our own steps, or to even know when nodeup has finished. on AWS, the timing for nodeup to run is wildly variable, and any automation around local images is nearly impossible because we cannot guess when most of the kops scripts and config files will be ready.
The creation of a local ip address is somewhat odd, and is a direct side affect of the lack of a way to inject code to [for example] compute the weave ip /host ip for inclusion in --cluster-dns manifiest, or to modify the manifest after the fact.
Thank you for investigating this!
Hey, @Raffo
Ok so we tried this and I'm disappointed to say kops is really making this hard for us.
First, we tried setting clusterDns in the manifest to 198.18.0.1,100.64.0.10. This comma-delimited format is accepted by kubelet, but kops returns a validation error that this is not valid.
Unfortunate-- we have to rely on the fact that 198.18.0.1 to work. Ok, we think-- but wait there's more.
If we attempt to set clusterDns to 198.18.0.1, we get this error from kops:
error populating cluster spec: Completed cluster failed validation: Spec.kubeDNS.serverIP: Invalid value: "100.64.0.10": Kubelet ClusterDNS did not match cluster kubeDNS.serverIP
This error means our strategy will not work at all. Why have a field when its only valid value is the same as another field, I wonder?
To be clear, when we manually update /etc/sysconfig/kubelet, and put the value we want in --cluster-dns, things work as expected.
We are assuming that if we change Spec.kubeDNS.serverIP to be 198.18.0.1, it will re-wire the coreDNS manifiests to attempt to listen on that interface, which is not what we want.
So at this point this is purely a fight getting kops to configure the manifests the way we'd like...
Well, kops is a community project, if this is creating major issues, we have to address this. I'd propose to discuss the issue at the next kops office hours and see where we can go from there.
@Raffo that's all I can ask! Would you like me to attend the office hours? If so, i'll try to do that. If not, i'm very comfortable that you can present the situation.
At the moment i'm about to clone the repo and see if I can just disable this check-- if that works for us, and if its acceptable, i'll make a PR.
@Raffo for now, we just manually edited cluster.spec in the s3 bucket. This works-- only the kops update command cares about the format, but this is valid and works in cluster.spec:
kubelet:
allowPrivileged: true
cgroupRoot: /
cloudProvider: aws
clusterDNS: 198.18.0.1,100.64.0.10
with 'works' defined as 'the value gets into /etc/sysconfig/network and my dns timeouts stop.
Thanks again for entertaining our plight!
Very interesting ! We facing the same DNS timeout issue.
We tried Weave-tc by @Quentin-M and this helps but I think like @dcowden that having dnsmasq provisioned with KOPS would be so amazing waiting for the netfilter fix.
I've tried Weave, Calico and amazon-vpc-cni-k8s 1.0.0 and the issue appears...
@dcowden the office hours info is here: https://github.com/kubernetes/kops#office-hours (the next meeting is next Friday) . From my experience, the meeting is the right place to discuss those issues. I will try to attend and mention that, if you can join as well, that'd be great 馃槂
@Raffo thanks, i've added this to my calendar, i'll try to attend. Right now, we're in a difficult situation: to get what we need done, we've had to abandon using the kops frontend and edit the spec in the s3 bucket directly.
Sorry I missed this discussion, and the bigger issue. I'm looking into this - I didn't fully grok that we had such a clean repro, so I'm going to try to repro and figure out what we can do..
@justinsb thanks for jumping in!
In this case there are two potential solution families:
(1) relax validation to tolerate dns configurations that are different than currently contemplated ( IE, become less opinionated ). in this case, clusterDNS should be opened up substantially from a validation viewpoint.
(2) deploy dnsmasq on the nodes as we have done as the 'right way' ( IE, stay opinionated with the idea that its more important to make it 'just work' than to be flexible ). In this case, clusterDNS should be removed.
Right now, our view would be that the only 'right' way to deploy k8s when you are using CNI is to do a dnsmasq on the node. Its also our view that there should not be a configuration whos only valid value must match another existing configuration
With the tool I was able to see the issue - thank you! But my cause wasn't insert_failed, it seemed to be "random" UDP drops. I also used iperf3 to validate that UDP drops on AWS are not uncommon (as high as 1 in 1000 packets, though drops seemed very bursty).
So I think that the best option is going to be a local proxy, but it's not clear (to me) exactly what we should deliver. In the meantime, I created #5610 which adds a feature-flag which turns off validation, so if you have a static value kops won't replace it or complain that it doesn't match the kube-dns service.
Hopefully in 1.11 we can set up a real agent - my guess is that a daemonset is going to be the easiest option, but hopefully the feature flag will let us experiment!
@justinsb for what its worth, it appears everyone who runs kubernetes runs into this dns thing. I guess relative to production we're more early adopters than I thought!
kubespray also supports adding in a dnsmasq daemonset
There's definitely configurations that don't hit it as much. GCE seems much more reliable for UDP packets than is AWS, which likely explains some of this. It does seem to happen more with some CNI providers as well, but we don't yet have a root cause.
I definitely agree that going with a local mode feels like the most practical workaround - there will always be _some_ UDP packet drops. I just don't think we can do much more than #5610 for kops 1.10, but hopefully that is sufficient for experimentation and comparison of the various approaches - and unblocks anyone that wants to run their own configuration.
PS thanks for the kubespray link. I don't see the daemonset (https://github.com/kubernetes-incubator/kubespray/blob/master/roles/dnsmasq/templates/dnsmasq-deploy.yml#L3), but it's definitely documented as doing that, so there must be some magic I'm missing! But it's a good place to start anyway :-)
Hi @justtinsb,
Thanks for #5610, that's all we need in 1.10, we have a base image for other reasons, and dnsmasq is super easy to set up.
Thanks for the patch @justinsb . Zalando as far as I know is running the DaemonSet https://github.com/zalando-incubator/kubernetes-on-aws/blob/dev/cluster/manifests/kube-dns/node-local-daemonset.yaml /cc @szuecs for more details.
Hi @dcowden , @justinsb
A solution to setup with kops spec, the loopback network interface for dnsmasq ?
@rekcah78 I'm not sure I understand your question. The main point of this thread is that it was impossible to create a valid kops spec to achieve the goal. However, combined with the new ExperimentalClusterDNS flag, coming soon, the kops spec will look like this
This does not run dnsmasq-- it just configures the cluster to use it. We set up dnsmasq manually on the underlying node. The setup is above
you're right @dcowden , with my own image and KOPS_FEATURE_FLAGS="+ExperimentalClusterDNS" and spec.kubelet.clusterDNS: 198.18.0.1,100.64.0.10 all work fine
Excellent! When will +ExperminentalClusterDNS be available in a kops release?
I believe it should be a goal to find a solution that does not require to bake a custom image as a lot of users rely on the default kops images or ubuntu/debian.
@dcowden available on kops 1.10
@Raffo @justinsb Score! thanks for that, i hadnt seen the release! We'll give it a try very soon. We have a planning meeting for a cluster upgrade next week-- this is welcome news.
Side note, I agree that the ideal solution for kops is to avoid a custom image-- for most users this is better. I'd request that our approach still work, however.
Just FYI, the folks at Weaveworks just released a blogpost with a good write up of the problem we are seeing here: https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts /cc @justinsb
@rekcah78 could you be more specific on how you build your own AMI ?
Someone pointed out that
NetRange: 198.18.0.0 - 198.19.255.255
CIDR: 198.18.0.0/15
NetName: SPECIAL-IPV4-BENCHMARK-TESTING-IANA-RESERVED
NetHandle: NET-198-18-0-0-1
Parent: NET198 (NET-198-0-0-0-0)
NetType: IANA Special Use
Maybe you'll never need to connect to SPECIAL-IPV4-BENCHMARK-TESTING-IANA-RESERVED but nevertheless
This issue is affecting us too. Rather than mess with the KOPS image, as a short-term hack we're putting a DNS cache in-process for a lot of our apps. We don't want to make custom KOPS images if possible.
What's a little frustrating about this is that any high traffic service platform will suffer fairly significantly from this issue if it's deployed on AWS because they use CNAMEs for almost everything.
This will let you use node ip as cluster-dns in KOPS on AWS
people should be able to substitute the curl command with whatever command they want to print out a ip of f.x a interface or similar
export KOPS_FEATURE_FLAGS=+ExperimentalClusterDNS
hooks:
- before:
- kubelet.service
manifest: |
[Unit]
Description=Set PRIVATE_EC2_IPV4 cluster-dns
[Service]
ExecStart=/bin/bash -c "sed -i 's/PRIVATE_EC2_IPV4/'$(/usr/bin/curl --silent --fail http://169.254.169.254/latest/meta-data/local-ipv4)'/g' /etc/sysconfig/kubelet; /bin/systemctl daemon-reload; /bin/systemctl restart kubelet"
RemainAfterExit=yes
name: private-ipv4
kubelet:
clusterDNS: PRIVATE_EC2_IPV4
you can find a example of a dnsmasq with prometheus metrics here:
https://github.com/zalando-incubator/kubernetes-on-aws/blob/dev/cluster/manifests/kube-dns/node-local-daemonset.yaml
remember to edit the daemonset and change it to point to your internal server.
for reference here is my node-local-dns manifests
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dnsmasq-node
namespace: kube-system
labels:
application: dnsmasq-node
spec:
selector:
matchLabels:
application: dnsmasq-node
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
application: dnsmasq-node
spec:
priorityClassName: system-node-critical
serviceAccountName: system
tolerations:
- operator: Exists
effect: NoSchedule
- operator: Exists
effect: NoExecute
volumes:
- name: kube-dns-config
configMap:
name: kube-dns
optional: true
containers:
- name: dnsmasq
image: registry.opensource.zalan.do/teapot/k8s-dns-dnsmasq-nanny-amd64:1.14.10
securityContext:
privileged: true
livenessProbe:
httpGet:
path: /healthcheck/dnsmasq
port: 9054
scheme: HTTP
initialDelaySeconds: 60
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 5
args:
- -v=2
- -logtostderr
- -configDir=/etc/k8s/dns/dnsmasq-nanny
- -restartDnsmasq=true
- --
- -k
- --cache-size=10000
# Set low dnsmasq --neg-ttl instead of --no-negcache https://github.com/kubernetes/dns/issues/239
- --neg-ttl=10
- --server=10.0.0.2#53
- --server=8.8.8.8#53
- --server=8.8.4.4#53
- --server=/cluster.local/100.64.0.10#53
- --server=/in-addr.arpa/100.64.0.10#53
- --server=/ip6.arpa/100.64.0.10#53
ports:
- containerPort: 53
name: dns
protocol: UDP
- containerPort: 53
name: dns-tcp
protocol: TCP
# see: https://github.com/kubernetes/kubernetes/issues/29055 for details
resources:
requests:
cpu: 100m
memory: 25Mi
volumeMounts:
- name: kube-dns-config
mountPath: /etc/k8s/dns/dnsmasq-nanny
- name: sidecar
image: registry.opensource.zalan.do/teapot/k8s-dns-sidecar-amd64:1.14.10
securityContext:
privileged: true
livenessProbe:
httpGet:
path: /metrics
port: 9054
scheme: HTTP
initialDelaySeconds: 60
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 5
args:
- --v=2
- --logtostderr
- --probe=dnsmasq,127.0.0.1:53,ec2.amazonaws.com,5,A
- --prometheus-port=9054
ports:
- containerPort: 9054
name: metrics
protocol: TCP
resources:
requests:
memory: 15Mi
hostNetwork: true
dnsPolicy: Default
automountServiceAccountToken: false
---
kind: Service
apiVersion: v1
metadata:
name: kube-dns-local
namespace: kube-system
labels:
application: dnsmasq-node
kubernetes.io/cluster-service: "true"
kubernetes.io/name: "KubeDNS"
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "9054"
prometheus.io/scrape: "true"
spec:
selector:
application: dnsmasq-node
type: ClusterIP
ports:
- name: monitor
port: 9054
targetPort: 9054
protocol: TCP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
k8s-app: dnsmasq-node
name: kube-dns-local
namespace: monitoring
spec:
endpoints:
- interval: 30s
port: monitor
namespaceSelector:
matchNames:
- kube-system
selector:
matchLabels:
application: dnsmasq-node
Remember to think about https://pracucci.com/kubernetes-dns-resolution-ndots-options-and-why-it-may-affect-application-performances.html and MUSL stuff..
Happy haxxing! :)
Could kops potentially leverage this? https://github.com/kubernetes/kubernetes/pull/70555
Is it possible to use the nodelocaldns addon somehow?
I am going to test nodelocaldns addon in a few weeks, just to see if there are any important performance changes with this addon but not in AWS just onpremises cluster build by kubespray
@jcperezamin in our case we're probably hitting conntrack limits and after 65k fast dns queries we start losing requests as explained here https://www.youtube.com/watch?v=Il-yzqBrUdo&feature=youtu.be&t=553
Update: how do you plan to test it? I'm trying to see how I can install it
I am going to use kubespray and tested this fix and is working out.
https://github.com/kubernetes/kubernetes/issues/56903#issuecomment-469294516, nodelocaldns is part of kubespray deployment https://github.com/kubernetes-sigs/kubespray/blob/master/inventory/sample/group_vars/k8s-cluster/k8s-cluster.yml just enabled it
enable_nodelocaldns: true
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/reopen
@cdobbyn: You can't reopen an issue/PR unless you authored it or you are a collaborator.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/reopen
@igoratencompass: You can't reopen an issue/PR unless you authored it or you are a collaborator.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
Sorry I missed this discussion, and the bigger issue. I'm looking into this - I didn't fully grok that we had such a clean repro, so I'm going to try to repro and figure out what we can do..