Kops: Best Way to Set --cluster-dns to HostIP?

Created on 7 Aug 2018 · 51Comments · Source: kubernetes/kops

We are facing dns lookup latencies, as described in many kubernetes issues, such as #45363

One of several solutions is to run a node-local dnsmasq instance, accessible via the nodeIP address. In practice, this works extremely well to solve dns problems.

The solution is relatively simple. Using our own base image, we can run dnsmasq. All we need then is to use this option in kubelet:

--cluster-dns ${NODE_IP}

But this simple thing is very hard, unless we're missing something. Since the NODE_IP is dynamic, we cannot provide a hard-coded value for it. How can we inject the node ip?

There are several strategies that do not work:

We can provide a place holder, but since we do not control the setup of the kubelet service, we cannot inject an environment variable.
It doesnt appear that a snippet of code, like $(/usr/bin/curl --silent --fail http://169.254.169.254/latest/meta-data/local-ipv4) is valid.
We have our own base image, and we can find out the node IP. But our script finishes before nodeup, so we cannot modify the resulting /etc/sysconfig/kubelet file, since its not present when our scripts run on instance start.

lifecyclrotten

Source

dcowden

👍4

Most helpful comment

Sorry I missed this discussion, and the bigger issue. I'm looking into this - I didn't fully grok that we had such a clean repro, so I'm going to try to repro and figure out what we can do..

justinsb on 13 Aug 2018

❤7 🎉3

All 51 comments

I've been reading about this issue this morning and that's very serious. Kops should support a solution by default IMO.

EDIT: maybe we can add this to the agenda of the next kops-office hours, happening next week on Friday.

Raffo on 10 Aug 2018

Isn't the new flag implemented as discussed in https://github.com/kubernetes/kops/issues/5283#issue-329881911 gonna solve this issue as well?

Raffo on 10 Aug 2018

@Raffo TBH I really don't know what the right thing to do is, because it's time variant.

In the long term, the right fix is to do nothing, because the fixes needed are in netfilter code. The issue can be avoided by using a networking layer that doesn't use NAT.. For example calico or aws router. In this case, I would have appreciated a warning on the networking page to this affect.

If the desire is for kops clusters to 'just work' on a shorter timescale , then my recommendation is for kops to deploy dnsmasq on each node, as we did in https://github.com/kubernetes/kubernetes/issues/56903#issuecomment-41110920. It seems that many people eventually go this route. The manner in which dnsmasq is implemented on the node is unimportant. We did it as a service on the node, to avoid as much nat as possible, but a daemonset with host network True would work.

The traffic shaping workaround and the fully Random workarounds would of course be specific to the networking plug-ins, so they should implement it. From a kops view, it might be nice to track which ones have the fix for the benefit of those choosing a network plug in.

This dns issue has been, by far, the most frustrating part of my kubernetes experience. It's the best kept dirty little k8s secret. You don't typically see it till you get ready for going live, and you start doing your load testing.

dcowden on 10 Aug 2018

👍1

I hear your frustration and I understand that it is a pity that is not well documented, it was indeed a surprise for me as well and it doesn't look like there is enough material on the right way to approach this problem.

Are you sure that calico would do the trick here? I tried to run the test that was shared in the thread on kubernetes/kubernetes and I have problems with calico as well.

Raffo on 10 Aug 2018

@Raffo I'm no expert, but I'm reasonably sure. What i'm sure of is that the dns issues are ultimately due to SNAT and DNAT race conditions in the kernel and netfilter. Given that, I can also be pretty sure that any networking layer doesnt use SNAT or DNAT, wont experience the problem. I think most of the other choices end up allocating a public ip on the host network adapter for the pod, and then doing BGP or other routing solutions to get the traffic around. In that case, there should be no NAT, thus no DNS issues.

The problem is that 'my dns times out' is kind of like 'my car wont run'. There are tons of different causes, because packet loss is inevitable no matter what you do.

That's why i think dnsmasq on the node is the 'right' solution. No matter what you do, you'll always have packet loss/latency with a UDP protocol going over the wire. In a k8s cluster, 99.9% (literally) of all lookups are repeat lookups. So simply using a node-local cache is the right answer.

dcowden on 10 Aug 2018

Sounds reasonable and it should be possible to do in kops.

Raffo on 10 Aug 2018

@Raffo sounds good! In case it is helpful, here's the most important source code for what we did ( we use our own centos-based images ):

cat <<EOF > /etc/sysconfig/network-scripts/ifcfg-lo:0
DEVICE=lo:0
BOOTPROTO=static
IPADDR=198.18.0.1
NETMASK=255.255.255.255
ONBOOT=yes
EOF

ifup lo:0

#then setup dnsmasq to listen on that new subinterface
yum install -y dnsmasq
systemctl enable dnsmasq

cat <<EOF > /etc/dnsmasq.d/colinx
cache-size=1000
log-queries
dns-forward-max=1500
all-servers
neg-ttl=30
listen-address=198.18.0.1

server=/cluster.local/100.64.0.10#53
server=/in-addr.arpa/100.64.0.10#53
server=/ip6.arpa/100.64.0.10#53
EOF
systemctl start dnsmasq

Coupled with kubelet:: --cluster-dns = 198.18.0.1,100.64.0.10 , we get the behavior we want

dcowden on 10 Aug 2018

👍1

Thanks for adding this. The idea way for kops would be not having to touch the AMI as users should be IMO able to run their own AMI (or one of the standard one) without experiencing this issue.

Raffo on 10 Aug 2018

@Raffo yes, that makes perfect sense. It should be no problem for kops to do add a daemonset with hostNetwork=true, and then somehow inject the hostIp or something into --cluster-dns.

As a side note I feel like I need to again plug #387. We find it very difficult and annoying that there is no way to hook scripts into nodeup-- either to run our own steps, or to even know when nodeup has finished. on AWS, the timing for nodeup to run is wildly variable, and any automation around local images is nearly impossible because we cannot guess when most of the kops scripts and config files will be ready.

The creation of a local ip address is somewhat odd, and is a direct side affect of the lack of a way to inject code to [for example] compute the weave ip /host ip for inclusion in --cluster-dns manifiest, or to modify the manifest after the fact.

Thank you for investigating this!

dcowden on 10 Aug 2018

Hey, @Raffo

Ok so we tried this and I'm disappointed to say kops is really making this hard for us.

First, we tried setting clusterDns in the manifest to 198.18.0.1,100.64.0.10. This comma-delimited format is accepted by kubelet, but kops returns a validation error that this is not valid.

Unfortunate-- we have to rely on the fact that 198.18.0.1 to work. Ok, we think-- but wait there's more.

If we attempt to set clusterDns to 198.18.0.1, we get this error from kops:

error populating cluster spec: Completed cluster failed validation: Spec.kubeDNS.serverIP: Invalid value: "100.64.0.10": Kubelet ClusterDNS did not match cluster kubeDNS.serverIP

This error means our strategy will not work at all. Why have a field when its only valid value is the same as another field, I wonder?

To be clear, when we manually update /etc/sysconfig/kubelet, and put the value we want in --cluster-dns, things work as expected.

We are assuming that if we change Spec.kubeDNS.serverIP to be 198.18.0.1, it will re-wire the coreDNS manifiests to attempt to listen on that interface, which is not what we want.

So at this point this is purely a fight getting kops to configure the manifests the way we'd like...

dcowden on 10 Aug 2018

Well, kops is a community project, if this is creating major issues, we have to address this. I'd propose to discuss the issue at the next kops office hours and see where we can go from there.

Raffo on 10 Aug 2018

@Raffo that's all I can ask! Would you like me to attend the office hours? If so, i'll try to do that. If not, i'm very comfortable that you can present the situation.

At the moment i'm about to clone the repo and see if I can just disable this check-- if that works for us, and if its acceptable, i'll make a PR.

dcowden on 10 Aug 2018

👍1

@Raffo for now, we just manually edited cluster.spec in the s3 bucket. This works-- only the kops update command cares about the format, but this is valid and works in cluster.spec:

  kubelet:
    allowPrivileged: true
    cgroupRoot: /
    cloudProvider: aws
    clusterDNS: 198.18.0.1,100.64.0.10

with 'works' defined as 'the value gets into /etc/sysconfig/network and my dns timeouts stop.

Thanks again for entertaining our plight!

dcowden on 10 Aug 2018

👍2 🎉1

Very interesting ! We facing the same DNS timeout issue.

We tried Weave-tc by @Quentin-M and this helps but I think like @dcowden that having dnsmasq provisioned with KOPS would be so amazing waiting for the netfilter fix.

I've tried Weave, Calico and amazon-vpc-cni-k8s 1.0.0 and the issue appears...

azman0101 on 11 Aug 2018

@dcowden the office hours info is here: https://github.com/kubernetes/kops#office-hours (the next meeting is next Friday) . From my experience, the meeting is the right place to discuss those issues. I will try to attend and mention that, if you can join as well, that'd be great 😃

Raffo on 11 Aug 2018

@Raffo thanks, i've added this to my calendar, i'll try to attend. Right now, we're in a difficult situation: to get what we need done, we've had to abandon using the kops frontend and edit the spec in the s3 bucket directly.

dcowden on 13 Aug 2018

Sorry I missed this discussion, and the bigger issue. I'm looking into this - I didn't fully grok that we had such a clean repro, so I'm going to try to repro and figure out what we can do..

justinsb on 13 Aug 2018

❤7 🎉3

@justinsb thanks for jumping in!

In this case there are two potential solution families:
(1) relax validation to tolerate dns configurations that are different than currently contemplated ( IE, become less opinionated ). in this case, clusterDNS should be opened up substantially from a validation viewpoint.

(2) deploy dnsmasq on the nodes as we have done as the 'right way' ( IE, stay opinionated with the idea that its more important to make it 'just work' than to be flexible ). In this case, clusterDNS should be removed.

Right now, our view would be that the only 'right' way to deploy k8s when you are using CNI is to do a dnsmasq on the node. Its also our view that there should not be a configuration whos only valid value must match another existing configuration

dcowden on 13 Aug 2018

With the tool I was able to see the issue - thank you! But my cause wasn't insert_failed, it seemed to be "random" UDP drops. I also used iperf3 to validate that UDP drops on AWS are not uncommon (as high as 1 in 1000 packets, though drops seemed very bursty).

So I think that the best option is going to be a local proxy, but it's not clear (to me) exactly what we should deliver. In the meantime, I created #5610 which adds a feature-flag which turns off validation, so if you have a static value kops won't replace it or complain that it doesn't match the kube-dns service.

Hopefully in 1.11 we can set up a real agent - my guess is that a daemonset is going to be the easiest option, but hopefully the feature flag will let us experiment!

justinsb on 14 Aug 2018

@justinsb for what its worth, it appears everyone who runs kubernetes runs into this dns thing. I guess relative to production we're more early adopters than I thought!

kubespray also supports adding in a dnsmasq daemonset

dcowden on 14 Aug 2018

There's definitely configurations that don't hit it as much. GCE seems much more reliable for UDP packets than is AWS, which likely explains some of this. It does seem to happen more with some CNI providers as well, but we don't yet have a root cause.

I definitely agree that going with a local mode feels like the most practical workaround - there will always be _some_ UDP packet drops. I just don't think we can do much more than #5610 for kops 1.10, but hopefully that is sufficient for experimentation and comparison of the various approaches - and unblocks anyone that wants to run their own configuration.

justinsb on 14 Aug 2018

PS thanks for the kubespray link. I don't see the daemonset (https://github.com/kubernetes-incubator/kubespray/blob/master/roles/dnsmasq/templates/dnsmasq-deploy.yml#L3), but it's definitely documented as doing that, so there must be some magic I'm missing! But it's a good place to start anyway :-)

justinsb on 14 Aug 2018

Hi @justtinsb,

Thanks for #5610, that's all we need in 1.10, we have a base image for other reasons, and dnsmasq is super easy to set up.

dcowden on 14 Aug 2018

Thanks for the patch @justinsb . Zalando as far as I know is running the DaemonSet https://github.com/zalando-incubator/kubernetes-on-aws/blob/dev/cluster/manifests/kube-dns/node-local-daemonset.yaml /cc @szuecs for more details.

Raffo on 14 Aug 2018

👍2

Hi @dcowden , @justinsb

A solution to setup with kops spec, the loopback network interface for dnsmasq ?

rekcah78 on 14 Aug 2018

@rekcah78 I'm not sure I understand your question. The main point of this thread is that it was impossible to create a valid kops spec to achieve the goal. However, combined with the new ExperimentalClusterDNS flag, coming soon, the kops spec will look like this

This does not run dnsmasq-- it just configures the cluster to use it. We set up dnsmasq manually on the underlying node. The setup is above

dcowden on 15 Aug 2018

you're right @dcowden , with my own image and KOPS_FEATURE_FLAGS="+ExperimentalClusterDNS" and spec.kubelet.clusterDNS: 198.18.0.1,100.64.0.10 all work fine

rekcah78 on 17 Aug 2018

Excellent! When will +ExperminentalClusterDNS be available in a kops release?

dcowden on 17 Aug 2018

I believe it should be a goal to find a solution that does not require to bake a custom image as a lot of users rely on the default kops images or ubuntu/debian.

Raffo on 17 Aug 2018

@dcowden available on kops 1.10

rekcah78 on 17 Aug 2018

@Raffo @justinsb Score! thanks for that, i hadnt seen the release! We'll give it a try very soon. We have a planning meeting for a cluster upgrade next week-- this is welcome news.

Side note, I agree that the ideal solution for kops is to avoid a custom image-- for most users this is better. I'd request that our approach still work, however.

dcowden on 17 Aug 2018

Just FYI, the folks at Weaveworks just released a blogpost with a good write up of the problem we are seeing here: https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts /cc @justinsb

Raffo on 17 Aug 2018

@rekcah78 could you be more specific on how you build your own AMI ?

slashetc on 19 Aug 2018

@slashetc with https://github.com/kubernetes/kube-deploy/tree/master/imagebuilder

rekcah78 on 20 Aug 2018

Someone pointed out that

NetRange:       198.18.0.0 - 198.19.255.255
CIDR:           198.18.0.0/15
NetName:        SPECIAL-IPV4-BENCHMARK-TESTING-IANA-RESERVED
NetHandle:      NET-198-18-0-0-1
Parent:         NET198 (NET-198-0-0-0-0)
NetType:        IANA Special Use

Maybe you'll never need to connect to SPECIAL-IPV4-BENCHMARK-TESTING-IANA-RESERVED but nevertheless

kostyrev on 19 Sep 2018

This issue is affecting us too. Rather than mess with the KOPS image, as a short-term hack we're putting a DNS cache in-process for a lot of our apps. We don't want to make custom KOPS images if possible.

What's a little frustrating about this is that any high traffic service platform will suffer fairly significantly from this issue if it's deployed on AWS because they use CNAMEs for almost everything.

svperfecta on 16 Oct 2018

👍1

This will let you use node ip as cluster-dns in KOPS on AWS

people should be able to substitute the curl command with whatever command they want to print out a ip of f.x a interface or similar

export KOPS_FEATURE_FLAGS=+ExperimentalClusterDNS

hooks:
- before:
  - kubelet.service
  manifest: |
    [Unit]
    Description=Set PRIVATE_EC2_IPV4 cluster-dns
    [Service]
    ExecStart=/bin/bash -c "sed -i 's/PRIVATE_EC2_IPV4/'$(/usr/bin/curl --silent --fail http://169.254.169.254/latest/meta-data/local-ipv4)'/g' /etc/sysconfig/kubelet; /bin/systemctl daemon-reload; /bin/systemctl restart kubelet"
    RemainAfterExit=yes
  name: private-ipv4

kubelet:
  clusterDNS: PRIVATE_EC2_IPV4

you can find a example of a dnsmasq with prometheus metrics here:
https://github.com/zalando-incubator/kubernetes-on-aws/blob/dev/cluster/manifests/kube-dns/node-local-daemonset.yaml

remember to edit the daemonset and change it to point to your internal server.

for reference here is my node-local-dns manifests

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dnsmasq-node
  namespace: kube-system
  labels:
    application: dnsmasq-node
spec:
  selector:
    matchLabels:
      application: dnsmasq-node
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        application: dnsmasq-node
    spec:
      priorityClassName: system-node-critical
      serviceAccountName: system
      tolerations:
      - operator: Exists
        effect: NoSchedule
      - operator: Exists
        effect: NoExecute
      volumes:
      - name: kube-dns-config
        configMap:
          name: kube-dns
          optional: true
      containers:
      - name: dnsmasq
        image: registry.opensource.zalan.do/teapot/k8s-dns-dnsmasq-nanny-amd64:1.14.10
        securityContext:
          privileged: true
        livenessProbe:
          httpGet:
            path: /healthcheck/dnsmasq
            port: 9054
            scheme: HTTP
          initialDelaySeconds: 60
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 5
        args:
        - -v=2
        - -logtostderr
        - -configDir=/etc/k8s/dns/dnsmasq-nanny
        - -restartDnsmasq=true
        - --
        - -k
        - --cache-size=10000
        # Set low dnsmasq --neg-ttl instead of --no-negcache https://github.com/kubernetes/dns/issues/239
        - --neg-ttl=10
        - --server=10.0.0.2#53
        - --server=8.8.8.8#53
        - --server=8.8.4.4#53
        - --server=/cluster.local/100.64.0.10#53
        - --server=/in-addr.arpa/100.64.0.10#53
        - --server=/ip6.arpa/100.64.0.10#53
        ports:
        - containerPort: 53
          name: dns
          protocol: UDP
        - containerPort: 53
          name: dns-tcp
          protocol: TCP
        # see: https://github.com/kubernetes/kubernetes/issues/29055 for details
        resources:
          requests:
            cpu: 100m
            memory: 25Mi
        volumeMounts:
        - name: kube-dns-config
          mountPath: /etc/k8s/dns/dnsmasq-nanny
      - name: sidecar
        image: registry.opensource.zalan.do/teapot/k8s-dns-sidecar-amd64:1.14.10
        securityContext:
          privileged: true
        livenessProbe:
          httpGet:
            path: /metrics
            port: 9054
            scheme: HTTP
          initialDelaySeconds: 60
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 5
        args:
        - --v=2
        - --logtostderr
        - --probe=dnsmasq,127.0.0.1:53,ec2.amazonaws.com,5,A
        - --prometheus-port=9054
        ports:
        - containerPort: 9054
          name: metrics
          protocol: TCP
        resources:
          requests:
            memory: 15Mi
      hostNetwork: true
      dnsPolicy: Default
      automountServiceAccountToken: false
---
kind: Service
apiVersion: v1
metadata:
  name: kube-dns-local
  namespace: kube-system
  labels:
    application: dnsmasq-node
    kubernetes.io/cluster-service: "true"
    kubernetes.io/name: "KubeDNS"
  annotations:
    prometheus.io/path: /metrics
    prometheus.io/port: "9054"
    prometheus.io/scrape: "true"
spec:
  selector:
    application: dnsmasq-node
  type: ClusterIP
  ports:
  - name: monitor
    port: 9054
    targetPort: 9054
    protocol: TCP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    k8s-app: dnsmasq-node
  name: kube-dns-local
  namespace: monitoring
spec:
  endpoints:
  - interval: 30s
    port: monitor
  namespaceSelector:
    matchNames:
    - kube-system
  selector:
    matchLabels:
      application: dnsmasq-node

Remember to think about https://pracucci.com/kubernetes-dns-resolution-ndots-options-and-why-it-may-affect-application-performances.html and MUSL stuff..

Happy haxxing! :)

roffe on 9 Nov 2018

👍5

Could kops potentially leverage this? https://github.com/kubernetes/kubernetes/pull/70555

cameronattard on 18 Dec 2018

https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/dns/nodelocaldns/README.md

roffe on 18 Dec 2018

Is it possible to use the nodelocaldns addon somehow?

alex88 on 4 Mar 2019

I am going to test nodelocaldns addon in a few weeks, just to see if there are any important performance changes with this addon but not in AWS just onpremises cluster build by kubespray

jcperezamin on 4 Mar 2019

@jcperezamin in our case we're probably hitting conntrack limits and after 65k fast dns queries we start losing requests as explained here https://www.youtube.com/watch?v=Il-yzqBrUdo&feature=youtu.be&t=553

Update: how do you plan to test it? I'm trying to see how I can install it

alex88 on 4 Mar 2019

I am going to use kubespray and tested this fix and is working out.
https://github.com/kubernetes/kubernetes/issues/56903#issuecomment-469294516, nodelocaldns is part of kubespray deployment https://github.com/kubernetes-sigs/kubespray/blob/master/inventory/sample/group_vars/k8s-cluster/k8s-cluster.yml just enabled it

Enable nodelocal dns cache

enable_nodelocaldns: true

jcperezamin on 5 Mar 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 3 Jun 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 3 Jul 2019

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 2 Aug 2019

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.