Kops: DNS mostly fails inside application pods on brand new cluster

Created on 6 Feb 2018 · 27Comments · Source: kubernetes/kops

Kops version: 1.8.0 (git-5099bc5)
Kubernetes version: v1.8.6 (6260bb08c46c31eea6cb538b34a9ceb3e406689c)
Cloud: AWS

The issue

When I create a brand new cluster, everything appears to be working fine. All the masters and workers are ready and I can deploy application pods. BUT, mostly the pods cannot resolve public DNS names like www.google.com, or even internal names like myservice.default. When I run ping www.google.com either the command takes a long time (over 10 seconds) and eventually says name not resolved, or it takes a long time and eventually starts pinging Google. It's as if kube-dns is failing most of the time, but not always.

Things I have noticed:

Only application pods have this problem, all system pods (on masters or workers) appear to be able to resolve names.
All nodes (masters and workers) can resolve names

Steps used to create the cluster:

Created cluster configuration:

kops create cluster \
  --api-loadbalancer-type internal \
  --associate-public-ip=false \
  --cloud=aws \
  --dns private \
  --image "595879546273/CoreOS-stable-1632.2.1-hvm" \
  --master-count 3 \
  --master-size t2.small \
  --master-zones "us-east-1b,us-east-1c,us-east-1d" \
  --name=stg-us-east-1.k8s.local \
  --network-cidr 10.0.64.0/22 \
  --networking flannel \
  --node-count 5 \
  --node-size t2.small \
  --out . \
  --output json \
  --ssh-public-key ~/.ssh/mykey.pub \
  --state s3://mybucket \
  --target=terraform \
  --topology private \
  --vpc vpc-3153eb2e \
  --zones "us-east-1b,us-east-1c,us-east-1d"

Modified subnets (kops edit cluster) as per https://github.com/kubernetes/kops/blob/master/docs/run_in_existing_vpc.md
Updated cluster config (kops update cluster) and deployed everything (terraform apply).

Cluster manifest:

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: 2018-02-05T16:07:43Z
  name: stg-us-east-1.k8s.local
spec:
  api:
    loadBalancer:
      type: Internal
  authorization:
    alwaysAllow: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://mybucket/stg-us-east-1.k8s.local
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-us-east-1b
      name: b
    - instanceGroup: master-us-east-1c
      name: c
    - instanceGroup: master-us-east-1d
      name: d
    name: main
  - etcdMembers:
    - instanceGroup: master-us-east-1b
      name: b
    - instanceGroup: master-us-east-1c
      name: c
    - instanceGroup: master-us-east-1d
      name: d
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.8.6
  masterInternalName: api.internal.stg-us-east-1.k8s.local
  masterPublicName: api.stg-us-east-1.k8s.local
  networkCIDR: 10.0.64.0/22
  networkID: vpc-73cfbb0a
  networking:
    flannel:
      backend: vxlan
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 10.0.65.0/24
    egress: nat-012ee02a09a7830d2
    id: subnet-de86e6f2
    name: us-east-1b
    type: Private
    zone: us-east-1b
  - cidr: 10.0.66.0/24
    egress: nat-012ee02a09a7830d2
    id: subnet-5fb5ef17
    name: us-east-1c
    type: Private
    zone: us-east-1c
  - cidr: 10.0.67.0/24
    egress: nat-012ee02a09a7830d2
    id: subnet-b13da5eb
    name: us-east-1d
    type: Private
    zone: us-east-1d
  - cidr: 10.0.64.32/27
    id: subnet-d68bebfa
    name: utility-us-east-1b
    type: Utility
    zone: us-east-1b
  - cidr: 10.0.64.96/27
    id: subnet-cbb0ea83
    name: utility-us-east-1c
    type: Utility
    zone: us-east-1c
  - cidr: 10.0.64.160/27
    id: subnet-f23ea6a8
    name: utility-us-east-1d
    type: Utility
    zone: us-east-1d
  topology:
    dns:
      type: Private
    masters: private
    nodes: private

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-02-05T16:07:43Z
  labels:
    kops.k8s.io/cluster: stg-us-east-1.k8s.local
  name: master-us-east-1b
spec:
  associatePublicIp: false
  image: 595879546273/CoreOS-stable-1632.2.1-hvm
  machineType: t2.small
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-east-1b
  role: Master
  subnets:
  - us-east-1b

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-02-05T16:07:43Z
  labels:
    kops.k8s.io/cluster: stg-us-east-1.k8s.local
  name: master-us-east-1c
spec:
  associatePublicIp: false
  image: 595879546273/CoreOS-stable-1632.2.1-hvm
  machineType: t2.small
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-east-1c
  role: Master
  subnets:
  - us-east-1c

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-02-05T16:07:43Z
  labels:
    kops.k8s.io/cluster: stg-us-east-1.k8s.local
  name: master-us-east-1d
spec:
  associatePublicIp: false
  image: 595879546273/CoreOS-stable-1632.2.1-hvm
  machineType: t2.small
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-east-1d
  role: Master
  subnets:
  - us-east-1d

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-02-05T16:07:43Z
  labels:
    kops.k8s.io/cluster: stg-us-east-1.k8s.local
  name: nodes
spec:
  associatePublicIp: false
  image: 595879546273/CoreOS-stable-1632.2.1-hvm
  machineType: t2.small
  maxSize: 3
  minSize: 3
  nodeLabels:
    kops.k8s.io/instancegroup: nodes
  role: Node
  subnets:
  - us-east-1b
  - us-east-1c
  - us-east-1d

Contents of resolv.conf on a node:

# This file is managed by man:systemd-resolved(8). Do not edit.
#
# This is a dynamic resolv.conf file for connecting local clients directly to
# all known DNS servers.
#
# Third party programs must not access this file directly, but only through the
# symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a different way,
# replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 10.0.64.2
search ec2.internal

Contents of resolv.conf on a system pod:

nameserver 10.0.64.2
search ec2.internal

Contents of resolv.conf on an application pod:

nameserver 100.64.0.10
search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
options ndots:5

Things I have tried

I've tried destroying the cluster and recreating it, this problem is always present.
I've tried adding options single-request-reopen to resolv.conf on the application pods, as discussed in https://github.com/kubernetes/kubernetes/issues/56903 but this made no difference.
I've tried removing options ndots:5 from application pod resolv.conf, as described in other places but this made no difference.

arenetworking lifecyclrotten

Source

joelittlejohn

Most helpful comment

I fixed this in my own kops cluster by editing the cluster:

kops edit cluster stg-us-east-1.k8s.local --state s3://mybucket

and adding a hook:

  - manifest: |
      Type=oneshot
      ExecStart=/usr/sbin/modprobe br_netfilter
    name: fix-dns.service

joelittlejohn on 7 Feb 2018

👍4

All 27 comments

After further investigation of this, I have five workers and only two of those worker nodes have the DNS problem in any application pod they run. The two nodes that have this problem in their pods are the two nodes that are running the kube-dns pods.

joelittlejohn on 6 Feb 2018

Some logs:

kubedns log (same on both kube-dns pods):

I0205 17:16:00.273097       1 dns.go:48] version: 1.14.4-2-g5584e04
I0205 17:16:00.280277       1 server.go:70] Using configuration read from directory: /kube-dns-config with period 10s
I0205 17:16:00.280336       1 server.go:113] FLAG: --alsologtostderr="false"
I0205 17:16:00.280346       1 server.go:113] FLAG: --config-dir="/kube-dns-config"
I0205 17:16:00.280353       1 server.go:113] FLAG: --config-map=""
I0205 17:16:00.280358       1 server.go:113] FLAG: --config-map-namespace="kube-system"
I0205 17:16:00.280363       1 server.go:113] FLAG: --config-period="10s"
I0205 17:16:00.280369       1 server.go:113] FLAG: --dns-bind-address="0.0.0.0"
I0205 17:16:00.280374       1 server.go:113] FLAG: --dns-port="10053"
I0205 17:16:00.280380       1 server.go:113] FLAG: --domain="cluster.local."
I0205 17:16:00.280387       1 server.go:113] FLAG: --federations=""
I0205 17:16:00.280394       1 server.go:113] FLAG: --healthz-port="8081"
I0205 17:16:00.280398       1 server.go:113] FLAG: --initial-sync-timeout="1m0s"
I0205 17:16:00.280403       1 server.go:113] FLAG: --kube-master-url=""
I0205 17:16:00.280409       1 server.go:113] FLAG: --kubecfg-file=""
I0205 17:16:00.280413       1 server.go:113] FLAG: --log-backtrace-at=":0"
I0205 17:16:00.280421       1 server.go:113] FLAG: --log-dir=""
I0205 17:16:00.280426       1 server.go:113] FLAG: --log-flush-frequency="5s"
I0205 17:16:00.280431       1 server.go:113] FLAG: --logtostderr="true"
I0205 17:16:00.280436       1 server.go:113] FLAG: --nameservers=""
I0205 17:16:00.280440       1 server.go:113] FLAG: --stderrthreshold="2"
I0205 17:16:00.280444       1 server.go:113] FLAG: --v="2"
I0205 17:16:00.280449       1 server.go:113] FLAG: --version="false"
I0205 17:16:00.280457       1 server.go:113] FLAG: --vmodule=""
I0205 17:16:00.291904       1 server.go:176] Starting SkyDNS server (0.0.0.0:10053)
I0205 17:16:00.299519       1 server.go:198] Skydns metrics enabled (/metrics:10055)
I0205 17:16:00.299529       1 dns.go:147] Starting endpointsController
I0205 17:16:00.299532       1 dns.go:150] Starting serviceController
I0205 17:16:00.300109       1 logs.go:41] skydns: ready for queries on cluster.local. for tcp://0.0.0.0:10053 [rcache 0]
I0205 17:16:00.300120       1 logs.go:41] skydns: ready for queries on cluster.local. for udp://0.0.0.0:10053 [rcache 0]
I0205 17:16:00.799732       1 dns.go:171] Initialized services and endpoints from apiserver
I0205 17:16:00.799852       1 server.go:129] Setting up Healthz Handler (/readiness)
I0205 17:16:00.799874       1 server.go:134] Setting up cache handler (/cache)
I0205 17:16:00.799894       1 server.go:120] Status HTTP port 8081

dnsmasq logs (same on both kube-dns pods):

I0205 17:16:00.182836       1 main.go:76] opts: {{/usr/sbin/dnsmasq [-k --cache-size=1000 --log-facility=- --server=/cluster.local/127.0.0.1#10053 --server=/in-addr.arpa/127.0.0.1#10053 --server=/in6.arpa/127.0.0.1#10053] true} /etc/k8s/dns/dnsmasq-nanny 10000000000}
I0205 17:16:00.186789       1 nanny.go:86] Starting dnsmasq [-k --cache-size=1000 --log-facility=- --server=/cluster.local/127.0.0.1#10053 --server=/in-addr.arpa/127.0.0.1#10053 --server=/in6.arpa/127.0.0.1#10053]
I0205 17:16:01.053760       1 nanny.go:111] 
W0205 17:16:01.053864       1 nanny.go:112] Got EOF from stdout
I0205 17:16:01.054055       1 nanny.go:108] dnsmasq[8]: started, version 2.78-security-prerelease cachesize 1000
I0205 17:16:01.054137       1 nanny.go:108] dnsmasq[8]: compile time options: IPv6 GNU-getopt no-DBus no-i18n no-IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth no-DNSSEC loop-detect inotify
I0205 17:16:01.054171       1 nanny.go:108] dnsmasq[8]: using nameserver 127.0.0.1#10053 for domain in6.arpa 
I0205 17:16:01.054206       1 nanny.go:108] dnsmasq[8]: using nameserver 127.0.0.1#10053 for domain in-addr.arpa 
I0205 17:16:01.054233       1 nanny.go:108] dnsmasq[8]: using nameserver 127.0.0.1#10053 for domain cluster.local 
I0205 17:16:01.054318       1 nanny.go:108] dnsmasq[8]: reading /etc/resolv.conf
I0205 17:16:01.054351       1 nanny.go:108] dnsmasq[8]: using nameserver 127.0.0.1#10053 for domain in6.arpa 
I0205 17:16:01.054378       1 nanny.go:108] dnsmasq[8]: using nameserver 127.0.0.1#10053 for domain in-addr.arpa 
I0205 17:16:01.054445       1 nanny.go:108] dnsmasq[8]: using nameserver 127.0.0.1#10053 for domain cluster.local 
I0205 17:16:01.054476       1 nanny.go:108] dnsmasq[8]: using nameserver 10.0.64.2#53
I0205 17:16:01.054538       1 nanny.go:108] dnsmasq[8]: read /etc/hosts - 7 addresses

sidecar log (same on both kube-dns pods):

ERROR: logging before flag.Parse: I0205 17:16:01.163076       1 main.go:48] Version v1.14.4-2-g5584e04
ERROR: logging before flag.Parse: I0205 17:16:01.163230       1 server.go:45] Starting server (options {DnsMasqPort:53 DnsMasqAddr:127.0.0.1 DnsMasqPollIntervalMs:5000 Probes:[{Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1} {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}] PrometheusAddr:0.0.0.0 PrometheusPort:10054 PrometheusPath:/metrics PrometheusNamespace:kubedns})
ERROR: logging before flag.Parse: I0205 17:16:01.163269       1 dnsprobe.go:75] Starting dnsProbe {Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}
ERROR: logging before flag.Parse: I0205 17:16:01.163315       1 dnsprobe.go:75] Starting dnsProbe {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}

joelittlejohn on 6 Feb 2018

I rebuilt this cluster without specifying the image to use, so the cluster was built with Debian Jessie instead of CoreOS (k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-01-14 (ami-8ec0e1f4) instead of 595879546273/CoreOS-stable-1632.2.1-hvm (ami-a53335df)) and this problem is solved :thinking:

So it seems like it's an issue in CoreOS or the provisioning code for CoreOS.

joelittlejohn on 6 Feb 2018

@joelittlejohn :

I think this has to do with kubernetes/kubernetes#21613

Fix that appears to work for us is to run sudo modprobe br_netfilter on all cluster nodes.

This affected our clusters using CoreOS AMI also!

Symptoms are that DNS responses are coming from unexpected source IP. When a DNS lookup is made it sends packet to the kube-dns _Service IP_. Then it receives response from the _Pod IP_, which is then dropped because the sender doesn't know it's talking to the pod... just the _Service IP_.

Symptoms are (when doing lookup with dig against _Service IP_):

# Symptoms: dig svc-name.svc.cluster.local  returns "reply from unexpected source" error such as:
root@example-pod-64598c547d-z9vb4:/# dig @100.64.0.12 svc-name.svc.cluster.local
;; reply from unexpected source: 100.96.3.3#53, expected 100.64.0.12#53
;; reply from unexpected source: 100.96.3.3#53, expected 100.64.0.12#53
;; reply from unexpected source: 100.96.3.3#53, expected 100.64.0.12#53

trinitronx on 6 Feb 2018

👍4

Not sure where in kops provision process this should go... but it probably can be solved by dropping in a file to /etc/modules-load.d/ to load this kernel module.

echo br_netfilter > /etc/modules-load.d/br_netfilter.conf

If using cloud-init or Ignition, there are equivalent options

trinitronx on 6 Feb 2018

👍2 ❤1 🎉1

@trinitronx Thanks loading this module completely fixed the problem! I had found a bunch of potential solution in the kubernetes issues list but not this one :joy:

So it looks like kops for CoreOS should adding echo br_netfilter > /etc/modules-load.d/br_netfilter.conf (or something else to load this module) as part of provisioning a CoreOS cluster, because right now the CoreOS clusters that kops is creating are broken :thinking:

joelittlejohn on 7 Feb 2018

👍2

I fixed this in my own kops cluster by editing the cluster:

kops edit cluster stg-us-east-1.k8s.local --state s3://mybucket

and adding a hook:

  - manifest: |
      Type=oneshot
      ExecStart=/usr/sbin/modprobe br_netfilter
    name: fix-dns.service

joelittlejohn on 7 Feb 2018

👍4

/assign @KashifSaadat @gambol99

calling the CoreOS gurus :)

chrislovecnm on 8 Feb 2018

Yeah .. we've hit this one before, its an old bug (albeit not in kops but a previous installer we used and fixed with the same hack as above) ... The br-netfilter module needs to be enabled so iptables forces all packets, even those traversing the bridge, to go through the pre-routing tables .. I'm surprised the kube-proxy doesn't try and modprobe this itself, much like here.

We seem to have this enabled already, without explicitly doing a modprobe hack .. but we are using canal, so perhaps either the version of flannel, or calico it doing it for us. Let me do a quick test with CoreOS-stable-1632.2.1-hvm to rule out the os version.

core@ip-10-250-29-239 /etc/cni $ sudo lsmod | grep br_netfilter
br_netfilter           24576  0
bridge                151552  1 br_netfilter

gambol99 on 8 Feb 2018

I noticed in the logs of CoreOS-stable-1632.2.1-hvm

Feb 08 11:26:15 ip-10-250-101-49.eu-west-2.compute.internal kernel: bridge: 
filtering via arp/ip/ip6tables is no longer available by default. Update your scripts 
to load br_netfilter if you need this

It might be worth raising on the flannel repo to get an official response ..

gambol99 on 8 Feb 2018

👍1

Like @joelittlejohn , we were able to fix this on cluster create by adding the following hook via kops cluster edit:

Under spec:

hooks:
- name: fix-dns.service
  roles:
  - Node
  - Master
  before:
  - network-pre.target
  - kubelet.service
  manifest: |
    Type=oneshot
    ExecStart=/usr/sbin/modprobe br_netfilter
    [Unit]
    Wants=network-pre.target
    [Install]
    WantedBy=multi-user.target

trinitronx on 9 Feb 2018

👍2

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 14 May 2018

/remove-lifecycle stale

Does anyone that has commented here know if this problem would still affect newly built clusters using all the latest versions? (kops 1.9, CoreOS, Kubernetes 1.9, flannel).

I'm loathe to just close this because it seems like such a massive bug: "DNS broken on brand new cluster". It's not yet clear to me whether this should be fixed in flannel, of kubernetes or kops.

joelittlejohn on 14 May 2018

👍1

Still the issue but with Calico instead of flannel...

kops 1.9
kubernetes 1.9.7
CoreOS -> https://coreos.com/dist/aws/aws-stable.json for my region
Calico

Not able to resolve internal dns entries. I'll try the workaround described above

alvgarvilla on 25 May 2018

I can confirm that on a cluster built with Kops 1.9.0, running CoreOS 1745.4.0 (Stable) br_netfilter is not loaded on boot.

macropin on 28 May 2018

Encountering this issue as well. Tried the 'hook' solution [ https://github.com/kubernetes/kops/issues/4391#issuecomment-364321275 ] though it didn't work.

Verified by
Deploying a radial/busyboxplus pod onto the node running kube-dns. Pinged a pod on the same node, no response. Pinged a pod on a different node and response was received. Both pods were pinged on their respective service names. So, service-name.namespace.
Also tested the same resources on minikube where everything is hosted on a single node. I had no issues there.

Versioning
Cloud-provider: gce
Networking: kubenet
Kernel Version: 4.4.64+
OS Image: cos-cloud/cos-stable-60-9592-90-0
Container Runtime Version: docker://1.13.1
Kubelet Version: v1.8.7
Kube-Proxy Version: v1.8.7
Operating system: linux
Architecture: amd64

RobertDiebels on 21 Jul 2018

re br_netfilter

I have been able to reproduce that force-unloading the module causes dns resolution to fail (with modprobe -r br_netfilter
I have not been able to reproduce a CoreOS or COS image that didn't have the module running
I think docker loads the module. I'm not sure why it would not load it - maybe if someone had passed some special flags to docker?
I don't think the module would be unloaded. Most likely is that the machine reboots and docker doesn't add it, I guess.

So I did add force-loading of the module in https://github.com/kubernetes/kops/pull/5490 . Hopefully that will help with the CoreOS issue.

@RobertDiebels I'm not sure ping works very well anyway between pods. I did try with COS on GCE and wasn't able to reproduce a problem doing curl -v -k https://kubernetes, but I only realized now that you were pinging so I'll have to try that case. But I wouldn't recommend using ping as a health-check anyway.

justinsb on 22 Jul 2018

@justinsb Thanks for the tip. I'll try the same using curl. Will report back here today.

EDIT: I just ran my code again and everything seems to be working. This time I waited 10 minutes before doing anything. Before I waited approx. 5 minutes. So I believe my issue was due to the time it takes to initialize. Disregard earlier my comment.
EDIT2: It appears it wasn't due to the initialization time. It was due to me running 2 clusters in the same gce project. Opening a new ticket for that issue.

RobertDiebels on 23 Jul 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 21 Oct 2018

We were experiencing a different symptom but probably due to the same source problem that is reported in this issue (another related one), and we could fix it following exactly what @trinitronx shared. All the details can be found here, it would be nice if kops could automatically take care of this since it took us a large amount of time and effort to figure out what was the problem (and meanwhile, our users were affected with plenty of timeouts).

marblestation on 26 Oct 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 25 Nov 2018

Same issue on AWS, Kops v1.10.0. I have tried creating K8s clusters with both Debian Stretch (not the standard Jessie) and Amazon Linux. Both have multiple pods failing due to DNS timeouts. In fact, even one of the DNS pods is failing, though the other is fine:

kube-system kube-dns-5fbcb4d67b-kfccn 0/3 CrashLoopBackOff 53 1h

Errors:

I1214 05:23:24.685877       1 dns.go:219] Waiting for [endpoints services] to be initialized from apiserver...
F1214 05:23:25.185865       1 dns.go:209] Timeout waiting for initialization

I've tried the 'hook' solution without success.

UPDATE

Just tried this again after upgrading to Kops beta v1.11 and rebuilding a clean cluster. My cluster is now working without any DNS issues.

curl -Lo kops https://github.com/kubernetes/kops/releases/download/1.11.0-beta.1/kops-linux-amd64
chmod +x ./kops
sudo mv ./kops /usr/local/bin/

MCLDG-zz on 14 Dec 2018

I have been unable to get DNS working in a new cluster using Kops 1.11 beta, stable channel. I've tried kube-dns and core-dns with weave as the overlay. No luck. I've tried the hook fix. No luck.

@MCLDG What AMI, DNS provider, and overlay are you using?

michaelajr on 14 Dec 2018

Hook fix did not work because...um, the module was already loaded, so that makes sense. So maybe I have a different issue. Been unable to get DNS working reliably. Sometimes it will work after a long delay. Then fail on the same lookup. Sometimes it times out. Sometimes if I kill one of the DNS servers (take replicas down to 1) - things work perfect! Then when I try to reproduce in a new cluster, taking down to one pod does NOT work. Super frustrating not being able to reproduce the issue reliably (or reproduce a fix reliably).

michaelajr on 14 Dec 2018

@michaelajr , my 'create cluster' statement looks as follows. Default AMI, built-in K8s DNS and AWS VPC networking.

kops create cluster \
    --node-count 2 \
    --zones ap-southeast-1a,ap-southeast-1b,ap-southeast-1c \
    --master-zones ap-southeast-1a,ap-southeast-1b,ap-southeast-1c \
    --node-size m5.large\
    --master-size t2.medium \
    --topology private \
    --networking amazon-vpc-routed-eni  \
    ${NAME}

Some of my DNS calls would succeed. The pattern that seemed to work was where pod A called pod B and both were on the same worker node. If the pods were on different nodes the call would fail, though this wasn't consistent.

MCLDG-zz on 17 Dec 2018

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 16 Jan 2019

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 16 Jan 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Allow opt-in to etcd3

justinsb · 4Comments

Cycle Nodes

owenmorgan · 3Comments

Support for newer docker versions like 18.03

lnformer · 3Comments

kops drain node

chrislovecnm · 3Comments

kopeio vs kopeio-vxlan

olalonde · 4Comments