K3s: coredns errors on 1.19.5 (coredns 1.7.1 + cache + DNSSEC)

Created on 13 Dec 2020  路  15Comments  路  Source: k3s-io/k3s

Environmental Info:
K3s Version: k3s version v1.19.5+k3s1 (b11612e2)

Node(s) CPU architecture, OS, and Version: Linux saturn 4.19.0-13-amd64 k3s-io/k3s#1 SMP Debian 4.19.160-2 (2020-11-28) x86_64 GNU/Linux

Cluster Configuration: 1 master, eight workers

Describe the bug:
Since upgrading to 1.19.5, DNS has become pretty reliably flaky. I run a github runner inside of my cluster and it very consistently fails to look up deb.debian.org. It's not regularly failing with every single query but it's failing often enough that I can't get a build to finish. When I reverted coredns back to 1.6.9 things worked again. I upgraded to 1.19.5 from 1.19.4-k3s2.

Steps To Reproduce:

  • Installed K3s: k3s was installed like this:

curl -sfL https://get.k3s.io | sh -s - \
--write-kubeconfig-mode=640 \
--advertise-address=10.x.x.x \
--cluster-cidr=10.y.0.0/16 \
--service-cidr=10.z.0.0/16 \
--cluster-dns=10.z.0.2 \
--flannel-backend=host-gw \
--disable=servicelb \
--disable=traefik

And then upgraded by pulling the binaries, copying them into place, and restarting k3s master first and then all of the workers second.

Expected behavior:
I expected DNS to return reliably and consistently.

Actual behavior:
DNS queries fail some non-trivial percent of the time.

Additional context / logs:
There are no logs from coredns or from k3s about this issue. I'm just experiencing issues where software that tries to use built-in dns to lookup external addresses fail with coredns 1.7.1 but do not fail with coredns 1.6.9.

To Test aredns kinbug kinupstream-issue prioritcritical-urgent

Most helpful comment

There's an upstream PR addressing this issue in CoreDNS:

https://github.com/coredns/coredns/issues/4189

Looks like it's fixed in 1.8.0.

All 15 comments

This is now doubly annoying because k3s keeps reverting my reversion so I have to go back in and make coredns run 1.6.9 on a regular basis.

Can confirm. Downgraded back to k3s 1.19.4-k3s1 for the time being.

I can confirm, it started happening after I upgraded to 1.19.5. The DNS lookup is failing seemingly randomly.

Can you provide any more info on what sort of DNS lookups are failing? In-cluster, out-of-cluster, etc? I'm not aware of any similar issues reported with the coredns project.

For me, the out-of-cluster DNS lookups are failing randomly. But, if I keep trying it eventually works. The exact configuration is such that I'm running two independent services in the same cluster but letting them communicate through external DNS (that is configured through Ingress).

I tried running dig to figure out what is exactly happening and I found out that everytime TTL expires I get a weird response.

/ # dig git.argc.in

; <<>> DiG 9.16.6 <<>> git.argc.in
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 54532
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 2048
;; QUESTION SECTION:
;git.argc.in.                   IN      A

;; ANSWER SECTION:
git.argc.in.            30      IN      CNAME   lb-0.argd.in.
git.argc.in.            30      IN      RRSIG   CNAME 13 3 300 20201218185311 20201216165311 34505 argc.in. ARV1ABkpLZiInqcMcF5y6oK4HE+uMryTwnajPJrxMJIeo2giZ2sOWT6q jjELOyJS/h97XaNstHI2C9RgcJUsNQ==

;; Query time: 0 msec
;; SERVER: 10.43.0.10#53(10.43.0.10)
;; WHEN: Thu Dec 17 17:55:29 UTC 2020
;; MSG SIZE  rcvd: 191

vs a Cached response


; <<>> DiG 9.16.6 <<>> git.argc.in
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 27309
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: c779baabef9ec284 (echoed)
;; QUESTION SECTION:
;git.argc.in.                   IN      A

;; ANSWER SECTION:
git.argc.in.            30      IN      CNAME   lb-0.argd.in.
lb-0.argd.in.           30      IN      A       167.233.11.253

;; Query time: 0 msec
;; SERVER: 10.43.0.10#53(10.43.0.10)
;; WHEN: Thu Dec 17 18:00:39 UTC 2020
;; MSG SIZE  rcvd: 117

This matches with the curl failing with Could not resolve host.

/ # curl -I git.argc.in
curl: (6) Could not resolve host: git.argc.in

Edit: Updated with cached response

What's weird about that? Can you compare it to an example of a cached response?

Updated original comment with the cached response. I'm not sure what exactly is RRSIG record but it seems to be related to DNSSEC. I will try disabling DNSSEC and see if that is really the case.
Better yet, I can try with a domain with DNSSEC disabled.

There's an upstream PR addressing this issue in CoreDNS:

https://github.com/coredns/coredns/issues/4189

Looks like it's fixed in 1.8.0.

Thanks for digging that up! It appears to be specifically related to cache and dnssec. We'll evaluate going to v1.8.0 or back to 1.6.9 with our next release.

This was broken by https://github.com/coredns/coredns/commit/acf9a0fa19928e605ac8ac3314890c9fef73e16b which is in 1.7.1
This was fixed by https://github.com/coredns/coredns/commit/268781d3553cca94f5e5ecbd248a537fbdf6dae8 which is in 1.8.0

For testing purposes, issues have reported failures resolving the following entries:

  • git.argc.in
  • deb.debian.org
  • test.my.salesforce.com

You should be able to test these by running:

kubectl run --rm --tty --stdin --restart=Never --image=tutum/dnsutils dnsutils -- dig git.argc.in

A successful response will have one or more A records in the ANSWER SECTION output:

; ANSWER SECTION:
git.argc.in.        25  IN  CNAME   lb-0.argd.in.
lb-0.argd.in.       25  IN  A   167.233.11.253

A failed response will have RRSIG records instead:

;; ANSWER SECTION:
git.argc.in.        30  IN  CNAME   lb-0.argd.in.
git.argc.in.        30  IN  RRSIG   CNAME 13 3 300 20201218233519 20201216213519 34505 argc.in. 86UMKuSTmm3y4ex/4orjGDpCdqsSi4rTfK9LJ7hhqhL368lpK1EIBXbt GQuS5yDyVNKp8rpq+6/2rW/wOerDGw==

This fails for me on a cluster running CoreDNS 1.7.1 that is forwarding to an upstream dnsmasq server with dnssec enabled. I am unsure if it matters whether or not the upstream supports dnssec or not.
If you鈥檙e testing on ubuntu or rhel, you might try disabling systemd-resolved and pointing your /etc/resolv.conf at 8.8.8.8. systemd-resolved or other non-dnssec-enabled upstream resolvers might mask the issue.

FYI - In case you couldn't tell from the PRs linked above, we are going to put out new releases this afternoon. The release-1.19 branch is going to go back to v1.6.9, and 1.20 (master) is going to go to v1.8.0 (unless we run into some other major issue with that version).

Thanks! Looking forward to the new release.

I can confirm upgrading to v1.19.5+k3s2 has resolved these issues for me. Thanks for the release!

Confirming from my end as well. The upgrade does fix the issue.

Was this page helpful?
0 / 5 - 0 ratings