Linkerd2: DNS canonicalize failed to refine

Created on 11 Jan 2019 · 7Comments · Source: linkerd/linkerd2

Bug Report

What is the issue?

Services with Linkerd sidecar are working fine. But the Linkerd sidecar logs show constant DNS errors.

How can it be reproduced?

N/A

Logs, error output, etc

ERR! linkerd2_proxy::proxy::canonicalize failed to refine catalyst.elasticsearch.service.azure-west-us.consul: deadline has elapsed
ERR! linkerd2_proxy::proxy::canonicalize failed to refine catalyst.elasticsearch.service.azure-west-us.consul: deadline has elapsed
ERR! linkerd2_proxy::proxy::canonicalize failed to refine catalyst.elasticsearch.service.azure-west-us.consul: deadline has elapsed
ERR! linkerd2_proxy::proxy::canonicalize failed to refine catalyst.elasticsearch.service.azure-west-us.consul: deadline has elapsed
ERR! linkerd2_proxy::proxy::canonicalize failed to refine catalyst.elasticsearch.service.azure-west-us.consul: deadline has elapsed
ERR! linkerd2_proxy::proxy::canonicalize failed to refine portfolio.kyro.svc.cluster.local: deadline has elapsed

`linkerd check` output

$ linkerd check
kubernetes-api: can initialize the client..................................[ok]
kubernetes-api: can query the Kubernetes API...............................[ok]
kubernetes-api: is running the minimum Kubernetes API version..............[ok]
linkerd-existence: control plane namespace exists..........................[ok]
linkerd-existence: controller pod is running...............................[ok]
linkerd-existence: can initialize the client...............................[ok]
linkerd-existence: can query the control plane API.........................[ok]
linkerd-api: control plane pods are ready..................................[ok]
linkerd-api: can query the control plane API...............................[ok]
linkerd-api[kubernetes]: control plane can talk to Kubernetes..............[ok]
linkerd-api[prometheus]: control plane can talk to Prometheus..............[ok]
linkerd-api: no invalid service profiles...................................[ok]
linkerd-version: can determine the latest version..........................[ok]
linkerd-version: cli is up-to-date.........................................[ok]
linkerd-version: control plane is up-to-date...............................[ok]

Status check results are [ok]

Environment

Kubernetes Version: v1.11.5
Cluster Environment: AKS
Host OS: Amazon Linux
Linkerd version:

Client version: edge-19.1.1
Server version: edge-19.1.1

Possible solution

N/A

Additional context

I switched over to Edge due to issue #2012. My cluster is configured to use CoreDNS, proxying requests for .consul domain to the local consul DNS service. I'm not sure if this has anything to do with the logged errors.

areproxy

Source

matkam

Most helpful comment

@matkam, just as a follow-up, I wanted to let you know that we just merged linkerd/linkerd2-proxy#175, so the proxy on linkerd/linkerd2-proxy@master (and presumably, the next edge release) will now honor a LINKERD2_PROXY_DNS_CANONICALIZE_TIMEOUT environment variable to override the default 100ms timeout.

hawkw on 18 Jan 2019

🎉2

All 7 comments

Thanks for the report, @matkam. Can you share details about how you've configured DNS?

From a pod, are you able to dig catalyst.elasticsearch.service.azure-west-us.consul?

Are there any functional problems observed, other than the log spam?

olix0r on 11 Jan 2019

Here's my coredns Corefile:

.:53 {
    errors
    health
    kubernetes cluster.local {
      pods insecure
      upstream
      fallthrough in-addr.arpa ip6.arpa
    }
    prometheus :9153
    proxy consul 10.100.97.83:53
    proxy . /etc/resolv.conf
    cache 30
}

dig:

$ dig catalyst.elasticsearch.service.azure-west-us.consul ANY

; <<>> DiG 9.10.6 <<>> catalyst.elasticsearch.service.azure-west-us.consul ANY
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 56681
;; flags: qr aa rd; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;catalyst.elasticsearch.service.azure-west-us.consul. IN    ANY

;; ANSWER SECTION:
catalyst.elasticsearch.service.azure-west-us.consul. 0 IN A 172.16.1.6
catalyst.elasticsearch.service.azure-west-us.consul. 0 IN TXT "consul-network-segment="
catalyst.elasticsearch.service.azure-west-us.consul. 0 IN A 172.16.1.14
catalyst.elasticsearch.service.azure-west-us.consul. 0 IN TXT "consul-network-segment="

;; Query time: 97 msec
;; SERVER: 10.100.0.10#53(10.100.0.10)
;; WHEN: Fri Jan 11 11:36:41 PST 2019
;; MSG SIZE  rcvd: 388

matkam on 11 Jan 2019

Hi @matkam, would you mind re-injecting the affected pods with the following flag:

--proxy-log-level=warn,linkerd2_proxy=info,linkerd2_proxy::proxy::canonicalize=debug,linkerd2_proxy::dns=debug,trust_dns_resolver=debug

This will give us more detailed debug logging from the proxy's DNS modules. Then, if the issue occurs again, it would be great to see the logs.

hawkw on 14 Jan 2019

Hi @hawkw, I reinjected two different deployments and ran them for a few minutes. Here are the logs:
https://gist.github.com/matkam/adc1f05cb3293998c7e1a64401329c4f

matkam on 14 Jan 2019

Thanks for the logs @matkam. It looks like the original refinement is succeeding, but then subsequent DNS queries hit the timeout.

Based on the dig output you posted above, I notice that the query time for catalyst.elasticsearch.service.azure-west-us.consul was 97 milliseconds. By default, Linkerd sets a timeout of 100 milliseconds for queries to canonicalize a DNS name. Given that your DNS server is taking close to 100 milliseconds to respond, I suspect queries are timing out frequently.

If you're not seeing any issues as a result of this, these log messages are likely benign. However, if you do see failures to pick up service discovery changes from DNS in a reasonable amount of time, we could consider making the timeout for DNS queries configurable.

We might also want to consider changing that log message message from an error to a warning, at least for timeouts.

hawkw on 16 Jan 2019

👍1

Hi @matkam, I've opened #2093 and #2094 to track making the timeout configurable and trying to reduce the log spam, respectively. Since this isn't really a bug and the actionable changes are tracked in those issues, I'm going to close this ticket.

Thanks for reporting, and if you have any more questions or problems, please don't hesitate to open new issues!

hawkw on 16 Jan 2019

hawkw on 18 Jan 2019

🎉2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Disable custom cluster domain changes during control plane upgrade

ihcsim · 4Comments

Add validation to the New Service Profile popup form

alpeb · 3Comments

Add option to inject to set resource limits

coleca · 4Comments

Typo in check error message

adleong · 4Comments

Feature Request - Allow images to be pulled in a single command

vikas027 · 4Comments