Linkerd2: DNS canonicalize failed to refine

Created on 11 Jan 2019  路  7Comments  路  Source: linkerd/linkerd2

Bug Report

What is the issue?

Services with Linkerd sidecar are working fine. But the Linkerd sidecar logs show constant DNS errors.

How can it be reproduced?

N/A

Logs, error output, etc

ERR! linkerd2_proxy::proxy::canonicalize failed to refine catalyst.elasticsearch.service.azure-west-us.consul: deadline has elapsed
ERR! linkerd2_proxy::proxy::canonicalize failed to refine catalyst.elasticsearch.service.azure-west-us.consul: deadline has elapsed
ERR! linkerd2_proxy::proxy::canonicalize failed to refine catalyst.elasticsearch.service.azure-west-us.consul: deadline has elapsed
ERR! linkerd2_proxy::proxy::canonicalize failed to refine catalyst.elasticsearch.service.azure-west-us.consul: deadline has elapsed
ERR! linkerd2_proxy::proxy::canonicalize failed to refine catalyst.elasticsearch.service.azure-west-us.consul: deadline has elapsed
ERR! linkerd2_proxy::proxy::canonicalize failed to refine portfolio.kyro.svc.cluster.local: deadline has elapsed 

linkerd check output

$ linkerd check
kubernetes-api: can initialize the client..................................[ok]
kubernetes-api: can query the Kubernetes API...............................[ok]
kubernetes-api: is running the minimum Kubernetes API version..............[ok]
linkerd-existence: control plane namespace exists..........................[ok]
linkerd-existence: controller pod is running...............................[ok]
linkerd-existence: can initialize the client...............................[ok]
linkerd-existence: can query the control plane API.........................[ok]
linkerd-api: control plane pods are ready..................................[ok]
linkerd-api: can query the control plane API...............................[ok]
linkerd-api[kubernetes]: control plane can talk to Kubernetes..............[ok]
linkerd-api[prometheus]: control plane can talk to Prometheus..............[ok]
linkerd-api: no invalid service profiles...................................[ok]
linkerd-version: can determine the latest version..........................[ok]
linkerd-version: cli is up-to-date.........................................[ok]
linkerd-version: control plane is up-to-date...............................[ok]

Status check results are [ok]

Environment

  • Kubernetes Version: v1.11.5
  • Cluster Environment: AKS
  • Host OS: Amazon Linux
  • Linkerd version:
Client version: edge-19.1.1
Server version: edge-19.1.1

Possible solution

N/A

Additional context

I switched over to Edge due to issue #2012. My cluster is configured to use CoreDNS, proxying requests for .consul domain to the local consul DNS service. I'm not sure if this has anything to do with the logged errors.

areproxy

Most helpful comment

@matkam, just as a follow-up, I wanted to let you know that we just merged linkerd/linkerd2-proxy#175, so the proxy on linkerd/linkerd2-proxy@master (and presumably, the next edge release) will now honor a LINKERD2_PROXY_DNS_CANONICALIZE_TIMEOUT environment variable to override the default 100ms timeout.

All 7 comments

Thanks for the report, @matkam. Can you share details about how you've configured DNS?

From a pod, are you able to dig catalyst.elasticsearch.service.azure-west-us.consul?

Are there any functional problems observed, other than the log spam?

Here's my coredns Corefile:

.:53 {
    errors
    health
    kubernetes cluster.local {
      pods insecure
      upstream
      fallthrough in-addr.arpa ip6.arpa
    }
    prometheus :9153
    proxy consul 10.100.97.83:53
    proxy . /etc/resolv.conf
    cache 30
}

dig:

$ dig catalyst.elasticsearch.service.azure-west-us.consul ANY

; <<>> DiG 9.10.6 <<>> catalyst.elasticsearch.service.azure-west-us.consul ANY
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 56681
;; flags: qr aa rd; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;catalyst.elasticsearch.service.azure-west-us.consul. IN    ANY

;; ANSWER SECTION:
catalyst.elasticsearch.service.azure-west-us.consul. 0 IN A 172.16.1.6
catalyst.elasticsearch.service.azure-west-us.consul. 0 IN TXT "consul-network-segment="
catalyst.elasticsearch.service.azure-west-us.consul. 0 IN A 172.16.1.14
catalyst.elasticsearch.service.azure-west-us.consul. 0 IN TXT "consul-network-segment="

;; Query time: 97 msec
;; SERVER: 10.100.0.10#53(10.100.0.10)
;; WHEN: Fri Jan 11 11:36:41 PST 2019
;; MSG SIZE  rcvd: 388

Hi @matkam, would you mind re-injecting the affected pods with the following flag:

--proxy-log-level=warn,linkerd2_proxy=info,linkerd2_proxy::proxy::canonicalize=debug,linkerd2_proxy::dns=debug,trust_dns_resolver=debug

This will give us more detailed debug logging from the proxy's DNS modules. Then, if the issue occurs again, it would be great to see the logs.

Hi @hawkw, I reinjected two different deployments and ran them for a few minutes. Here are the logs:
https://gist.github.com/matkam/adc1f05cb3293998c7e1a64401329c4f

Thanks for the logs @matkam. It looks like the original refinement is succeeding, but then subsequent DNS queries hit the timeout.

Based on the dig output you posted above, I notice that the query time for catalyst.elasticsearch.service.azure-west-us.consul was 97 milliseconds. By default, Linkerd sets a timeout of 100 milliseconds for queries to canonicalize a DNS name. Given that your DNS server is taking close to 100 milliseconds to respond, I suspect queries are timing out frequently.

If you're not seeing any issues as a result of this, these log messages are likely benign. However, if you do see failures to pick up service discovery changes from DNS in a reasonable amount of time, we could consider making the timeout for DNS queries configurable.

We might also want to consider changing that log message message from an error to a warning, at least for timeouts.

Hi @matkam, I've opened #2093 and #2094 to track making the timeout configurable and trying to reduce the log spam, respectively. Since this isn't really a bug and the actionable changes are tracked in those issues, I'm going to close this ticket.

Thanks for reporting, and if you have any more questions or problems, please don't hesitate to open new issues!

@matkam, just as a follow-up, I wanted to let you know that we just merged linkerd/linkerd2-proxy#175, so the proxy on linkerd/linkerd2-proxy@master (and presumably, the next edge release) will now honor a LINKERD2_PROXY_DNS_CANONICALIZE_TIMEOUT environment variable to override the default 100ms timeout.

Was this page helpful?
0 / 5 - 0 ratings