Kong: Intermittent DNS failures

Created on 10 Feb 2020 · 9Comments · Source: Kong/kong

Summary

We are running into some strange issues with Kong being unable to resolve DNS, despite being able resolve other requests (from the same Consul domain) in other requests before and after.

I am not sure of the order of the error within Kong (as the errors appear in the same second), but it appears that my upstream times out (assume this is first as it appears first in each error log), and then Kong has an issue with trying to do a DNS lookup. It should be trying SRV records first (I believe that is type 33), but seemingly it can't resolve. However, often in the second before or after (or sometimes the same second), a request to another service on Consul resolves perfectly. This is very intermittent. I am wondering if perhaps a DNS lookup is affected by the fact that the upstream has failed, but can't figure out why that would be the case. Logs below:

2020/02/09 22:45:57 [error] 31#0: *25854631 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 172.31.44.180, server: kong, request: "GET /v1/health/ HTTP/1.1", upstream: "http://172.31.57.159:42956/v1/health/", host: "<redacted>.com" 
2020/02/09 22:45:57 [error] 31#0: *25854631 [lua] balancer.lua:850: balancer_execute(): DNS resolution failed: dns server error: 100 cache only lookup failed. Tried: [
    "(short)public-api.service.consul:(na) - cache-miss",
    "public-api.service.consul:33 - cache only lookup failed/dns server error: 100 cache only lookup failed",
    "public-api.service.consul.eu-west-1.compute.internal:33 - cache only lookup failed/dns server error: 100 cache only lookup failed",
    "public-api.service.consul:1 - cache only lookup failed/dns server error: 100 cache only lookup failed",
    "public-api.service.consul.eu-west-1.compute.internal:1 - cache only lookup failed/dns server error: 100 cache only lookup failed",
    "public-api.service.consul:5 - cache only lookup failed/dns server error: 100 cache only lookup failed",
    "public-api.service.consul.eu-west-1.compute.internal:5 - cache only lookup failed/dns server error: 100 cache only lookup failed"
] while connecting to upstream, client: 172.31.44.180, server: kong, request: "GET /v1/health/ HTTP/1.1", upstream: "http://172.31.57.159:42956/v1/health/", host: "<redacted>.com" 
2020/02/09 22:45:57 [error] 31#0: *25854631 [lua] init.lua:610: balancer(): failed to retry the dns/balancer resolver for public-api.service.consul' with: dns server error: 100 cache only lookup failed while connecting to upstream, client: 172.31.44.180, server: kong, request: "GET /v1/health/ HTTP/1.1", upstream: "http://172.31.57.159:42956/v1/health/", host: "<redacted>.com"

2020/02/09 22:46:22 [error] 31#0: *25854763 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 172.31.25.53, server: kong, request: "GET /v1/users?id=<redacted> HTTP/1.1", upstream: "http://172.31.57.159:42956/v1/users?id=<redacted>", host: "<redacted>.com" 
2020/02/09 22:46:22 [error] 31#0: *25854763 [lua] balancer.lua:850: balancer_execute(): DNS resolution failed: dns server error: 100 cache only lookup failed. Tried: [
    "(short)public-api.service.consul:(na) - cache-miss",
    "public-api.service.consul:33 - cache only lookup failed/dns server error: 100 cache only lookup failed",
    "public-api.service.consul.eu-west-1.compute.internal:33 - cache only lookup failed/dns server error: 100 cache only lookup failed",
    "public-api.service.consul:1 - cache only lookup failed/dns server error: 100 cache only lookup failed",
    "public-api.service.consul.eu-west-1.compute.internal:1 - cache only lookup failed/dns server error: 100 cache only lookup failed",
    "public-api.service.consul:5 - cache only lookup failed/dns server error: 100 cache only lookup failed",
    "public-api.service.consul.eu-west-1.compute.internal:5 - cache only lookup failed/dns server error: 100 cache only lookup failed"
] while connecting to upstream, client: 172.31.25.53, server: kong, request: "GET /v1/users?id>=<redacted> HTTP/1.1", upstream: "http://172.31.57.159:42956/v1/users?id=<redacted>", host: "<redacted>.com" 
2020/02/09 22:46:22 [error] 31#0: *25854763 [lua] init.lua:610: balancer(): failed to retry the dns/balancer resolver for public-api.service.consul' with: dns server error: 100 cache only lookup failed while connecting to upstream, client: 172.31.25.53, server: kong, request: "GET /v1/users?id=<redacted> HTTP/1.1", upstream: "http://172.31.57.159:42956/v1/users?id=<redacted>", host: "<redacted>.com" 



md5-3fa47e5dab80e8c0e02d500eefdeaa53



2020/02/09 22:46:31 [error] 31#0: *25857533 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 172.31.25.53, server: kong, request: "GET /v1/users?id=<redacted> HTTP/1.1", upstream: "http://172.31.57.159:42956/v1/users?id=<redacted>", host: "<redacted>.com" 
2020/02/09 22:46:31 [error] 31#0: *25857533 [lua] balancer.lua:850: balancer_execute(): DNS resolution failed: dns server error: 100 cache only lookup failed. Tried: [
    "(short)public-api.service.consul:(na) - cache-miss",
    "public-api.service.consul:33 - cache only lookup failed/dns server error: 100 cache only lookup failed",
    "public-api.service.consul.eu-west-1.compute.internal:33 - cache only lookup failed/dns server error: 100 cache only lookup failed",
    "public-api.service.consul:1 - cache only lookup failed/dns server error: 100 cache only lookup failed",
    "public-api.service.consul.eu-west-1.compute.internal:1 - cache only lookup failed/dns server error: 100 cache only lookup failed",
    "public-api.service.consul:5 - cache only lookup failed/dns server error: 100 cache only lookup failed",
    "public-api.service.consul.eu-west-1.compute.internal:5 - cache only lookup failed/dns server error: 100 cache only lookup failed"
] while connecting to upstream, client: 172.31.25.53, server: kong, request: "GET /v1/users?id=<redacted> HTTP/1.1", upstream: "http://172.31.57.159:42956/v1/users?id=<redacted>", host: "<redacted>.com" 
2020/02/09 22:46:31 [error] 31#0: *25857533 [lua] init.lua:610: balancer(): failed to retry the dns/balancer resolver for public-api.service.consul' with: dns server error: 100 cache only lookup failed while connecting to upstream, client: 172.31.25.53, server: kong, request: "GET /v1/users?id=<redacted> HTTP/1.1", upstream: "http://172.31.57.159:42956/v1/users?id=<redacted>", host: "<redacted>.com" 



md5-3fa47e5dab80e8c0e02d500eefdeaa53



2020/02/09 22:48:18 [error] 31#0: *25850242 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 172.31.44.180, server: kong, request: "GET /v1/users HTTP/1.1", upstream: "http://172.31.57.159:42956/v1/users", host: "<redacted>.com" 
2020/02/09 22:48:18 [error] 31#0: *25850242 [lua] balancer.lua:850: balancer_execute(): DNS resolution failed: dns server error: 100 cache only lookup failed. Tried: [
    "(short)public-api.service.consul:(na) - cache-miss",
    "public-api.service.consul:33 - cache only lookup failed/dns server error: 100 cache only lookup failed",
    "public-api.service.consul.eu-west-1.compute.internal:33 - cache only lookup failed/dns server error: 100 cache only lookup failed",
    "public-api.service.consul:1 - cache only lookup failed/dns server error: 100 cache only lookup failed",
    "public-api.service.consul.eu-west-1.compute.internal:1 - cache only lookup failed/dns server error: 100 cache only lookup failed",
    "public-api.service.consul:5 - cache only lookup failed/dns server error: 100 cache only lookup failed",
    "public-api.service.consul.eu-west-1.compute.internal:5 - cache only lookup failed/dns server error: 100 cache only lookup failed"
] while connecting to upstream, client: 172.31.44.180, server: kong, request: "GET /v1/users HTTP/1.1", upstream: "http://172.31.57.159:42956/v1/users", host: "<redacted>.com" 
2020/02/09 22:48:18 [error] 31#0: *25850242 [lua] init.lua:610: balancer(): failed to retry the dns/balancer resolver for public-api.service.consul' with: dns server error: 100 cache only lookup failed while connecting to upstream, client: 172.31.44.180, server: kong, request: "GET /v1/users HTTP/1.1", upstream: "http://172.31.57.159:42956/v1/users", host: "<redacted>.com"

Steps To Reproduce

Unable to reproduce on demand, the failure is intermittent.

Additional Details & Logs

Kong version: 1.3.0
Kong error logs (shown in summary)
Operating system: AWS Linux (ECS optimised AMI)
Consul: 1.6.2 (for all agents)

Source

neilmcgibbon

Most helpful comment

My guess is you have a fairly short dns TTL for the name and fairly low traffic volume at the same time.

The underlying retry mechanism is limited in a sense that it cannot do socket operations. This means that during a retry, no DNS lookups can be done. Hence the error being a cache only lookup failure.

Say you have a TTL of 30 seconds, in second 25 a request is proxied, but times out after 10 seconds (second 35). Now Kong does a retry, Kong cannot do DNS resolution in this phase so will just search its cache, and fail because the TTL has expired and hence the record removed.

This would not happen if in between second 30 and 35 another request would come in for the same server, since that would refresh the DNS and the other requests retry would find that record. (hence; only happens with fairly low traffic volumes)

The DNS resolver has a property dns_stale_ttl (default 4 seconds) which determines how long Kong is allowed to use stale DNS records (eg. they expired, but Kong will not yet purge them from the cache).
By increasing this setting you will probably be able to remove (or at least reduce) the errors.

See https://docs.konghq.com/2.0.x/configuration/#dns_stale_ttl

Tieske on 10 Feb 2020

🚀3

All 9 comments

@Tieske do the log messages above read any bells?

hishamhm on 10 Feb 2020

My guess is you have a fairly short dns TTL for the name and fairly low traffic volume at the same time.

See https://docs.konghq.com/2.0.x/configuration/#dns_stale_ttl

Tieske on 10 Feb 2020

🚀3

( @Tieske wow, that was a quick answer! :D )

hishamhm on 10 Feb 2020

🚀1

@Tieske thanks so much for your detailed response.

You are 100% correct, the TTL is low, and traffic volume is low on this service. However, I wonder if you have any advice regarding mitigating this based on an exceptionally low TTL. We use Consul for DNS discovery, and by default the TTL is 0; from their docs:

By default, all DNS results served by Consul set a 0 TTL value. This disables caching of DNS results.

Assuming we want to keep the default Consul TTL of 0, what do you suggest I can do with the dns_stale_ttl property you mentioned?

neilmcgibbon on 12 Feb 2020

well, essentially the trick with ttl=0 is about making it dynamic, quick responding to healthprobes (by Consul), and allow Consul to dictate loadbalancing (it will serve different IP's)

So you could set the ttl to a higher value, have Kong do healthchecks and loadbalancing.

Alternatively set the stale-ttl value to a value at least higher than the timeout you have configured on the backend. That should keep the errors away.

Tieske on 12 Feb 2020

@Tieske thanks again for the prompt feedback, but I think I am misunderstanding something here...

As something to try, you said (above):

Alternatively set the stale-ttl value to a value at least higher than the timeout you have configured on the backend. That should keep the errors away.

But if the dns_stale_ttl is set to its default of 4 seconds, and the Consul's default TTL is 0 seconds, then we have already "set the stale-ttl value to a value at least higher than the timeout you have configured on the backend". Is there another step I should be taking, or configuring TTL in a third location?

neilmcgibbon on 14 Feb 2020

Ah, I meant the timeout for the connection to the backend. Since that is timing out, and then the retry fails. Those would be the ones on the Kong service entity; https://docs.konghq.com/2.0.x/admin-api/#service-object

Tieske on 14 Feb 2020

We're closing this one since it's not a bug on the Kong side, but feel free to keep the conversation going!

hishamhm on 17 Feb 2020

I have a query @hishamhm , I am witnessing a case, where, the stale ttl stays for days. As dns resolving is failed, everytime.
All the configurations are default. Is this a possible case?