Kong: DNS resolution failed in dbless mode.

Created on 16 Jan 2020 · 39Comments · Source: Kong/kong

Recently, i often got an error below in dbless mode. It will return to noraml when restarting kong.

2020/01/09 12:05:19 [error] 42406#0: *151691 [lua] balancer.lua:917: execute(): DNS resolution failed: dns server error: 3 name error. Tried: ["(short)upstream_8:(na) - cache-miss","upstream_8:33 - cache-miss/scheduled/querying/dns server error: 3 name error","upstream_8:1 - cache-miss/scheduled/querying/dns server error: 3 name error","upstream_8:5 - cache-miss/scheduled/querying/dns server error: 3 name error"], client: 192.168.63.1, server: kong, request: "POST /hr/1 HTTP/1.1", host: "gateway.corp.com:8000"

corbalancer tasbug tasneeds-investigation

Source

SunshineYang

Most helpful comment

Hi all,

We've verified the issue _is_ reproducible on 2.0.4 (and below), but _not reproducible_ on 2.1.0-alpha1. Here are the steps we took:

Run a Kong 2.0.4:

docker run -ti -d --name kong -e "KONG_DATABASE=off" \ -e "KONG_PROXY_LISTEN=0.0.0.0:8000" -e "KONG_ADMIN_LISTEN=0.0.0.0:8001" -e KONG_PROXY_ERROR_LOG=/logs/error.log -e KONG_PROXY_ACCESS_LOG=/dev/null \ -p 8000:8000 \ -p 8001:8001 \ -v "$PWD/logs:/logs" \ kong:2.0.4

Start a shell loop sending POST requests to /config:

for i in $(seq 1 500); do http :8001/config config=@../misc/kong.yml; done

(Find kong.yaml here).

In a separate terminal window, send requests to the proxy port:

for i in $(seq 1 5000); do curl -s -o /dev/null -w "%{http_code}\n" localhost:8000; done

This will print the status code of all requests; lots of 503s will be seen. The logs in logs/error.log will contain errors like the following:

2020/06/01 15:52:18 [error] 22#0: *19380 [lua] balancer.lua:929: execute(): DNS resolution failed: dns server error: 3 name error. Tried: ["(short)upstream1:(na) - cache-miss","upstream1:33 - cache-hit/dns server error: 3 name error","upstream1:1 - cache-hit/dns server error: 3 name error","upstream1:5 - cache-hit/dns server error: 3 name error"], client: 172.17.0.1, server: kong, request: "GET / HTTP/1.1", host: "localhost:8000"

If the same is attempted against 2.1.0-alpha1 (or with Kong master branch), no 503s will occur.

Note that the issue is also not reproducible in Kong master, since it includes https://github.com/Kong/kong/pull/5831.

Once again, thanks for all the feedback on this!

gszr on 1 Jun 2020

👍2

All 39 comments

After reviewing code, i guess it may be caused by function execute in /kong/runloop/balancer.lua.

SunshineYang on 16 Jan 2020

@Tieske, does this ring any bells?

bungle on 17 Jan 2020

It happens sometimes, i can't find any way to reproduce it. According to code, it happens when target.balancer is nil. I change the code if dns_cache_only and target.balancer ~= nil then balancer = target.balancer, and the error doesn't happen yet.

SunshineYang on 17 Jan 2020

❤2

@SunshineYang, yes, I think your fix could be the one that need to be made. I checked where we set balancer, and it seems we do not always set it on first try.

bungle on 17 Jan 2020

@SunshineYang, I believe your fix should not be done as that makes the balancer phase possibly to yield, and that you cannot do on balancer. So more investigation is needed on this.

bungle on 17 Jan 2020

@SunshineYang, are you sure your Kong can resolve upstream_8?

bungle on 17 Jan 2020

What is happening here is that Kong has an Upstream defined by name upstream_8. But because at the time of the failure for some reason that balancer/Upstream by that name is unavailable. And then Kong falls back on DNS resolution, which then fails, since the name doesn't exist as a DNS name, only as an Upstream.

@locao @hishamhm there have been some updates in the balancer lifecycle, creating and replacing them on changes. Does this ring a bell?

Tieske on 17 Jan 2020

❤2

upstream_8

"upstream_8" is the name of upstream, and target of upstream_8 can be resolved.

SunshineYang on 19 Jan 2020

@SunshineYang could you confirm which Kong version are you testing this with? Thanks!

hishamhm on 3 Feb 2020

@SunshineYang could you confirm which Kong version are you testing this with? Thanks!

1.4.0 and 1.4.1

SunshineYang on 4 Feb 2020

@SunshineYang Does this happens in the beginning when your configuration is (re)loaded or after it is running for a while?

Also, does reloading the configuration (via the /config endpoint) fix the problem? (without having to restart Kong)

Are you using Kong for Kubernetes?

hishamhm on 10 Feb 2020

I'm not using Kubernetes. It happens after running for a while, and it will not fix unless restarting kong not reloading configuration via /config endpoint.

SunshineYang on 11 Feb 2020

run into this after upgrade from 1.4.2 to 2.0.2 (db-less mode)
with the exact environment, exact amount of traffic, nothing changed except kong version
just after a while, log start flooding with DNS resolution failed: dns server error: 3 name error.

And the problem keep getting worse, eventually no request can be finished.
reproducible in our prod env

[ingress-kong-7679db7945-nd99v proxy] 2020/02/27 20:13:56 [error] 30#0: *1496 [lua] balancer.lua:921: execute(): DNS resolution failed: dns server error: 3 name error. Tried: ["(short)vm-victoria-metrics-cluster-vminsert.zeus.http.svc:(na) - cache-miss","vm-victoria-metrics-cluster-vminsert.zeus.http.svc.kong.svc.cluster.local:33 - cache-hit/stale/in progress (async)/dns server error: 3 name error","vm-victoria-metrics-cluster-vminsert.zeus.http.svc.svc.cluster.local:33 - cache-hit/stale/in progress (async)/dns server error: 3 name error","vm-victoria-metrics-cluster-vminsert.zeus.http.svc.cluster.local:33 - cache-hit/stale/in progress (async)/dns server error: 3 name error","vm-victoria-metrics-cluster-vminsert.zeus.http.svc.maas:33 - cache-hit/stale/in progress (async)/dns server error: 3 name error","vm-victoria-metrics-cluster-vminsert.zeus.http.svc:33 - cache-hit/stale/in progress (async)/dns server error: 3 name error","vm-victoria-metrics-cluster-vminsert.zeus.http.svc.kong.svc.cluster.local:1 - cache-hit/stale/in progress (async)/dns server error: 3 name error","vm-victoria-metrics-cluster-vminsert.zeus.http.svc.svc.cluster.local:1 - cache-hit/stale/in progress (async)/dns server error: 3 name error","vm-victoria-metrics-cluster-vminsert.zeus.http.svc.cluster.local:1 - cache-hit/stale/in progress (async)/dns server error: 3 name error","vm-victoria-metrics-cluster-vminsert.zeus.http.svc.maas:1 - cache-hit/stale/in progress (async)/dns server error: 3 name error","vm-victoria-metrics-cluster-vminsert.zeus.http.svc:1 - cache-hit/stale/in progress (async)/dns server error: 3 name error","vm-victoria-metrics-cluster-vminsert.zeus.http.svc.kong.svc.cluster.local:5 - cache-hit/stale/in progress (async)/dns server error: 3 name error","vm-victoria-metrics-cluster-vminsert.zeus.http.svc.svc.cluster.local:5 - cache-hit/stale/in progress (async)/dns server error: 3 name error","vm-victoria-metrics-cluster-vminsert.zeus.http.svc.cluster.local:5 - cache-hit/stale/in progress (async)/dns server error: 3 name error","vm-victoria-metrics-cluster-vminsert.zeus.http.svc.maas:5 - cache-hit/stale/in progress (async)/dns server error: 3 name error","vm-victoria-metrics-cluster-vminsert.zeus.http.svc:5 - cache-hit/stale/in progress (async)/dns server error: 3 name error"], client: **.**.**.**, server: kong, request: "POST /insert/999/prometheus HTTP/1.1", host: "*****"

Ehekatl on 27 Feb 2020

wait, I think mine case was completely different.
the failing domain contains a wierd .http.
they should be vm-victoria-metrics-cluster-vminsert.zeus.svc.
/etc/resolve.conf looks good. No idea where this come from.

@hishamhm Any insight? I review recent code change but found nothing related...
I can confirm there is nothing changed but the kong image.

Ehekatl on 29 Feb 2020

No idea where this come from.

@Ehekatl Kong doesn't concatenate segments to the DNS resolved values like that, so this must be coming from somewhere else... I spot .maas in your domains, perhaps it is MAAS's domain resolution configuration that is doing that? (I have no firsthand experience with it, but could be a place to look)

hishamhm on 9 Mar 2020

@hishamhm I've post the detail in https://github.com/Kong/kubernetes-ingress-controller/issues/560
I believe there are serveral issue involed:

for some reason, get_balancer can fail and return nil even though backend are health
e.g. In a db-less kong cluster, When I run into this problem, only one pod have the issue, the rest are good, which means other pods think the upstream have healthy targets.
when get_balancer fail and get nil result, kong start fail back to dns resolution, but seems it never switch back, when dns resolution is not working, it unable to recover from the loop
dns resolution fail because in db-less mode, kong ingress controller named service in domain.port format, result in unresolvable domain.

Ehekatl on 9 Mar 2020

@Ehekatl,

Hi, I just discussed this with @javierguerragiraldez as we are investigating the issue. Thank you for the further info in that Kubernetes ticket.

I believe the problem lies in this commit:
https://github.com/Kong/kong/commit/5d8c87959cad18ae82518003199df010bc2c6a5c

If you change that line back to:

if last_equal_index == new_size and new_size > 0 then

Does it fix your issue?

I believe what it tried to fix is ok for the fix, but it needs to be reworked. It might be that:

new_size == old_size

may hold true but it still doesn't mean we should not apply the history. Or perhaps that:

and new_size > 0

Needs to be added back. @javierguerragiraldez is investigating it further, but if you can try to change the line to:

Before Fix

if last_equal_index == new_size and new_size > 0 then

and then try (which could fix the issue it tried, but not break what you are seeing):

if last_equal_index == new_size and new_size == old_size and new_size > 0 then

And report back if these help?

bungle on 16 Mar 2020

@bungle after changed to if last_equal_index == new_size and new_size == old_size and new_size > 0 then for several hours

we haven't found any issue so far.
but maybe need more testing (i.e. endpoint updates, ingress updates, etc)

Ehekatl on 18 Mar 2020

👍1

@bungle @javierguerragiraldez
after running for 14 hours, it happens again on 2 of 5 kong pods.
really weird, the targets IP aren't change
I see no endpoints update in ingress controller log

still investigating, will try change it to if last_equal_index == new_size and new_size > 0 then and see

Ehekatl on 19 Mar 2020

@bungle @javierguerragiraldez one more thing
when I use /upstreams/id/targets/all/ to check the upstream targets
I got Not Found once for every three requests ( while ingress controller have no log saying endpoint updates

Ehekatl on 19 Mar 2020

got the same problem after change it to if last_equal_index == new_size and new_size > 0
seems unrelated...
I'm rolling back to 1.4.2

Ehekatl on 19 Mar 2020

@Ehekatl great feedback! Let us know if it happens with 1.4.2 too.

bungle on 20 Mar 2020

@bungle it doesn't, it only happen after upgrade to a higher version

Ehekatl on 20 Mar 2020

@Ehekatl
Hi~
Have you made progress since the last comment?
I also met the same problem.

I use kong 2.0.2 , ingress-controller:0.7.1, and dbless mode
I also use codedns in another namespace. There are no error log about this.

2020/03/24 01:25:54 [error] 23#0: *1497383 [lua] balancer.lua:921: execute(): DNS resolution failed: dns server error: 3 name error. Tried: ["(short)service1.namespace.8080.svc:(na) - cache-miss","service1.namespace.8080.svc.namespace.svc.cluster.local:33 - cache-hit/dns server error: 3 name error","service1.namespace.8080.svc.svc.cluster.local:33 - cache-hit/dns server error: 3 name error","service1.namespace.8080.svc.cluster.local:33 - cache-hit/dns server error: 3 name error","service1.namespace.8080.svc.openstacklocal:33 - cache-hit/dns server error: 3 name error","service1.namespace.8080.svc:33 - cache-hit/dns server error: 3 name error","service1.namespace.8080.svc.namespace.svc.cluster.local:1 - cache-hit/dns server error: 3 name error","service1.namespace.8080.svc.svc.cluster.local:1 - cache-hit/dns server error: 3 name error","service1.namespace.8080.svc.cluster.local:1 - cache-hit/dns server error: 3 name error","service1.namespace.8080.svc.openstacklocal:1 - cache-hit/dns server error: 3 name error","service1.namespace.8080.svc:1 - cache-hit/dns server error: 3 name error","service1.namespace.8080.svc.namespace.svc.cluster.local:5 - cache-hit/dns server error: 3 name error","service1.namespace.8080.svc.svc.cluster.local:5 - cache-hit/dns server error: 3 name error","service1.namespace.8080.svc.cluster.local:5 - cache-hit/dns server error: 3 name error","service1.namespace.8080.svc.openstacklocal:5 - cache-hit/dns server error: 3 name error","service1.namespace.8080.svc:5 - cache-hit/dns server error: 3 name error"], client: 10.11.0.26, server: kong, request: "GET /favicon.ico HTTP/1.1", host: "10.10.10.10", referrer: "http://10.10.10.10/api"

novajung on 24 Mar 2020

@novajung without luck, I'm sticking with 1.4.2 for now.

Ehekatl on 25 Mar 2020

confirm that @SunshineYang 's approach does eliminate the issue
by change it to if dns_cache_only and target.balancer ~= nil then balancer = target.balancer

Ehekatl on 3 Apr 2020

I think I figured out what the issue is, but I'm not familiar with the base code so I'll just share my findings, without creating a pull request yet.

In /kong/runloop/balancer.lua, when the cache key "balancer:upstreams" is being invalidated, inside get_all_upstreams the cache callback load_upstreams_dict_into_memory returns an empty dictionary, this value being saved into the cache.

If I force to invalidate the cache key again when the dictionary is empty it seems to work, on the next execute function call, though the first one will fail with "name resolution failed".

maybe you can judge better wether it should wait for singletons.db.upstreams to be re-populated or something to do with the cache TTL.

get_all_upstreams = function()
    local upstreams_dict, err = singletons.core_cache:get("balancer:upstreams", opts,
                                                load_upstreams_dict_into_memory)
    if err then
      return nil, err
    end

    if #upstreams_dict == 0 then
      singletons.core_cache:invalidate_local("balancer:upstreams")
    end

    return upstreams_dict or {}
  end

uflorin on 4 Apr 2020

👍2

@uflorin did you eliminate the issue with the above changes?

@bungle @Tieske @locao @hishamhm do you have any thoughts?

murillopaula on 21 Apr 2020

This is also an issue for us running Kong 2.0 in DBless mode inside of Kubernetes with 0.8.0 version of the ingress controller.

inglemr on 23 Apr 2020

👍2

Same problem here. Using kong 2.0.3 db-less.
Kong tried to resolve my upstream as a DNS :(

errorlog.log

carnei-ro on 6 May 2020

Re-opening for this comment: https://github.com/Kong/kubernetes-ingress-controller/issues/560#issuecomment-625018858

hbagdi on 8 May 2020

cc @guanlan

hbagdi on 8 May 2020

https://github.com/Kong/kong/pull/5926#issuecomment-632146319

carnei-ro on 21 May 2020

Hi all,

We've verified the issue _is_ reproducible on 2.0.4 (and below), but _not reproducible_ on 2.1.0-alpha1. Here are the steps we took:

Run a Kong 2.0.4:

docker run -ti -d --name kong -e "KONG_DATABASE=off" \ -e "KONG_PROXY_LISTEN=0.0.0.0:8000" -e "KONG_ADMIN_LISTEN=0.0.0.0:8001" -e KONG_PROXY_ERROR_LOG=/logs/error.log -e KONG_PROXY_ACCESS_LOG=/dev/null \ -p 8000:8000 \ -p 8001:8001 \ -v "$PWD/logs:/logs" \ kong:2.0.4

Start a shell loop sending POST requests to /config:

for i in $(seq 1 500); do http :8001/config config=@../misc/kong.yml; done

(Find kong.yaml here).

In a separate terminal window, send requests to the proxy port:

for i in $(seq 1 5000); do curl -s -o /dev/null -w "%{http_code}\n" localhost:8000; done

This will print the status code of all requests; lots of 503s will be seen. The logs in logs/error.log will contain errors like the following:

2020/06/01 15:52:18 [error] 22#0: *19380 [lua] balancer.lua:929: execute(): DNS resolution failed: dns server error: 3 name error. Tried: ["(short)upstream1:(na) - cache-miss","upstream1:33 - cache-hit/dns server error: 3 name error","upstream1:1 - cache-hit/dns server error: 3 name error","upstream1:5 - cache-hit/dns server error: 3 name error"], client: 172.17.0.1, server: kong, request: "GET / HTTP/1.1", host: "localhost:8000"

If the same is attempted against 2.1.0-alpha1 (or with Kong master branch), no 503s will occur.

Note that the issue is also not reproducible in Kong master, since it includes https://github.com/Kong/kong/pull/5831.

Once again, thanks for all the feedback on this!

gszr on 1 Jun 2020

👍2

This issued was fixed by https://github.com/Kong/kong/commit/29a731a7d7f442e104720a3ef6fbdc8ca44c9b13, present in master and 2.1-beta. Closing this as resolved.

gszr on 24 Jun 2020

Recently, i often got an error below in dbless mode. It will return to noraml when restarting kong.

2020/01/09 12:05:19 [error] 42406#0: *151691 [lua] balancer.lua:917: execute(): DNS resolution failed: dns server error: 3 name error. Tried: ["(short)upstream_8:(na) - cache-miss","upstream_8:33 - cache-miss/scheduled/querying/dns server error: 3 name error","upstream_8:1 - cache-miss/scheduled/querying/dns server error: 3 name error","upstream_8:5 - cache-miss/scheduled/querying/dns server error: 3 name error"], client: 192.168.63.1, server: kong, request: "POST /hr/1 HTTP/1.1", host: "gateway.corp.com:8000"

Has the problem been solved？

jsdtxsx on 7 Jul 2020

Still reproduced in my production environment with hybrid deployment mode (Kong version 2.1.3, deployed into kubernetes cluster). I have five DP pods, and one of them ran into this error suddenly. Recreating pods fixed this issue temporarily, and I'm trying @Ehekatl's answer as a workaround.

artificerpi on 10 Nov 2020

Run into the same problem with DB-less mode(kong version 2.1.0).
After I add a service via admin API, One of my three DP nodes got this error, continued for more than two minutes, then I delete the service via admin API, everything is well.

Maybe sync config between cp and dp lead to cache invalid, and cache is not rebuild or rebuild failed? I can't find more details from error log.

huaerfan on 16 Nov 2020

I think I figured out what the issue is, but I'm not familiar with the base code so I'll just share my findings, without creating a pull request yet.

In /kong/runloop/balancer.lua, when the cache key "balancer:upstreams" is being invalidated, inside get_all_upstreams the cache callback load_upstreams_dict_into_memory returns an empty dictionary, this value being saved into the cache.

If I force to invalidate the cache key again when the dictionary is empty it seems to work, on the next execute function call, though the first one will fail with "name resolution failed".

maybe you can judge better wether it should wait for singletons.db.upstreams to be re-populated or something to do with the cache TTL.
get_all_upstreams = function()
    local upstreams_dict, err = singletons.core_cache:get("balancer:upstreams", opts,
                                                load_upstreams_dict_into_memory)
    if err then
      return nil, err
    end

    if #upstreams_dict == 0 then
      singletons.core_cache:invalidate_local("balancer:upstreams")
    end

    return upstreams_dict or {}
  end

This will make traffic poxy pretty slow, tested :(

artificerpi on 17 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings