Kong: DNS resolution failed in dbless mode.

Created on 16 Jan 2020  Â·  39Comments  Â·  Source: Kong/kong

Recently, i often got an error below in dbless mode. It will return to noraml when restarting kong.

2020/01/09 12:05:19 [error] 42406#0: *151691 [lua] balancer.lua:917: execute(): DNS resolution failed: dns server error: 3 name error. Tried: ["(short)upstream_8:(na) - cache-miss","upstream_8:33 - cache-miss/scheduled/querying/dns server error: 3 name error","upstream_8:1 - cache-miss/scheduled/querying/dns server error: 3 name error","upstream_8:5 - cache-miss/scheduled/querying/dns server error: 3 name error"], client: 192.168.63.1, server: kong, request: "POST /hr/1 HTTP/1.1", host: "gateway.corp.com:8000"

corbalancer tasbug tasneeds-investigation

Most helpful comment

Hi all,

We've verified the issue _is_ reproducible on 2.0.4 (and below), but _not reproducible_ on 2.1.0-alpha1. Here are the steps we took:

  1. Run a Kong 2.0.4:
docker run -ti -d --name kong -e "KONG_DATABASE=off" \ -e "KONG_PROXY_LISTEN=0.0.0.0:8000" -e "KONG_ADMIN_LISTEN=0.0.0.0:8001" -e KONG_PROXY_ERROR_LOG=/logs/error.log -e KONG_PROXY_ACCESS_LOG=/dev/null \ -p 8000:8000 \ -p 8001:8001 \ -v "$PWD/logs:/logs" \ kong:2.0.4
  1. Start a shell loop sending POST requests to /config:
for i in $(seq 1 500); do http :8001/config config=@../misc/kong.yml; done

(Find kong.yaml here).

  1. In a separate terminal window, send requests to the proxy port:
for i in $(seq 1 5000); do curl -s -o /dev/null -w "%{http_code}\n" localhost:8000; done

This will print the status code of all requests; lots of 503s will be seen. The logs in logs/error.log will contain errors like the following:

2020/06/01 15:52:18 [error] 22#0: *19380 [lua] balancer.lua:929: execute(): DNS resolution failed: dns server error: 3 name error. Tried: ["(short)upstream1:(na) - cache-miss","upstream1:33 - cache-hit/dns server error: 3 name error","upstream1:1 - cache-hit/dns server error: 3 name error","upstream1:5 - cache-hit/dns server error: 3 name error"], client: 172.17.0.1, server: kong, request: "GET / HTTP/1.1", host: "localhost:8000"

If the same is attempted against 2.1.0-alpha1 (or with Kong master branch), no 503s will occur.

Note that the issue is also not reproducible in Kong master, since it includes https://github.com/Kong/kong/pull/5831.

Once again, thanks for all the feedback on this!

All 39 comments

After reviewing code, i guess it may be caused by function execute in /kong/runloop/balancer.lua.

image

@Tieske, does this ring any bells?

It happens sometimes, i can't find any way to reproduce it. According to code, it happens when target.balancer is nil. I change the code if dns_cache_only and target.balancer ~= nil then balancer = target.balancer, and the error doesn't happen yet.

@SunshineYang, yes, I think your fix could be the one that need to be made. I checked where we set balancer, and it seems we do not always set it on first try.

@SunshineYang, I believe your fix should not be done as that makes the balancer phase possibly to yield, and that you cannot do on balancer. So more investigation is needed on this.

@SunshineYang, are you sure your Kong can resolve upstream_8?

What is happening here is that Kong has an Upstream defined by name upstream_8. But because at the time of the failure for some reason that balancer/Upstream by that name is unavailable. And then Kong falls back on DNS resolution, which then fails, since the name doesn't exist as a DNS name, only as an Upstream.

@locao @hishamhm there have been some updates in the balancer lifecycle, creating and replacing them on changes. Does this ring a bell?

upstream_8

"upstream_8" is the name of upstream, and target of upstream_8 can be resolved.

image

image

@SunshineYang could you confirm which Kong version are you testing this with? Thanks!

@SunshineYang could you confirm which Kong version are you testing this with? Thanks!

1.4.0 and 1.4.1

@SunshineYang Does this happens in the beginning when your configuration is (re)loaded or after it is running for a while?

Also, does reloading the configuration (via the /config endpoint) fix the problem? (without having to restart Kong)

Are you using Kong for Kubernetes?

I'm not using Kubernetes. It happens after running for a while, and it will not fix unless restarting kong not reloading configuration via /config endpoint.

run into this after upgrade from 1.4.2 to 2.0.2 (db-less mode)
with the exact environment, exact amount of traffic, nothing changed except kong version
just after a while, log start flooding with DNS resolution failed: dns server error: 3 name error.

And the problem keep getting worse, eventually no request can be finished.
reproducible in our prod env

[ingress-kong-7679db7945-nd99v proxy] 2020/02/27 20:13:56 [error] 30#0: *1496 [lua] balancer.lua:921: execute(): DNS resolution failed: dns server error: 3 name error. Tried: ["(short)vm-victoria-metrics-cluster-vminsert.zeus.http.svc:(na) - cache-miss","vm-victoria-metrics-cluster-vminsert.zeus.http.svc.kong.svc.cluster.local:33 - cache-hit/stale/in progress (async)/dns server error: 3 name error","vm-victoria-metrics-cluster-vminsert.zeus.http.svc.svc.cluster.local:33 - cache-hit/stale/in progress (async)/dns server error: 3 name error","vm-victoria-metrics-cluster-vminsert.zeus.http.svc.cluster.local:33 - cache-hit/stale/in progress (async)/dns server error: 3 name error","vm-victoria-metrics-cluster-vminsert.zeus.http.svc.maas:33 - cache-hit/stale/in progress (async)/dns server error: 3 name error","vm-victoria-metrics-cluster-vminsert.zeus.http.svc:33 - cache-hit/stale/in progress (async)/dns server error: 3 name error","vm-victoria-metrics-cluster-vminsert.zeus.http.svc.kong.svc.cluster.local:1 - cache-hit/stale/in progress (async)/dns server error: 3 name error","vm-victoria-metrics-cluster-vminsert.zeus.http.svc.svc.cluster.local:1 - cache-hit/stale/in progress (async)/dns server error: 3 name error","vm-victoria-metrics-cluster-vminsert.zeus.http.svc.cluster.local:1 - cache-hit/stale/in progress (async)/dns server error: 3 name error","vm-victoria-metrics-cluster-vminsert.zeus.http.svc.maas:1 - cache-hit/stale/in progress (async)/dns server error: 3 name error","vm-victoria-metrics-cluster-vminsert.zeus.http.svc:1 - cache-hit/stale/in progress (async)/dns server error: 3 name error","vm-victoria-metrics-cluster-vminsert.zeus.http.svc.kong.svc.cluster.local:5 - cache-hit/stale/in progress (async)/dns server error: 3 name error","vm-victoria-metrics-cluster-vminsert.zeus.http.svc.svc.cluster.local:5 - cache-hit/stale/in progress (async)/dns server error: 3 name error","vm-victoria-metrics-cluster-vminsert.zeus.http.svc.cluster.local:5 - cache-hit/stale/in progress (async)/dns server error: 3 name error","vm-victoria-metrics-cluster-vminsert.zeus.http.svc.maas:5 - cache-hit/stale/in progress (async)/dns server error: 3 name error","vm-victoria-metrics-cluster-vminsert.zeus.http.svc:5 - cache-hit/stale/in progress (async)/dns server error: 3 name error"], client: **.**.**.**, server: kong, request: "POST /insert/999/prometheus HTTP/1.1", host: "*****"

wait, I think mine case was completely different.
the failing domain contains a wierd .http.
they should be vm-victoria-metrics-cluster-vminsert.zeus.svc.
/etc/resolve.conf looks good. No idea where this come from.

@hishamhm Any insight? I review recent code change but found nothing related...
I can confirm there is nothing changed but the kong image.

No idea where this come from.

@Ehekatl Kong doesn't concatenate segments to the DNS resolved values like that, so this must be coming from somewhere else... I spot .maas in your domains, perhaps it is MAAS's domain resolution configuration that is doing that? (I have no firsthand experience with it, but could be a place to look)

@hishamhm I've post the detail in https://github.com/Kong/kubernetes-ingress-controller/issues/560
I believe there are serveral issue involed:

  1. for some reason, get_balancer can fail and return nil even though backend are health
    e.g. In a db-less kong cluster, When I run into this problem, only one pod have the issue, the rest are good, which means other pods think the upstream have healthy targets.

  2. when get_balancer fail and get nil result, kong start fail back to dns resolution, but seems it never switch back, when dns resolution is not working, it unable to recover from the loop

  3. dns resolution fail because in db-less mode, kong ingress controller named service in domain.port format, result in unresolvable domain.

@Ehekatl,

Hi, I just discussed this with @javierguerragiraldez as we are investigating the issue. Thank you for the further info in that Kubernetes ticket.

I believe the problem lies in this commit:
https://github.com/Kong/kong/commit/5d8c87959cad18ae82518003199df010bc2c6a5c

If you change that line back to:

if last_equal_index == new_size and new_size > 0 then

Does it fix your issue?

I believe what it tried to fix is ok for the fix, but it needs to be reworked. It might be that:

new_size == old_size

may hold true but it still doesn't mean we should not apply the history. Or perhaps that:

and new_size > 0 

Needs to be added back. @javierguerragiraldez is investigating it further, but if you can try to change the line to:

Before Fix

if last_equal_index == new_size and new_size > 0 then

and then try (which could fix the issue it tried, but not break what you are seeing):

if last_equal_index == new_size and new_size == old_size and new_size > 0 then

And report back if these help?

@bungle after changed to if last_equal_index == new_size and new_size == old_size and new_size > 0 then for several hours

we haven't found any issue so far.
but maybe need more testing (i.e. endpoint updates, ingress updates, etc)

@bungle @javierguerragiraldez
after running for 14 hours, it happens again on 2 of 5 kong pods.
really weird, the targets IP aren't change
I see no endpoints update in ingress controller log

still investigating, will try change it to if last_equal_index == new_size and new_size > 0 then and see

@bungle @javierguerragiraldez one more thing
when I use /upstreams/id/targets/all/ to check the upstream targets
I got Not Found once for every three requests ( while ingress controller have no log saying endpoint updates

got the same problem after change it to if last_equal_index == new_size and new_size > 0
seems unrelated...
I'm rolling back to 1.4.2

@Ehekatl great feedback! Let us know if it happens with 1.4.2 too.

@bungle it doesn't, it only happen after upgrade to a higher version

@Ehekatl
Hi~
Have you made progress since the last comment?
I also met the same problem.

I use kong 2.0.2 , ingress-controller:0.7.1, and dbless mode
I also use codedns in another namespace. There are no error log about this.

2020/03/24 01:25:54 [error] 23#0: *1497383 [lua] balancer.lua:921: execute(): DNS resolution failed: dns server error: 3 name error. Tried: ["(short)service1.namespace.8080.svc:(na) - cache-miss","service1.namespace.8080.svc.namespace.svc.cluster.local:33 - cache-hit/dns server error: 3 name error","service1.namespace.8080.svc.svc.cluster.local:33 - cache-hit/dns server error: 3 name error","service1.namespace.8080.svc.cluster.local:33 - cache-hit/dns server error: 3 name error","service1.namespace.8080.svc.openstacklocal:33 - cache-hit/dns server error: 3 name error","service1.namespace.8080.svc:33 - cache-hit/dns server error: 3 name error","service1.namespace.8080.svc.namespace.svc.cluster.local:1 - cache-hit/dns server error: 3 name error","service1.namespace.8080.svc.svc.cluster.local:1 - cache-hit/dns server error: 3 name error","service1.namespace.8080.svc.cluster.local:1 - cache-hit/dns server error: 3 name error","service1.namespace.8080.svc.openstacklocal:1 - cache-hit/dns server error: 3 name error","service1.namespace.8080.svc:1 - cache-hit/dns server error: 3 name error","service1.namespace.8080.svc.namespace.svc.cluster.local:5 - cache-hit/dns server error: 3 name error","service1.namespace.8080.svc.svc.cluster.local:5 - cache-hit/dns server error: 3 name error","service1.namespace.8080.svc.cluster.local:5 - cache-hit/dns server error: 3 name error","service1.namespace.8080.svc.openstacklocal:5 - cache-hit/dns server error: 3 name error","service1.namespace.8080.svc:5 - cache-hit/dns server error: 3 name error"], client: 10.11.0.26, server: kong, request: "GET /favicon.ico HTTP/1.1", host: "10.10.10.10", referrer: "http://10.10.10.10/api"

@novajung without luck, I'm sticking with 1.4.2 for now.

confirm that @SunshineYang 's approach does eliminate the issue
by change it to if dns_cache_only and target.balancer ~= nil then balancer = target.balancer

I think I figured out what the issue is, but I'm not familiar with the base code so I'll just share my findings, without creating a pull request yet.

In /kong/runloop/balancer.lua, when the cache key "balancer:upstreams" is being invalidated, inside get_all_upstreams the cache callback load_upstreams_dict_into_memory returns an empty dictionary, this value being saved into the cache.

If I force to invalidate the cache key again when the dictionary is empty it seems to work, on the next execute function call, though the first one will fail with "name resolution failed".

maybe you can judge better wether it should wait for singletons.db.upstreams to be re-populated or something to do with the cache TTL.

get_all_upstreams = function()
    local upstreams_dict, err = singletons.core_cache:get("balancer:upstreams", opts,
                                                load_upstreams_dict_into_memory)
    if err then
      return nil, err
    end

    if #upstreams_dict == 0 then
      singletons.core_cache:invalidate_local("balancer:upstreams")
    end

    return upstreams_dict or {}
  end

@uflorin did you eliminate the issue with the above changes?

@bungle @Tieske @locao @hishamhm do you have any thoughts?

This is also an issue for us running Kong 2.0 in DBless mode inside of Kubernetes with 0.8.0 version of the ingress controller.

Same problem here. Using kong 2.0.3 db-less.
Kong tried to resolve my upstream as a DNS :(

errorlog.log

cc @guanlan

Hi all,

We've verified the issue _is_ reproducible on 2.0.4 (and below), but _not reproducible_ on 2.1.0-alpha1. Here are the steps we took:

  1. Run a Kong 2.0.4:
docker run -ti -d --name kong -e "KONG_DATABASE=off" \ -e "KONG_PROXY_LISTEN=0.0.0.0:8000" -e "KONG_ADMIN_LISTEN=0.0.0.0:8001" -e KONG_PROXY_ERROR_LOG=/logs/error.log -e KONG_PROXY_ACCESS_LOG=/dev/null \ -p 8000:8000 \ -p 8001:8001 \ -v "$PWD/logs:/logs" \ kong:2.0.4
  1. Start a shell loop sending POST requests to /config:
for i in $(seq 1 500); do http :8001/config config=@../misc/kong.yml; done

(Find kong.yaml here).

  1. In a separate terminal window, send requests to the proxy port:
for i in $(seq 1 5000); do curl -s -o /dev/null -w "%{http_code}\n" localhost:8000; done

This will print the status code of all requests; lots of 503s will be seen. The logs in logs/error.log will contain errors like the following:

2020/06/01 15:52:18 [error] 22#0: *19380 [lua] balancer.lua:929: execute(): DNS resolution failed: dns server error: 3 name error. Tried: ["(short)upstream1:(na) - cache-miss","upstream1:33 - cache-hit/dns server error: 3 name error","upstream1:1 - cache-hit/dns server error: 3 name error","upstream1:5 - cache-hit/dns server error: 3 name error"], client: 172.17.0.1, server: kong, request: "GET / HTTP/1.1", host: "localhost:8000"

If the same is attempted against 2.1.0-alpha1 (or with Kong master branch), no 503s will occur.

Note that the issue is also not reproducible in Kong master, since it includes https://github.com/Kong/kong/pull/5831.

Once again, thanks for all the feedback on this!

This issued was fixed by https://github.com/Kong/kong/commit/29a731a7d7f442e104720a3ef6fbdc8ca44c9b13, present in master and 2.1-beta. Closing this as resolved.

Recently, i often got an error below in dbless mode. It will return to noraml when restarting kong.

2020/01/09 12:05:19 [error] 42406#0: *151691 [lua] balancer.lua:917: execute(): DNS resolution failed: dns server error: 3 name error. Tried: ["(short)upstream_8:(na) - cache-miss","upstream_8:33 - cache-miss/scheduled/querying/dns server error: 3 name error","upstream_8:1 - cache-miss/scheduled/querying/dns server error: 3 name error","upstream_8:5 - cache-miss/scheduled/querying/dns server error: 3 name error"], client: 192.168.63.1, server: kong, request: "POST /hr/1 HTTP/1.1", host: "gateway.corp.com:8000"

Has the problem been solved?

Still reproduced in my production environment with hybrid deployment mode (Kong version 2.1.3, deployed into kubernetes cluster). I have five DP pods, and one of them ran into this error suddenly. Recreating pods fixed this issue temporarily, and I'm trying @Ehekatl's answer as a workaround.

Run into the same problem with DB-less mode(kong version 2.1.0).
After I add a service via admin API, One of my three DP nodes got this error, continued for more than two minutes, then I delete the service via admin API, everything is well.

Maybe sync config between cp and dp lead to cache invalid, and cache is not rebuild or rebuild failed? I can't find more details from error log.

I think I figured out what the issue is, but I'm not familiar with the base code so I'll just share my findings, without creating a pull request yet.

In /kong/runloop/balancer.lua, when the cache key "balancer:upstreams" is being invalidated, inside get_all_upstreams the cache callback load_upstreams_dict_into_memory returns an empty dictionary, this value being saved into the cache.

If I force to invalidate the cache key again when the dictionary is empty it seems to work, on the next execute function call, though the first one will fail with "name resolution failed".

maybe you can judge better wether it should wait for singletons.db.upstreams to be re-populated or something to do with the cache TTL.

get_all_upstreams = function()
    local upstreams_dict, err = singletons.core_cache:get("balancer:upstreams", opts,
                                                load_upstreams_dict_into_memory)
    if err then
      return nil, err
    end

    if #upstreams_dict == 0 then
      singletons.core_cache:invalidate_local("balancer:upstreams")
    end

    return upstreams_dict or {}
  end

This will make traffic poxy pretty slow, tested :(

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jeremyxu2010 picture jeremyxu2010  Â·  39Comments

daviesf1 picture daviesf1  Â·  39Comments

throrin19 picture throrin19  Â·  39Comments

plukevdh picture plukevdh  Â·  52Comments

jeremyjpj0916 picture jeremyjpj0916  Â·  81Comments