Kong: Kong throws '404 - no Route matched with those values' because of corrupted Cache

Created on 24 Dec 2019  路  27Comments  路  Source: Kong/kong

Summary

When the system under massive traffic, and CRUD routes via Kong's admin API frequently, Kong returns 404 - no Route matched with those values for a route path that exists at times.
the delete route occasionally returns 500: {"message":"An unexpected error occurred"}. When we retry to delete - in succeed.
It happens randomly (once a week) and for a while taking the cache to rebuilt - therefore we assume it related to corrupted cache (as described here: https://github.com/Kong/kong/issues/5207)

Steps To Reproduce

  1. Start Kong
  2. Load traffic on the system (~ 500 TPS)
  3. Create and delete route via Kong's admin API frequently
  4. After a while, some of the NGINX workers in Kong will start to return '404 - no Route matched with those values' for paths that exist at times
  5. Things start to recover after a while (~40 minutes) or after Kong restart

Additional Details & Logs

  • Kong version: 1.4
  • Operating system: CentOS Linux release 7.7.1908
  • Database: Cassandra
  • Occurs after getting 500: {"message": "An unexpected error occurred"} from Kong admin API while trying to delete a route (We have a retry mechanism, so whenever we get this error we try again for 5 times with few seconds of waiting between each attempt)
  • We install Kongs over K8S cluster (Helm deployments)
  • We work with 3 deployments of Kong connected to the same Cassandra:

    1. ControlPlan - for all rest API calls

    2. DataPlan - for incoming data ingestion (specific rest API which is more commonly used)

    3. UserExperience - for all admin operations (like creation and deletion of routes)

  • ControlPlan and DataPlan Kong deployments have 6 pods (replica=6) over K8S, UserExperience has 2)
  • Kong related config params:

    1. ControlPlan:

      cassandra_schema_consensus_timeout = 100000 cassandra_repl_factor = 3 cassandra_data_centers = westus:3,eastus:3 cassandra_local_datacenter = westus db_update_propagation = 2 db_update_frequency = 15 mem_cache_size = 1024m cassandra_ssl = off cassandra_ssl_verify = off cassandra_timeout = 35000 anonymous_reports = off cassandra_consistency = LOCAL_QUORUM cassandra_lb_policy = RequestDCAwareRoundRobin cassandra_repl_strategy = NetworkTopologyStrategy router_consistency = strict router_update_frequency = 1

    2. DataPlan:

      cassandra_schema_consensus_timeout = 100000 cassandra_repl_factor = 3 cassandra_data_centers = westus:3,eastus:3 cassandra_local_datacenter = westus db_update_propagation = 2 db_update_frequency = 28 mem_cache_size = 1024m cassandra_ssl = off cassandra_ssl_verify = off cassandra_timeout = 35000 anonymous_reports = off cassandra_consistency = LOCAL_QUORUM cassandra_lb_policy = RequestDCAwareRoundRobin cassandra_repl_strategy = NetworkTopologyStrategy router_consistency = strict router_update_frequency = 1

    3. UserExperience:

      cassandra_schema_consensus_timeout = 100000 cassandra_repl_factor = 3 cassandra_data_centers = eastus:3,westus:3 cassandra_local_datacenter = westus db_update_propagation = 2 db_update_frequency = 7 mem_cache_size = 1024m cassandra_ssl = off cassandra_ssl_verify = off cassandra_timeout = 15000 anonymous_reports = off cassandra_consistency = LOCAL_QUORUM cassandra_lb_policy = DCAwareRoundRobin cassandra_repl_strategy = NetworkTopologyStrategy router_consistency = eventual router_update_frequency = 1

  • NGINX related config params (for all 3 deployments):
    lua_shared_dict kong 5m; lua_shared_dict kong_db_cache 512m; lua_shared_dict kong_db_cache_miss 128m; lua_shared_dict kong_locks 8m; lua_shared_dict kong_process_events 5m; lua_shared_dict kong_cluster_events 5m; lua_shared_dict kong_healthchecks 5m; lua_shared_dict kong_rate_limiting_counters 12m; lua_shared_dict kong_cassandra 5m; lua_socket_log_errors off;
  • ### All pods configured with worker_processes 1; for NGINX workers
    We did it because we assumed there is a collision in shared memory between NGINX workers which causing this problem to appear (mutexes? semaphores?). After reducing the NGINX workers to be one, The problem still exists but occurs way less often (once a week instead of once a day)
  • ### There was a complete evolution around this issue (trying to fix it by configuration):
  • First, we set router_consistency = eventual for all deployments with a low update frequency-time
  • Then, we change the most frequently used Kongs deployments to be router_consistency = strict (ControlPlan and DataPlan)
  • After, we reduce the amount of NGINX workers to be 4 instead of 8 (Kongs deployments: ControlPlan and in: 3 pods each, UserExperience: 2 pods)
  • At last, we reduce the amount of NGINX workers to be 1 and scale the deployment to be 6 pods for ControlPlan and DataPlan (portal still have 2 pods)
  • We are considering, as a workaround, not deleting any routes from now on, instead - patching them to a not exists path /dev/null and once a while deletes all routes with this path. We hope it will keep Kong deployments stable without any occurrence of this issue.
  • related issues:

    1. https://github.com/Kong/kong/issues/4055

    2. https://github.com/Kong/kong/issues/5207

Are you guys familiar with this kind of issue?

pending author feedback tasneeds-investigation

Most helpful comment

Hi @al942u,

@vasuharish was able to finally find and fix the issue that we have tried to chase for so long. Today I merged #5564 that should also fix this issue, thus I am closing this. Please reopen if you see the issue persisting even after applying that.

All 27 comments

I think cache is not an issue. It is just too many workers try to update and it takes time. Thus why don't you always use eventual?

@bungle Also happened with eventual, Aslo the same tested with one worker.

hi @bungle , Thanks for your reply.

  1. We believe the cache is the issue because when validating the data in Cassandra, everything is there and valid (so, we came to the conclusion the memory of the cache is corrupted and therefore the routes not found).
  2. We started with 8 workers and then reduce it to 4 - still happened, now it was reduced to one worker and it's still happening. So reducing the workers makes it occurred less frequently but it still occurs.
  3. As I wrote, we started with eventual and we came across the problem once an hour, after changing it in DataPlan and ControlPlan kongs to strict - the problem still occurs, but less frequently.

I guess that needs further investigation. What I mean that it is not necessarily a corruption issue. you are just pushing the workers. E.g. you make some load to them while triggering constant router/plugins rebuilds. You seem to be creating and deleting route. Thus some rebuilds will get router without that route (giving that 404) and some with that route (or is it about sone other route that you are not creating or deleting)? It could easily be that Cassandra is bottle neck there as you say things settle after some time. But sure there can be issues. I remember we removed one check that I originally proposed for rebuilds which included a check if router was changed during rebuild (a paging related issue).

Thank you for your response.

I want to understand,
When working in eventual mode, my understanding of the new cache is being built in the background and then Kong "switches" the caches to the newly built one. I would not expect to see any 404 on old/existing (that not deleted) routes (at any giving time).

or is it about sone other route that you are not creating or deleting?

The 404's are on existing routes that not related to CRUD ones.

It could easily be that Cassandra is bottle neck there as you say things settle after some time

When we observe these 404's it didn't feel like something "in transition", the reason I'm saying it, it lasted almost an hour.
Cassandra wise - it was updated and there is no evidence for resource concerns (memory, CPU, IO etc..)

I remember we removed one check that I originally proposed for rebuilds which included a check if router was changed during rebuild (a paging related issue)

Our guess is one of the next "cache rebuilds" managed to build the complete routing in-memory cache and solve the 404's by itself. The problem, of course, is critical as Kong rejects valid requests during these sporadic, long periods

How can we help to promote the investigation?

New cache is built based on certain events. E.g. deleting a route causes it. Now if you did delete and create, that means two events, both of which trigger rebuild. So it can be that you got rebuilt with only delete in it and not create. It can even be possible that you got create and delete in, but we skipped some other route because of paging. That's the thing I was referring about detecting changes. But we will then start a new rebuild right after, but if you keep constantly creating and deleting routes this flux stays. If we add detection of change during rebuilt that will move the new router further, as it never can finish it as you change stuff while it is building.

How many Kong nodes do you run during the test?

We have 4 K8S worker nodes.
On these nodes, there are 6 K8S pods of ControlPlan, 6 K8S pods of DataPlan and 2 K8S pods of UserExperience (I remind you that each pod runs 1 NGINX worker).

@bungle , Can you please elaborate on the paging issue which may cause the problem of the missing routes?

Many thanks.

@al942u

The pagination issue is this:
https://github.com/Kong/kong/blob/next/kong/runloop/handler.lua#L638

for route, err in db.routes:each() do

That actually may make multiple database calls, and inside that loop there are other possible db calls.

So think it got 1000 rows in first call now it processes rows and you create a new row. But that row got inserted in first page, we will miss that route. There are other cases too, where we might even skip route that nobody touched (a more severe one).

And finally here we swap the router:
https://github.com/Kong/kong/blob/next/kong/runloop/handler.lua#L664

And you see between the :each line and that line we don't check if router has changed meanwhile we did the router build. That could be seen as invalid state router in some cases. We swap the router and then start a new build round, but that one router could be invalid for a little while, thus for you it seems like 40 mins.

On my original proposal you can see things like this:
https://github.com/Kong/kong/blob/1229d6bae7a584688ed634625622472d00f10376/kong/runloop/handler.lua#L775-L778

e.g. a recursive build. It cancels the build and starts a new one if change is detected. But we didn't end up having that.

Now, I am not sure if your issue is because of it, but let's see, you have some 14 pods running, potentially which all may be polling db and doing router rebuilds. On each machine you could have 8 workers, which means that you can have over hundred rebuilds running all the time, and if they are slow, they can start piling up. Especially when you run strict as that can also start new builds on requests while eventual only does it on background, and doens't pile up.

I am not sure if your use case is a real-world example, or are you just testing something. I mean it is quite uncommon to create and delete routes all the time (e.g. multiple times per second). What is your use case of doing so?

Hi @bungle ,
Thank you for your quick reply.

what is your use case of doing so?

Our use case is a production running solution where CRUD may not happen for a few hours or it might happen multiple times in a second for a short period of time.

e.g. a recursive build. It cancels the build and starts a new one if change is detected. But we didn't end up having that.

This change is less suitable for us because we might have a burst of rebuilds (it could happen because we have multiple routes creations a second)

Does the paging issue happen in Postgress as well?

Dose the NGINX workers in the Kong container share the same cache memory or each one have its own memory (and rebuilds its own cache)? because we notice that for multiple NGINX workers, some of the workers are with corrupted cache while the others are working as expected.

This change is less suitable for us because we might have a burst of rebuilds (it could happen because we have multiple routes creations a second)

How about this change:
https://github.com/Kong/kong/commit/1de9c229eef47f1c168ab6fa1d2b8001db5aacc1

It does check for changes while rebuilding, but instead of recursively rebuilding, it just cancels the build because it may lead to invalid state router in some edge case. The version is not updated, so with eventual new rebuild try is started after router_update_frequency.

We also need to ensure that you are testing correctly. It sure feels more like some worker event didn't fire issue on a single worker than this rare paging issue. Can you try with the Kong 2.0.0rc1 and with this: https://github.com/Kong/kong/commit/1de9c229eef47f1c168ab6fa1d2b8001db5aacc1

Just to make sure it is not that.

Can you describe your test situation bit more?

Is 404 happening on some workers with newly added routes? Or randomly even on a route that is not updated? Do you also see Kong proxying to deleted routes? Do you see any errors in your Kong logs?

How fast are you trying a newly added route?

Remember there is db_update_frequency which makes other nodes to be out of sync at worst that period + some rebuilding time. Even with strict as strictness only applies to a node, not two or more nodes (but you said that you have some workers even on same node out of sync). Then there is also worker events, and here are the defaults for that:

    local ok, err = worker_events.configure {
      shm = "kong_process_events", -- defined by "lua_shared_dict"
      timeout = 5,            -- life time of event data in shm
      interval = 1,           -- poll interval (seconds)

      wait_interval = 0.010,  -- wait before retry fetching event data
      wait_max = 0.5,         -- max wait time before discarding event
    }

See description here:
https://github.com/Kong/lua-resty-worker-events#configure

So depending on load I think some events might get lost. @Tieske can you give more internals on this?

Does the paging issue happen in Postgress as well?

Yes.

Dose the NGINX workers in the Kong container share the same cache memory or each one have its own memory (and rebuilds its own cache)? because we notice that for multiple NGINX workers, some of the workers are with corrupted cache while the others are working as expected.

Every worker rebuilds their own router using worker memory. They do not share it with other workers with shared memory.

Hi @bungle ,
Sorry for the delayed response.

On previous comments, I answered some of your questions but let me elaborate here:

  • First, our case is related to "old" routes that did not change at all (not those which currently created or deleted). And therefore we tend to agree with the paging issue and not the fire event on a specific worker's issue. BTW, The corrupted worker/workers worked fine and then suddenly stop proxying old existing routes

We also need to ensure that you are testing correctly...
Can you describe your test situation bit more?...

We are unable to reproduce this issue on our look-alike systems. It happens only on production, unpredictably. It usually happens once a week - That's why it's a rare edge case but critical for us because we cannot have downtime on our APIs.

Since we change instead of DELETE route into PATCH route to unexists path, we didn't come across this issue. This is not a solution for us but we tend to believe it has to do with the dynamic deletions which lead to the paging issue.

Is 404 happening on some workers with newly added routes? Or randomly even on a route that is not updated?

The 404 happens on random routes that are not related to CRUD.

Do you also see Kong proxying to deleted routes?

No.

Do you see any errors in your Kong logs?

No, only the unexpected error occurred while trying to delete the route. Actually, 10% of the delete routes fail on the first attempt (i.e out of 100 deletions, 10 will fail) and when retrying we succeed (We have a retry mechanism).

How fast are you trying a newly added route?

It depends on the user, but usually 30 seconds (or more) later (and it works).

@al942u, any chance you could try #5429 OR #5431 (compatible with 1.4)?

@al942u also, if possible could you let us know how many Routes you have configured in your database?

This will help us know if the router rebuild iteration needs to do pagination on the DB, and will give us a better idea whether the root cause is the one addressed in #5429.

Hi @bungle @hishamhm - the issue occurs sporadicaly. holiday season was quite -I assume our customers were not making changes which involves route changes in the system. Currently we have around 13K routes and counting. we started to observe it again, we're linking it to DEL ops , we stopped deleting routes (we use PATCH directing route to /dev/null ) however, we suspect the issue occurs again because we DEL consumers/plugins. generally speaking, we expect the number of cache entries to continue to rise. from what we see we really suspect DEL operations are causing it.
We were unable to reproduce it on our look-alike systems. @bungle I was reading your #5431 summary desciption- it seems the cache can get out of sync nevertheless we'll give it a try and let you know it may take us a while as its an edge case. can I expect "router was changed while rebuilding it" in the logs? is it an info level thing?

@nesia-amit yes, you should get an [error]-level log containing "router was changed while rebuilding it" if this case is hit. #5431 was included in Kong 2.0.0. Thanks for the feedback, it would be great if you could keep and eye on that and let us know if that line happens and if the problem persists or not.

Hi @hishamhm ,
We upgraded the Kong to the one you suggested. We will track the logs and see if the issue repeats.

We will update with any progress.

Thanks.

@al942u Thanks! Did you get any more info from the logs?

Hi @hishamhm ,
We haven't seen any occurrence of the issue since 馃挴 (It takes time to reproduce), however, we did see the following another issue:
When we try to delete OAuth2 application which not exists, it returns 500 HTTP status instead of 404/204 like the other APIs do:

curl -i -X DELETE 10.2.112.9:8001/consumers/27ebaaad-9000-4ef7-a1c5-4080f63e55d3/oauth2/not-exists-oauth2-app

^^^ HTTP status returned: 500 (The consumer exists BTW)

curl -i -X GET 10.2.112.9:8001/consumers/27ebaaad-9000-4ef7-a1c5-4080f63e55d3/oauth2/not-exists-oauth2-app

^^^ HTTP status returned: 404

Kong logs related to the 500 returned:

2020/02/11 15:08:52 [error] 32#0: *5885517 lua coroutine: runtime error: /usr/local/share/lua/5.1/kong/db/schema/init.lua:1995: attempt to index local 'entity' (a nil value)
2020/02/11 15:08:52 [error] 32#0: *5885517 [lua] api_helpers.lua:422: handle_error(): /usr/local/share/lua/5.1/lapis/application.lua:401: /usr/local/share/lua/5.1/kong/db/schema/init.lua:1995: attempt to index local 'entity' (a nil value)

We suspect that the first time we deleted the OAuth 2 application, we also got 500 but the Kong actually deleted it. Then, in the retries, we get 500 because it doesn't exist

Hi @al942u,

@vasuharish was able to finally find and fix the issue that we have tried to chase for so long. Today I merged #5564 that should also fix this issue, thus I am closing this. Please reopen if you see the issue persisting even after applying that.

@bungle @vasuharish Thank you for the info and fix, Seems make sense for 404 and invalid cache entry issue we saw. We will give the fix a try, Do you have any idea when the next release is scheduled ?

@UnixBoy1, you shouldn't have to wait for too long for 2.0.2 or 2.1.0.

Amazing fix! Will it be also included in a 1.x release?

@bungle testing new release, will update if any issue are found.

@UnixBoy1 @bungle , Kong's team not update the docker image in Docker hub with the related release. The last tag is related to release 2.0.1

@bungle , can you tell us when the docker image related to release 2.0.2 (with the fix) will be on docker hub?

Many thanks.

CC @nesia-amit

This has been delayed because of lack of communication.
The work can be tracked here: https://github.com/docker-library/official-images/pull/7534

New version installed 2.0.2 and running in prod with router_consistency = eventual for a few days now , we will monitor and let you know if we see any issues. For now it seems to be fixed.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

sonicaghi picture sonicaghi  路  39Comments

subnetmarco picture subnetmarco  路  97Comments

sonicaghi picture sonicaghi  路  47Comments

jeremyjpj0916 picture jeremyjpj0916  路  59Comments

ahmadnassri picture ahmadnassri  路  59Comments