Kong: Admin API GET /status "database.reachable" field inaccurately reports Cassandra DB connection health as "true" when Database is offline

Created on 22 Apr 2020 · 12Comments · Source: Kong/kong

Summary

After Kong has started, if Cassandra database goes down, Admin-API /status still reports database:reachable=true

Steps To Reproduce

Start Kong with a Cassandra Database
Turn off your Cassandra Database
Call Admin API /status endpoint
Observe

{"database":{"reachable":true} ....

Even though

Datacenter: CTC
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens       Owns (effective)  Host ID                               Rack
DN  10.204.90.237   28.94 MiB  256          100.0%            de44f0ea-9ac4-4bef-ac05-c92926e6fc89  RACK3
DN  10.86.168.39    29.66 MiB  256          100.0%            42f16255-69c6-4aee-b39e-6f2c97513e56  RACK1
DN  10.86.177.62    28.2 MiB   256          100.0%            b62d5d30-0b88-4dd6-b1c1-4f800606a562  RACK2

Additional Details & Logs

Kong version: 1.4.3
Operating system Kong Alpine

tasfeature

Source

rsbrisci

👍1

Most helpful comment

This is where I would start to look at:
https://github.com/Kong/kong/blob/master/kong/api/routes/health.lua#L17-L79

If you turn off your cassandra and this still works:
https://github.com/Kong/kong/blob/master/kong/api/routes/health.lua#L68

It is really a bug, so I would look that then:
https://github.com/Kong/kong/blob/master/kong/db/init.lua#L164

And then:
https://github.com/Kong/kong/blob/master/kong/db/strategies/cassandra/connector.lua#L334

Could it be that this:
https://github.com/Kong/kong/blob/master/kong/db/strategies/cassandra/connector.lua#L356

Is not always checked for "live connection". Should that be fixed? Or should this be more robust:
https://github.com/Kong/kong/blob/master/kong/api/routes/health.lua#L68-L73

E.g. actually use that connection and see if it works.

bungle on 23 Apr 2020

❤2

All 12 comments

Mr @bungle - I've been digging my way through https://github.com/Kong/kong/tree/master/kong/db trying to see just how Kong determines that reachable status after the init() - and am failing.

Any pointers? I'm more than willing to try a PR on this

rsbrisci on 23 Apr 2020

I mean... Kong obviously notices that nodes becomes unreachable. It prints in STDOUT that it marks each node as down - there should just be some way to say "If all nodes down, database is now unreachable"

Or, fancier "If unable to meet consistency setting, database now unreachable"

rsbrisci on 23 Apr 2020

This is where I would start to look at:
https://github.com/Kong/kong/blob/master/kong/api/routes/health.lua#L17-L79

If you turn off your cassandra and this still works:
https://github.com/Kong/kong/blob/master/kong/api/routes/health.lua#L68

It is really a bug, so I would look that then:
https://github.com/Kong/kong/blob/master/kong/db/init.lua#L164

And then:
https://github.com/Kong/kong/blob/master/kong/db/strategies/cassandra/connector.lua#L334

Could it be that this:
https://github.com/Kong/kong/blob/master/kong/db/strategies/cassandra/connector.lua#L356

Is not always checked for "live connection". Should that be fixed? Or should this be more robust:
https://github.com/Kong/kong/blob/master/kong/api/routes/health.lua#L68-L73

E.g. actually use that connection and see if it works.

bungle on 23 Apr 2020

❤2

This is where I would start to look at:
https://github.com/Kong/kong/blob/master/kong/api/routes/health.lua#L17-L79

If you turn off your cassandra and this still works:
https://github.com/Kong/kong/blob/master/kong/api/routes/health.lua#L68

Screen Shot 2020-04-23 at 1 54 33 PM
Screen Shot 2020-04-23 at 1 55 26 PM

2020/04/23 20:07:07 [error] 77#0: *2707118 [kong] kong.lua:42 [cassandra] could not execute page query: [Unavailable exception] Cannot achieve consistency level LOCAL_QUORUM, client: 10.129.112.1, server: kong_admin, request: "GET / HTTP/1.1", host: "kong:8001"

Could it be that this:
https://github.com/Kong/kong/blob/master/kong/db/strategies/cassandra/connector.lua#L356

Is not always checked for "live connection". Should that be fixed? Or should this be more robust:
https://github.com/Kong/kong/blob/master/kong/api/routes/health.lua#L68-L73

E.g. actually use that connection and see if it works.

I'm thinking that's it's never checked for a "live connection" right now.
But I almost want to make this smarter. Really, I don't care if I can necessarily connect to any random C* db node in my pool, I specifically care whether or not I haven enough connections to satisfy my consistency settings. That's what actually enables the database to be functional.

The best way to test that may well just be a functional test, but I wouldn't want to make something too expensive on the DB...

rsbrisci on 23 Apr 2020

It seems an actually reliable test of DB health is to hit the root path / of the Admin API endpoint. This returns HTTP 500 - Unexpected error occured, with the following in STDOUT.

2020/04/23 20:07:07 [error] 77#0: *2707118 [kong] kong.lua:42 [cassandra] could not execute page query: [Unavailable exception] Cannot achieve consistency level LOCAL_QUORUM, client: 10.129.112.1, server: kong_admin, request: "GET / HTTP/1.1", host: "kong:8001"

will fail not only when all db nodes are unreachable, but also when consistency fails - which is the kinda functional test I really like

rsbrisci on 23 Apr 2020

@rsbrisci,

So it seems to fail here:
https://github.com/Kong/kong/blob/master/kong/api/routes/kong.lua#L40

Perhaps that you could test on /status too, though it might be a bit expensive if there are huge number of plugins, but as we have it already at /, I guess it might make sense to have it in /status.

bungle on 24 Apr 2020

@bungle does Kong have a simple 1 entry table in the keyspace, like a Kong metadata table(maybe contains kong version/openresty version/nginx version or something of the sort thats not intense to read on that could be put in a light polling bg task(maybe once every 30 seconds or so, maybe configured to run as often as the db polling interval set in kong) of sorts that informs that json status element(which would fix the bug rather than hack something that just works for us but leaves this field with incorrect desired behavior)?

I suppose the bandaid way would be in the /status hot path do that lookup on some kinda meta info if it exists or as you mentioned that plugins call, but plugins imo is way too intense a db call right? Querying all plugins could return 1000's of records in our case when we have 1000+ proxies and each proxy gets 3+ plugins or so. I think @rsbrisci pointed us at the /upstreams resource right now as a simple query that fails if we have lost db availability or accessibility from our Kong node.

jeremyjpj0916 on 30 Apr 2020

This is not a bug, and entirely expected behaviour so far for the reachable property.

As per its name, this property reports on whether or not Kong can _reach_ the database (i.e. does not encounter any connectivity issues), and does _not_ report on said database's "health". Monitoring the database's health is not one of Kong's responsibilities, and should be delegated to other monitoring tools, especially for a distributed database such as Cassandra where the definition of "healthy" is far from simply being a binary yes/no. Besides, no single query will determine the health of the Cassandra cluster with any certainty: there are too many variables for each query executed by a C* application (replication factor, consistency settings, built-in retry policies in the driver, etc...) so that if one succeeds, it absolutely does not guarantee that another query with a different partition layout and different CQL settings executed, say, on the proxy path, will actually succeed (e.g. even if some nodes are down, the "health" query could still succeed, while a proxy query may still fail).

Imho, the only thing that _could_ be reported in Kong's monitoring endpoint would be the number of C* nodes that are considered up and/or down by the driver. But again, such nodes could actually be _healthy_ yet _unreachable by Kong_ due to network connectivity issues. In my view, the closest we could get to monitoring C* nodes from Kong would be to track the reachability + the status of each node the driver is keeping track of, i.e. for each node Kong is connected to.

Also, note that currently, DB reachability does not bypass the cosocket connection pool (as noted in the code), which, as far as I can tell and from the top of my head, isn't too much of an issue (if the connection is still open, the DB is still reachable). But the DB reachability test is also susceptible to Kong's DAO's stored connection mechanism, which could be an issue if kong.db:connect() is called without kong.db:close() before reaching the /status endpoint handler. This doesn't seem to be the case so far though.

thibaultcha on 4 May 2020

All the same @thibaultcha shouldn't it be counted as a connectivity issue when a Connection Refused is encountered when attempting to establish a connection on the DB node(s) specified port? That's the behavior it will see when the Cassandra application has stopped on the remote DBs.

Or are we talking more lower-level than that? "Reachable" in the sense that an ICMP ping would succeed to the host?

Either way... we've currently a generic GET to /upstreams included in our healthcheck monitor, in addition to calling /status. /Upsteams is a complete arbitrary resource we picked because we don't often use it, and the response payload is small; but it does seem to be a reliable proxy for C* database Connectivity + Functionality (including consistency), and returns the 500 unexpected error when either condition is not met. It would be ideal to implement something in either /status or kong health to indicate whether or not the database was both reachable and functional.

rsbrisci on 5 May 2020

shouldn't it be counted as a connectivity issue when a Connection Refused is encountered when attempting to establish a connection on the DB node(s) specified port?

It should indeed, but the driver having built-in support for connecting to healthy nodes only, what is needed is work in the C* driver itself to expose a low-level API allowing Kong to expose individual C* nodes health metrics. The reachable property still follows the same definition: whether or not a connection can be opened between Kong and at least one node of the database (current and expected behavior).

It would be ideal to implement something in either /status or kong health to indicate whether or not the database was both reachable _and_ functional.

Again, we can expose low-level health metrics from the driver, but expecting a general, binary "healthy"/"not healthy" result isn't realistic imho.

thibaultcha on 5 May 2020

Imho, the only thing that could be reported in Kong's monitoring endpoint would be the number of C* nodes that are considered up and/or down by the driver.

Is there a strong aversion to having a passive monitor within Kong as a client to C/Postgres that keeps tabs on if writes or reads are working based on Kongs current configurations? I agree its not Kong's role to know the state of the C cluster as a whole. Other keyspaces, other applications etc. all could be using this Cassandra cluster and its not really in scope. But I think where I disagree is the idea that there is no value to knowing if Kong itself is facing problems against a Cassandra cluster(Or postgres node) based on how its currently configured(consistency settings, etc.). Its super helpful to know when Kong is failing consistently due to timeouts or unable to achieve consistency settings(Down C* nodes) etc. Maybe due to some network change in between Kong and the C* nodes such as firewalls or misconfigured low level network routers(because we do monitor intranode communication between C* also, so the monitoring we get from Kong gives us insight into network issues between Kong and C*). Right now this is sort of achieved by forcing a call to the admin api /upstreams resource to force a read query(does not confirm writes work but its a start and enough for us).

This concept seems out of scope from what the Kong team had in mind for the reachable field in the admin api though. Maybe a new field like availability ?? or something else would be able to be updated true/false though. My idea was a simple metainfo table Kong kept with just some default info about the Kong cluster instance and maybe a field it could even use for write attempts so read as well as write could be verified working in the background(Ideal solution would be to not need meta table but during the regular runtime read and writes operations Kong performs every so often in seconds determine availability in a passive manner). Its certainly possible to make an out of Kong process script to also connect to C* with similar consistency settings as Kong does and write and read from some arbitrary keyspace, but the work seems redundant and I think Kong users could heavily benefit from such a feature OOTB exposed as just a simple admin-api field that toggles back and forth true/false if the reads/writes are successful or not.

I still believe that if in runtime we do stop all C* node processes so the server won't be accepting connections on the db port that that I define that as reachability false in general terms as a connection to the hosts could not be successfully established at this point in time 😀 . Is there anything during runtime that would actually change reachability to anything other than true right now? Kinda seems its a field set once upon startup which only proves being able to reach the database at the beginning of a Kong nodes lifespan(hopefully a long healthy lifespan ❤️ 😆) .

jeremyjpj0916 on 5 May 2020

Hi, we are retagging this as a feature request because on this case Cassandra is indeed "reachable" (however the Cassandra driver defines this). It is conceded that reachable is not as useful as usable. So making it a usable flag (maybe a separate one?) implementing what @jeremyjpj0916 suggests, or using some other approach, is a feature request.