Gunicorn: Bypass connection queue for load balancer health checks

Created on 27 Dec 2016  路  25Comments  路  Source: benoitc/gunicorn

Would be nice if gunicorn allowed to bypass connection queue for application health check requests.

By health check requests, I mean normal HTTP requests to /health path or similar. These requests would hit the application as any other request, they would just not get queued in listen queue (or maybe they would have their own queue).

The reasoning behind this is that you could set health check to a much shorter interval than what your timeout is. For example, with timeout of 60 seconds, you could use 5-10 second interval for health checks (health endpoint is often very lightweight, so there shouldn't be any IO/CPU delay on that).

Without this, you need to match it like: timeout = healthcheck interval * times until marked faulty.

Uwsgi actually supports this, by allowing listening to multiple sockets and then mapping specific sockets to specific workers.

Would this be too complicated for gunicorn? I think this should be a common need but maybe I'm missing something or over-thinking it?

Most helpful comment

The problem in this ticket was that let's say you're running:

  • A threaded worker with, say, 10 threads
  • 100 requests come in so first 10 of those get served
  • But they get "stuck" (because they're waiting for IO or CPU, for example)
  • So your app is working correctly, things are just going slow
  • Your /health check request comes in and gets queued (as 101st request)
  • After a while, the health check caller (probably load balancer) times out
  • ...and thinks your application is broken, marking the instance broken, terminating it and spinning a new one
  • All your instances get rotated like this
  • Requests queue up
  • Rotation continues
  • World ends

The solution to fix this already exists in gunicorn:

  • Have it listen to 2 sockets
  • One you use for normal requests (port 8000, for example)
  • One you use for health check requests (port 56000, for example)

Like:

gunicorn -b 0.0.0.0:8000 -b 0.0.0.0:56000 ...

Both sockets have their own backlog, so health check requests at 56000 get directly served even if there's queue at port 8000 (where "normal" requests go).

All 25 comments

I'm not sure that I see the benefit and it would be a bit complicated, requiring new options, path-based routing in all the worker code, or maybe other things. And if it's Gunicorn and not the application code that responds to this route how does that help to know that the application is healthy? I may misunderstand, too, so feel free to clarify or link to uwsgi docs.

if it's Gunicorn and not the application code that responds to this route how does that help to know that the application is healthy

Those requests would still get routed to the application process as usual. So, if it was down or misbehaving, healthcheck requests would fail.

In uwsgi, you can do something like:

uwsgi-socket: /tmp/foobar.sock
http-socket: 0.0.0.0:56000
processes: 5
map-socket: 0:1,2,3,4
map-socket: 1:5

In my case, I route requests through nginx to uwsgi-socket but (Amazon ELB) healthchecks directly to http-socket. If nginx/uwsgi-socket get bloated and connections queued, that doesn't affect healthcheck requests.

Yeah, the process serving healthcheck is separate from the processes serving actual application requests. However, both live under the same uwsgi master process and get launched/managed similarly, so it's quite unlikely that healthcheck process would work while application processes would not, or the contrary. Of course, it's possible.

I'm not sure that I see the benefit and it would be a bit complicated, requiring new options, path-based routing in all the worker code, or maybe other things.

Yeah, path-based routing and QoS would be one option, but I think it would be simpler to "just" have two listening queues (listening to different sockets), that would route to the same wsgi application on a first-come-first-served basis?

(Why would I want something like this in gunicorn? Uwsgi is a bit too complicated for my needs and I'd rather use something simpler...)

You can have gunicorn bind to multiple addresses already. Is that sufficient for your use case?

For example: gunicorn -b unix:///tmp/gunicorn.sock -b 0.0.0.0:56000 myapp:app gives you something like what you show in your example.

Oh, I didn't realize that. Do they share the same listen queue though?

They do not share the same listen queue.

To the best of my knowledge, none of the worker types maintain an explicit queue beyond the listen backlog of the sockets, which are separate per socket at the OS level.

Alright, sounds like it's all already implement then.

Behavior will be slightly different from uwsgi as with gunicorn (IO-)blocked workers will block also healthcheck requests (as there's no separate worker for healthcheck requests) but on the other hand, it will be the exact same worker serving both requests. So, there are pros and there are cons, as usual. However, both will allow to bypass the listen queue for healthcheck requests, which is the major selling point.

Thanks for clarifying things up @tilgovi!

You're welcome! If you use the threaded worker or asynchronous workers then even IO-blocked requests should not prevent the health check from succeeding.

By "threaded worker" do you mean gthread worker?

Yes.

About that, does it have any relation to this empty package index entry https://pypi.python.org/pypi/gthread?

It does not.

Right.

For anyone reading this: note that async workers is not the holy grail - if you allow application to spin up, say 100 connections, you may run out of DB/cache/whatever connections. Not to mention what happens if you allow 1000 connections (the default for gevent/eventlet worker type in gunicorn). At least SQL databases are pretty conservative when it comes to how many concurrent connections they allow by default (and I have concerns for running 1000 concurrent transactions with locking etc.) . Redis 2.6+ allows 10k and some other product may allow more or less... in any case, if you go async, remember to consider these.

Hi @tuukkamustonen

I am having the same trouble. I realise even with solution provided by @tilgovi the queue is the same, meaning healthcheck waits until the request sent before it gets processed.

Thanks in advance.

Arvind

If you set gunicorn to listen in two different sockets, then the backlog (queues) for the sockets are different and not shared. I don't know how it is determined which queue gets their requests in first (if they queue up). If it's round-robin across the socket queues, then it allows to get the healthcheck request in faster, but sure it needs to first wait for an available worker.

These days I'm running with _gevent_, so there's really ever no backlog, as IO delayed requests just get to idle in the background and new requests are taken in.

Hi @arvindrajan92

I too faced the same issue. Were you able to find any workaround for this?

Thanks
Akshita

@winnie-as why would you want that. The purpose of health check is .... to check the health of your server. So if blocked then there is an issue. If your sync wokrer is blocking you may either want to revisit your app to not block too much your worker. Basically any blocking work should be done asynchronously. Or use an async worker. Or a combination of both.

The problem in this ticket was that let's say you're running:

  • A threaded worker with, say, 10 threads
  • 100 requests come in so first 10 of those get served
  • But they get "stuck" (because they're waiting for IO or CPU, for example)
  • So your app is working correctly, things are just going slow
  • Your /health check request comes in and gets queued (as 101st request)
  • After a while, the health check caller (probably load balancer) times out
  • ...and thinks your application is broken, marking the instance broken, terminating it and spinning a new one
  • All your instances get rotated like this
  • Requests queue up
  • Rotation continues
  • World ends

The solution to fix this already exists in gunicorn:

  • Have it listen to 2 sockets
  • One you use for normal requests (port 8000, for example)
  • One you use for health check requests (port 56000, for example)

Like:

gunicorn -b 0.0.0.0:8000 -b 0.0.0.0:56000 ...

Both sockets have their own backlog, so health check requests at 56000 get directly served even if there's queue at port 8000 (where "normal" requests go).

Well, after re-reading what others have written :) I realize that @arvindrajan92 and @winnie-as are reporting that even with the dual-socket setup, health check requests sent to the secondary socket still get queued up?

tilgovi originally wrote:

To the best of my knowledge, none of the worker types maintain an explicit queue beyond the listen backlog of the sockets, which are separate per socket at the OS level.

That would mean there's no relation between the sockets, and this being broken would indicate an OS level issue (but I cannot say).

Most health checks can be configured to have a higher or lower timeout. If your application cannot serve a new request in the health check timeout period, that's the definition of unhealthy. I agree with @benoitc, the motivation for this seems lacking or suspect.

Appreciate your input here, though my experience differs.

I wouldn't want my app to get marked unhealthy, simply because queues are full and the application is handling the responses as fast as it can.

Consider a sudden spike of requests, for whatever reason.

There needs to be a shortcut / qos route for health checks to get in. It would be a disaster, if under heavy load, load balancer begun interpreting the application as unhealthy, when it's simply under load, and waiting for scaling to kick in (adding new instances, it's not instant).

Sure, I could set load balancer health check timeout to 60 seconds. But then it would be slower to detect an application, that's actually broken. Having an overtaking lane is simply more convenient and allows to react faster.

I don't think there's perfect solution, though. Pros and cons, as usual.

If you can use connection backlog metrics from the OS, that could help you trigger scaling sooner. Or scale based on CPU usage, and trigger scaling as soon as your application is fully busy. It's probably best if your health checks reflect the client experience.

One approach in some systems, like kunernetes, is to use a shorter timeout for readiness checks than for health checks, so that the container is not restarted by the load balancer gives it a break.

You should try to experiment with different metrics or probes to get the right scaling behavior, but I don't think gunicorn should try to apply some qos priority to requests.

If you want to run a sidecar nginx in front, you could use that to serve health checks, too, if what you want is simply to know if the pod is reachable.

If you can use connection backlog metrics from the OS, that could help you trigger scaling sooner. Or scale based on CPU usage, and trigger scaling as soon as your application is fully busy. It's probably best if your health checks reflect the client experience.

Normally, we use CPU scaling, because that's the element that makes requests wait and that the server actually has impact on. Scaling by backlog metrics is simply more difficult, and we haven't needed that yet. Current model works fine.

I don't really disagree with what you say, there's just a compromise that needs to be made (perfect way vs cost effective way). The scaling is not normally instant (there's at least 1-2min+ delay with traditional EC2 like server setups, containers might work much faster) and so the instance simply needs to tolerate some slowness, without getting marked unhealthy. Maybe in an environment that can react faster, things could be different.

Yeah, autoscaling is always a bit tricky. Scaling at lower CPU levels is good if you can afford it, to always keep some capacity. I'm not sure what else to advise here. It's always imperfect. I'm not sure that a QoS for health checks would even help, and might cause other issues.

Was this page helpful?
0 / 5 - 0 ratings