Ingress-nginx: Service latency degradation

Created on 16 Jan 2018  路  13Comments  路  Source: kubernetes/ingress-nginx

Is this a BUG REPORT or FEATURE REQUEST?: BUG REPORT

NGINX Ingress controller version: 0.9.0 (quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.9.0)

Kubernetes version:

Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.2", GitCommit:"bdaeafa71f6c7c04636251031f93464384d54963", GitTreeState:"clean", BuildDate:"2017-10-24T19:48:57Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.6", GitCommit:"6260bb08c46c31eea6cb538b34a9ceb3e406689c", GitTreeState:"clean", BuildDate:"2017-12-21T06:23:29Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration: bare-metal
  • OS (e.g. from /etc/os-release):
PRETTY_NAME="Debian GNU/Linux 9 (stretch)"
NAME="Debian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
  • Kernel:
Linux srv0551 4.9.0-4-amd64 #1 SMP Debian 4.9.65-3+deb9u1 (2017-12-23) x86_64 GNU/Linux
  • Others: calico v2.6.3

What happened:
We use ingress controller for almost all our services. And one of them has big load of 30k rps. This traffic is handled by 3 ingress controllers (there are three separate physical servers for them). And we experience a problem with service deploy process. When one service which is using the same ingress controller updates its pods (endpoints) ingress reloads its configuration. And at the same time we get increasing latency for all network communications on each service which uses these ingress controllers (even latency from service to database and to localhost). What we see in our monitoring: increasing number of established connections and increased latency.

What you expected to happen:
Deploy process shouldn't affect other services.

How to reproduce it (as minimally and precisely as possible):
The easiest way to reproduce it is to deploy service, start any perf tool to increase load to this service and then start nginx -t on ingress. Because ingress tests configuration before reload it invokes it each time during deploy of each service. So you should run nginx -t hundreds times and you will see the degradation. These are our metrics:
2018-01-16 12 17 00
2018-01-16 12 17 36

Anything else we need to know:
This problem can be reproduced only at big load, starting from 5k rps to one ingress controller.
Also I saw this issue and we have this patch, but still have problems: https://github.com/kubernetes/kubernetes/issues/48358

help wanted prioritimportant-soon

Most helpful comment

This problem can be reproduced only at big load, starting from 5k rps to one ingress controller.
Also I saw this issue and we have this patch, but still have problems: kubernetes/kubernetes#48358

The only way we can fix this is to remove the need to reload nginx on endpoints changes. Right now the only way to do this is to use lua, more specifically balancer_by_lua_block
This is the reason why I added lua support back in the custom nginx image https://github.com/kubernetes/ingress-nginx/pull/1852
After the next release I will start a POC to test/prove if using lua we can solve once and for all this issue.

All 13 comments

Also, removing option reuseport from ingress template slightly improves situation. Latency increases to 15ms from 2-3ms, but we still observe degradation.

Also, removing option reuseport from ingress template slightly improves situation

I will add an option to make this configurable with the Configmap

This problem can be reproduced only at big load, starting from 5k rps to one ingress controller.
Also I saw this issue and we have this patch, but still have problems: kubernetes/kubernetes#48358

The only way we can fix this is to remove the need to reload nginx on endpoints changes. Right now the only way to do this is to use lua, more specifically balancer_by_lua_block
This is the reason why I added lua support back in the custom nginx image https://github.com/kubernetes/ingress-nginx/pull/1852
After the next release I will start a POC to test/prove if using lua we can solve once and for all this issue.

Since PR #2174 will implement the "no reload" feature, I'm here asking on what will be next steps.
I mean, this issue is still present if you add virtualhosts or SSL certs while you have many virtualhosts running. In worst case scenario, not only latency increases but the result can be completely stuck nginx processes continuing to serve old configurations for so much time (the only way to fix is to kill ingress controller container).

We already discussed this issue while I was proposing to implement LUA on virtualhosts and ssl too.

@aledbf @ElvinEfendi were involved in discussion. What are next steps?

@valeriano-manassero thanks for starting the discussion. IMHO the next immediate step after https://github.com/kubernetes/ingress-nginx/pull/2174 gets merged should be coming up with a way to test Lua code. One option here would be to use current e2e test framework.

After the test process is in place I suggest we focus on the remaining tasks in the following priority:

  1. Matching all the LB functionalities Nginx offers
  2. Add support for dynamic SSL(using certificate_by_lua)
  3. Implementing dynamic virtual hosts changes.

@Lookyan can you test quay.io/aledbf/nginx-ingress-controller:0.343 adding the flag
--enable-dynamic-configuration in the deployment?
This image contains the next release 0.12.0.

Yes, thank you, I'll test it.

@ElvinEfendi @aledbf sorry, I was very busy in the last days. Regarding dynamic-reconfiguration I was in hurry to deploy something usable for my company so I was forced to create my fork:
https://github.com/NCCloud/fluid

There are, obviously huge differences from the point where I forked from here but I hope, in future, we can do it. Feel free to comment or give me a way to merge that functionalities here.

@valeriano-manassero some good stuff in that fork! It would be great to get consistent hashing and dynamic certificate features ported to this repository from your fork.

This week I'm going to refactor balancer.lua a bit and document how to add a new load balancer algorithm.

Closing. This is fixed with the new dynamic configuration feature. We will enable this feature by default in a couple releases.

@Lookyan have you tried --enable-dynamic-configuration? It would be great to hear your experience with it.

Sorry, forgot to write here.
Yes, dynamic reconfiguration helps. We have problems only when new services deployed and there is still nginx conf reload.

@Lookyan Did you try Fluid https://github.com/NCCloud/fluid fork? We got same issues and tried to avoid reload at all. The fork is not perfect but maybe it can help you.

Was this page helpful?
0 / 5 - 0 ratings