Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/.):
No
What keywords did you search in NGINX Ingress controller issues before filing this one? (If you have found any duplicates, you should instead reply there.):
some relevant links :
https://github.com/kubernetes/ingress-nginx/issues/4174
Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG REPORT
NGINX Ingress controller version:
0.25.1/0.21.0
Kubernetes version (use kubectl version):
1.15.3/1.10.11
Environment:
bare-metal/VM
What happened:
proxy_next_upstream is not working , neither for ingress-nginx version 0.21.0 nor 0.25.1.
So without it, we don't have a timely failover mechanism in ingress-nginx !
What you expected to happen:
no 504 or other error.
but there were errors.
How to reproduce it (as minimally and precisely as possible):
(1)setup two pods for the same deployment, running on two nodes.
(2) confirm the setting in ingress-nginx's nginx.conf has been properly setup via configmap.
proxy_next_upstream error timeout http_504 http_503 http_500 http_502 http_404 invalid_header;
proxy_next_upstream_timeout 2;
(3) looping curl -H "Host:xxx" ingressIp:port (yes, it's GET , not the same issue like #4174)
(4) reboot one node
(5) there will be 40+s , 50% request failed with 504 Gateway timeout
Anything else we need to know:
see more info in my comments in
https://github.com/kubernetes/ingress-nginx/pull/3207#issuecomment-532024743
and
https://github.com/kubernetes/ingress-nginx/pull/3207#issuecomment-532029693
my test script:
while true; do curl -H "Host:a.com" 10.96.0.14 -v &>> /tmp/xx ; don
the error part in the log
* About to connect() to 10.96.0.14 port 80 (#0)
* Trying 10.96.0.14...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
^M 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Connected to 10.96.0.14 (10.96.0.14) port 80 (#0)
> GET / HTTP/1.1^M
> User-Agent: curl/7.29.0^M
> Accept: */*^M
> Host:a.com^M
> ^M
^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:03 --:--:-- 0^M 0 0 0 0 0 0 0 0 --:--:-- 0:00:04 --:--:-- 0< HTTP/1.1 504 Gateway Time-out^M
< Server: openresty/1.15.8.1^M
< Date: Tue, 17 Sep 2019 02:32:00 GMT^M
< Content-Type: text/html^M
< Content-Length: 173^M
< Connection: keep-alive^M
< ^M
{ [data not shown]
^M100 173 100 173 0 0 34 0 0:00:05 0:00:05 --:--:-- 34^M100 173 100 173 0 0 34 0 0:00:05 0:00:05 --:--:-- 43
* Connection #0 to host 10.96.0.14 left intact
<html>^M
<head><title>504 Gateway Time-out</title></head>^M
<body>^M
<center><h1>504 Gateway Time-out</h1></center>^M
<hr><center>openresty/1.15.8.1</center>^M
</body>^M
</html>^M
the nginx.conf
# Configuration checksum: 12565735316375508966
# setup custom paths that do not require root access
pid /tmp/nginx.pid;
daemon off;
worker_processes 1;
worker_rlimit_nofile 64512;
worker_shutdown_timeout 10s ;
events {
multi_accept on;
worker_connections 16384;
use epoll;
}
http {
lua_package_path "/usr/local/openresty/site/lualib/?.ljbc;/usr/local/openresty/site/lualib/?/init.ljbc;/usr/local/openresty/lualib/?.ljbc;/usr/local/openresty/lualib/?/init.ljbc;/usr/local/openresty/site/lualib/?.lua;/usr/local/openresty/site/lualib/?/init.lua;/usr/local/openresty/lualib/?.lua;/usr/local/openresty/lualib/?/init.lua;./?.lua;/usr/local/openresty/luajit/share/luajit-2.1.0-beta3/?.lua;/usr/local/share/lua/5.1/?.lua;/usr/local/share/lua/5.1/?/init.lua;/usr/local/openresty/luajit/share/lua/5.1/?.lua;/usr/local/openresty/luajit/share/lua/5.1/?/init.lua;/usr/local/lib/lua/?.lua;;";
lua_package_cpath "/usr/local/openresty/site/lualib/?.so;/usr/local/openresty/lualib/?.so;./?.so;/usr/local/lib/lua/5.1/?.so;/usr/local/openresty/luajit/lib/lua/5.1/?.so;/usr/local/lib/lua/5.1/loadall.so;/usr/local/openresty/luajit/lib/lua/5.1/?.so;;";
lua_shared_dict configuration_data 15M;
lua_shared_dict certificate_data 16M;
init_by_lua_block {
collectgarbage("collect")
local lua_resty_waf = require("resty.waf")
lua_resty_waf.init()
-- init modules
local ok, res
ok, res = pcall(require, "lua_ingress")
if not ok then
error("require failed: " .. tostring(res))
else
lua_ingress = res
lua_ingress.set_config({
use_forwarded_headers = false,
is_ssl_passthrough_enabled = false,
http_redirect_code = 308,
listen_ports = { ssl_proxy = "442", https = "443" },
})
end
ok, res = pcall(require, "configuration")
if not ok then
error("require failed: " .. tostring(res))
else
configuration = res
end
ok, res = pcall(require, "balancer")
if not ok then
error("require failed: " .. tostring(res))
else
balancer = res
end
ok, res = pcall(require, "monitor")
if not ok then
error("require failed: " .. tostring(res))
else
monitor = res
end
ok, res = pcall(require, "certificate")
if not ok then
error("require failed: " .. tostring(res))
else
certificate = res
end
ok, res = pcall(require, "plugins")
if not ok then
error("require failed: " .. tostring(res))
else
plugins = res
end
-- load all plugins that'll be used here
plugins.init({})
}
init_worker_by_lua_block {
lua_ingress.init_worker()
balancer.init_worker()
monitor.init_worker()
plugins.run()
}
geoip_country /etc/nginx/geoip/GeoIP.dat;
geoip_city /etc/nginx/geoip/GeoLiteCity.dat;
geoip_org /etc/nginx/geoip/GeoIPASNum.dat;
geoip_proxy_recursive on;
aio threads;
aio_write on;
tcp_nopush on;
tcp_nodelay on;
log_subrequest on;
reset_timedout_connection on;
keepalive_timeout 75s;
keepalive_requests 100;
client_body_temp_path /tmp/client-body;
fastcgi_temp_path /tmp/fastcgi-temp;
proxy_temp_path /tmp/proxy-temp;
ajp_temp_path /tmp/ajp-temp;
client_header_buffer_size 1k;
client_header_timeout 60s;
large_client_header_buffers 4 8k;
client_body_buffer_size 8k;
client_body_timeout 60s;
http2_max_field_size 4k;
http2_max_header_size 16k;
http2_max_requests 1000;
types_hash_max_size 2048;
server_names_hash_max_size 1024;
server_names_hash_bucket_size 32;
map_hash_bucket_size 64;
proxy_headers_hash_max_size 512;
proxy_headers_hash_bucket_size 64;
variables_hash_bucket_size 128;
variables_hash_max_size 2048;
underscores_in_headers off;
ignore_invalid_headers on;
limit_req_status 503;
limit_conn_status 503;
include /etc/nginx/mime.types;
default_type text/html;
gzip on;
gzip_comp_level 5;
gzip_http_version 1.1;
gzip_min_length 256;
gzip_types application/atom+xml application/javascript application/x-javascript application/json application/rss+xml application/vnd.ms-fontobject application/x-font-ttf application/x-web-app-manifest+json application/xhtml+xml application/xml font/opentype image/svg+xml image/x-icon text/css text/javascript text/plain text/x-component;
gzip_proxied any;
gzip_vary on;
# Custom headers for response
server_tokens on;
# disable warnings
uninitialized_variable_warn off;
# Additional available variables:
# $namespace
# $ingress_name
# $service_name
# $service_port
log_format upstreaminfo '$the_real_ip - [$the_real_ip] - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" $request_length $request_time [$proxy_upstream_name] [$proxy_alternative_upstream_name] $upstream_addr $upstream_response_length $upstream_response_time $upstream_status $req_id';
map $request_uri $loggable {
default 1;
}
access_log /var/log/nginx/access.log upstreaminfo if=$loggable;
error_log /var/log/nginx/error.log notice;
resolver 10.96.0.2 valid=30s;
# See https://www.nginx.com/blog/websocket-nginx
map $http_upgrade $connection_upgrade {
default upgrade;
# See http://nginx.org/en/docs/http/ngx_http_upstream_module.html#keepalive
'' '';
}
# The following is a sneaky way to do "set $the_real_ip $remote_addr"
# Needed because using set is not allowed outside server blocks.
map '' $the_real_ip {
default $remote_addr;
}
# Reverse proxies can detect if a client provides a X-Request-ID header, and pass it on to the backend server.
# If no such header is provided, it can provide a random value.
map $http_x_request_id $req_id {
default $http_x_request_id;
"" $request_id;
}
# Create a variable that contains the literal $ character.
# This works because the geo module will not resolve variables.
geo $literal_dollar {
default "$";
}
server_name_in_redirect off;
port_in_redirect off;
ssl_protocols TLSv1.2;
# turn on session caching to drastically improve performance
ssl_session_cache builtin:1000 shared:SSL:10m;
ssl_session_timeout 10m;
# allow configuring ssl session tickets
ssl_session_tickets on;
# slightly reduce the time-to-first-byte
ssl_buffer_size 4k;
# allow configuring custom ssl ciphers
ssl_ciphers 'ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256';
ssl_prefer_server_ciphers on;
ssl_ecdh_curve auto;
proxy_ssl_session_reuse on;
upstream upstream_balancer {
server 0.0.0.1; # placeholder
balancer_by_lua_block {
balancer.balance()
}
keepalive 32;
keepalive_timeout 60s;
keepalive_requests 100;
}
# Cache for internal auth checks
proxy_cache_path /tmp/nginx-cache-auth levels=1:2 keys_zone=auth_cache:10m max_size=128m inactive=30m use_temp_path=off;
# Global filters
## start server _
server {
server_name _ ;
listen 80 default_server reuseport backlog=511 ;
listen [::]:80 default_server reuseport backlog=511 ;
listen 443 default_server reuseport backlog=511 ssl http2 ;
listen [::]:443 default_server reuseport backlog=511 ssl http2 ;
set $proxy_upstream_name "-";
# PEM sha: 91d97d82eb70ad22a884a54e950f0c0b91223297
ssl_certificate /etc/ingress-controller/ssl/default-fake-certificate.pem;
ssl_certificate_key /etc/ingress-controller/ssl/default-fake-certificate.pem;
ssl_certificate_by_lua_block {
certificate.call()
}
location / {
set $namespace "";
set $ingress_name "";
set $service_name "";
set $service_port "{0 0 }";
set $location_path "/";
rewrite_by_lua_block {
lua_ingress.rewrite({
force_ssl_redirect = false,
use_port_in_redirects = false,
})
balancer.rewrite()
plugins.run()
}
header_filter_by_lua_block {
plugins.run()
}
body_filter_by_lua_block {
}
log_by_lua_block {
balancer.log()
monitor.call()
plugins.run()
}
if ($scheme = https) {
more_set_headers "Strict-Transport-Security: max-age=15724800; includeSubDomains";
}
access_log off;
port_in_redirect off;
set $balancer_ewma_score -1;
set $proxy_upstream_name "upstream-default-backend";
set $proxy_host $proxy_upstream_name;
set $pass_access_scheme $scheme;
set $pass_server_port $server_port;
set $best_http_host $http_host;
set $pass_port $pass_server_port;
set $proxy_alternative_upstream_name "";
client_max_body_size 1m;
proxy_set_header Host $best_http_host;
# Pass the extracted client certificate to the backend
# Allow websocket connections
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
proxy_set_header X-Request-ID $req_id;
proxy_set_header X-Real-IP $the_real_ip;
proxy_set_header X-Forwarded-For $the_real_ip;
proxy_set_header X-Forwarded-Host $best_http_host;
proxy_set_header X-Forwarded-Port $pass_port;
proxy_set_header X-Forwarded-Proto $pass_access_scheme;
proxy_set_header X-Original-URI $request_uri;
proxy_set_header X-Scheme $pass_access_scheme;
# Pass the original X-Forwarded-For
proxy_set_header X-Original-Forwarded-For $http_x_forwarded_for;
# mitigate HTTPoxy Vulnerability
# https://www.nginx.com/blog/mitigating-the-httpoxy-vulnerability-with-nginx/
proxy_set_header Proxy "";
# Custom headers to proxied server
proxy_connect_timeout 5s;
proxy_send_timeout 60s;
proxy_read_timeout 60s;
proxy_buffering off;
proxy_buffer_size 4k;
proxy_buffers 4 4k;
proxy_request_buffering on;
proxy_http_version 1.1;
proxy_cookie_domain off;
proxy_cookie_path off;
# In case of errors try the next upstream server before returning an error
proxy_next_upstream error timeout http_504 http_503 http_500 http_502 http_404 invalid_header non_idempotent;
proxy_next_upstream_timeout 2;
proxy_next_upstream_tries 3;
proxy_pass http://upstream_balancer;
proxy_redirect off;
}
# health checks in cloud providers require the use of port 80
location /healthz {
access_log off;
return 200;
}
# this is required to avoid error if nginx is being monitored
# with an external software (like sysdig)
location /nginx_status {
allow 127.0.0.1;
allow ::1;
deny all;
access_log off;
stub_status on;
}
}
## end server _
## start server a.com
server {
server_name a.com ;
listen 80 ;
listen [::]:80 ;
listen 443 ssl http2 ;
listen [::]:443 ssl http2 ;
set $proxy_upstream_name "-";
location / {
set $namespace "default";
set $ingress_name "nginx-l7";
set $service_name "nginx-nginx";
set $service_port "{0 0 }";
set $location_path "/";
rewrite_by_lua_block {
lua_ingress.rewrite({
force_ssl_redirect = false,
use_port_in_redirects = false,
})
balancer.rewrite()
plugins.run()
}
header_filter_by_lua_block {
plugins.run()
}
body_filter_by_lua_block {
}
log_by_lua_block {
balancer.log()
monitor.call()
plugins.run()
}
port_in_redirect off;
set $balancer_ewma_score -1;
set $proxy_upstream_name "default-nginx-nginx-80";
set $proxy_host $proxy_upstream_name;
set $pass_access_scheme $scheme;
set $pass_server_port $server_port;
set $best_http_host $http_host;
set $pass_port $pass_server_port;
set $proxy_alternative_upstream_name "";
client_max_body_size 1m;
proxy_set_header Host $best_http_host;
# Pass the extracted client certificate to the backend
# Allow websocket connections
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
proxy_set_header X-Request-ID $req_id;
proxy_set_header X-Real-IP $the_real_ip;
proxy_set_header X-Forwarded-For $the_real_ip;
proxy_set_header X-Forwarded-Host $best_http_host;
proxy_set_header X-Forwarded-Port $pass_port;
proxy_set_header X-Forwarded-Proto $pass_access_scheme;
proxy_set_header X-Original-URI $request_uri;
proxy_set_header X-Scheme $pass_access_scheme;
# Pass the original X-Forwarded-For
proxy_set_header X-Original-Forwarded-For $http_x_forwarded_for;
# mitigate HTTPoxy Vulnerability
# https://www.nginx.com/blog/mitigating-the-httpoxy-vulnerability-with-nginx/
proxy_set_header Proxy "";
# Custom headers to proxied server
proxy_connect_timeout 5s;
proxy_send_timeout 60s;
proxy_read_timeout 60s;
proxy_buffering off;
proxy_buffer_size 4k;
proxy_buffers 4 4k;
proxy_request_buffering on;
proxy_http_version 1.1;
proxy_cookie_domain off;
proxy_cookie_path off;
# In case of errors try the next upstream server before returning an error
proxy_next_upstream error timeout http_504 http_503 http_500 http_502 http_404 invalid_header non_idempotent;
proxy_next_upstream_timeout 2;
proxy_next_upstream_tries 3;
proxy_pass http://upstream_balancer;
proxy_redirect off;
}
}
## end server a.com
# backend for when default-backend-service is not configured or it does not have endpoints
server {
listen 8181 default_server reuseport backlog=511;
listen [::]:8181 default_server reuseport backlog=511;
set $proxy_upstream_name "internal";
access_log off;
location / {
return 404;
}
}
# default server, used for NGINX healthcheck and access to nginx stats
server {
listen unix:/tmp/nginx-status-server.sock;
set $proxy_upstream_name "internal";
keepalive_timeout 0;
gzip off;
access_log off;
location /healthz {
return 200;
}
location /is-dynamic-lb-initialized {
content_by_lua_block {
local configuration = require("configuration")
local backend_data = configuration.get_backends_data()
if not backend_data then
ngx.exit(ngx.HTTP_INTERNAL_SERVER_ERROR)
return
end
ngx.say("OK")
ngx.exit(ngx.HTTP_OK)
}
}
location /nginx_status {
stub_status on;
}
location /configuration {
# this should be equals to configuration_data dict
client_max_body_size 10m;
client_body_buffer_size 10m;
proxy_buffering off;
content_by_lua_block {
configuration.call()
}
}
location / {
content_by_lua_block {
ngx.exit(ngx.HTTP_NOT_FOUND)
}
}
}
}
stream {
lua_package_cpath "/usr/local/lib/lua/?.so;/usr/lib/lua-platform-path/lua/5.1/?.so;;";
lua_package_path "/etc/nginx/lua/?.lua;/etc/nginx/lua/vendor/?.lua;/usr/local/lib/lua/?.lua;;";
lua_shared_dict tcp_udp_configuration_data 5M;
init_by_lua_block {
collectgarbage("collect")
-- init modules
local ok, res
ok, res = pcall(require, "configuration")
if not ok then
error("require failed: " .. tostring(res))
else
configuration = res
end
ok, res = pcall(require, "tcp_udp_configuration")
if not ok then
error("require failed: " .. tostring(res))
else
tcp_udp_configuration = res
end
ok, res = pcall(require, "tcp_udp_balancer")
if not ok then
error("require failed: " .. tostring(res))
else
tcp_udp_balancer = res
end
}
init_worker_by_lua_block {
tcp_udp_balancer.init_worker()
}
lua_add_variable $proxy_upstream_name;
log_format log_stream [$time_local] $protocol $status $bytes_sent $bytes_received $session_time;
access_log /var/log/nginx/access.log log_stream ;
error_log /var/log/nginx/error.log;
upstream upstream_balancer {
server 0.0.0.1:1234; # placeholder
balancer_by_lua_block {
tcp_udp_balancer.balance()
}
}
server {
listen unix:/tmp/ingress-stream.sock;
access_log off;
content_by_lua_block {
tcp_udp_configuration.call()
}
}
# TCP services
# UDP services
}
@panpan0000 please post the ingress controller pod log
(1) The two upstream pod
[root@macvlan-master ~]# k get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-nginx-d4b6994bd-b769s 1/1 Running 3 3h11m 10.7.111.71 macvlan-slave2 <none> <none>
nginx-nginx-d4b6994bd-tbf99 1/1 Running 3 23h 10.7.111.70 macvlan-slave1 <none> <none>
(2) The ingress-nginx pod:
[root@macvlan-master ~]# kn get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-ingress-controller-65cdc4bf9b-pzm8b 1/1 Running 0 90s 10.7.111.77 macvlan-slave1 <none> <none>
(3)
the ingress-nginx log
NGINX Ingress controller
Release: 0.25.1
Build: git-5179893a9
Repository: https://github.com/kubernetes/ingress-nginx/
nginx version: openresty/1.15.8.1
-------------------------------------------------------------------------------
W0917 09:34:39.180770 9 flags.go:221] SSL certificate chain completion is disabled (--enable-ssl-chain-completion=false)
nginx version: openresty/1.15.8.1
W0917 09:34:39.192198 9 client_config.go:541] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0917 09:34:39.192639 9 main.go:183] Creating API client for https://10.96.0.1:443
I0917 09:34:39.213781 9 main.go:227] Running in Kubernetes cluster version v1.15 (v1.15.3) - git (clean) commit f774be9 - platform linux/amd64
I0917 09:34:39.705490 9 main.go:102] Created fake certificate with PemFileName: /etc/ingress-controller/ssl/default-fake-certificate.pem
I0917 09:34:39.775005 9 nginx.go:274] Starting NGINX Ingress controller
I0917 09:34:39.818900 9 event.go:258] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"ingress-nginx", Name:"nginx-configuration", UID:"a969b27d-d8e8-11e9-902c-005056b46946", APIVersion:"v1", ResourceVersion:"334529", FieldPath:""}): type: 'Normal' reason: 'CREATE' ConfigMap ingress-nginx/nginx-configuration
I0917 09:34:39.823639 9 event.go:258] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"ingress-nginx", Name:"tcp-services", UID:"a96bb4e1-d8e8-11e9-902c-005056b46946", APIVersion:"v1", ResourceVersion:"326721", FieldPath:""}): type: 'Normal' reason: 'CREATE' ConfigMap ingress-nginx/tcp-services
I0917 09:34:39.823703 9 event.go:258] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"ingress-nginx", Name:"udp-services", UID:"a96dfe25-d8e8-11e9-902c-005056b46946", APIVersion:"v1", ResourceVersion:"326722", FieldPath:""}): type: 'Normal' reason: 'CREATE' ConfigMap ingress-nginx/udp-services
I0917 09:34:40.902133 9 event.go:258] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"default", Name:"nginx-l7", UID:"41a1232c-d827-11e9-902c-005056b46946", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"327757", FieldPath:""}): type: 'Normal' reason: 'CREATE' Ingress default/nginx-l7
I0917 09:34:40.976291 9 nginx.go:318] Starting NGINX process
I0917 09:34:40.976500 9 leaderelection.go:235] attempting to acquire leader lease ingress-nginx/ingress-controller-leader-nginx...
I0917 09:34:40.979321 9 controller.go:133] Configuration changes detected, backend reload required.
I0917 09:34:40.981511 9 status.go:86] new leader elected: nginx-ingress-controller-65cdc4bf9b-7fxzl
E0917 09:34:41.128364 9 checker.go:41] healthcheck error: Get http+unix://nginx-status/healthz: dial unix /tmp/nginx-status-server.sock: connect: no such file or directory
I0917 09:34:41.765212 9 controller.go:149] Backend successfully reloaded.
I0917 09:34:41.765267 9 controller.go:158] Initial sync, sleeping for 1 second.
I0917 09:35:23.818812 9 leaderelection.go:245] successfully acquired lease ingress-nginx/ingress-controller-leader-nginx
I0917 09:35:23.818823 9 status.go:86] new leader elected: nginx-ingress-controller-65cdc4bf9b-pzm8b
10.7.1.31 - [10.7.1.31] - - [17/Sep/2019:09:36:31 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.29.0" 68 0.001 [default-nginx-nginx-80] [] 10.7.111.70:80 612 0.001 200 xxxx
10.7.1.31 - [10.7.1.31] - - [17/Sep/2019:09:36:31 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.29.0" 68 0.016 [default-nginx-nginx-80] [] 10.7.111.71:80 612 0.017 200 xxxx
10.7.1.31 - [10.7.1.31] - - [17/Sep/2019:09:36:31 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.29.0" 68 0.001 [default-nginx-nginx-80] [] 10.7.111.70:80 612 0.001 200 xxxx
10.7.1.31 - [10.7.1.31] - - [17/Sep/2019:09:36:32 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.29.0" 68 0.001 [default-nginx-nginx-80] [] 10.7.111.71:80 612 0.001 200 xxxx
10.7.1.31 - [10.7.1.31] - - [17/Sep/2019:09:36:32 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.29.0" 68 0.000 [default-nginx-nginx-80] [] 10.7.111.70:80 612 0.000 200 xxxx
.....
.... repeat .................
.... repeat .................
.....
.....
.....
.....
.....
.....
..... ##When doing the reboot on the node which upstream pod (which pod IP is 10.7.111.71) lives.##
......## 504 comes out ##
2019/09/17 09:36:51 [error] 48#48: *523 upstream timed out (110: Connection timed out) while connecting to upstream, client: 10.7.1.31, server: a.com, request: "GET / HTTP/1.1", upstream: "http://10.7.111.71:80/", host: "a.com"
10.7.1.31 - [10.7.1.31] - - [17/Sep/2019:09:36:51 +0000] "GET / HTTP/1.1" 504 173 "-" "curl/7.29.0" 68 5.001 [default-nginx-nginx-80] [] 10.7.111.71:80 0 5.001 504 xxxx
10.7.1.31 - [10.7.1.31] - - [17/Sep/2019:09:36:52 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.29.0" 69 0.001 [default-nginx-nginx-80] [] 10.7.111.70:80 612 0.001 200 xxxxx
2019/09/17 09:36:57 [error] 48#48: *546 upstream timed out (110: Connection timed out) while connecting to upstream, client: 10.7.1.31, server: a.com, request: "GET / HTTP/1.1", upstream: "http://10.7.111.71:80/", host: "a.com"
10.7.1.31 - [10.7.1.31] - - [17/Sep/2019:09:36:57 +0000] "GET / HTTP/1.1" 504 173 "-" "curl/7.29.0" 68 5.000 [default-nginx-nginx-80] [] 10.7.111.71:80 0 5.001 504 914f5c508fc20d620c9131b8adcd1beb
10.7.1.31 - [10.7.1.31] - - [17/Sep/2019:09:36:57 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.29.0" 68 0.000 [default-nginx-nginx-80] [] 10.7.111.70:80 612 0.000 200 xxxxx
2019/09/17 09:36:58 [error] 48#48: *553 upstream timed out (110: Connection timed out) while connecting to upstream, client: 10.7.1.31, server: a.com, request: "GET / HTTP/1.1", upstream: "http://10.7.111.71:80/", host: "a.com"
10.7.1.31 - [10.7.1.31] - - [17/Sep/2019:09:36:58 +0000] "GET / HTTP/1.1" 504 173 "-" "curl/7.29.0" 69 5.000 [default-nginx-nginx-80] [] 10.7.111.71:80 0 5.000 504 xxxxx
2019/09/17 09:37:02 [error] 48#48: *568 upstream timed out (110: Connection timed out) while connecting to upstream, client: 10.7.1.31, server: a.com, request: "GET / HTTP/1.1", upstream: "http://10.7.111.71:80/", host: "a.com"
10.7.1.31 - [10.7.1.31] - - [17/Sep/2019:09:37:02 +0000] "GET / HTTP/1.1" 504 173 "-" "curl/7.29.0" 68 5.001 [default-nginx-nginx-80] [] 10.7.111.71:80 0 5.001 504 c08c377ebc5450564fb5a68a9eab3e5f
10.7.1.31 - [10.7.1.31] - - [17/Sep/2019:09:37:02 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.29.0" 68 0.001 [default-nginx-nginx-80] [] 10.7.111.70:80 612 0.000 200 xxxxx
2019/09/17 09:37:08 [error] 48#48: *596 upstream timed out (110: Connection timed out) while connecting to upstream, client: 10.7.1.31, server: a.com, request: "GET / HTTP/1.1", upstream: "http://10.7.111.71:80/", host: "a.com"
10.7.1.31 - [10.7.1.31] - - [17/Sep/2019:09:37:08 +0000] "GET / HTTP/1.1" 504 173 "-" "curl/7.29.0" 68 5.000 [default-nginx-nginx-80] [] 10.7.111.71:80 0 5.000 504 xxxx
.....
.....
.....repeating the 50% failure rate
.....
.....
### after about 30 secs.. only health upstream pod will be used (pod ip =10.7.111.70) ###
10.7.1.31 - [10.7.1.31] - - [17/Sep/2019:09:37:26 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.29.0" 68 0.000 [default-nginx-nginx-80] [] 10.7.111.70:80 612 0.000 200 xxxx
......
...... ingress only redirect to 10.7.111.70
......
......
### another 60 seconds .. two pods are both back ###
......
......
....
......
10.7.1.31 - [10.7.1.31] - - [17/Sep/2019:09:38:26 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.29.0" 68 0.002 [default-nginx-nginx-80] [] 10.7.111.71:80 612 0.002 200 ....
10.7.1.31 - [10.7.1.31] - - [17/Sep/2019:09:38:27 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.29.0" 68 0.000 [default-nginx-nginx-80] [] 10.7.111.70:80 612 0.000 200 ...
by the way, I've some investigation , written in the last comments in https://github.com/kubernetes/ingress-nginx/pull/3207. hope it helps
@panpan0000 you seem to have proxy_next_upstream_timeout set to 2 seconds. Are you sure the failing pod fails in 2 seconds? Because if it does not fail faster than 2 seconds, Nginx won't retry. You can try increasing it.
Edit: looking at your logs above it looks like request to failing pod fails in 5 seconds. Given you have proxy_next_upstream_timeout set to 2 seconds, Nginx will never retry those requests, by design. Depending on your app, you might want to reduce following numbers
proxy_connect_timeout 5s;
proxy_send_timeout 60s;
or increase proxy_next_upstream_timeout or increase proxy_next_upstream_timeout.
Thank you so much for your attention and reply. @ElvinEfendi
But unfortunately , with below parameters in nginx.conf, curl will still failed with 504 gateway timeout.
the upstream pod are plain nginx pod.
Can you shed more lights?
proxy_connect_timeout 5s;
proxy_send_timeout 5s;
proxy_read_timeout 5s;
proxy_buffering off;
proxy_buffer_size 4k;
proxy_next_upstream error timeout http_504 http_503 http_500 http_502 http_404 invalid_header non_idempotent;
proxy_next_upstream_timeout 5;
proxy_next_upstream_tries 3;
Your configuration is still not correct. I suggest you set proxy_next_upstream timeout to 16 seconds in the above case. Otherwise those configuration does not make sense.
Thank you @ElvinEfendi it works now.
proxy-next-upstream will decrease the throughout,
hope max_fails can comes back to ingress someday ....https://github.com/kubernetes/ingress-nginx/pull/3207
If nothing left, this issue can be closed.
thank you guys again!
Glad I could help!
hope max_fails can comes back to ingress someday ....#3207
@panpan0000 As suggested in the PR you linked the idea is that we can achieve the same behaviour suing k8s readiness probe. We rely on readiness probe for health checking too. Can you explain why is readiness probe not enough? What do you need max_fails for?
Hi, @ElvinEfendi :
why liveness/readiness probe are not good enough :
quote my comments in https://github.com/kubernetes/ingress-nginx/pull/3207
We cannot always rely on the liveness-probe.
There's a situation which is the problem: a node crash (we can simulate it by reboot a node). Liveness-probe was taken care by kubelet, which is not working when node down.
And kubernetes will not mark that node from ready to NotReady until 40s timeout.
Until then, the pods on that node will be changed from Running to Unknown state.
That say, there will be 40+ seconds which ingress-nginx thought pods are alive but actually not, the requests which send to those dead pods will suffer from 504 Gateway Timeout.
This kind of 40+s duration and problem is unacceptable for many critical business.
@panpan0000 thanks for explanation. Have you tried using EWMA load balancing? While it is not a replacement for max-fails behaviour it _can_ help to reduce errors in the case of node failures because it'll learn that the pods running on the failing node are taking longer to respond and therefore it'll reduce the chance of the being picked during load balancing.
A proper fix: I'm thinking of introducing active/passing healthchecking to ingress-nginx. Maybe then we can restore max-fails feature too.
Most helpful comment
@panpan0000 thanks for explanation. Have you tried using EWMA load balancing? While it is not a replacement for max-fails behaviour it _can_ help to reduce errors in the case of node failures because it'll learn that the pods running on the failing node are taking longer to respond and therefore it'll reduce the chance of the being picked during load balancing.
A proper fix: I'm thinking of introducing active/passing healthchecking to ingress-nginx. Maybe then we can restore max-fails feature too.