Ingress-nginx: Many prometheus metrics disappeared after upgrade

Created on 6 Sep 2018 · 32Comments · Source: kubernetes/ingress-nginx

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/.): no

What keywords did you search in NGINX Ingress controller issues before filing this one? (If you have found any duplicates, you should instead reply there.): grafana, prometheus

Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT

NGINX Ingress controller version: 0.19

Kubernetes version (use kubectl version): 1.11.0

Environment:

Cloud provider or hardware configuration: baremetal
OS (e.g. from /etc/os-release): ubuntu 16.04
Kernel (e.g. uname -a):
Install tools: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.19.0
Others:

What happened:

I was on 0.17, I had a nice looking grafana dashboard. When I upgraded to 0.19, half of the panels have no data.

What you expected to happen:

metrics shouldn't disappear on upgrade.

How to reproduce it (as minimally and precisely as possible):
Scrape the prometheus metrics endpoint:

$ http get http://<POD_IP>:10254/metrics | grep nginx_ingress_controller_bytes

The grep returns empty. I have many metrics, but many are also missing. Here are the metrics it is returning now:

metrics.txt

Anything else we need to know:
Container arguments:

        - args:
          - /nginx-ingress-controller
          - --default-backend-service=$(POD_NAMESPACE)/default-http-backend
          - --configmap=$(POD_NAMESPACE)/nginx-configuration
          - --tcp-services-configmap=$(POD_NAMESPACE)/tcp-services
          - --udp-services-configmap=$(POD_NAMESPACE)/udp-services
          - --publish-service=$(POD_NAMESPACE)/ingress-nginx
          - --annotations-prefix=nginx.ingress.kubernetes.io
          - --enable-dynamic-configuration=false

And configmap:

    compute-full-forwarded-for: "true"
    disable-ipv6: "true"
    disable-ipv6-dns: "true"
    load-balance: ip_hash
    proxy-read-timeout: "3600"
    proxy-send-timeout: "3600"
    use-proxy-protocol: "true"
    worker-processes: "4"
    worker-shutdown-timeout: "43200"

lifecyclrotten

Source

gjcarneiro

👍1

Most helpful comment

@ElvinEfendi I think I can provide some more context.

We noticed this after upgrading from chart version 0.17.2 to 0.29.1. We do indeed have --enable-dynamic-configuration=false set after upgrading to 0.29.1. We did not have that set previously, but this option was already false by default in 0.17.2.

(I don't think v0.17.2 or 0.29.1 is particularly important, it's just what we had deployed. As others have noted, this problem seems to have first appeared in chart version 0.19 and only with dynamic configuration disabled.)

So I took a snapshot of /metrics on both versions:

0.17.2: https://gist.github.com/sczizzo/c9778cf758b7ee37d254db591ad0a57b#file-metrics-chart-0-17-2-txt
0.29.1: https://gist.github.com/sczizzo/c9778cf758b7ee37d254db591ad0a57b#file-metrics-chart-0-29-1-txt

I then did a diff on the metric names:
https://gist.github.com/sczizzo/c9778cf758b7ee37d254db591ad0a57b#file-metric-names-diff-txt

So to answer your first set of questions, we're definitely getting _some_ metrics. Also appears the naming scheme changed a bit between these releases.

We used to be able to, say, get the total number of responses by server_zone like so:

sum(rate(nginx_responses_total{job="nginx-ingress"}[5m])) by (server_zone)

But I don't see any way to accomplish this given the new set of metrics.

As to your second set of questions, we did not see any instance of error when setting up timer.every, but we did see lots of omitting metrics for the request, current batch is full error while encoding metrics. As far as I can tell, with dynamic configuration disabled, monitor.init_worker() is never called, so we wouldn't expect to see that timer.every error (perhaps init_worker should be running in this case?). Otherwise no errors really jump out at me.

I've haven't spent too much time digging into the running containers yet, but with chart 0.17.2 there actually is no /tmp/prometheus-nginx.socket; apparently that was added later. With v0.29.1 I was able to run strace on both the nginx process and the controller process in addition to nc -U; didn't see any sign that socket was written.

sczizzo on 1 Nov 2018

👍4

All 32 comments

I was on 0.17, I had a nice looking grafana dashboard. When I upgraded to 0.19, half of the panels have no data.

This works as expected. The prometheus metrics are not persistent. That means when you upgrade the version, there are no stats. You need traffic to get data from prometheus

aledbf on 6 Sep 2018

But I do have traffic. This nginx IC is used in production and is working fine. The upgrade was over 24 hours ago and still no metrics.

Are you sure there is no regression in the code?... Note that I have --enable-dynamic-configuration=false, not sure if that matters or not..

gjcarneiro on 6 Sep 2018

Are you sure there is no regression in the code?... Note that I have --enable-dynamic-configuration=false, not sure if that matters or not..

The metrics work with or without dynamic mode

aledbf on 6 Sep 2018

I was able to reproduce in the dev environment. Downgrade to 0.18 -> it works. Go back to 0.19 -> metrics gone.

gjcarneiro on 6 Sep 2018

I hit this as well.

I was running 0.17.1 and upgraded to 0.19.0, and several metrics apparently stopped being reported by the metrics endpoint. As mentioned by @gjcarneiro, downgrading to 0.18.0 also restored the lost metrics on my end (I used the latest nginx-ingress chart version with both).

Metrics dump for both versions (some labels were redacted):
https://gist.github.com/danielfm/d429b8fa055671d6fccb1ee1c1863ab9

I checked the changelog between 0.18.0 and 0.19.0, but could not find any change that would explain this, so any help is greatly appreciated.

danielfm on 10 Sep 2018

I have the same problem. After upgrade to 0.19.0 i lost many metrics.

sergelogvinov on 15 Sep 2018

I have the same problem. After upgrade to 0.19.0 i lost many metrics.

opskumu on 18 Sep 2018

Same here. New installation on 0.19.0 and I've been struggling through all the examples of metrics online without seeing them in my nginx-ingress controller, now I've seen this ticket.

davidcodesido on 18 Sep 2018

I used a custom template configuration nginx.tmpl with 0.18.0, when upgrade from 0.18.0 --> 0.19.0, template not updated to new. After update nginx.tmpl to 0.19.0, metrics work ok.

opskumu on 21 Sep 2018

I'm not using custom templates in my configuration. In my case, as I said before, I tried the exact same configuration with the exact same nginx-ingress chart version both with v0.18.0 and v0.19.0, and several metrics stopped being reported in v0.19.0 (see the GitHub gist I posted earlier for more information).

danielfm on 21 Sep 2018

I've just hit this too....

Nginx Ingress: 0.20.0
Kubernetes: 1.9

Any work around?

edit: using standard templates, but with custom snippets
further edit: I am getting the ingress metrics, but NO Nginx metrics, specially 2xx 3xx 4xx 5xx codes

rlees85 on 12 Oct 2018

I have the same problem in our production environment. Prometheus can't collect many metrics.

nginx-ingress-controller:0.20.0
kubernetes v1.10.0
--enable-dynamic-configuration=false

Also enable-dynamic-configuration shouldn't have any relation with metrics, but it does seem have.

@aledbf @ElvinEfendi @nicksardo

shenlanse on 19 Oct 2018

I'm using the default setting for enable-dynamic-configuration which I believe is true. I also tried setting it to false. Couldn't get all the metrics either way on 0.19.0 and above.

rlees85 on 19 Oct 2018

Same issue here on 0.20.0 and dev

bernardoVale on 21 Oct 2018

it works fine on 0.18.0

bernardoVale on 21 Oct 2018

I have the same issue with 0.19.0 and 0.20.0, No problem with 0.18.0

paalkr on 23 Oct 2018

Is everyone missing only Nginx metrics and have no issue with controller metrics? If not is there a pattern what metrics are missing?

Are you using custom template?

Do you see one or more of the following messages in the logs?
error when setting up timer.every
omitting metrics for the request, current batch is full
error while encoding metrics

Do you see any other Nginx error in the logs?

Can you strace a Nginx worker and see whether it's writing to unix:/tmp/prometheus-nginx.socket and what's it writing (it should be Nginx metrics such HTTP status, full list is at https://github.com/kubernetes/ingress-nginx/blob/bc6f2e7016b9209203fd6a36aca2c2e04a10cae8/rootfs/etc/nginx/lua/monitor.lua#L19)?

Can you also strace controller process and see whether it's reading from the same socket?

ElvinEfendi on 26 Oct 2018

I'm hitting the same issue on Version 0.20.0:

With --enable-dynamic-configuration=false the metrics are missing, while with --enable-dynamic-configuration=true metrics are exported as expected.

Thubo on 31 Oct 2018

@ElvinEfendi I think I can provide some more context.

So I took a snapshot of /metrics on both versions:

0.17.2: https://gist.github.com/sczizzo/c9778cf758b7ee37d254db591ad0a57b#file-metrics-chart-0-17-2-txt
0.29.1: https://gist.github.com/sczizzo/c9778cf758b7ee37d254db591ad0a57b#file-metrics-chart-0-29-1-txt

I then did a diff on the metric names:
https://gist.github.com/sczizzo/c9778cf758b7ee37d254db591ad0a57b#file-metric-names-diff-txt

So to answer your first set of questions, we're definitely getting _some_ metrics. Also appears the naming scheme changed a bit between these releases.

We used to be able to, say, get the total number of responses by server_zone like so:

sum(rate(nginx_responses_total{job="nginx-ingress"}[5m])) by (server_zone)

But I don't see any way to accomplish this given the new set of metrics.

sczizzo on 1 Nov 2018

👍4

We experience the same issue here with 0.19.0 version, with custom nginx.tmpl we need that enable-dynamic-configuration was explicitly set to false, but it leads to lack of metrics. Is there any work on that going on?

begemotik on 23 Nov 2018

From the changelog, it seems that --enable-dynamic-configuration=false will disappear as option, and dynamic configuration will always be enabled.

To be honest, I've had a poor experience of dynamic configuration enabled, so I have a lot of misgivings about this development path. So, I guess if --enable-dynamic-configuration=false is indeed causing the problem, they are going to remove that option, so problem gone.. :(

gjcarneiro on 23 Nov 2018

@gjcarneiro I know it can be frustrating when things don't work as expected. We are doing our best to make ingress-nginx better. There were valid reasons to switch to dynamic mode and many users benefit from it. Going forward to support both modes is not feasible for us few maintainers.

I see that you've referenced previous version's changelog, but have you tried the latest version https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.21.0 (dynamic mode only)? We have fixed several bugs in that release.

I've had a poor experience of dynamic configuration enabled

Instead of taking a step backwards and going to non dynamic mode can you try the latest version (0.21.0) and let us know what's the poor experience you were referring to? There's nothing fundamentally broken with dynamic mode as far as I know and it provides important benefits. We can all work together to fix small issues that arises.

ElvinEfendi on 25 Nov 2018

👍3

I am facing the same issue on nginx-ingress-controller:0.20.0. All the major upstream related metrics and nginx are gone from the scrape service endpoints. Is this issue got addressed in latest version (0.21.0). Any permanent solutions ? .

vishksaj on 26 Dec 2018

We have different k8s clusters with ingress-nginx 0.21.0. (kubernetes 1.9.10)

The problem is that keys nginx_ingress_controller_ssl_expire_time_seconds aren't exists in one cluster but available in another.

I am just getting /metrics from ingress-nginx via Kubernetes API (with kubectl proxy).

We run nginx-ingress-controller with parameters like:

--configmap=kube-extra/nginx-ingress-controller 
--default-ssl-certificate=kube-extra/wildcard-internal-ingress 
--tcp-services-configmap=kube-extra/nginx-ingress-controller-tcp-ports 
--sort-backends=true 
--annotations-prefix=ingress.kubernetes.io 
--enable-ssl-chain-completion=false

Configuration for different clusters almost the same (except domains/certificates etc)

The main difference I can see between clusters is the errors like:

Error getting SSL certificate "somename/somedomain": local SSL certificate somename/somedomain was not found. Using default certificate

for cluster where nginx_ingress_controller_ssl_expire_time_seconds keys aren't available.

If somebody have any suggestions I'll be happy to hear them.

YuraBeznos on 9 Jan 2019

👍1

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 9 Apr 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 9 May 2019

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 8 Jun 2019

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 8 Jun 2019

Can somebody share the resolution?

jaksky on 5 Mar 2020

Well, meanwhile I upgraded to the latest version, and all metrics are there. I'm pretty sure this is fixed now. No idea which exact version fixed it.

gjcarneiro on 5 Mar 2020

@gjcarneiro which version do you currently use?
I am on 0.26.1 and getting somewhat the same issue.

[redacted] redacted@ip-redacted:$ curl -sS 192.169.192.75:10254/metrics | wc -l
4437
[redacted] redacted@ip-redacted:$ curl -sS 192.169.192.75:10254/metrics | wc -l
1848

One deployment with 2 replicas, one of the pods is missing the ssl expiry metrics which i am interested in. This happens regardless of how many times i will recreate the pod.

Eduardo1911 on 15 Sep 2020

Also using 0.26.1. I don't know if _all_ the metrics have been preserved, but they're essentially there.

For the case of ssl expiry, we have metrics:

$ http get 10.134.4.40:10254/metrics | grep nginx_ingress_controller_ssl_expire_time_seconds | wc -l
49

gjcarneiro on 15 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

nginx basic auth fails with "Permission denied"

bashofmann · 3Comments

Rewrite Annotation does not work, NGINX: 0.9.0

cxj110 · 3Comments

Setting nginx 'proxy-buffer-size' to 16k causes nginx config error

whereisaaron · 3Comments

Too many nginx reloads

yuyang0 · 3Comments

Nginx Ingress Controller - Specify load balancing method

silasbw · 3Comments