Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/.): no
What keywords did you search in NGINX Ingress controller issues before filing this one? (If you have found any duplicates, you should instead reply there.): grafana, prometheus
Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT
NGINX Ingress controller version: 0.19
Kubernetes version (use kubectl version): 1.11.0
Environment:
uname -a):quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.19.0What happened:
I was on 0.17, I had a nice looking grafana dashboard. When I upgraded to 0.19, half of the panels have no data.
What you expected to happen:
metrics shouldn't disappear on upgrade.
How to reproduce it (as minimally and precisely as possible):
Scrape the prometheus metrics endpoint:
$ http get http://<POD_IP>:10254/metrics | grep nginx_ingress_controller_bytes
The grep returns empty. I have many metrics, but many are also missing. Here are the metrics it is returning now:
Anything else we need to know:
Container arguments:
- args:
- /nginx-ingress-controller
- --default-backend-service=$(POD_NAMESPACE)/default-http-backend
- --configmap=$(POD_NAMESPACE)/nginx-configuration
- --tcp-services-configmap=$(POD_NAMESPACE)/tcp-services
- --udp-services-configmap=$(POD_NAMESPACE)/udp-services
- --publish-service=$(POD_NAMESPACE)/ingress-nginx
- --annotations-prefix=nginx.ingress.kubernetes.io
- --enable-dynamic-configuration=false
And configmap:
compute-full-forwarded-for: "true"
disable-ipv6: "true"
disable-ipv6-dns: "true"
load-balance: ip_hash
proxy-read-timeout: "3600"
proxy-send-timeout: "3600"
use-proxy-protocol: "true"
worker-processes: "4"
worker-shutdown-timeout: "43200"
I was on 0.17, I had a nice looking grafana dashboard. When I upgraded to 0.19, half of the panels have no data.
This works as expected. The prometheus metrics are not persistent. That means when you upgrade the version, there are no stats. You need traffic to get data from prometheus
But I do have traffic. This nginx IC is used in production and is working fine. The upgrade was over 24 hours ago and still no metrics.
Are you sure there is no regression in the code?... Note that I have --enable-dynamic-configuration=false, not sure if that matters or not..
Are you sure there is no regression in the code?... Note that I have --enable-dynamic-configuration=false, not sure if that matters or not..
The metrics work with or without dynamic mode
I was able to reproduce in the dev environment. Downgrade to 0.18 -> it works. Go back to 0.19 -> metrics gone.
I hit this as well.
I was running 0.17.1 and upgraded to 0.19.0, and several metrics apparently stopped being reported by the metrics endpoint. As mentioned by @gjcarneiro, downgrading to 0.18.0 also restored the lost metrics on my end (I used the latest nginx-ingress chart version with both).
Metrics dump for both versions (some labels were redacted):
https://gist.github.com/danielfm/d429b8fa055671d6fccb1ee1c1863ab9
I checked the changelog between 0.18.0 and 0.19.0, but could not find any change that would explain this, so any help is greatly appreciated.
I have the same problem. After upgrade to 0.19.0 i lost many metrics.
I have the same problem. After upgrade to 0.19.0 i lost many metrics.
Same here. New installation on 0.19.0 and I've been struggling through all the examples of metrics online without seeing them in my nginx-ingress controller, now I've seen this ticket.
I used a custom template configuration nginx.tmpl with 0.18.0, when upgrade from 0.18.0 --> 0.19.0, template not updated to new. After update nginx.tmpl to 0.19.0, metrics work ok.
I'm not using custom templates in my configuration. In my case, as I said before, I tried the exact same configuration with the exact same nginx-ingress chart version both with v0.18.0 and v0.19.0, and several metrics stopped being reported in v0.19.0 (see the GitHub gist I posted earlier for more information).
I've just hit this too....
Nginx Ingress: 0.20.0
Kubernetes: 1.9
Any work around?
edit: using standard templates, but with custom snippets
further edit: I am getting the ingress metrics, but NO Nginx metrics, specially 2xx 3xx 4xx 5xx codes
I have the same problem in our production environment. Prometheus can't collect many metrics.
nginx-ingress-controller:0.20.0
kubernetes v1.10.0
--enable-dynamic-configuration=false
Also enable-dynamic-configuration shouldn't have any relation with metrics, but it does seem have.
@aledbf @ElvinEfendi @nicksardo
I'm using the default setting for enable-dynamic-configuration which I believe is true. I also tried setting it to false. Couldn't get all the metrics either way on 0.19.0 and above.
Same issue here on 0.20.0 and dev
it works fine on 0.18.0
I have the same issue with 0.19.0 and 0.20.0, No problem with 0.18.0
Is everyone missing only Nginx metrics and have no issue with controller metrics? If not is there a pattern what metrics are missing?
Are you using custom template?
Do you see one or more of the following messages in the logs?
error when setting up timer.every
omitting metrics for the request, current batch is full
error while encoding metrics
Do you see any other Nginx error in the logs?
Can you strace a Nginx worker and see whether it's writing to unix:/tmp/prometheus-nginx.socket and what's it writing (it should be Nginx metrics such HTTP status, full list is at https://github.com/kubernetes/ingress-nginx/blob/bc6f2e7016b9209203fd6a36aca2c2e04a10cae8/rootfs/etc/nginx/lua/monitor.lua#L19)?
Can you also strace controller process and see whether it's reading from the same socket?
I'm hitting the same issue on Version 0.20.0:
With --enable-dynamic-configuration=false the metrics are missing, while with --enable-dynamic-configuration=true metrics are exported as expected.
@ElvinEfendi I think I can provide some more context.
We noticed this after upgrading from chart version 0.17.2 to 0.29.1. We do indeed have --enable-dynamic-configuration=false set after upgrading to 0.29.1. We did not have that set previously, but this option was already false by default in 0.17.2.
(I don't think v0.17.2 or 0.29.1 is particularly important, it's just what we had deployed. As others have noted, this problem seems to have first appeared in chart version 0.19 and only with dynamic configuration disabled.)
So I took a snapshot of /metrics on both versions:
I then did a diff on the metric names:
https://gist.github.com/sczizzo/c9778cf758b7ee37d254db591ad0a57b#file-metric-names-diff-txt
So to answer your first set of questions, we're definitely getting _some_ metrics. Also appears the naming scheme changed a bit between these releases.
We used to be able to, say, get the total number of responses by server_zone like so:
sum(rate(nginx_responses_total{job="nginx-ingress"}[5m])) by (server_zone)
But I don't see any way to accomplish this given the new set of metrics.
As to your second set of questions, we did not see any instance of error when setting up timer.every, but we did see lots of omitting metrics for the request, current batch is full error while encoding metrics. As far as I can tell, with dynamic configuration disabled, monitor.init_worker() is never called, so we wouldn't expect to see that timer.every error (perhaps init_worker should be running in this case?). Otherwise no errors really jump out at me.
I've haven't spent too much time digging into the running containers yet, but with chart 0.17.2 there actually is no /tmp/prometheus-nginx.socket; apparently that was added later. With v0.29.1 I was able to run strace on both the nginx process and the controller process in addition to nc -U; didn't see any sign that socket was written.
We experience the same issue here with 0.19.0 version, with custom nginx.tmpl we need that enable-dynamic-configuration was explicitly set to false, but it leads to lack of metrics. Is there any work on that going on?
From the changelog, it seems that --enable-dynamic-configuration=false will disappear as option, and dynamic configuration will always be enabled.
To be honest, I've had a poor experience of dynamic configuration enabled, so I have a lot of misgivings about this development path. So, I guess if --enable-dynamic-configuration=false is indeed causing the problem, they are going to remove that option, so problem gone.. :(
@gjcarneiro I know it can be frustrating when things don't work as expected. We are doing our best to make ingress-nginx better. There were valid reasons to switch to dynamic mode and many users benefit from it. Going forward to support both modes is not feasible for us few maintainers.
I see that you've referenced previous version's changelog, but have you tried the latest version https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.21.0 (dynamic mode only)? We have fixed several bugs in that release.
I've had a poor experience of dynamic configuration enabled
Instead of taking a step backwards and going to non dynamic mode can you try the latest version (0.21.0) and let us know what's the poor experience you were referring to? There's nothing fundamentally broken with dynamic mode as far as I know and it provides important benefits. We can all work together to fix small issues that arises.
I am facing the same issue on nginx-ingress-controller:0.20.0. All the major upstream related metrics and nginx are gone from the scrape service endpoints. Is this issue got addressed in latest version (0.21.0). Any permanent solutions ? .
We have different k8s clusters with ingress-nginx 0.21.0. (kubernetes 1.9.10)
The problem is that keys nginx_ingress_controller_ssl_expire_time_seconds aren't exists in one cluster but available in another.
I am just getting /metrics from ingress-nginx via Kubernetes API (with kubectl proxy).
We run nginx-ingress-controller with parameters like:
--configmap=kube-extra/nginx-ingress-controller
--default-ssl-certificate=kube-extra/wildcard-internal-ingress
--tcp-services-configmap=kube-extra/nginx-ingress-controller-tcp-ports
--sort-backends=true
--annotations-prefix=ingress.kubernetes.io
--enable-ssl-chain-completion=false
Configuration for different clusters almost the same (except domains/certificates etc)
The main difference I can see between clusters is the errors like:
Error getting SSL certificate "somename/somedomain": local SSL certificate somename/somedomain was not found. Using default certificate
for cluster where nginx_ingress_controller_ssl_expire_time_seconds keys aren't available.
If somebody have any suggestions I'll be happy to hear them.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Can somebody share the resolution?
Well, meanwhile I upgraded to the latest version, and all metrics are there. I'm pretty sure this is fixed now. No idea which exact version fixed it.
@gjcarneiro which version do you currently use?
I am on 0.26.1 and getting somewhat the same issue.
[redacted] redacted@ip-redacted:$ curl -sS 192.169.192.75:10254/metrics | wc -l
4437
[redacted] redacted@ip-redacted:$ curl -sS 192.169.192.75:10254/metrics | wc -l
1848
One deployment with 2 replicas, one of the pods is missing the ssl expiry metrics which i am interested in. This happens regardless of how many times i will recreate the pod.
Also using 0.26.1. I don't know if _all_ the metrics have been preserved, but they're essentially there.
For the case of ssl expiry, we have metrics:
$ http get 10.134.4.40:10254/metrics | grep nginx_ingress_controller_ssl_expire_time_seconds | wc -l
49
Most helpful comment
@ElvinEfendi I think I can provide some more context.
We noticed this after upgrading from chart version 0.17.2 to 0.29.1. We do indeed have
--enable-dynamic-configuration=falseset after upgrading to 0.29.1. We did not have that set previously, but this option was alreadyfalseby default in 0.17.2.(I don't think v0.17.2 or 0.29.1 is particularly important, it's just what we had deployed. As others have noted, this problem seems to have first appeared in chart version 0.19 and only with dynamic configuration disabled.)
So I took a snapshot of
/metricson both versions:I then did a diff on the metric names:
https://gist.github.com/sczizzo/c9778cf758b7ee37d254db591ad0a57b#file-metric-names-diff-txt
So to answer your first set of questions, we're definitely getting _some_ metrics. Also appears the naming scheme changed a bit between these releases.
We used to be able to, say, get the total number of responses by
server_zonelike so:But I don't see any way to accomplish this given the new set of metrics.
As to your second set of questions, we did not see any instance of
error when setting up timer.every, but we did see lots ofomitting metrics for the request, current batch is full error while encoding metrics. As far as I can tell, with dynamic configuration disabled,monitor.init_worker()is never called, so we wouldn't expect to see thattimer.everyerror (perhapsinit_workershould be running in this case?). Otherwise no errors really jump out at me.I've haven't spent too much time digging into the running containers yet, but with chart 0.17.2 there actually is no
/tmp/prometheus-nginx.socket; apparently that was added later. With v0.29.1 I was able to runstraceon both the nginx process and the controller process in addition tonc -U; didn't see any sign that socket was written.