Describe the bug
High startup time for vmagent pod to start scraping in k8s setup, while readiness probe responds with OK before scraping has started. This results in loss of metrics in case of pod rotations.
To Reproduce
Expected behavior
The new pod's readiness probe endpoint should not mark the pod as ready until it has started scraping the targets. A log line
kubernetes_sd_configs: added targets: 1, removed targets: 0; total targets: 1 marks the start of scraping.
Version
The line returned when passing --version command line flag to binary. For example:
$ /vmagent-prod -version
vmagent-20200911-222705-tags-v1.41.0-cluster-0-gca08161b5
Additional context
In this case, I had 3.4k scrape targets, and the pod startup time was ~3 minutes.
2020-11-02T07:01:18.395Z info VictoriaMetrics/app/vmagent/main.go:102 started vmagent in 0.006 seconds
2020-11-02T07:01:18.396Z info VictoriaMetrics/lib/httpserver/httpserver.go:78 starting http server at http://:8429/
2020-11-02T07:01:18.396Z info VictoriaMetrics/lib/httpserver/httpserver.go:79 pprof handlers are exposed at http://:8429/debug/pprof/
2020-11-02T07:01:18.396Z info VictoriaMetrics/lib/promscrape/scraper.go:78 reading Prometheus configs from "/etc/vmagent/config_out/vmagent.env.yaml"
2020-11-02T07:01:22.057Z info VictoriaMetrics/lib/promscrape/scraper.go:106 SIGHUP received; reloading Prometheus configs from "/etc/vmagent/config_out/vmagent.env.yaml"
2020-11-02T07:01:22.057Z info VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:105 SIGHUP received; reloading relabel configs pointed by -remoteWrite.relabelConfig and -remoteWrite.urlRelabelConfig
2020-11-02T07:01:22.057Z info VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:112 Successfully reloaded relabel configs
2020-11-02T07:01:22.204Z info VictoriaMetrics/lib/promscrape/scraper.go:113 nothing changed in "/etc/vmagent/config_out/vmagent.env.yaml"
2020-11-02T07:04:08.208Z error VictoriaMetrics/lib/promscrape/scraper.go:269 skipping duplicate scrape target with identical labels; endpoint=http://10.100.100.194:4001/metrics, labels={endpoint="metrics", instance="10.100.100.194:4001", job="service1", namespace="namespace1", pod="service1-5b8cc8d8bd-8249p", pod_name="service1-5b8cc8d8bd-8249p", prometheus="victoria-metrics/example-vmagent", service="service1"}; make sure service discovery and relabeling is set up properly
2020-11-02T07:04:08.208Z error VictoriaMetrics/lib/promscrape/scraper.go:269 skipping duplicate scrape target with identical labels; endpoint=http://10.100.102.157:4001/metrics, labels={endpoint="metrics", instance="10.100.102.157:4001", job="service1", namespace="namespace1", pod="service1-5b8cc8d8bd-qmj45", pod_name="service1-5b8cc8d8bd-qmj45", prometheus="victoria-metrics/example-vmagent", service="service1"}; make sure service discovery and relabeling is set up properly
2020-11-02T07:04:08.208Z error VictoriaMetrics/lib/promscrape/scraper.go:269 skipping duplicate scrape target with identical labels; endpoint=http://10.100.202.221:4001/metrics, labels={endpoint="metrics", instance="10.100.202.221:4001", job="service1", namespace="namespace1", pod="service1-5b8cc8d8bd-nh5g6", pod_name="service1-5b8cc8d8bd-nh5g6", prometheus="victoria-metrics/example-vmagent", service="service1"}; make sure service discovery and relabeling is set up properly
2020-11-02T07:04:08.208Z error VictoriaMetrics/lib/promscrape/scraper.go:269 skipping duplicate scrape target with identical labels; endpoint=http://10.100.37.102:4001/metrics, labels={endpoint="metrics", instance="10.100.37.102:4001", job="service1", namespace="namespace1", pod="service1-5b8cc8d8bd-wrxh7", pod_name="service1-5b8cc8d8bd-wrxh7", prometheus="victoria-metrics/example-vmagent", service="service1"}; make sure service discovery and relabeling is set up properly
2020-11-02T07:04:08.208Z error VictoriaMetrics/lib/promscrape/scraper.go:269 skipping duplicate scrape target with identical labels; endpoint=http://10.100.62.98:4001/metrics, labels={endpoint="metrics", instance="10.100.62.98:4001", job="service1", namespace="namespace1", pod="service1-5b8cc8d8bd-p7lmt", pod_name="service1-5b8cc8d8bd-p7lmt", prometheus="victoria-metrics/example-vmagent", service="service1"}; make sure service discovery and relabeling is set up properly
2020-11-02T07:04:08.208Z error VictoriaMetrics/lib/promscrape/scraper.go:269 skipping duplicate scrape target with identical labels; endpoint=http://10.100.85.213:4001/metrics, labels={endpoint="metrics", instance="10.100.85.213:4001", job="service1", namespace="namespace1", pod="service1-5b8cc8d8bd-wvc4h", pod_name="service1-5b8cc8d8bd-wvc4h", prometheus="victoria-metrics/example-vmagent", service="service1"}; make sure service discovery and relabeling is set up properly
2020-11-02T07:04:08.213Z error VictoriaMetrics/lib/promscrape/scraper.go:269 skipping duplicate scrape target with identical labels; endpoint=http://10.100.111.22:9404/metrics, labels={endpoint="service2xp", instance="10.100.111.22:9404", job="service2", namespace="namespace2", pod="service2-2", prometheus="victoria-metrics/example-vmagent", service="service2"}; make sure service discovery and relabeling is set up properly
2020-11-02T07:04:08.213Z error VictoriaMetrics/lib/promscrape/scraper.go:269 skipping duplicate scrape target with identical labels; endpoint=http://10.100.56.246:9404/metrics, labels={endpoint="service2xp", instance="10.100.56.246:9404", job="service2", namespace="namespace2", pod="service2-0", prometheus="victoria-metrics/example-vmagent", service="service2"}; make sure service discovery and relabeling is set up properly
2020-11-02T07:04:08.213Z error VictoriaMetrics/lib/promscrape/scraper.go:269 skipping duplicate scrape target with identical labels; endpoint=http://10.100.71.101:9404/metrics, labels={endpoint="service2xp", instance="10.100.71.101:9404", job="service2", namespace="namespace2", pod="service2-1", prometheus="victoria-metrics/example-vmagent", service="service2"}; make sure service discovery and relabeling is set up properly
2020-11-02T07:04:08.213Z error VictoriaMetrics/lib/promscrape/scraper.go:269 skipping duplicate scrape target with identical labels; endpoint=http://10.100.111.22:9141/metrics, labels={endpoint="service2xp", instance="10.100.111.22:9141", job="service2", namespace="namespace2", pod="service2-2", prometheus="victoria-metrics/example-vmagent", service="service2"}; make sure service discovery and relabeling is set up properly
2020-11-02T07:04:08.259Z info VictoriaMetrics/lib/promscrape/scraper.go:303 kubernetes_sd_configs: added targets: 3390, removed targets: 0; total targets: 3390
2020-11-02T07:04:08.546Z error VictoriaMetrics/lib/promscrape/scrapework.go:199 error when scraping "http://10.100.65.73:4001/actuator/prometheus" from job "service3" with labels {endpoint="metrics", instance="10.100.65.73:4001", job="service3", namespace="namespace3", pod="service3-6566dddcd4-6c94b", prometheus="victoria-metrics/example-vmagent", service="service3"}: error when scraping "http://10.100.65.73:4001/actuator/prometheus": dial tcp4 10.100.65.73:4001: connect: connection refused
2020-11-02T07:04:08.555Z error VictoriaMetrics/lib/promscrape/scrapework.go:199 error when scraping "http://10.100.106.90:4001/metrics" from job "service4" with labels {endpoint="metrics", instance="10.100.106.90:4001", job="service4", namespace="namespace5", pod="service4-685b74d859-m858z", pod_name="service4-685b74d859-m858z", prometheus="victoria-metrics/example-vmagent", service="service4"}: unexpected status code returned when scraping "http://10.100.106.90:4001/metrics": 404; expecting 200; response body: "{\"timestamp\":1604300648554,\"status\":404,\"error\":\"Not Found\",\"message\":\"No message available\",\"path\":\"/metrics\"}"
2020-11-02T07:04:08.757Z error VictoriaMetrics/lib/promscrape/scrapework.go:199 error when scraping "http://10.100.101.105:4001/actuator/prometheus" from job "service5" with labels {endpoint="metrics", instance="10.100.101.105:4001", job="service5", namespace="namespace6", pod="service5-575b5655d5-xhflq", prometheus="victoria-metrics/example-vmagent", service="service5"}: unexpected status code returned when scraping "http://10.100.101.105:4001/actuator/prometheus": 404; expecting 200; response body: "{\"timestamp\":\"2020-11-02T07:04:08.757+0000\",\"status\":404,\"error\":\"Not Found\",\"message\":\"No message available\",\"path\":\"/actuator/prometheus\"}"
FYI, vmagent should expose /ready http handler starting from commit 4c808d58bfb737a528a8e9ee0ed022d2ea9a412b . This handler returns 200 OK status code when all the service discovery configs are initialized.
For kubernetes it can be used with following probe config:
startupProbe:
httpGet:
path: /ready
port: http
failureThreshold: 30
periodSeconds: 15
FYI, /ready endpoint is available starting from v1.46.0.
Most helpful comment
FYI, vmagent should expose
/readyhttp handler starting from commit 4c808d58bfb737a528a8e9ee0ed022d2ea9a412b . This handler returns200 OKstatus code when all the service discovery configs are initialized.