Victoriametrics: High startup time for vmagent

Created on 2 Nov 2020  路  3Comments  路  Source: VictoriaMetrics/VictoriaMetrics

Describe the bug
High startup time for vmagent pod to start scraping in k8s setup, while readiness probe responds with OK before scraping has started. This results in loss of metrics in case of pod rotations.

To Reproduce

  1. Create a vmagent pod, which scrapes for a considerable number of targets (3.3k in this case, can be lower).
  2. Wait for the pod to start scraping (can be checked from logs).
  3. Once started, rotate the pod with another vmagent pod instance.
    You would notice that the older pod has been removed before the new pod starts scraping, and the new pod's readiness endpoint responds with OK.

Expected behavior
The new pod's readiness probe endpoint should not mark the pod as ready until it has started scraping the targets. A log line
kubernetes_sd_configs: added targets: 1, removed targets: 0; total targets: 1 marks the start of scraping.

Version
The line returned when passing --version command line flag to binary. For example:

$ /vmagent-prod -version
vmagent-20200911-222705-tags-v1.41.0-cluster-0-gca08161b5

Additional context
In this case, I had 3.4k scrape targets, and the pod startup time was ~3 minutes.

2020-11-02T07:01:18.395Z    info    VictoriaMetrics/app/vmagent/main.go:102 started vmagent in 0.006 seconds
2020-11-02T07:01:18.396Z    info    VictoriaMetrics/lib/httpserver/httpserver.go:78 starting http server at http://:8429/
2020-11-02T07:01:18.396Z    info    VictoriaMetrics/lib/httpserver/httpserver.go:79 pprof handlers are exposed at http://:8429/debug/pprof/
2020-11-02T07:01:18.396Z    info    VictoriaMetrics/lib/promscrape/scraper.go:78    reading Prometheus configs from "/etc/vmagent/config_out/vmagent.env.yaml"
2020-11-02T07:01:22.057Z    info    VictoriaMetrics/lib/promscrape/scraper.go:106   SIGHUP received; reloading Prometheus configs from "/etc/vmagent/config_out/vmagent.env.yaml"
2020-11-02T07:01:22.057Z    info    VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:105  SIGHUP received; reloading relabel configs pointed by -remoteWrite.relabelConfig and -remoteWrite.urlRelabelConfig
2020-11-02T07:01:22.057Z    info    VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:112  Successfully reloaded relabel configs
2020-11-02T07:01:22.204Z    info    VictoriaMetrics/lib/promscrape/scraper.go:113   nothing changed in "/etc/vmagent/config_out/vmagent.env.yaml"
2020-11-02T07:04:08.208Z    error   VictoriaMetrics/lib/promscrape/scraper.go:269   skipping duplicate scrape target with identical labels; endpoint=http://10.100.100.194:4001/metrics, labels={endpoint="metrics", instance="10.100.100.194:4001", job="service1", namespace="namespace1", pod="service1-5b8cc8d8bd-8249p", pod_name="service1-5b8cc8d8bd-8249p", prometheus="victoria-metrics/example-vmagent", service="service1"}; make sure service discovery and relabeling is set up properly
2020-11-02T07:04:08.208Z    error   VictoriaMetrics/lib/promscrape/scraper.go:269   skipping duplicate scrape target with identical labels; endpoint=http://10.100.102.157:4001/metrics, labels={endpoint="metrics", instance="10.100.102.157:4001", job="service1", namespace="namespace1", pod="service1-5b8cc8d8bd-qmj45", pod_name="service1-5b8cc8d8bd-qmj45", prometheus="victoria-metrics/example-vmagent", service="service1"}; make sure service discovery and relabeling is set up properly
2020-11-02T07:04:08.208Z    error   VictoriaMetrics/lib/promscrape/scraper.go:269   skipping duplicate scrape target with identical labels; endpoint=http://10.100.202.221:4001/metrics, labels={endpoint="metrics", instance="10.100.202.221:4001", job="service1", namespace="namespace1", pod="service1-5b8cc8d8bd-nh5g6", pod_name="service1-5b8cc8d8bd-nh5g6", prometheus="victoria-metrics/example-vmagent", service="service1"}; make sure service discovery and relabeling is set up properly
2020-11-02T07:04:08.208Z    error   VictoriaMetrics/lib/promscrape/scraper.go:269   skipping duplicate scrape target with identical labels; endpoint=http://10.100.37.102:4001/metrics, labels={endpoint="metrics", instance="10.100.37.102:4001", job="service1", namespace="namespace1", pod="service1-5b8cc8d8bd-wrxh7", pod_name="service1-5b8cc8d8bd-wrxh7", prometheus="victoria-metrics/example-vmagent", service="service1"}; make sure service discovery and relabeling is set up properly
2020-11-02T07:04:08.208Z    error   VictoriaMetrics/lib/promscrape/scraper.go:269   skipping duplicate scrape target with identical labels; endpoint=http://10.100.62.98:4001/metrics, labels={endpoint="metrics", instance="10.100.62.98:4001", job="service1", namespace="namespace1", pod="service1-5b8cc8d8bd-p7lmt", pod_name="service1-5b8cc8d8bd-p7lmt", prometheus="victoria-metrics/example-vmagent", service="service1"}; make sure service discovery and relabeling is set up properly
2020-11-02T07:04:08.208Z    error   VictoriaMetrics/lib/promscrape/scraper.go:269   skipping duplicate scrape target with identical labels; endpoint=http://10.100.85.213:4001/metrics, labels={endpoint="metrics", instance="10.100.85.213:4001", job="service1", namespace="namespace1", pod="service1-5b8cc8d8bd-wvc4h", pod_name="service1-5b8cc8d8bd-wvc4h", prometheus="victoria-metrics/example-vmagent", service="service1"}; make sure service discovery and relabeling is set up properly
2020-11-02T07:04:08.213Z    error   VictoriaMetrics/lib/promscrape/scraper.go:269   skipping duplicate scrape target with identical labels; endpoint=http://10.100.111.22:9404/metrics, labels={endpoint="service2xp", instance="10.100.111.22:9404", job="service2", namespace="namespace2", pod="service2-2", prometheus="victoria-metrics/example-vmagent", service="service2"}; make sure service discovery and relabeling is set up properly
2020-11-02T07:04:08.213Z    error   VictoriaMetrics/lib/promscrape/scraper.go:269   skipping duplicate scrape target with identical labels; endpoint=http://10.100.56.246:9404/metrics, labels={endpoint="service2xp", instance="10.100.56.246:9404", job="service2", namespace="namespace2", pod="service2-0", prometheus="victoria-metrics/example-vmagent", service="service2"}; make sure service discovery and relabeling is set up properly
2020-11-02T07:04:08.213Z    error   VictoriaMetrics/lib/promscrape/scraper.go:269   skipping duplicate scrape target with identical labels; endpoint=http://10.100.71.101:9404/metrics, labels={endpoint="service2xp", instance="10.100.71.101:9404", job="service2", namespace="namespace2", pod="service2-1", prometheus="victoria-metrics/example-vmagent", service="service2"}; make sure service discovery and relabeling is set up properly
2020-11-02T07:04:08.213Z    error   VictoriaMetrics/lib/promscrape/scraper.go:269   skipping duplicate scrape target with identical labels; endpoint=http://10.100.111.22:9141/metrics, labels={endpoint="service2xp", instance="10.100.111.22:9141", job="service2", namespace="namespace2", pod="service2-2", prometheus="victoria-metrics/example-vmagent", service="service2"}; make sure service discovery and relabeling is set up properly
2020-11-02T07:04:08.259Z    info    VictoriaMetrics/lib/promscrape/scraper.go:303   kubernetes_sd_configs: added targets: 3390, removed targets: 0; total targets: 3390
2020-11-02T07:04:08.546Z    error   VictoriaMetrics/lib/promscrape/scrapework.go:199    error when scraping "http://10.100.65.73:4001/actuator/prometheus" from job "service3" with labels {endpoint="metrics", instance="10.100.65.73:4001", job="service3", namespace="namespace3", pod="service3-6566dddcd4-6c94b", prometheus="victoria-metrics/example-vmagent", service="service3"}: error when scraping "http://10.100.65.73:4001/actuator/prometheus": dial tcp4 10.100.65.73:4001: connect: connection refused
2020-11-02T07:04:08.555Z    error   VictoriaMetrics/lib/promscrape/scrapework.go:199    error when scraping "http://10.100.106.90:4001/metrics" from job "service4" with labels {endpoint="metrics", instance="10.100.106.90:4001", job="service4", namespace="namespace5", pod="service4-685b74d859-m858z", pod_name="service4-685b74d859-m858z", prometheus="victoria-metrics/example-vmagent", service="service4"}: unexpected status code returned when scraping "http://10.100.106.90:4001/metrics": 404; expecting 200; response body: "{\"timestamp\":1604300648554,\"status\":404,\"error\":\"Not Found\",\"message\":\"No message available\",\"path\":\"/metrics\"}"
2020-11-02T07:04:08.757Z    error   VictoriaMetrics/lib/promscrape/scrapework.go:199    error when scraping "http://10.100.101.105:4001/actuator/prometheus" from job "service5" with labels {endpoint="metrics", instance="10.100.101.105:4001", job="service5", namespace="namespace6", pod="service5-575b5655d5-xhflq", prometheus="victoria-metrics/example-vmagent", service="service5"}: unexpected status code returned when scraping "http://10.100.101.105:4001/actuator/prometheus": 404; expecting 200; response body: "{\"timestamp\":\"2020-11-02T07:04:08.757+0000\",\"status\":404,\"error\":\"Not Found\",\"message\":\"No message available\",\"path\":\"/actuator/prometheus\"}"
enhancement vmagent

Most helpful comment

FYI, vmagent should expose /ready http handler starting from commit 4c808d58bfb737a528a8e9ee0ed022d2ea9a412b . This handler returns 200 OK status code when all the service discovery configs are initialized.

All 3 comments

FYI, vmagent should expose /ready http handler starting from commit 4c808d58bfb737a528a8e9ee0ed022d2ea9a412b . This handler returns 200 OK status code when all the service discovery configs are initialized.

For kubernetes it can be used with following probe config:

startupProbe:
  httpGet:
    path: /ready
    port: http
  failureThreshold: 30
  periodSeconds: 15

FYI, /ready endpoint is available starting from v1.46.0.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

prdatur picture prdatur  路  3Comments

sh0rez picture sh0rez  路  3Comments

oOHenry picture oOHenry  路  4Comments

WilliamDahlen picture WilliamDahlen  路  3Comments

valyala picture valyala  路  4Comments