We tried to replace prometheus by vmagent, but vmagent flooded our consul cluster almost to death, so we had to switch back to prometheus for now.
I tcpdump'ed a little, and found two probable causes.
prometheus uses /v1/catalog/service endpoint to determine each service's data.
vmagent usees /v1/health/service endpoint.
Probably health/service method applies more load than catalog/service. Maybe it makes sense to switch to it?
vmagent doesn't use blocking queries: https://www.consul.io/api-docs/features/blocking
It also could potentially reduce the load on the consul. I suggest to implement it.
Confirmed the issue. Try the following workarounds until the issue is fixed by using blocking queries:
allow_stale: true in consul_sd_config-promscrape.consulSDCheckInterval command-line flag value. By default it is set to 30 seconds, which means that vmagent queries Consul every 30 seconds. Note that vmagent ignores refresh_interval set in consul_sd_config.Another option for reducing load on Consul is to reduce -promscrape.discovery.concurrency command-line flag value. By default it is set to 500. Try reducing it to 100.
@wf1nder , could you build vmagent from the commit e149019c00dcb053e5324f874ce0b69667d9e5bf according to these instructions and verify whether it reduces load on Consul for your case.
This commit enabled background refresh caching in Consul when querying information for each target. See https://www.consul.io/api/features/caching for details.
FYI, both commits mentioned above are included in v1.37.2.
Thank you!
We already using allow_stale: true option in all configurations, and -promscrape.consulSDCheckInterval was already increased to 60s before :)
I'm starting to test new release version of vmagent, which includes decreased discovery concurrency, and with background refresh feature. Testing may take a while, because firstly we will test in testing envoronment, and then in production. I think after 1-3 days I will write results here.
Testing may take a while, because firstly we will test in testing envoronment, and then in production. I think after 1-3 days I will write results here.
Thanks for the update! Waiting for results then...
Ok, I already have some results.
On version 1.37.2 the load on the consul is reduced compared to version 1.37.0, but it still more than with prometheus.
Unfortunately I don't have any relevant metric to describe the amount of load. But there is indirect metric.
The legacy cluster have 4 prometheus instances: 2 identical installations of 2 sharded prom hosts in each, 4 prom instances in total, performing service discovery by consul cluster. And this is working fine, without issues with consul.
There is no need to shard vmagent in our testing env, so we can run less vmagent instances.
On version 1.37.0 with two instances of vmagent in total problems start. Random clients occasionally can't perform requests to consul cluster. Adding one more vmagent instance (2>3) is increasing those errors significantly.
On version 1.37.2 with 2 instances of vmagent there is no errors with consul client requests, everything works fine. But with one more vmagent instance (2>3) the errors appears again.
I also tried to decrease -promscrape.discovery.concurre鈥媙cy option from 100 to 30, but it didn't have a noticeable effect.
In prod environment we still need to use sharding even with vmagent, so there should be more vmagent instances than 2, and this will cause problems for the consul.
I assume that the switching from /v1/health/service endpoint to /v1/catalog/service may decrease load on consul. The health endpoint inside consul agent merging information about the service with its health status, so performing more operation to respond.
I assume that there is no problem if vmagent will not know services health status and will try to scrape all of them, even if they are unhealthy and can't respond, like prometheus does. But I'm not sure about this.
Thanks for the valueable info! I'll try switching to blocking queries, since it looks like background refresh caching still has performance issues. As for switching from /v1/health/service to /v1/catalog/service, it looks like the latest Prometheus versions use /v1/health/service - see the corresponding code.
@wf1nder , could you try building vmagent from the commit 5009b25a03837303050bde957ab0b52d523ea3a5 and verify whether it leads to lower load on Consul and whether it properly discovers Consul targets? This commit uses long polling when requesting Consul like Prometheus does. See https://www.consul.io/api-docs/features/blocking for details on long polling aka blocking requests.
vmagent and single-node VictoriaMetrics have been switched to long polling (aka blocking requests) for Consul service discovery starting from v1.49.0. This should reduce load on Consul when discovering big number of scrape targets. @wf1nder , could you upgrade vmagent to v1.49.0 and verify this?
It seems that now there is some bug when interacting with the Consul.
On the vmagent version 1.48.0 it queries consul by following requests:
GET /v1/catalog/services?dc=dc1&stale HTTP/1.1
Now on the vmagent version 1.49.0 GET parameter dc replaced with sdc:
GET /v1/catalog/services?sdc=dc1&stale&index=796431813&wait=50s HTTP/1.1
Per service checks also contains sdc instead of dc:
GET /v1/health/service/servicename?sdc=dc1&stale&index=796059909&wait=50s HTTP/1.1
Consul ignoring that parameter, and as a result it returns information about services only from current datacenter.
The bug with incorrect filtering on datacenter should be fixed in the commit fd9fd191b91ef4c9965ecb0a73ee7dd0161c8d3e .
I confirm that bug with dc/sdc parameters gone in vmagent builded from master, now scraping works fine.
And I see that now vmagent using long polling, and at first glance this reduces load to consul. Let me check it out in a couple of days, and I will write about the result.
@wf1nder , the commit 5f9d88a3cbff581bd331957e274ca9c565f4f1ad should increase wait time for blocking API responses from 50 seconds to 525 seconds. This should reduce load on Consul by up to 10x. Could you build vmagent from this commit and verify it on your workload?