Victoriametrics: vmagent: some targets not scraped or metrics not written anymore in v1.40.1

Created on 8 Sep 2020  路  8Comments  路  Source: VictoriaMetrics/VictoriaMetrics

Describe the bug
After upgrading, some metrics aren't fed to storage anymore, not sure if not scraped or not written
The number of targets reported by the agent doesn't change much after upgrade (and the gap is probably due to kubernetes scaling pods)

To Reproduce
Upgrade from v1.40.0 to v1.40.1

Number of targets for this agent ~3000
All targets are kubernetes pods
Sample rate ~300k

Expected behavior
Get the same series as previous version

Version

/ # ./vmagent-prod --version
vmagent-20200903-232636-tags-v1.40.1-cluster-0-gd2f6f96e4

Used command-line flags

flag{name="csvTrimTimestamp", value="1ms"} 1
flag{name="dryRun", value="false"} 1
flag{name="enableTCP6", value="false"} 1
flag{name="envflag.enable", value="false"} 1
flag{name="envflag.prefix", value=""} 1
flag{name="fs.disableMmap", value="false"} 1
flag{name="graphiteListenAddr", value=""} 1
flag{name="graphiteTrimTimestamp", value="1s"} 1
flag{name="http.disableResponseCompression", value="false"} 1
flag{name="http.maxGracefulShutdownDuration", value="7s"} 1
flag{name="http.pathPrefix", value=""} 1
flag{name="http.shutdownDelay", value="0s"} 1
flag{name="httpListenAddr", value=":8429"} 1
flag{name="import.maxLineLen", value="104857600"} 1
flag{name="influxListenAddr", value=""} 1
flag{name="influxMeasurementFieldSeparator", value="_"} 1
flag{name="influxSkipMeasurement", value="false"} 1
flag{name="influxSkipSingleField", value="false"} 1
flag{name="influxTrimTimestamp", value="1ms"} 1
flag{name="insert.maxQueueDuration", value="1m0s"} 1
flag{name="loggerErrorsPerSecondLimit", value="10"} 1
flag{name="loggerFormat", value="default"} 1
flag{name="loggerLevel", value="INFO"} 1
flag{name="loggerOutput", value="stderr"} 1
flag{name="maxConcurrentInserts", value="24"} 1
flag{name="maxInsertRequestSize", value="33554432"} 1
flag{name="memory.allowedBytes", value="0"} 1
flag{name="memory.allowedPercent", value="60"} 1
flag{name="opentsdbHTTPListenAddr", value=""} 1
flag{name="opentsdbListenAddr", value=""} 1
flag{name="opentsdbTrimTimestamp", value="1s"} 1
flag{name="opentsdbhttp.maxInsertRequestSize", value="33554432"} 1
flag{name="opentsdbhttpTrimTimestamp", value="1ms"} 1
flag{name="promscrape.config", value="/etc/prometheus/prometheus.yml"} 1
flag{name="promscrape.config.dryRun", value="false"} 1
flag{name="promscrape.config.strictParse", value="false"} 1
flag{name="promscrape.configCheckInterval", value="0s"} 1
flag{name="promscrape.consulSDCheckInterval", value="30s"} 1
flag{name="promscrape.disableCompression", value="false"} 1
flag{name="promscrape.disableKeepAlive", value="false"} 1
flag{name="promscrape.discovery.concurrency", value="100"} 1
flag{name="promscrape.discovery.concurrentWaitTime", value="1m0s"} 1
flag{name="promscrape.dnsSDCheckInterval", value="30s"} 1
flag{name="promscrape.ec2SDCheckInterval", value="1m0s"} 1
flag{name="promscrape.fileSDCheckInterval", value="30s"} 1
flag{name="promscrape.gceSDCheckInterval", value="1m0s"} 1
flag{name="promscrape.kubernetesSDCheckInterval", value="30s"} 1
flag{name="promscrape.maxScrapeSize", value="167772160"} 1
flag{name="promscrape.suppressScrapeErrors", value="true"} 1
flag{name="remoteWrite.basicAuth.password", value="secret"} 1
flag{name="remoteWrite.basicAuth.username", value=""} 1
flag{name="remoteWrite.bearerToken", value="secret"} 1
flag{name="remoteWrite.decimalPlaces", value="0"} 1
flag{name="remoteWrite.flushInterval", value="1s"} 1
flag{name="remoteWrite.label", value=""} 1
flag{name="remoteWrite.maxBlockSize", value="33554432"} 1
flag{name="remoteWrite.maxDiskUsagePerURL", value="0"} 1
flag{name="remoteWrite.proxyURL", value=""} 1
flag{name="remoteWrite.queues", value="8"} 1
flag{name="remoteWrite.relabelConfig", value=""} 1
flag{name="remoteWrite.sendTimeout", value="1m0s"} 1
flag{name="remoteWrite.showURL", value="false"} 1
flag{name="remoteWrite.tlsCAFile", value=""} 1
flag{name="remoteWrite.tlsCertFile", value=""} 1
flag{name="remoteWrite.tlsInsecureSkipVerify", value="false"} 1
flag{name="remoteWrite.tlsKeyFile", value="secret"} 1
flag{name="remoteWrite.tlsServerName", value=""} 1
flag{name="remoteWrite.tmpDataPath", value="vmagent-remotewrite-data"} 1
flag{name="remoteWrite.url", value="secret"} 1
flag{name="remoteWrite.urlRelabelConfig", value=""} 1
flag{name="tls", value="false"} 1
flag{name="tlsCertFile", value=""} 1
flag{name="tlsKeyFile", value="secret"} 1
flag{name="version", value="false"} 1
bug vmagent

All 8 comments

@clmssz , could you look into error log for vmagent? Probably it contains some useful info?

Could you also pass -promscrape.suppressScrapeErrors=false command-line flag in order to see scrape errors in vmagent logs? The last error for each scrape target is also available at http://vmagent:8429/targets page. It would be great if you could compare the output of this page for vmagent v1.40.0 and v1.40.1 . Probably vmagent v1.40.1 fails to scrape certain targets due to some errors?

Could you also compare graphs from the official Grafana dashboard for vmagent between v1.40.0 and v1.40.1 ? Probably there is some visible difference between v1.40.0 and v1.40.1?

We are experiencing the same issue.

We can confirm that it's not a scrapping problem - vmagent does scrape the targets. Also, there is no information in the logs - logs are mute.
After much debugging, our best guess was that vmagent was simply truncating the amount of data being sent to the remoteWrite

We saw something like:

vmagent 1.40.1

scrape_samples_post_metric_relabeling{account="my_cluster", job="static_configs"} = 143K
count({account="my_cluster", job="static_configs"}) = 6K

vmagent 1.40.0

scrape_samples_post_metric_relabeling{account="my_cluster", job="static_configs"} = 143K
count({account="my_cluster", job="static_configs"}) = 143K

Note that, from 1.40.1 to 1.40.0 - the only thing we changed was the version of vmagent. We did not changed any parameters or config.
The logs of vmagent are virtually the same for both versions.


vmagent logs

vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:11 build version: vmagent-20200815-133831-tags-v1.40.0-cluster-0-gd9f7ea1c6
vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:12 command line flags
vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:20 flag "csvTrimTimestamp" = "1ms"
vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:20 flag "dryRun" = "false"
vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:20 flag "enableTCP6" = "false"
vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:20 flag "envflag.enable" = "false"
vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:20 flag "envflag.prefix" = ""
vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:20 flag "fs.disableMmap" = "false"
vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:20 flag "graphiteListenAddr" = ""
vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:20 flag "graphiteTrimTimestamp" = "1s"
vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:20 flag "http.disableResponseCompression" = "false"
vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:20 flag "http.maxGracefulShutdownDuration" = "7s"
vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:20 flag "http.pathPrefix" = ""
vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:20 flag "http.shutdownDelay" = "0s"
vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:20 flag "httpListenAddr" = ":8429"
vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:20 flag "import.maxLineLen" = "104857600"
vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:20 flag "influxListenAddr" = ""
vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:20 flag "influxMeasurementFieldSeparator" = "_"
vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:20 flag "influxSkipMeasurement" = "false"
vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:20 flag "influxSkipSingleField" = "false"
vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:20 flag "influxTrimTimestamp" = "1ms"
vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:20 flag "insert.maxQueueDuration" = "1m0s"
vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:20 flag "loggerErrorsPerSecondLimit" = "10"
vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:20 flag "loggerFormat" = "default"
vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:20 flag "loggerLevel" = "INFO"
vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:20 flag "loggerOutput" = "stderr"
vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:20 flag "maxConcurrentInserts" = "64"
vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:20 flag "maxInsertRequestSize" = "8388608"
vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:20 flag "memory.allowedBytes" = "0"
vmagent 2020-09-11T13:28:31.650Z info VictoriaMetrics/lib/logger/flag.go:20 flag "memory.allowedPercent" = "60"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "opentsdbHTTPListenAddr" = ""
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "opentsdbListenAddr" = ""
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "opentsdbTrimTimestamp" = "1s"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "opentsdbhttp.maxInsertRequestSize" = "33554432"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "opentsdbhttpTrimTimestamp" = "1ms"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "promscrape.config" = "/etc/prometheus/prometheus.yml"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "promscrape.config.dryRun" = "false"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "promscrape.config.strictParse" = "false"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "promscrape.configCheckInterval" = "0s"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "promscrape.consulSDCheckInterval" = "30s"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "promscrape.disableCompression" = "false"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "promscrape.disableKeepAlive" = "false"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "promscrape.discovery.concurrency" = "100"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "promscrape.discovery.concurrentWaitTime" = "1m0s"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "promscrape.dnsSDCheckInterval" = "30s"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "promscrape.ec2SDCheckInterval" = "1m0s"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "promscrape.fileSDCheckInterval" = "30s"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "promscrape.gceSDCheckInterval" = "1m0s"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "promscrape.kubernetesSDCheckInterval" = "30s"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "promscrape.maxScrapeSize" = "67108864"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "promscrape.suppressScrapeErrors" = "false"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "remoteWrite.basicAuth.password" = "secret"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "remoteWrite.basicAuth.username" = ""
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "remoteWrite.bearerToken" = ""
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "remoteWrite.decimalPlaces" = "0"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "remoteWrite.flushInterval" = "30s"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "remoteWrite.label" = "account=my_cluster"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "remoteWrite.maxBlockSize" = "33554432"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "remoteWrite.maxDiskUsagePerURL" = "0"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "remoteWrite.proxyURL" = ""
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "remoteWrite.queues" = "2"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "remoteWrite.relabelConfig" = ""
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "remoteWrite.sendTimeout" = "1m0s"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "remoteWrite.showURL" = "false"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "remoteWrite.tlsCAFile" = ""
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "remoteWrite.tlsCertFile" = ""
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "remoteWrite.tlsInsecureSkipVerify" = "false"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "remoteWrite.tlsKeyFile" = "secret"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "remoteWrite.tlsServerName" = ""
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "remoteWrite.tmpDataPath" = "vmagent-remotewrite-data"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "remoteWrite.url" = "https://insert.myhostserver.com/insert/1/prometheus"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "remoteWrite.urlRelabelConfig" = ""
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "tls" = "false"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "tlsCertFile" = ""
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "tlsKeyFile" = "secret"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/logger/flag.go:20 flag "version" = "false"
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/cgroup/cpu.go:26 updating GOMAXPROCS to 2 according to cgroup CPU quota
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/app/vmagent/main.go:80 starting vmagent at ":8429"...
vmagent 2020-09-11T13:28:31.651Z info VictoriaMetrics/lib/memory/memory.go:45 limiting caches to 5153960755 bytes, leaving 3435973837 bytes to the OS according to -memory.allowedPercent=60.000000 and -memory.allowedBytes=0
vmagent 2020-09-11T13:28:31.655Z info VictoriaMetrics/lib/persistentqueue/fastqueue.go:46 opened fast persistent queue at "vmagent-remotewrite-data/persistent-queue/D34B67C161FE6048" with maxInmemoryBlocks=200
vmagent 2020-09-11T13:28:31.655Z info VictoriaMetrics/app/vmagent/remotewrite/client.go:125 initialized client for -remoteWrite.url="https://insert.myhostserver.com/insert/1/prometheus"
vmagent 2020-09-11T13:28:31.655Z info VictoriaMetrics/app/vmagent/main.go:102 started vmagent in 0.004 seconds
vmagent 2020-09-11T13:28:31.655Z info VictoriaMetrics/lib/httpserver/httpserver.go:73 starting http server at http://:8429/
vmagent 2020-09-11T13:28:31.655Z info VictoriaMetrics/lib/promscrape/scraper.go:78 reading Prometheus configs from "/etc/prometheus/prometheus.yml"
vmagent 2020-09-11T13:28:31.655Z info VictoriaMetrics/lib/httpserver/httpserver.go:74 pprof handlers are exposed at http://:8429/debug/pprof/
vmagent 2020-09-11T13:28:31.657Z info VictoriaMetrics/lib/promscrape/scraper.go:303 static_configs: added targets: 1, removed targets: 0; total targets: 1

I suspect that the issue may be related to 7785869ccc4d4ff4dc80ad9f33442f6c5b369c72 .

@PerGon , could you build vmagent from the commit 5af777469acff2f216f869cd6412cbba861f95c1 (the previous commit before 7785869ccc4d4ff4dc80ad9f33442f6c5b369c72) and verify, whether it fixes the issue? If the issue isn't reproduced there, then could you build vmagent from 7785869ccc4d4ff4dc80ad9f33442f6c5b369c72 and verify whether the issue is reproducible in it? See build instructions for vmagent.

@valyala your suspicion seems to be correct!

I've compiled both commits you referred and had them running in parallel using the exact same config.

This was the results:
image

@clmssz , @PerGon , thanks for your help! It looks like the root cause of the bug has been detected and fixed in the commit 7d893a234ccda2f34d1c06f0caec4ab53b58c027 . Could you verify whether vmagent built from this commit is free from the bug? Then the commit will be included in the next release.

@valyala sorry for the delay, was planning to continue debugging tonight but just saw the updates. I deployed the 7d893a2 commit, and so far it seems to have fix the issue, I will keep you updated if something goes south. Thanks ! To you and @PerGon :)

I just ran vmagents compiled with all the commits referred above, and here's the results:

image

So the issue seems to be fixed indeed! Thanks 馃檱

The bugfix has been included in vmagent v1.41.0. Closing the issue as fixed.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

v98765 picture v98765  路  3Comments

jelmd picture jelmd  路  3Comments

valyala picture valyala  路  4Comments

faceair picture faceair  路  3Comments

sh0rez picture sh0rez  路  3Comments