Victoriametrics: Victoria-metrics data usage started growing

Created on 14 Aug 2020  路  18Comments  路  Source: VictoriaMetrics/VictoriaMetrics

Describe the bug
Hello,
I'm trying to identify problem where VM storage suddenly began to grow.
VM was running for about half a year without any problems, then we noticed, that at some point (July 3) storage began to grow. Yet there is no visible increase on write requests to VM write endpoint. It doesn't seem that there is any significant increase on new series either.

Expected behavior
See linear increase on used storage.

Screenshots
Used storage suddenly begins increasing:
Screenshot 2020-08-14 at 08 15 34
No visible increase of requests to write endpoint:
Screenshot 2020-08-14 at 08 16 42
No significant increase ow new time series:
Screenshot 2020-08-14 at 08 17 33

Version
The line returned when passing --version command line flag to binary. For example:

$ ./victoria-metrics-prod --version
victoria-metrics-20200521-144539-tags-v1.35.6-0-g8905bc2a4

Used command-line flags

flag{name="bigMergeConcurrency", value="0"} 1
flag{name="csvTrimTimestamp", value="1ms"} 1
flag{name="dedup.minScrapeInterval", value="0s"} 1
flag{name="deleteAuthKey", value="secret"} 1
flag{name="enableTCP6", value="false"} 1
flag{name="envflag.enable", value="false"} 1
flag{name="envflag.prefix", value=""} 1
flag{name="fs.disableMmap", value="false"} 1
flag{name="graphiteListenAddr", value="0.0.0.0:2003"} 1
flag{name="graphiteTrimTimestamp", value="1s"} 1
flag{name="http.disableResponseCompression", value="false"} 1
flag{name="http.maxGracefulShutdownDuration", value="7s"} 1
flag{name="http.pathPrefix", value=""} 1
flag{name="http.shutdownDelay", value="0s"} 1
flag{name="httpAuth.password", value="secret"} 1
flag{name="httpAuth.username", value=""} 1
flag{name="httpListenAddr", value="0.0.0.0:8428"} 1
flag{name="import.maxLineLen", value="104857600"} 1
flag{name="influxListenAddr", value=""} 1
flag{name="influxMeasurementFieldSeparator", value="_"} 1
flag{name="influxSkipSingleField", value="false"} 1
flag{name="influxTrimTimestamp", value="1ms"} 1
flag{name="insert.maxQueueDuration", value="1m0s"} 1
flag{name="loggerFormat", value="default"} 1
flag{name="loggerLevel", value="INFO"} 1
flag{name="loggerOutput", value="stderr"} 1
flag{name="maxConcurrentInserts", value="224"} 1
flag{name="maxInsertRequestSize", value="335544320"} 1
flag{name="maxLabelsPerTimeseries", value="30"} 1
flag{name="memory.allowedPercent", value="60"} 1
flag{name="metricsAuthKey", value="secret"} 1
flag{name="opentsdbHTTPListenAddr", value=""} 1
flag{name="opentsdbListenAddr", value=""} 1
flag{name="opentsdbTrimTimestamp", value="1s"} 1
flag{name="opentsdbhttp.maxInsertRequestSize", value="33554432"} 1
flag{name="opentsdbhttpTrimTimestamp", value="1ms"} 1
flag{name="pprofAuthKey", value="secret"} 1
flag{name="precisionBits", value="64"} 1
flag{name="promscrape.config", value=""} 1
flag{name="promscrape.config.dryRun", value="false"} 1
flag{name="promscrape.config.strictParse", value="false"} 1
flag{name="promscrape.configCheckInterval", value="0s"} 1
flag{name="promscrape.consulSDCheckInterval", value="30s"} 1
flag{name="promscrape.disableCompression", value="false"} 1
flag{name="promscrape.discovery.concurrency", value="500"} 1
flag{name="promscrape.discovery.concurrentWaitTime", value="1m0s"} 1
flag{name="promscrape.dnsSDCheckInterval", value="30s"} 1
flag{name="promscrape.ec2SDCheckInterval", value="1m0s"} 1
flag{name="promscrape.fileSDCheckInterval", value="30s"} 1
flag{name="promscrape.gceSDCheckInterval", value="1m0s"} 1
flag{name="promscrape.kubernetesSDCheckInterval", value="30s"} 1
flag{name="promscrape.maxScrapeSize", value="16777216"} 1
flag{name="promscrape.suppressScrapeErrors", value="false"} 1
flag{name="retentionPeriod", value="12"} 1
flag{name="search.cacheTimestampOffset", value="5m0s"} 1
flag{name="search.disableCache", value="false"} 1
flag{name="search.latencyOffset", value="30s"} 1
flag{name="search.logSlowQueryDuration", value="5s"} 1
flag{name="search.maxConcurrentRequests", value="16"} 1
flag{name="search.maxExportDuration", value="720h0m0s"} 1
flag{name="search.maxLookback", value="0s"} 1
flag{name="search.maxPointsPerTimeseries", value="30000"} 1
flag{name="search.maxQueryDuration", value="30s"} 1
flag{name="search.maxQueryLen", value="16384"} 1
flag{name="search.maxQueueDuration", value="10s"} 1
flag{name="search.maxStalenessInterval", value="0s"} 1
flag{name="search.maxTagKeys", value="secret"} 1
flag{name="search.maxTagValues", value="100000"} 1
flag{name="search.maxUniqueTimeseries", value="300000"} 1
flag{name="search.minStalenessInterval", value="0s"} 1
flag{name="search.resetCacheAuthKey", value="secret"} 1
flag{name="selfScrapeInstance", value="self"} 1
flag{name="selfScrapeInterval", value="0s"} 1
flag{name="selfScrapeJob", value="victoria-metrics"} 1
flag{name="smallMergeConcurrency", value="0"} 1
flag{name="snapshotAuthKey", value="secret"} 1
flag{name="storageDataPath", value="/var/lib/victoria_metrics"} 1
flag{name="tls", value="false"} 1
flag{name="tlsCertFile", value=""} 1
flag{name="tlsKeyFile", value="secret"} 1
flag{name="version", value="false"} 1

Additional context
The only thing that happened around that time is we upgraded prometheus from 2.16.0 to 2.19.2.

Is there any way to go deeper into debugging this? Thank you.

question

Most helpful comment

A workaround will be provided in Prometheus 2.22.

All 18 comments

It is recommended setting up the official Grafana dashboard for VictoriaMetrics according to these docs. The dashboard will give insights on VictoriaMetrics state, which may explain data usage growth. For instance, the dashboard contains graphs on data ingestion rate and on the number of data points in the database.

We have this dashboard. It does not show anything unusual though:
Screenshot 2020-08-17 at 08 45 43
Screenshot 2020-08-17 at 08 47 02

Hello, we still not being able to identify problem. Today we decided to downgrade Prometheus to last version space usage was good (2.16.0). And it seems that it helped. - datapoint count is still increasing, yet increase on disk usage slowed down:

Screenshot 2020-08-25 at 12 41 14

Hi @Seitanas! Could you pls update dashboard to the latest version? It is recommended to update VM as well but it is up to you. Could you pls also provide screenshots for the following panels:

  • Active time series
  • Requests rate (specifically api/v1/write endpoint)
  • Datapoints ingestion rate
  • Logging rate
  • Churn rate
  • Slow inserts

From Prometheus side would be nice to get graphs for following metrics:

  • rate(prometheus_remote_storage_samples_in_total)
  • rate(prometheus_remote_storage_succeeded_samples_total)
  • rate(prometheus_remote_storage_dropped_samples_total)

Would be great to get the screenshots including both old and new Prometheus version switches.

Hello @hagen1778,

Graphs:

Active time series:

Screenshot 2020-09-15 at 11 19 40

Requests rate:

Screenshot 2020-09-15 at 11 23 20

Datapoints ingestion rate:

Screenshot 2020-09-15 at 11 24 52

Logging rate:

Screenshot 2020-09-15 at 11 25 42

Churn rate:

Screenshot 2020-09-15 at 11 26 56

Slow inserts:

Screenshot 2020-09-15 at 11 27 59

rate(prometheus_remote_storage_samples_in_total):

Screenshot 2020-09-15 at 11 30 48

rate(prometheus_remote_storage_succeeded_samples_total):

Screenshot 2020-09-15 at 11 33 23

rate(prometheus_remote_storage_dropped_samples_total):

Screenshot 2020-09-15 at 11 32 17

_Prometheus upgrade happened at 07.03 Downgrade at 08.25_

Thanks for screenshots! Can't see anything suspicious except of spike for Requests rate panel, but the screenshot is missing labels so I can't say what the spike exactly.

Another guess is that new Prom version started to write smaller batches or batches of different size, resulting into small data parts on disk. This could be checked by looking into storageDataPath folder and compare number of small and big parts for month 06 and 07.

Could you pls also check Prometheus shard metrics:

  • prometheus_remote_storage_shards
  • prometheus_remote_storage_pending_samples

Hello,

big/small usage is as folows:

6.1G    /var/lib/victoria_metrics/data/small/2020_06/
927G    /var/lib/victoria_metrics/data/big/2020_06/
20G /var/lib/victoria_metrics/data/small/2020_07/
3.0T    /var/lib/victoria_metrics/data/big/2020_07/

prometheus_remote_storage_shards:

Screenshot 2020-09-16 at 09 24 50

prometheus_remote_storage_pending_samples:

Screenshot 2020-09-16 at 09 25 01

We also did some comparison against multiple prometheus versions, and it seems that with Prometheus 2.18.0 and newer prometheus_tsdb_storage_blocks_bytes also begins to grow:

https://github.com/prometheus/prometheus/issues/7846

Thanks! Could you also compare a number of big/small parts per month? Pls also plot rate(vm_merges_total) and rate(vm_rows_merged_total) metrics for VM.

By parts, i assume filecount in big/small directories? If so, then:

/var/lib/victoria_metrics/data/small/2020_06/
9
/var/lib/victoria_metrics/data/small/2020_07/
12

/var/lib/victoria_metrics/data/big/2020_06/
20
/var/lib/victoria_metrics/data/big/2020_07/
34

Graphs

rate(vm_merges_total):

Screenshot 2020-09-16 at 13 42 50

rate(vm_rows_merged_total):

Screenshot 2020-09-16 at 13 43 29

(At the end of August we added new VM servers, because old ones were full, - this is visible in graphs).

@Seitanas , could you provide graph for sum(sum_over_time(scrape_series_added[5m])) on the time range that covers old and new Prometheus versions?

It would be great also to see the graph for sum(sum_over_time(scrape_samples_scraped[5m])) on the same time range.

@Seitanas , try substituting Prometheus with vmagent and see whether this helps reducing storage usage growth.

sum(sum_over_time(scrape_series_added[5m])):

Screenshot 2020-09-21 at 16 59 59

sum(sum_over_time(scrape_samples_scraped[5m])):

Screenshot 2020-09-21 at 17 00 59

It looks like time series churn rate (the first graph) and the number of unique time series (the second graph) has been increased in the middle of August. But there is nothing special at the beginning of July when new Prometheus version has been installed according to the initial bug report. Probably, newer versions of Prometheus introduced instability for the real interval between scrapes? This may result in not so good compression ratio for timestamps data.

We might have tracked it down to
https://github.com/golang/go/issues/38860 , effectively causing scrapes not to be aligned in some cases.

A workaround will be provided in Prometheus 2.22.

@roidelapluie , thanks for the update! Then closing the issue, since it should automatically disappear after upgrading to Prometheus 2.22 .

FYI, Prometheus v2.22 has been released. It contains the bugfix for this issue, which could result in increased disk space usage in both Prometheus and VictoriaMetrics.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

valyala picture valyala  路  4Comments

oOHenry picture oOHenry  路  4Comments

isality picture isality  路  3Comments

pmitra43 picture pmitra43  路  3Comments

v98765 picture v98765  路  3Comments