Victoriametrics: Constant growth of goroutines number at vmstorage

Created on 12 Feb 2020  路  9Comments  路  Source: VictoriaMetrics/VictoriaMetrics

Describe the bug

Running VictoriaMetrics cluster v1.33.0 with 3 vmstorage instances. Observing constant increase in a number of goroutines at vmstorage components. Had same behavior with v1.32.6.

To Reproduce

These issues just started happening after upgrade from 1.31.5 to 1.32.6.

Screenshots

Drops to zero are caused by vmstorage restarts.

vmstorage-goroutines

Version

vmstorage-20200204-220434-tags-v1.33.0-cluster-0-g8e77b548

Used command-line flags

vmstorage flags:

flag{name="bigMergeConcurrency", value="0"} 1
flag{name="dedup.minScrapeInterval", value="0s"} 1
flag{name="enableTCP6", value="false"} 1
flag{name="fs.disableMmap", value="false"} 1
flag{name="http.disableResponseCompression", value="false"} 1
flag{name="httpListenAddr", value=":8482"} 1
flag{name="loggerFormat", value="default"} 1
flag{name="loggerLevel", value="INFO"} 1
flag{name="loggerOutput", value="stderr"} 1
flag{name="memory.allowedPercent", value="60"} 1
flag{name="precisionBits", value="64"} 1
flag{name="retentionPeriod", value="6"} 1
flag{name="rpc.disableCompression", value="false"} 1
flag{name="search.maxTagKeys", value="secret"} 1
flag{name="search.maxTagValues", value="100000"} 1
flag{name="search.maxUniqueTimeseries", value="300000"} 1
flag{name="smallMergeConcurrency", value="0"} 1
flag{name="snapshotAuthKey", value="secret"} 1
flag{name="storageDataPath", value="/vmstorage-data"} 1
flag{name="version", value="false"} 1

Additional context

Also attaching /debug/pprof/goroutine from every vmstorage instance:

vm-pprof.zip

bug

Most helpful comment

@valyala After upgrade to v1.33.1 the goroutine leak have disappeared. Now, 3 hours passed after upgrade and number of vmstorage goroutines keeps stable, near 450.

Thanks!

All 9 comments

got the same "issue". it is not with vmstorage core but in grafana dashboard. go_goroutines is gauge not counter.

Confirmed the issue. It looks like the issue has been introduced in v1.32.3 . It is related to erroneous use of time.Timer instead of time.Ticker in cleaner code for index and block caches. Each new block creates new cleaner goroutine. This goroutine never finishes due to the issue mentioned above. So the number of goroutines constantly grows.
The bug should be fixed in the following commits:

  • single-node version - eceaf13e5e1eb5463770fff3ee19b0f3b51d9713
  • cluster version - 347aaba79d44df3b21bef1f8f107352a133a635c

@pavdmyt , @freeseacher , could you build VictoriaMetrics from these commits and verify whether the issue with growing number of goroutines is gone for your workloads? See instructions on how to build VictoriaMetrics from sources:

Oops, found yet another bug, which could result in stuck goroutines after v1.32.3. The bug should be fixed in the following commits:

  • single-node version - 7836ad89078847083ae8f9da92cafabe60a0b274
  • cluster version - e3b18ca1ab69082ec46e82acd5909f1311e39cc8

@pavdmyt , @freeseacher , could you build VictoriaMetrics from these commit and verify that they fix the goroutine leak?

@valyala We don't have described issue at our staging cluster. Maybe you can suggest on how to reproduce it? Then we could build VM with recent fixes and try at our staging.

The issue must be reproduced by creating new time series at constant pace. These time series would add new entries to inverted index. These entries will be put into new inmemory parts, which should trigger the issue with constantly growing goroutines.

The easiest way to create new time series is to regularly update time series label to something new (for instance, random number or current timestamp). Please do not do it in production, since this may result in high churn rate and high cardinality issues. See more information about these issue in this arcticle.

@valyala I'm working on reproducing the issue at stage cluster. Will post results here (probably tomorrow).

@pavdmyt , the bugfix is available v1.33.1, so you can try it and verify whether it fixes the issue.

@valyala After upgrade to v1.33.1 the goroutine leak have disappeared. Now, 3 hours passed after upgrade and number of vmstorage goroutines keeps stable, near 450.

Thanks!

@pavdmyt , thanks for the confirmation! Closing the issue as fixed.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

faceair picture faceair  路  3Comments

WilliamDahlen picture WilliamDahlen  路  3Comments

n4mine picture n4mine  路  3Comments

jelmd picture jelmd  路  3Comments

dima-vm picture dima-vm  路  3Comments