Victoriametrics: vminsert: performance degradation while ingesting data in the Prometheus exposition format

Created on 14 Aug 2020  路  9Comments  路  Source: VictoriaMetrics/VictoriaMetrics

Describe the bug
We use the cluster version of VictoriaMetrics, run it in Kubernetes, and send data in the Prometheus exposition format to http://<vminsert>/insert/0/prometheus/api/v1/import/prometheus endpoint, we also gzip data before sending. This works fine.

At some point, we observe the following vminsert errors:

2020-08-13T21:53:27.130Z warn VictoriaMetrics/app/vminsert/main.go:184 remoteAddr: "10.32.82.19:39628"; error in "/insert/0/prometheus/api/v1/import/prometheus": cannot read graphite plaintext protocol data: unexpected EOF

After that CPU usage gets to 100%, memory usage also increased:

vm2
vm1

Interestingly, vminsert reports also it reached max concurrent inserts:
vm3

While ingestion rate didn't increase dramatically:
vm4

However, there was a spike in active time series:
vm5
Churn rate and slow inserts had an increase also - 10k and 8% appropriately.

A restart of vminsert pods resolved the issue.

To Reproduce
I'm not sure how to reproduce, still in search of some correlations

Expected behavior
No errors, stable resource usage. Or errors related to the format we use instead of Graphite plaintext format.

Version
v1.39.4-cluster

Used command-line flags
--maxLabelsPerTimeseries=100 in addition to --storageNode

Additional context
vmselect/vmstorage had no errors in logs at that time. Let me know if I should include other graphs from the official VM dashboard as well.

bug

All 9 comments

@vpedosyuk , could you provide cpu profile and goroutine profile next time you encounter this bug?
These profiles may be obtained with the following commands (please substitute vminsert with the corresponding vminsert node address):

  • cpu profile:
curl http://vminsert:8480/debug/pprof/profile > cpu.pprof
  • goroutine profile:
curl http://vminsert:8480/debug/pprof/goroutine > goroutine.pprof

@valyala I've already sent a cpu profile to [email protected] and I'll try to collect goroutine profile next time

@vpedosyuk , thank you!

Could you build vminsert from the commit c6b054784739159a64fb778a4a077151bee6cee0 and verify whether the issue is fixed there? See build instructions for vminsert at https://victoriametrics.github.io/Cluster-VictoriaMetrics.html#building-from-sources .

Thanks for the quick fix! I've built it and deployed but we'll need to monitor it for a few days to make sure the issue is gone.

FYI, the bugfix has been included in v1.40.0 release, so it may be tested there.

@valyala we've experienced another outage:

The ingestion rate was stable until it started decreasing at 12:41, very similar to what I reported in this issue
1

I see that churn rate started growing a few minutes before - at 12:38
2a
2b

The reason for that could've been heavy PromQL queries.

However, the strange thing is that the vmstorage pods were up and running (until I started rebooting everything), moreover, vminserts are the only who were reporting warning in logs:
3

like these:

2020-08-17T09:41:42.272Z warn VictoriaMetrics/app/vminsert/main.go:187 remoteAddr: "10.32.148.8:35832"; error in "/insert/0/prometheus/api/v1/import/prometheus": cannot read Prometheus exposition data: unexpected EOF
...
2020-08-17T09:41:52.915Z warn VictoriaMetrics/app/vminsert/main.go:160 remoteAddr: "10.34.52.118:54032"; error in "/insert/0/prometheus": cannot handle more than 16 concurrent inserts during 1m0s; possible solutions: increase `-insert.maxQueueDuration`, increase `-maxConcurrentInserts`, increase server capacity

At the same time I see vminserts' goroutines number increased:
4
concurrent inserts are again at max:
5

while CPU usage decreased to zero
6
and memory is almost at maximum
7

Then I tried to restart vminsert but with no luck this time. I restarted vmstorage and then vminserts - the ingestion rate started growing and eventually stabilized but I still observed maximum concurrent inserts from vminsert for some time.

Is it OK to have such rapid growth in concurrent vminserts? Does it all mean that we simply need to increase vmstorage resources?

Attaching goroutines and CPU profiles of vminserts, maybe it helps somehow.
cpu.zip
goroutine.zip
Unfortunately, I didn't notice high memory usage then, so I didn't collect memory profiles.

Happened again, this time I see no more than 6% slow inserts and 2.5k churn rate; again maximum concurrent inserts, lowered CPU usage, and a 7-8 times increased number of goroutines.

Can it be related to the data we send or the way we send the data?

Attaching vminserts profiles for goroutines, CPU, and memory. This time I didn't restart anything so the profiles should be "cleaner" I guess.
goroutine.zip
memory.zip
cpu.zip

Does it all mean that we simply need to increase vmstorage resources?

It looks like so. There are chances that the increased time series churn rate could lead to the issue. Could you share graphs for vmstorage nodes from the official Grafana dashboard on the same time range?

Sorry for the late response on this. We've got only 7 days of retention for monitoring of monitoring so no graphs for vmstorage on that particular time range, unfortunately.

Regarding the issue, we've first replaced HDD with SSD, the situation got better, then we increased RAM for vmstorage pods (in total +28 GB). As a result, it happens less often and restores itself faster. Thanks.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

EricAntoni picture EricAntoni  路  3Comments

sh0rez picture sh0rez  路  3Comments

oOHenry picture oOHenry  路  4Comments

valyala picture valyala  路  4Comments

ozn0417 picture ozn0417  路  3Comments