Victoriametrics: vminsert ignoring indefinitely one of vmstorage pods after sudden restart

Created on 16 Jun 2020  路  3Comments  路  Source: VictoriaMetrics/VictoriaMetrics

Describe the bug
Hello, so are testing victoriametrics for our production load and trying to play around the resources limits, and we have vmstorage pods being killed by OOM, the vminsert and vmselect stop dealing with those nodes (sending metrics/ requesting data) which is expected, but then the vmstorage pod is back up again, and vminsert nodes are still ignoring indefinitely the restarted vmstorage node, and reroute metrics in our case for the othe vmstorage node left
vmselect nodes will read again from that restarted node once it is up again

To Reproduce
1- shutdown one of vmstorage nodes unexpectedly
2- vmstorage node is back up again
3- check vminsert logs/ vmstorage metrics on active timeseries

Expected behavior
vminsert will check that the vmstorage is reachable every "now and then" and then route traffic again to it once it is reachable

Screenshots

here you can see that after restart of vmstorage-1 until now , it receives no active timeseries
image

and here you can see that at same time, there is no vminsert connectiosn anymore to vmstorage-1
image

here you can see that vmselect is reading metrics from vmstorage-1 again (big resolution for graph to be able to see the rate change)
image

i can also see these logs in the vminsert pods:

[pod/vminsert-66df7975b7-lvmpf/vminsert] 2020-06-15T13:29:55.522Z       warn    VictoriaMetrics/app/vminsert/netstorage/netstorage.go:195       cannot dial storageNode "vmstorage-1.vmstorage
.victoriametrics.svc.cluster.local:8400": dial tcp4 10.58.130.197:8400: connect: connection refused
[pod/vminsert-66df7975b7-lvmpf/vminsert] 2020-06-15T13:29:55.526Z       warn    VictoriaMetrics/app/vminsert/netstorage/netstorage.go:195       cannot dial storageNode "vmstorage-0.vmstorage
.victoriametrics.svc.cluster.local:8400": dial tcp4: lookup vmstorage-0.vmstorage.victoriametrics.svc.cluster.local on 172.20.0.10:53: no such host
[pod/vminsert-66df7975b7-lvmpf/vminsert] 2020-06-15T13:29:55.526Z       warn    VictoriaMetrics/app/vminsert/netstorage/netstorage.go:153       cannot push 30534997 bytes with 59548 rows to 
storage nodes, since all the nodes are temporarily unavailable; re-trying to send the data soon
[pod/vminsert-66df7975b7-lvmpf/vminsert] 2020-06-15T13:29:56.229Z       error   VictoriaMetrics/app/vminsert/netstorage/netstorage.go:328       cannot send rerouted rows because all the stor
age nodes are unhealthy
[pod/vminsert-66df7975b7-lvmpf/vminsert] 2020-06-15T13:29:56.231Z       warn    VictoriaMetrics/app/vminsert/netstorage/netstorage.go:195       cannot dial storageNode "vmstorage-1.vmstorage
.victoriametrics.svc.cluster.local:8400": dial tcp4 10.58.130.197:8400: connect: connection refused
[pod/vminsert-66df7975b7-lvmpf/vminsert] 2020-06-15T13:29:56.234Z       warn    VictoriaMetrics/app/vminsert/netstorage/netstorage.go:195       cannot dial storageNode "vmstorage-0.vmstorage
.victoriametrics.svc.cluster.local:8400": dial tcp4: lookup vmstorage-0.vmstorage.victoriametrics.svc.cluster.local on 172.20.0.10:53: no such host
[pod/vminsert-66df7975b7-lvmpf/vminsert] 2020-06-15T13:29:56.234Z       warn    VictoriaMetrics/app/vminsert/netstorage/netstorage.go:153       cannot push 30534997 bytes with 59548 rows to 
storage nodes, since all the nodes are temporarily unavailable; re-trying to send the data soon
[pod/vminsert-66df7975b7-lvmpf/vminsert] 2020-06-15T13:29:57.229Z       error   VictoriaMetrics/app/vminsert/netstorage/netstorage.go:328       cannot send rerouted rows because all the stor
age nodes are unhealthy
[pod/vminsert-66df7975b7-lvmpf/vminsert] 2020-06-15T13:29:57.233Z       warn    VictoriaMetrics/app/vminsert/netstorage/netstorage.go:195       cannot dial storageNode "vmstorage-1.vmstorage
.victoriametrics.svc.cluster.local:8400": dial tcp4: lookup vmstorage-1.vmstorage.victoriametrics.svc.cluster.local on 172.20.0.10:53: no such host
[pod/vminsert-66df7975b7-lvmpf/vminsert] 2020-06-15T13:29:57.254Z       warn    VictoriaMetrics/app/vminsert/netstorage/netstorage.go:195       cannot dial storageNode "vmstorage-1.vmstorage
.victoriametrics.svc.cluster.local:8400": dial tcp4: lookup vmstorage-1.vmstorage.victoriametrics.svc.cluster.local on 172.20.0.10:53: no such host
[pod/vminsert-66df7975b7-nq6xv/vminsert] 2020-06-15T13:29:55.521Z       warn    VictoriaMetrics/app/vminsert/netstorage/netstorage.go:195       cannot dial storageNode "vmstorage-1.vmstorage
.victoriametrics.svc.cluster.local:8400": dial tcp4 10.58.130.197:8400: connect: connection refused
[pod/vminsert-66df7975b7-nq6xv/vminsert] 2020-06-15T13:29:55.525Z       warn    VictoriaMetrics/app/vminsert/netstorage/netstorage.go:195       cannot dial storageNode "vmstorage-0.vmstorage
.victoriametrics.svc.cluster.local:8400": dial tcp4: lookup vmstorage-0.vmstorage.victoriametrics.svc.cluster.local on 172.20.0.10:53: no such host
[pod/vminsert-66df7975b7-nq6xv/vminsert] 2020-06-15T13:29:55.525Z       warn    VictoriaMetrics/app/vminsert/netstorage/netstorage.go:153       cannot push 9451901 bytes with 28415 rows to s
torage nodes, since all the nodes are temporarily unavailable; re-trying to send the data soon
[pod/vminsert-66df7975b7-nq6xv/vminsert] 2020-06-15T13:29:56.224Z       error   VictoriaMetrics/app/vminsert/netstorage/netstorage.go:328       cannot send rerouted rows because all the stor
age nodes are unhealthy
[pod/vminsert-66df7975b7-nq6xv/vminsert] 2020-06-15T13:29:56.224Z       warn    VictoriaMetrics/app/vminsert/netstorage/netstorage.go:195       cannot dial storageNode "vmstorage-1.vmstorage
.victoriametrics.svc.cluster.local:8400": dial tcp4 10.58.130.197:8400: connect: connection refused
[pod/vminsert-66df7975b7-nq6xv/vminsert] 2020-06-15T13:29:56.229Z       warn    VictoriaMetrics/app/vminsert/netstorage/netstorage.go:195       cannot dial storageNode "vmstorage-0.vmstorage
.victoriametrics.svc.cluster.local:8400": dial tcp4: lookup vmstorage-0.vmstorage.victoriametrics.svc.cluster.local on 172.20.0.10:53: no such host
[pod/vminsert-66df7975b7-nq6xv/vminsert] 2020-06-15T13:29:56.229Z       warn    VictoriaMetrics/app/vminsert/netstorage/netstorage.go:153       cannot push 9451901 bytes with 28415 rows to s
torage nodes, since all the nodes are temporarily unavailable; re-trying to send the data soon
[pod/vminsert-66df7975b7-nq6xv/vminsert] 2020-06-15T13:29:57.224Z       error   VictoriaMetrics/app/vminsert/netstorage/netstorage.go:328       cannot send rerouted rows because all the stor
age nodes are unhealthy
[pod/vminsert-66df7975b7-nq6xv/vminsert] 2020-06-15T13:29:57.228Z       warn    VictoriaMetrics/app/vminsert/netstorage/netstorage.go:195       cannot dial storageNode "vmstorage-1.vmstorage
.victoriametrics.svc.cluster.local:8400": dial tcp4: lookup vmstorage-1.vmstorage.victoriametrics.svc.cluster.local on 172.20.0.10:53: no such host
[pod/vminsert-66df7975b7-nq6xv/vminsert] 2020-06-15T13:29:57.285Z       warn    VictoriaMetrics/app/vminsert/netstorage/netstorage.go:195       cannot dial storageNode "vmstorage-1.vmstorage
.victoriametrics.svc.cluster.local:8400": dial tcp4: lookup vmstorage-1.vmstorage.victoriametrics.svc.cluster.local on 172.20.0.10:53: no such host

Version

vminsert-20200605-193120-tags-v1.37.0-cluster-0-g9f55dea16
vmstorage-20200605-193746-tags-v1.37.0-cluster-0-g9f55dea16
vmselect-20200605-193438-tags-v1.37.0-cluster-0-g9f55dea16

Used command-line flags
vminsert:
```bash
VM_maxConcurrentInserts: "1000"
VM_memory_allowedPercent: "90"
VM_rpc_disableCompression: "true"
VM_storageNode: vmstorage-0.vmstorage.victoriametrics.svc.cluster.local:8400,vmstorage-1.vmstorage.victoriametrics.svc.cluster.local:8400

vmstorage:
```bash
  VM_memory_allowedPercent: "90"
  VM_retentionPeriod: "6"
  VM_search_maxUniqueTimeseries: "10000000"
  VM_storageDataPath: /storage

vmselect:

  VM_cacheDataPath: /cache
  VM_memory_allowedPercent: "95"
  VM_search_maxPointsPerTimeseries: "5000000"
  VM_search_maxQueryDuration: 10m
  VM_storageNode: vmstorage-0.vmstorage.victoriametrics.svc.cluster.local:8401,vmstorage-1.vmstorage.victoriametrics.svc.cluster.local:8401

we are using env flag, and VM_ as prefix

bug

All 3 comments

I can confirm this bug. This also happens in our set-up since upgrade to 1.36.3 (from 1.36.0) and still happens in 1.37.0. Workaround is to restart vminsert after vmstorage has been restarted.

@sw0x2A , thanks for the workaround. I can confirm that it works. The bug should be fixed after the commit 464682f380501b87555f7f280ba5006d4d60a373 , which is already mentioned in the https://github.com/VictoriaMetrics/VictoriaMetrics/issues/546 .

The commit will be included in the next release of VictoriaMetrics. In the mean time cluster components of VictoriaMetrics can be built from cluster branch according to these docs.

The bug should be fixed in v1.37.1. Closing the issue as fixed.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

EricAntoni picture EricAntoni  路  3Comments

sh0rez picture sh0rez  路  3Comments

isality picture isality  路  3Comments

pmitra43 picture pmitra43  路  3Comments

oOHenry picture oOHenry  路  4Comments