Describe the bug
Hello, so are testing victoriametrics for our production load and trying to play around the resources limits, and we have vmstorage pods being killed by OOM, the vminsert and vmselect stop dealing with those nodes (sending metrics/ requesting data) which is expected, but then the vmstorage pod is back up again, and vminsert nodes are still ignoring indefinitely the restarted vmstorage node, and reroute metrics in our case for the othe vmstorage node left
vmselect nodes will read again from that restarted node once it is up again
To Reproduce
1- shutdown one of vmstorage nodes unexpectedly
2- vmstorage node is back up again
3- check vminsert logs/ vmstorage metrics on active timeseries
Expected behavior
vminsert will check that the vmstorage is reachable every "now and then" and then route traffic again to it once it is reachable
Screenshots
here you can see that after restart of vmstorage-1 until now , it receives no active timeseries

and here you can see that at same time, there is no vminsert connectiosn anymore to vmstorage-1

here you can see that vmselect is reading metrics from vmstorage-1 again (big resolution for graph to be able to see the rate change)

i can also see these logs in the vminsert pods:
[pod/vminsert-66df7975b7-lvmpf/vminsert] 2020-06-15T13:29:55.522Z warn VictoriaMetrics/app/vminsert/netstorage/netstorage.go:195 cannot dial storageNode "vmstorage-1.vmstorage
.victoriametrics.svc.cluster.local:8400": dial tcp4 10.58.130.197:8400: connect: connection refused
[pod/vminsert-66df7975b7-lvmpf/vminsert] 2020-06-15T13:29:55.526Z warn VictoriaMetrics/app/vminsert/netstorage/netstorage.go:195 cannot dial storageNode "vmstorage-0.vmstorage
.victoriametrics.svc.cluster.local:8400": dial tcp4: lookup vmstorage-0.vmstorage.victoriametrics.svc.cluster.local on 172.20.0.10:53: no such host
[pod/vminsert-66df7975b7-lvmpf/vminsert] 2020-06-15T13:29:55.526Z warn VictoriaMetrics/app/vminsert/netstorage/netstorage.go:153 cannot push 30534997 bytes with 59548 rows to
storage nodes, since all the nodes are temporarily unavailable; re-trying to send the data soon
[pod/vminsert-66df7975b7-lvmpf/vminsert] 2020-06-15T13:29:56.229Z error VictoriaMetrics/app/vminsert/netstorage/netstorage.go:328 cannot send rerouted rows because all the stor
age nodes are unhealthy
[pod/vminsert-66df7975b7-lvmpf/vminsert] 2020-06-15T13:29:56.231Z warn VictoriaMetrics/app/vminsert/netstorage/netstorage.go:195 cannot dial storageNode "vmstorage-1.vmstorage
.victoriametrics.svc.cluster.local:8400": dial tcp4 10.58.130.197:8400: connect: connection refused
[pod/vminsert-66df7975b7-lvmpf/vminsert] 2020-06-15T13:29:56.234Z warn VictoriaMetrics/app/vminsert/netstorage/netstorage.go:195 cannot dial storageNode "vmstorage-0.vmstorage
.victoriametrics.svc.cluster.local:8400": dial tcp4: lookup vmstorage-0.vmstorage.victoriametrics.svc.cluster.local on 172.20.0.10:53: no such host
[pod/vminsert-66df7975b7-lvmpf/vminsert] 2020-06-15T13:29:56.234Z warn VictoriaMetrics/app/vminsert/netstorage/netstorage.go:153 cannot push 30534997 bytes with 59548 rows to
storage nodes, since all the nodes are temporarily unavailable; re-trying to send the data soon
[pod/vminsert-66df7975b7-lvmpf/vminsert] 2020-06-15T13:29:57.229Z error VictoriaMetrics/app/vminsert/netstorage/netstorage.go:328 cannot send rerouted rows because all the stor
age nodes are unhealthy
[pod/vminsert-66df7975b7-lvmpf/vminsert] 2020-06-15T13:29:57.233Z warn VictoriaMetrics/app/vminsert/netstorage/netstorage.go:195 cannot dial storageNode "vmstorage-1.vmstorage
.victoriametrics.svc.cluster.local:8400": dial tcp4: lookup vmstorage-1.vmstorage.victoriametrics.svc.cluster.local on 172.20.0.10:53: no such host
[pod/vminsert-66df7975b7-lvmpf/vminsert] 2020-06-15T13:29:57.254Z warn VictoriaMetrics/app/vminsert/netstorage/netstorage.go:195 cannot dial storageNode "vmstorage-1.vmstorage
.victoriametrics.svc.cluster.local:8400": dial tcp4: lookup vmstorage-1.vmstorage.victoriametrics.svc.cluster.local on 172.20.0.10:53: no such host
[pod/vminsert-66df7975b7-nq6xv/vminsert] 2020-06-15T13:29:55.521Z warn VictoriaMetrics/app/vminsert/netstorage/netstorage.go:195 cannot dial storageNode "vmstorage-1.vmstorage
.victoriametrics.svc.cluster.local:8400": dial tcp4 10.58.130.197:8400: connect: connection refused
[pod/vminsert-66df7975b7-nq6xv/vminsert] 2020-06-15T13:29:55.525Z warn VictoriaMetrics/app/vminsert/netstorage/netstorage.go:195 cannot dial storageNode "vmstorage-0.vmstorage
.victoriametrics.svc.cluster.local:8400": dial tcp4: lookup vmstorage-0.vmstorage.victoriametrics.svc.cluster.local on 172.20.0.10:53: no such host
[pod/vminsert-66df7975b7-nq6xv/vminsert] 2020-06-15T13:29:55.525Z warn VictoriaMetrics/app/vminsert/netstorage/netstorage.go:153 cannot push 9451901 bytes with 28415 rows to s
torage nodes, since all the nodes are temporarily unavailable; re-trying to send the data soon
[pod/vminsert-66df7975b7-nq6xv/vminsert] 2020-06-15T13:29:56.224Z error VictoriaMetrics/app/vminsert/netstorage/netstorage.go:328 cannot send rerouted rows because all the stor
age nodes are unhealthy
[pod/vminsert-66df7975b7-nq6xv/vminsert] 2020-06-15T13:29:56.224Z warn VictoriaMetrics/app/vminsert/netstorage/netstorage.go:195 cannot dial storageNode "vmstorage-1.vmstorage
.victoriametrics.svc.cluster.local:8400": dial tcp4 10.58.130.197:8400: connect: connection refused
[pod/vminsert-66df7975b7-nq6xv/vminsert] 2020-06-15T13:29:56.229Z warn VictoriaMetrics/app/vminsert/netstorage/netstorage.go:195 cannot dial storageNode "vmstorage-0.vmstorage
.victoriametrics.svc.cluster.local:8400": dial tcp4: lookup vmstorage-0.vmstorage.victoriametrics.svc.cluster.local on 172.20.0.10:53: no such host
[pod/vminsert-66df7975b7-nq6xv/vminsert] 2020-06-15T13:29:56.229Z warn VictoriaMetrics/app/vminsert/netstorage/netstorage.go:153 cannot push 9451901 bytes with 28415 rows to s
torage nodes, since all the nodes are temporarily unavailable; re-trying to send the data soon
[pod/vminsert-66df7975b7-nq6xv/vminsert] 2020-06-15T13:29:57.224Z error VictoriaMetrics/app/vminsert/netstorage/netstorage.go:328 cannot send rerouted rows because all the stor
age nodes are unhealthy
[pod/vminsert-66df7975b7-nq6xv/vminsert] 2020-06-15T13:29:57.228Z warn VictoriaMetrics/app/vminsert/netstorage/netstorage.go:195 cannot dial storageNode "vmstorage-1.vmstorage
.victoriametrics.svc.cluster.local:8400": dial tcp4: lookup vmstorage-1.vmstorage.victoriametrics.svc.cluster.local on 172.20.0.10:53: no such host
[pod/vminsert-66df7975b7-nq6xv/vminsert] 2020-06-15T13:29:57.285Z warn VictoriaMetrics/app/vminsert/netstorage/netstorage.go:195 cannot dial storageNode "vmstorage-1.vmstorage
.victoriametrics.svc.cluster.local:8400": dial tcp4: lookup vmstorage-1.vmstorage.victoriametrics.svc.cluster.local on 172.20.0.10:53: no such host
Version
vminsert-20200605-193120-tags-v1.37.0-cluster-0-g9f55dea16
vmstorage-20200605-193746-tags-v1.37.0-cluster-0-g9f55dea16
vmselect-20200605-193438-tags-v1.37.0-cluster-0-g9f55dea16
Used command-line flags
vminsert:
```bash
VM_maxConcurrentInserts: "1000"
VM_memory_allowedPercent: "90"
VM_rpc_disableCompression: "true"
VM_storageNode: vmstorage-0.vmstorage.victoriametrics.svc.cluster.local:8400,vmstorage-1.vmstorage.victoriametrics.svc.cluster.local:8400
vmstorage:
```bash
VM_memory_allowedPercent: "90"
VM_retentionPeriod: "6"
VM_search_maxUniqueTimeseries: "10000000"
VM_storageDataPath: /storage
vmselect:
VM_cacheDataPath: /cache
VM_memory_allowedPercent: "95"
VM_search_maxPointsPerTimeseries: "5000000"
VM_search_maxQueryDuration: 10m
VM_storageNode: vmstorage-0.vmstorage.victoriametrics.svc.cluster.local:8401,vmstorage-1.vmstorage.victoriametrics.svc.cluster.local:8401
we are using env flag, and VM_ as prefix
I can confirm this bug. This also happens in our set-up since upgrade to 1.36.3 (from 1.36.0) and still happens in 1.37.0. Workaround is to restart vminsert after vmstorage has been restarted.
@sw0x2A , thanks for the workaround. I can confirm that it works. The bug should be fixed after the commit 464682f380501b87555f7f280ba5006d4d60a373 , which is already mentioned in the https://github.com/VictoriaMetrics/VictoriaMetrics/issues/546 .
The commit will be included in the next release of VictoriaMetrics. In the mean time cluster components of VictoriaMetrics can be built from cluster branch according to these docs.
The bug should be fixed in v1.37.1. Closing the issue as fixed.