Victoriametrics: some thought after vm running over 1 month

Created on 14 Nov 2019 · 9Comments · Source: VictoriaMetrics/VictoriaMetrics

hi:
we have running vm for about one month on production environment, now we have some thought and problem,so we open issue here hope to find some advice.

we use cluster version and compile from commit 99786c2 (app/vmselect/prometheus: add -search.maxLookback command-line flag for overriding dynamic calculations for max lookback interval)

1.we found vm doesn't have replication or any other way to avoid data loss when vmstorage need restart ,vm introduction say it could avoid data loss when as prometheus remote storage, but we write data to vm directly, so may be now doesn't have any method to prevent it.

2.we now write more than 1.5 billion point every day and therefore use more than 1gb harddisk space per day, we really think down sampling is necessary for tsdb when used in prodcution env or it would be a huge hardware cost to maintain .

3.we have write a tool migrate data from influx to vm based on influx_inspect, when we test vm with a 2c4g server and migrate data to vm, we found memory usage increase and finally
vmstorage oom,we set vmstorage -memory.allowedPercent=30, still no use.
first we think may be 2c4g will meet performance problem,so we use 2C8G server in production env, now 1 month past, memory keep increase everyday from 20% to now 55%. we have now 450K active time series and very low query frequency(only every day less than 5 developer query on grafana ). now we doubt whether it has memory leak.

4.seems now “tenancy” only part to encode hash to decide which vmstoage node to save data, not really allow different tenancy insert data to different vmstoage sub cluster, one vmstorage node may have many tenancy data, whether it would be change in the future?

5.we now add load balance in the front of vminsert and vmselect ,lb need a url to do "health check" and I check source code, /metrics maybe the appropriate path for now, whether it would add something like "/ping"?

at the end,we hope "vmstorage restart without data loss" & "data downsampling" in a hurry, we want to know project team has any plan or idea to do this? and if could,maybe our developer will do something on this..

thx...

question

Source

flashmouse

Most helpful comment

@flashmouse , thanks for the questions. See answers below.

1.we found vm doesn't have replication or any other way to avoid data loss when vmstorage need restart ,vm introduction say it could avoid data loss when as prometheus remote storage, but we write data to vm directly, so may be now doesn't have any method to prevent it.

Cluster version of VM shouldn't lose data if the following conditions are met:

The number of vmstorage nodes is greater than 1.
vmstorage nodes are restarted one-by-one instead of all-at-once.

In this case vminsert re-routes incoming data to available vmstorage nodes if certain nodes are temporarily unavailable. This prevents from data loss on cluster nodes' upgrade / config change or restart.

2.we now write more than 1.5 billion point every day and therefore use more than 1gb harddisk space per day, we really think down sampling is necessary for tsdb when used in prodcution env or it would be a huge hardware cost to maintain .

The downsampling isn't implemented yet because of the following issues:

@matejzero already mentioned that downsampling doesn't always reduce the required storage space, since bigger number of correlated values in time series compress better than smaller number of downsampled values with lower correlation.
The downsampling doesn't reduce the number of time series, so it doesn't help with high cardinality issue.
There is no silver bullet downsampling: various people want various downsampling for various time series: somebody just wants a random sample, somebody wants min value, others want the average value, somebody want a histogram, others want a set of quantiles on the downsampled interval, etc.

VictoriaMetrics reduces the need in downsampling by providing good compression ratio (your case shows 1.5GB/1B samples=1.5 bytes per sample) and high scan speed - up to 50 million of samples per CPU core - which linearly scales to the available CPU cores on vmselect nodes.

As for 1.5GB/day grows rate for time series data, this is pretty low number. This means 1.5GB*365=547GB per year. 547GB HDD-based persistent replicated durable GCP disk costs $20/month. IMHO, this is quite low price for storing 365 billion of samples. There are less reliable options with smaller costs.

we use 2C8G server in production env, now 1 month past, memory keep increase everyday from 20% to now 55%. we have now 450K active time series and very low query frequency(only every day less than 5 developer query on grafana ). now we doubt whether it has memory leak.

Could you post memory usage graphs for VictoriaMetrics components for the last month? This is process_resident_memory_bytes metric.
Could you post memory profile for the node which you think has memory leak?

seems now “tenancy” only part to encode hash to decide which vmstoage node to save data, not really allow different tenancy insert data to different vmstoage sub cluster, one vmstorage node may have many tenancy data, whether it would be change in the future?

There are no plans to change this in the future. If you want to store each tenant in a single cluster, then just use a dedicated cluster per tenant. A simple path-based routing proxy can be put in front of such clusters in order to properly route tenantID requests to particular cluster.

we now add load balance in the front of vminsert and vmselect ,lb need a url to do "health check" and I check source code, /metrics maybe the appropriate path for now, whether it would add something like "/ping"?

There is /health url for health check - see https://github.com/VictoriaMetrics/VictoriaMetrics/blob/c197641978f5c78c3d5fe165e20c92827014abdb/lib/httpserver/httpserver.go#L132 .

valyala on 15 Nov 2019

👍3

All 9 comments

Not part of VM team, just some ideas from my POV.

One way to solve this problem is to run 2 clusters and have prometheus send data to both clusters. Then add promxy deduplication proxy in front. Not ideal solution, but this will remove blank spots from the graphs. Another way would be to introduce some kind of replication for cluster VM. Thanos is using a version of first option, you run multiple prometheus instances with a special unique label and then upload all that data to object storage. Thanos Querier then deduplicates data on fetch. You have a single storage endpoint, but you still need 2x the space to save all data.
We are also testing VM and our usage after a few days has been around 0,77GB/day with around 2 bilion metrics / day. But size will vary based on number of unique metrics I think. There might not be any storage savings when using deduplication. Read this document by Thanos, where they are saying there wont be any savings:

In fact, downsampling doesn’t save you any space but instead it adds 2 more blocks for each raw block which are only slightly smaller or relatively similar size to raw block. This is required by internal downsampling implementation which to be mathematically correct holds various aggregations. This means that downsampling can increase the size of your storage a bit (~3x), but it gives massive advantage on querying long ranges.

matejzero on 15 Nov 2019

@flashmouse , thanks for the questions. See answers below.

1.we found vm doesn't have replication or any other way to avoid data loss when vmstorage need restart ,vm introduction say it could avoid data loss when as prometheus remote storage, but we write data to vm directly, so may be now doesn't have any method to prevent it.

Cluster version of VM shouldn't lose data if the following conditions are met:

The number of vmstorage nodes is greater than 1.
vmstorage nodes are restarted one-by-one instead of all-at-once.

2.we now write more than 1.5 billion point every day and therefore use more than 1gb harddisk space per day, we really think down sampling is necessary for tsdb when used in prodcution env or it would be a huge hardware cost to maintain .

The downsampling isn't implemented yet because of the following issues:

@matejzero already mentioned that downsampling doesn't always reduce the required storage space, since bigger number of correlated values in time series compress better than smaller number of downsampled values with lower correlation.
The downsampling doesn't reduce the number of time series, so it doesn't help with high cardinality issue.
There is no silver bullet downsampling: various people want various downsampling for various time series: somebody just wants a random sample, somebody wants min value, others want the average value, somebody want a histogram, others want a set of quantiles on the downsampled interval, etc.

we use 2C8G server in production env, now 1 month past, memory keep increase everyday from 20% to now 55%. we have now 450K active time series and very low query frequency(only every day less than 5 developer query on grafana ). now we doubt whether it has memory leak.

Could you post memory usage graphs for VictoriaMetrics components for the last month? This is process_resident_memory_bytes metric.
Could you post memory profile for the node which you think has memory leak?

seems now “tenancy” only part to encode hash to decide which vmstoage node to save data, not really allow different tenancy insert data to different vmstoage sub cluster, one vmstorage node may have many tenancy data, whether it would be change in the future?

we now add load balance in the front of vminsert and vmselect ,lb need a url to do "health check" and I check source code, /metrics maybe the appropriate path for now, whether it would add something like "/ping"?

There is /health url for health check - see https://github.com/VictoriaMetrics/VictoriaMetrics/blob/c197641978f5c78c3d5fe165e20c92827014abdb/lib/httpserver/httpserver.go#L132 .

valyala on 15 Nov 2019

👍3

Any chance you can share or post your code for influx to vm data migration? We are getting ready to do some influx migrations and this would save us time. I saw this repo and figured might be a good place to start for posting a PR https://github.com/VictoriaMetrics/vmctl

nickmy9729 on 25 Nov 2019

@valyala thx for reply!
replication is important ,so we may go find another way to resolve the probloem.

I agree different metrics has different downsampling requirement, as influx it may need user to define cq so user can downsampling as they want. some other tsdb may add value type (such as ratio,counter,gauge and so on) when do add datapoint request,then db can downsampling by value type.I think second way may be a good solution.
@matejzero mention downsampling may increase data size sometimes, that's I never thought before, thx for remind!

/health path is exactly what we need!

and "memory leak" I have said, because we haven't set prometheus for vm, so we cannot get the memory change these days,because it's increase very slow, so we may keep watching it in the future,if something happened, I'll give information in detail.

@nickmy9729 If you need I may commit the code later, because it based on influx_inspect, it would take lot of network bandwidth, in my situation, it may generate data about 20~40x larger than origin data kept by influxdb...

flashmouse on 26 Nov 2019

@flashmouse yeAh that’s what I figured based on the output of influx_inspect. I and still interested, and thanks!

nickmy9729 on 30 Nov 2019

@flashmouse This is a good share.
@valyala How the vmstorage cluster handles the case if one of the vmstorage node is crashed and we need to re-build it from scratch. In my understanding, the cluster architecture doesn't perform synchronisation from node to node.
Should we manually vmbackup from the healthy node and vmrestore the data to the new node?

patrixgdd on 10 Jan 2020

FYI, https://github.com/VictoriaMetrics/VictoriaMetrics/issues/93 contains useful tips about data migration from InfluxDB to VictoriaMetrics (and from Prometheus to VictoriaMetrics).

How the vmstorage cluster handles the case if one of the vmstorage node is crashed and we need to re-build it from scratch. In my understanding, the cluster architecture doesn't perform synchronisation from node to node.

Data isn't shared between vmstorage nodes. Data is lost if underlying storage where vmstorage node stores data (aka -storageDataPath directory) stops working. The cluster will continue working with partial data.

It is recommended to make regular backups for each vmstorage node with vmbackup tool. Also it is recommended using durable replicated underlying storage such as GCP persistent disks, which is protected from hardware failures.

Should we manually vmbackup from the healthy node and vmrestore the data to the new node?

No. Each vmstorage node contains distinct share of data, so there is no sense in copying data from one vmstorage node to another - it will just duplicate this data.

valyala on 11 Jan 2020

FYI, cluster version of VictoriaMetrics supports application-level replication starting from v1.36.0. See these docs for more details.

valyala on 27 May 2020

thx, we will try version 1.36 recently, and check whether problem #490 resolved.

flashmouse on 3 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings