hi:
we have running vm for about one month on production environment, now we have some thought and problem,so we open issue here hope to find some advice.
we use cluster version and compile from commit 99786c2 (app/vmselect/prometheus: add -search.maxLookback command-line flag for overriding dynamic calculations for max lookback interval)
1.we found vm doesn't have replication or any other way to avoid data loss when vmstorage need restart ,vm introduction say it could avoid data loss when as prometheus remote storage, but we write data to vm directly, so may be now doesn't have any method to prevent it.
2.we now write more than 1.5 billion point every day and therefore use more than 1gb harddisk space per day, we really think down sampling is necessary for tsdb when used in prodcution env or it would be a huge hardware cost to maintain .
3.we have write a tool migrate data from influx to vm based on influx_inspect, when we test vm with a 2c4g server and migrate data to vm, we found memory usage increase and finally
vmstorage oom,we set vmstorage -memory.allowedPercent=30, still no use.
first we think may be 2c4g will meet performance problem,so we use 2C8G server in production env, now 1 month past, memory keep increase everyday from 20% to now 55%. we have now 450K active time series and very low query frequency(only every day less than 5 developer query on grafana ). now we doubt whether it has memory leak.
4.seems now “tenancy” only part to encode hash to decide which vmstoage node to save data, not really allow different tenancy insert data to different vmstoage sub cluster, one vmstorage node may have many tenancy data, whether it would be change in the future?
5.we now add load balance in the front of vminsert and vmselect ,lb need a url to do "health check" and I check source code, /metrics maybe the appropriate path for now, whether it would add something like "/ping"?
at the end,we hope "vmstorage restart without data loss" & "data downsampling" in a hurry, we want to know project team has any plan or idea to do this? and if could,maybe our developer will do something on this..
thx...
Not part of VM team, just some ideas from my POV.
One way to solve this problem is to run 2 clusters and have prometheus send data to both clusters. Then add promxy deduplication proxy in front. Not ideal solution, but this will remove blank spots from the graphs. Another way would be to introduce some kind of replication for cluster VM. Thanos is using a version of first option, you run multiple prometheus instances with a special unique label and then upload all that data to object storage. Thanos Querier then deduplicates data on fetch. You have a single storage endpoint, but you still need 2x the space to save all data.
We are also testing VM and our usage after a few days has been around 0,77GB/day with around 2 bilion metrics / day. But size will vary based on number of unique metrics I think. There might not be any storage savings when using deduplication. Read this document by Thanos, where they are saying there wont be any savings:
In fact, downsampling doesn’t save you any space but instead it adds 2 more blocks for each raw block which are only slightly smaller or relatively similar size to raw block. This is required by internal downsampling implementation which to be mathematically correct holds various aggregations. This means that downsampling can increase the size of your storage a bit (~3x), but it gives massive advantage on querying long ranges.
@flashmouse , thanks for the questions. See answers below.
1.we found vm doesn't have replication or any other way to avoid data loss when vmstorage need restart ,vm introduction say it could avoid data loss when as prometheus remote storage, but we write data to vm directly, so may be now doesn't have any method to prevent it.
Cluster version of VM shouldn't lose data if the following conditions are met:
vmstorage nodes is greater than 1.vmstorage nodes are restarted one-by-one instead of all-at-once.In this case vminsert re-routes incoming data to available vmstorage nodes if certain nodes are temporarily unavailable. This prevents from data loss on cluster nodes' upgrade / config change or restart.
2.we now write more than 1.5 billion point every day and therefore use more than 1gb harddisk space per day, we really think down sampling is necessary for tsdb when used in prodcution env or it would be a huge hardware cost to maintain .
The downsampling isn't implemented yet because of the following issues:
silver bullet downsampling: various people want various downsampling for various time series: somebody just wants a random sample, somebody wants min value, others want the average value, somebody want a histogram, others want a set of quantiles on the downsampled interval, etc.VictoriaMetrics reduces the need in downsampling by providing good compression ratio (your case shows 1.5GB/1B samples=1.5 bytes per sample) and high scan speed - up to 50 million of samples per CPU core - which linearly scales to the available CPU cores on vmselect nodes.
As for 1.5GB/day grows rate for time series data, this is pretty low number. This means 1.5GB*365=547GB per year. 547GB HDD-based persistent replicated durable GCP disk costs $20/month. IMHO, this is quite low price for storing 365 billion of samples. There are less reliable options with smaller costs.
we use 2C8G server in production env, now 1 month past, memory keep increase everyday from 20% to now 55%. we have now 450K active time series and very low query frequency(only every day less than 5 developer query on grafana ). now we doubt whether it has memory leak.
process_resident_memory_bytes metric.seems now “tenancy” only part to encode hash to decide which vmstoage node to save data, not really allow different tenancy insert data to different vmstoage sub cluster, one vmstorage node may have many tenancy data, whether it would be change in the future?
There are no plans to change this in the future. If you want to store each tenant in a single cluster, then just use a dedicated cluster per tenant. A simple path-based routing proxy can be put in front of such clusters in order to properly route tenantID requests to particular cluster.
we now add load balance in the front of vminsert and vmselect ,lb need a url to do "health check" and I check source code, /metrics maybe the appropriate path for now, whether it would add something like "/ping"?
There is /health url for health check - see https://github.com/VictoriaMetrics/VictoriaMetrics/blob/c197641978f5c78c3d5fe165e20c92827014abdb/lib/httpserver/httpserver.go#L132 .
Any chance you can share or post your code for influx to vm data migration? We are getting ready to do some influx migrations and this would save us time. I saw this repo and figured might be a good place to start for posting a PR https://github.com/VictoriaMetrics/vmctl
@valyala thx for reply!
replication is important ,so we may go find another way to resolve the probloem.
I agree different metrics has different downsampling requirement, as influx it may need user to define cq so user can downsampling as they want. some other tsdb may add value type (such as ratio,counter,gauge and so on) when do add datapoint request,then db can downsampling by value type.I think second way may be a good solution.
@matejzero mention downsampling may increase data size sometimes, that's I never thought before, thx for remind!
/health path is exactly what we need!
and "memory leak" I have said, because we haven't set prometheus for vm, so we cannot get the memory change these days,because it's increase very slow, so we may keep watching it in the future,if something happened, I'll give information in detail.
@nickmy9729 If you need I may commit the code later, because it based on influx_inspect, it would take lot of network bandwidth, in my situation, it may generate data about 20~40x larger than origin data kept by influxdb...
@flashmouse yeAh that’s what I figured based on the output of influx_inspect. I and still interested, and thanks!
@flashmouse This is a good share.
@valyala How the vmstorage cluster handles the case if one of the vmstorage node is crashed and we need to re-build it from scratch. In my understanding, the cluster architecture doesn't perform synchronisation from node to node.
Should we manually vmbackup from the healthy node and vmrestore the data to the new node?
FYI, https://github.com/VictoriaMetrics/VictoriaMetrics/issues/93 contains useful tips about data migration from InfluxDB to VictoriaMetrics (and from Prometheus to VictoriaMetrics).
How the vmstorage cluster handles the case if one of the vmstorage node is crashed and we need to re-build it from scratch. In my understanding, the cluster architecture doesn't perform synchronisation from node to node.
Data isn't shared between vmstorage nodes. Data is lost if underlying storage where vmstorage node stores data (aka -storageDataPath directory) stops working. The cluster will continue working with partial data.
It is recommended to make regular backups for each vmstorage node with vmbackup tool. Also it is recommended using durable replicated underlying storage such as GCP persistent disks, which is protected from hardware failures.
Should we manually vmbackup from the healthy node and vmrestore the data to the new node?
No. Each vmstorage node contains distinct share of data, so there is no sense in copying data from one vmstorage node to another - it will just duplicate this data.
FYI, cluster version of VictoriaMetrics supports application-level replication starting from v1.36.0. See these docs for more details.
thx, we will try version 1.36 recently, and check whether problem #490 resolved.
Most helpful comment
@flashmouse , thanks for the questions. See answers below.
Cluster version of VM shouldn't lose data if the following conditions are met:
vmstoragenodes is greater than 1.vmstoragenodes are restarted one-by-one instead of all-at-once.In this case
vminsertre-routes incoming data to availablevmstoragenodes if certain nodes are temporarily unavailable. This prevents from data loss on cluster nodes' upgrade / config change or restart.The downsampling isn't implemented yet because of the following issues:
silver bullet downsampling: various people want various downsampling for various time series: somebody just wants a random sample, somebody wants min value, others want the average value, somebody want a histogram, others want a set of quantiles on the downsampled interval, etc.VictoriaMetrics reduces the need in downsampling by providing good compression ratio (your case shows 1.5GB/1B samples=1.5 bytes per sample) and high scan speed - up to 50 million of samples per CPU core - which linearly scales to the available CPU cores on
vmselectnodes.As for 1.5GB/day grows rate for time series data, this is pretty low number. This means 1.5GB*365=547GB per year. 547GB HDD-based persistent replicated durable GCP disk costs $20/month. IMHO, this is quite low price for storing 365 billion of samples. There are less reliable options with smaller costs.
process_resident_memory_bytesmetric.There are no plans to change this in the future. If you want to store each tenant in a single cluster, then just use a dedicated cluster per tenant. A simple path-based routing proxy can be put in front of such clusters in order to properly route tenantID requests to particular cluster.
There is
/healthurl for health check - see https://github.com/VictoriaMetrics/VictoriaMetrics/blob/c197641978f5c78c3d5fe165e20c92827014abdb/lib/httpserver/httpserver.go#L132 .