I want to try to use VM to track business data.
My company operates an e-commerce platform with tens of millions of products;
Approximate data volume: 15Mil -20Mil item
Each product has about: 15 metric
The frequency of crawling metrics is once a day
This means that every day:
15Mil * 10 ~ 20Mil * 10 data points
Want to retain 365 days of data
Is vm suitable for this scenario?
Data write speed does not need to be fast.
In this scenario, are there any suggestions for the machine configuration and VM configuration to use?
Need to use the cluster version?
I imported 12Mil item * 8 metric * 1day test data, it gen 96.3 Mil datapoints, disk space occupied by data points: 1.4GB ,disk space occupied by inverted index: 2.8GB
Here, I want to thank @valyala , your work is always amazing. gozstd\fasthttp Let me learn a lot.
Hi @zplzpl! Sorry for late response.
As you may see from your experiment this is not the best case for VM. Timeseries databases are optimized for high density continuous series of datapoints. There are a lot of optimizations in them for writing and querying the data, like LSM trees or inverted index.
Your case is a bit different - very high cardinality (uniqueness) and infrequent writes. For such usecase I'd recommend to look on ClickHouse or something similar.
However, let's see how VM can help. The experiment you did shows that inverted index is larger than the actual data. If cardinality of datapoints (all labels combinations) is stable - then index won't grow anymore, since it stores only timeseries IDs and label pairs. Moreover, index stores active timeseries (inserted or requested in last hour) and also sharded by time. So it is likely that disk space occupied by index will be reduced if all the timeseries aren't queried frequently.
The total required disk space looks ok. Your calculations shows 1.4GB for part of data for 1 day. Let's assume it is 3GB per day for full data, then it would be 1TB for 1 year, which is totally ok. The difficult with high volume of data is scanning big amounts of data, which could be limited by disk speed.
Your case is very similar to the billy benchmark we recently tested. Pls see here for more details https://medium.com/@valyala/billy-how-victoriametrics-deals-with-more-than-500-billion-rows-e82ff8f725da. There you may find the ways to load and query a 1 year of readings and VM performance theoretical limits.
Please feel free to ask questions if any.
Most helpful comment
Hi @zplzpl! Sorry for late response.
As you may see from your experiment this is not the best case for VM. Timeseries databases are optimized for high density continuous series of datapoints. There are a lot of optimizations in them for writing and querying the data, like LSM trees or inverted index.
Your case is a bit different - very high cardinality (uniqueness) and infrequent writes. For such usecase I'd recommend to look on ClickHouse or something similar.
However, let's see how VM can help. The experiment you did shows that inverted index is larger than the actual data. If cardinality of datapoints (all labels combinations) is stable - then index won't grow anymore, since it stores only timeseries IDs and label pairs. Moreover, index stores active timeseries (inserted or requested in last hour) and also sharded by time. So it is likely that disk space occupied by index will be reduced if all the timeseries aren't queried frequently.
The total required disk space looks ok. Your calculations shows 1.4GB for part of data for 1 day. Let's assume it is 3GB per day for full data, then it would be 1TB for 1 year, which is totally ok. The difficult with high volume of data is scanning big amounts of data, which could be limited by disk speed.
Your case is very similar to the billy benchmark we recently tested. Pls see here for more details https://medium.com/@valyala/billy-how-victoriametrics-deals-with-more-than-500-billion-rows-e82ff8f725da. There you may find the ways to load and query a 1 year of readings and VM performance theoretical limits.
Please feel free to ask questions if any.