Victoriametrics: query_range is more and more slowly

Created on 1 Jul 2020  ·  25Comments  ·  Source: VictoriaMetrics/VictoriaMetrics

Describe the bug

  1. query_range is more and more slowly, but will be recovery at 08:00 every day(utc 00:00)
  2. vminsert always oom

Expected behavior
query_range will be more faster

Screenshots
111

Version
The line returned when passing --version command line flag to binary. For example:

./vmstorage-prod -version
vmstorage-20200623-210915-heads-cluster-0-g46c5c077

Used command-line flags

./vminsert-prod -influxSkipSingleField -insert.maxQueueDuration 10s -storageNode=10.15.38.29:8400 -storageNode=10.15.38.30:8400 -storageNode=10.15.59.47:8400 -storageNode=10.15.59.48:8400 -storageNode=10.15.59.49:8400 -storageNode=10.15.34.61:8400 -storageNode=10.15.38.19:8400 -storageNode=10.15.38.20:8400 -storageNode=10.15.53.33:8400 -storageNode=10.15.59.20:8400 -storageNode=10.15.36.41:8400
./vmselect-prod -search.maxQueryLen 32768 -storageNode=10.15.78.25:8401 -storageNode=10.15.77.18:8401 -storageNode=10.15.77.19:8401 -storageNode=10.15.77.20:8401 -storageNode=10.15.68.27:8401 -storageNode=10.15.78.44:8401 -storageNode=10.15.81.42:8401 -storageNode=10.15.81.46:8401 -storageNode=10.15.87.23:8401 -storageNode=10.15.49.19:8401 -storageNode=10.15.58.61:8401 -storageNode=10.15.38.29:8401 -storageNode=10.15.38.30:8401 -storageNode=10.15.59.47:8401 -storageNode=10.15.59.48:8401 -storageNode=10.15.59.49:8401 -storageNode=10.15.34.61:8401 -storageNode=10.15.38.19:8401 -storageNode=10.15.38.20:8401 -storageNode=10.15.53.33:8401 -storageNode=10.15.59.20:8401 -storageNode=10.15.36.41:8401
./vmstorage-prod -storageDataPath /data1/vmdata -retentionPeriod 6 -search.maxUniqueTimeseries 5000000

Additional context

  1. we have 11 vmstorage instances. total ingest rate ~ 2.6M points/s
  2. vmstorage log by grep '^2020-07.*error' app.log
    vmstorage.error.log
  3. vminsert log by grep -E '2020-07-01.*(error|warn)' vminsert.log
    vminsert.error.log.tar.gz
  4. vmselect log by grep -E '2020-07-01.*(warn|error)' vmselect.log | grep -v 'slow query according' | grep -Ev 'VictoriaMetrics/app/vmselect/main.go:31[78]'
    vmselect.error.log.tar.gz
  5. vmstorage pprof
    pprof.vmstorage-prod.samples.cpu.001.pb.gz@utc 04:47
    pprof.vmstorage-prod.samples.cpu.002.pb.gz@utc 05:40
need more info question

All 25 comments

update

i upgrade vmstorage/vminsert/vmselect to 0f754bea490b425be531704522fe4ceec7125948
but the problem still exist

update

by reduce the query_range request(almost to 0)
8 of 11 vmstorage nodes recovery, but other 3 nodes still have high cpu usage.
i choose 2 node, run pprof and put the result here.

pprof:
high_cpu.gz
recovery.gz

@n4mine , thanks for the detailed info!

This may be related to the issue a586b8b6d4f0ead2b60eb38711f59d0a120d82d7 , which has been fixed in v1.37.3 . The issue may be triggered by heavy queries. Try updating to the latest release and waiting for a day.

The profile shows that the most of time is spent on populating sets of timeseries ids that match the given label filters from incoming queries. It looks like these label filters match big number of time series (millions). Could you share the contents of /select/<accountID>/prometheus/api/v1/status/tsdb page for <accountID> with the biggest amounts of data? Feel free redacting sensitive labels before sharing the page contents.

Could you also explain why the number of -storageNode args passed to vmselect is 22, while the number of -storageNode args passed to vminsert is 11?

Note that the order of -storageNode args passed to vminsert instances must be the same in order to achieve the optimal data distribition and optimal performance for both data ingestion and data querying.

@valyala

This may be related to the issue a586b8b , which has been fixed in v1.37.3 . The issue may be triggered by heavy queries. Try updating to the latest release and waiting for a day.

i upgrade vmstorage/vminsert/vmselect to 0f754be 40 hours ago
but the problem still exist

the contents of /select/\ with the biggest amounts of data?

i use single accountID =0

{
  "status": "success",
  "data": {
    "seriesCountByMetricName": [
      {
        "name": "by_tags.rpcdisf.latency",
        "value": 10412944
      },
      {
        "name": "by_tags.rpc_dirpc_call.latency",
        "value": 7764388
      },
      {
        "name": "cpu.irq.core",
        "value": 7125778
      },
      {
        "name": "cpu.iowait.core",
        "value": 7124495
      },
      {
        "name": "cpu.idle.core",
        "value": 7121998
      },
      {
        "name": "cpu.steal.core",
        "value": 7120264
      },
      {
        "name": "cpu.softirq.core",
        "value": 7120208
      },
      {
        "name": "cpu.sys.core",
        "value": 7119071
      },
      {
        "name": "cpu.user.core",
        "value": 7118806
      },
      {
        "name": "rpc.latency",
        "value": 6644584
      }
    ],
    "labelValueCountByLabelName": [
      {
        "name": "local",
        "value": 50332
      },
      {
        "name": "key",
        "value": 43511
      },
      {
        "name": "callee",
        "value": 32500
      },
      {
        "name": "__name__",
        "value": 31361
      },
      {
        "name": "�",
        "value": 31065
      },
      {
        "name": "host",
        "value": 16881
      },
      {
        "name": "container",
        "value": 13405
      },
      {
        "name": "pod",
        "value": 12647
      },
      {
        "name": "topic",
        "value": 8032
      },
      {
        "name": "url",
        "value": 7921
      }
    ],
    "seriesCountByLabelValuePair": [
      {
        "name": "__ns__=__sensitive__ns1",
        "value": 41615557
      },
      {
        "name": "protocolType=ALL",
        "value": 41607369
      },
      {
        "name": "host=appNetMonitor",
        "value": 41607368
      },
      {
        "name": "os=ALL",
        "value": 41607368
      },
      {
        "name": "tripCountry=BR",
        "value": 38124039
      },
      {
        "name": "sid=gulfstream",
        "value": 30609327
      },
      {
        "name": "__ns__=__sensitive__ns2",
        "value": 27117483
      },
      {
        "name": "__ns__=__sensitive__ns3",
        "value": 19210597
      },
      {
        "name": "caller-cluster=us01-v",
        "value": 17705344
      },
      {
        "name": "appType=ALL",
        "value": 16604810
      }
    ]
  }
}

Could you also explain why the number of -storageNode args passed to vmselect is 22, while the number of -storageNode args passed to vminsert is 11?

cause i switch storage nodes for trying to solve https://github.com/VictoriaMetrics/VictoriaMetrics/issues/581
i have to provide all new and old vmstorage nodes to vmselect, so old data can be provide for obtain.

i think it's suitable for this situation? (switch vmstorage nodes temporary)

step1: 
vmselect --> node1 ~ node11
vminsert --> node1 ~ node11

step2: (current state)
vmselect --> node1 ~ node11, node12 ~ node22
vminsert --> node12 ~ node22

step3: (after a few days)
vmselect --> node12 ~ node22
vminsert --> node12 ~ node22

spent on populating sets of timeseries ids that match the given label filters from incoming queries. It looks like these label filters match big number of time series (millions)

are the incoming queries include /insert request?

spent on populating sets of timeseries ids that match the given label filters from incoming queries. It looks like these label filters match big number of time series (millions)

are the incoming queries include /insert request?

answer my self, seem it's positive, now the /select traffic almost to 0.
but the cpu usage is still high.
and the pprof result prove my guess

111

 "seriesCountByLabelValuePair": [
      {
        "name": "__ns__=__sensitive__ns1",
        "value": 41615557
      },
      {
        "name": "protocolType=ALL",
        "value": 41607369
      },
      {
        "name": "host=appNetMonitor",
        "value": 41607368
      },
      {
        "name": "os=ALL",
        "value": 41607368
      },
      {
        "name": "tripCountry=BR",
        "value": 38124039
      },

this part is same series but was inactive for a long time(30 days+).
i thinks it's shouldn't exist here.

so if /insert path use same label name/value --> TSID mapping as /tsdb/status result above, it will be performance issue?

a few questions

  1. any solution to reduce the high label -> tsid cache effective immediately?
  2. i delete the highest labelvaluepair series: __ns__=__sensitive__ns1, but it's not work
  3. do you suggest to split to multi tenant?

this part is same series but was inactive for a long time(30 days+). i thinks it's shouldn't exist here.

The /api/v1/status/tsdb returns time series seen during the current day. If doesn't filter out time series deleted on the current day. The /api/v1/status/tsdb accepts optional arg date=YYYY-MM-DD in order to see time series stats for the particular day - see https://victoriametrics.github.io/#troubleshooting for details.

  1. any solution to reduce the high label -> tsid cache effective immediately?

How much RAM is available per each vmstorage node? It looks like the MetricName -> TSID cache size isn't enough for storing the data for active time series. Could you try increasing RAM size for vmstorage nodes for a few hours and see whether this helps reducing CPU time?

  1. i delete the highest labelvaluepair series: __ns__=__sensitive__ns1, but it's not work

This will increase the size of dmis set, which contains ids for deleted time series. This may increase CPU usage at dmis.Has call mentioned in the comment above. Note that is itsn't recommended to use delete API on a frequent basis, since it has non-zero overhead - all the deleted time series ids trap into dmis set and stay there for at least -retentionPeriod months.

  1. do you suggest to split to multi tenant?

This should improve query performance on a per-tenant basis. But this will remove the ability to use data from multiple tenants in a single query. So let's try optimizing the current case.

How much RAM is available per each vmstorage node?

the nodes which low cpu usage, has 16 - 26GB memory available.
and nodes(3 nodes) which high cpu usage, has 38, 48 and 69GB memory available.

and total memory is 128GB.

Could you try increasing RAM size for vmstorage nodes for a few hours and see whether this helps reducing CPU time?

emmm, we use the physical machine, increasing RAM is not easy to operate

all the deleted time series ids trap into dmis set and stay there for at least -retentionPeriod months.

is there a way to clear that dmis sets?

This should improve query performance on a per-tenant basis. But this will remove the ability to use data from multiple tenants in a single query.

i think change tenant will lost all data. it's not a good choose.

So let's try optimizing the current case.

Great :)

the nodes which low cpu usage, has 16 - 26GB memory available.
and nodes(3 nodes) which high cpu usage, has 38, 48 and 69GB memory available.

It is recommend to have nodes with identical hardware configuration per particular service type in the cluster. Otherwise hardware resources may be left underutilized or overutilized.

and total memory is 128GB.

This should be enough for up to 100M active time series in ideal case when time series are evenly distributed across vmstorage nodes. The distribution may be uneven when certain vmstorage nodes are overloaded - in this case vminsert re-routes time series to less loaded vmstorage nodes. This reduces the total capacity of the cluster for active time series. If the metainfo for active time series doesn't fit available RAM, then VictoriaMetrics could start using more CPU and disk IO resources as outlined in troubleshooting docs. Could you check for the number of active time series in the cluster on the official grafana dashboard?

is there a way to clear that dmis sets?

unfortunately VictoriaMetrics doesn't provide mechanisms to delete dmis (ids for deleted time series) :( The dmis contents is stored in the inverted index (indexdb folder under -storageDataPath), so the easiest way to delete it is to delete inverted index. But this is equivalent to deleting all the stored data, since it is impossible to search for the required data without inverted index.

It is recommend to have nodes with identical hardware configuration per particular service type in the cluster. Otherwise hardware resources may be left underutilized or overutilized.

sorry, may be my describe is not clear
every single vmstorage node, has 128G memory
i have 11 vmstorage nodes

Could you check for the number of active time series in the cluster on the official grafana dashboard?

it's ~209M(total)
and these high cpu usage nodes, has the lowest active series

other infomation(btw, all vmstorage node start same time, so the stats of *_total should be consistent)

the vmstorage node which has name *001, *003, *007 is abnormal, and everything(cpu usage going up) begin at 07-01 12:00(+0800)

active series

000

tsid cache entries

111

tsid cache size

222

vm_allowed_memory_bytes

333

memory available

444

vm_deleted_metrics_total

555
111

vm_cache_misses_total

666

ingest rate

777

cpu usage last 10 days

888

@valyala

according the suggestion of delete api scenarios

The delete API is intended mainly for the following cases:

  • One-off deleting of accidentally written invalid (or undesired) time series.
  • One-off deleting of user data due to GDPR.

it's reasonable to check dmis while add() in insert path?

reason 1

for accidentally written invalid (or undesired) time series or user data due to GDPR
after delete operate finish, these data will not be queryable any more.

reason 2

all the deleted time series ids trap into dmis set and stay there for at least -retentionPeriod months.

these deleted data exist until retentionPeriod reached

reason 3

in this case, /delete has called.
so in insert path, to check those series ids which in dmis will be affect performance obviously when /delete impact huge number series(8M vm_deleted_metrics_total{type="indexdb"} in this case)

so, if there is a way to avoid check dmis in insert path? it will be improve performance.

i try to delete && !dmis.Has(r.TSID.MetricID) in #604 , but failed when test in TestStorageDeleteMetrics.

i build it locally and deploy to my environment, it seem worked, the cpu usage was gone down, but i don't known is a good solution or not.

While the idea of removing the check for deleted time series in fast path during data ingestion is OK, it is impossible to just remove the check, since this will lead to inability to insert data for the deleted time series in the future. For example, after the time series foo{bar="baz"} is deleted, Storage.add will continue adding incoming data points to the deleted time series, so they will remain invisible until the corresponding entry in MetricName->TSID cache for the deleted time series is evicted. This is unexpected behavior for users, since they won't see newly added data after deleting time series.

In theory there is another approach - to synchronously delete MetricName->TSID entries for the deleted time series on every delete API call. I doubt it is possible.

Note that the Storage.add function contains two calls to dmis.Has:

The attached CPU profile shows that the CPU time is spent on the slow path. This means that the size of MetricName->TSID cache isn't enough for holding information for active time time series. See slow inserts percentage graph on the official dashboard for VictoriaMetrics - if it shows values bigger than a few percents, then it is likely there is no enough memory for holding information for active time series in RAM and this will likely result in increased usage for CPU and disk IO. If your dashboard has no this graph, then try upgrading the dashboard from Grafana site.

There are the following solutions for this case:

  • To add more vmstorage nodes, so the summary RAM size for them would be enough for holding information for active time series.
  • To increase RAM size per each vmstorage node. As I understood, this case doesn't fit you, since vmstorage runs on bare metal machines.
  • To reduce the number of active time series during data ingestion, so they will fit the available RAM on vmstorage nodes. The most effective approach is to investigate /api/v1/status/tsdb output and to remove the source of time series with high cardinality labels.

See slow inserts percentage graph on the official dashboard for VictoriaMetrics - if it shows values bigger than a few percents, then it is likely there is no enough memory for holding information for active time series in RAM and this will likely result in increased usage for CPU and disk IO.

the slow inserts percentage seems ok
image

but the follow image confuse me, these abnormal nodes have enough available memory, but the vm_cache_size_bytes{type="storage/tsid"} on these nodes is strange(bottom 3 lines), is there a tools to debug why the TSID cache is flapping?
222

@valyala

a new clues,

from 2020-07-01 04:00(utc), something changed, and then tsid cache size start to flapping, and the cpu usage of few vmstorage nodes rising up.

At the time, the vmstorage's log like

2020-07-01T04:00:00.002Z        info    VictoriaMetrics/lib/mergeset/table.go:167       opening table "/data1/vmdata/indexdb/161B62429BB1E5B2"...
2020-07-01T04:00:00.084Z        info    VictoriaMetrics/lib/mergeset/table.go:201       table "/data1/vmdata/indexdb/161B62429BB1E5B2" has been opened in 0.08
2 seconds; partsCount: 0; blocksCount: 0, itemsCount: 0; sizeBytes: 0
2020-07-01T04:00:00.090Z        info    VictoriaMetrics/lib/mergeset/table.go:217       waiting for raw items flusher to stop on "/data1/vmdata/indexdb/161B62
429BB1E5B0"...
2020-07-01T04:00:00.090Z        info    VictoriaMetrics/lib/mergeset/table.go:220       raw items flusher stopped in 0.000 seconds on "/data1/vmdata/indexdb/1
61B62429BB1E5B0"
2020-07-01T04:00:00.090Z        info    VictoriaMetrics/lib/mergeset/table.go:222       waiting for converters to stop on "/data1/vmdata/indexdb/161B62429BB1E
5B0"...
2020-07-01T04:00:00.090Z        info    VictoriaMetrics/lib/mergeset/table.go:225       converters stopped in 0.000 seconds on "/data1/vmdata/indexdb/161B6242
9BB1E5B0"
2020-07-01T04:00:00.090Z        info    VictoriaMetrics/lib/mergeset/table.go:227       waiting for part mergers to stop on "/data1/vmdata/indexdb/161B62429BB
1E5B0"...
2020-07-01T04:00:00.091Z        info    VictoriaMetrics/lib/mergeset/table.go:230       part mergers stopped in 0.000 seconds on "/data1/vmdata/indexdb/161B62
429BB1E5B0"
2020-07-01T04:00:00.091Z        info    VictoriaMetrics/lib/mergeset/table.go:232       flushing inmemory parts to files on "/data1/vmdata/indexdb/161B62429BB
1E5B0"...
2020-07-01T04:00:00.091Z        info    VictoriaMetrics/lib/mergeset/table.go:256       0 inmemory parts have been flushed to files in 0.000 seconds on "/data
1/vmdata/indexdb/161B62429BB1E5B0"
2020-07-01T04:00:00.466Z        info    VictoriaMetrics/lib/storage/index_db.go:370     dropping indexDB "/data1/vmdata/indexdb/161B62429BB1E5B0"
2020-07-01T04:00:00.574Z        info    VictoriaMetrics/lib/storage/index_db.go:372     indexDB "/data1/vmdata/indexdb/161B62429BB1E5B0" has been dropped
2020-07-01T04:00:00.581Z        error   VictoriaMetrics/lib/storage/index_db.go:2607    missing metricName by metricID 1592974931433885494; this could be the case after unclean shutdown; deleting the metricID, so it could be re-created next time
2020-07-01T04:00:00.582Z        error   VictoriaMetrics/lib/storage/index_db.go:2607    missing metricName by metricID 1592974931435360975; this could be the case after unclean shutdown; deleting the metricID, so it could be re-created next time
2020-07-01T04:00:00.582Z        error   VictoriaMetrics/lib/storage/index_db.go:2607    missing metricName by metricID 1592974931435362390; this could be the case after unclean shutdown; deleting the metricID, so it could be re-created next time
2020-07-01T04:00:00.582Z        error   VictoriaMetrics/lib/storage/index_db.go:2607    missing metricName by metricID 1592974931435381426; this could be the case after unclean shutdown; deleting the metricID, so it could be re-created next time

VictoriaMetrics keeps indexdb instance (inverted index for matching labels to time series) that covers -retentionPeriod. It looks like the log captured indexdb rotation, which occurs every -retentionPeriod - the /data1/vmdata/indexdb/161B62429BB1E5B0 has been substituted by /data1/vmdata/indexdb/161B62429BB1E5B2. It is likely the missing metricName by metricID error is related to a bug with handling of non-empty dmis aka a set of deleted metric ids for each indexdb instance. How many missing metricName by metricID... log lines the log contains after the indexdb rotation?

How many missing metricName by metricID... log lines the log contains after the indexdb rotation?

min logs lines is 43, and max is 306 in different nodes

and the time distribution is from 07-01 04:00(utc) to 07-01 05:57(utc)

@n4mine , thanks for the idea about removing dmis.Has check inside Storage.add! It has been appeared it is quite easy to implement it by just resetting MetricName->TSID cache after the deletion of time series. This guarantees that the cache won't contain entries for the deleted time series. See the commit fe58462bef9f6c211a036fa1e4f9cf3ced4b9ad4 .

@valyala

thanks a lot.

BTW, i have split cluster horizontally from 1 to 4 yesterday by namespace in our business scenario.

i will continue to observe this case until next retentionPeriod reach.

thanks again :)

FYI, the commit that removes dmis.Has calls from data ingestion path has been included in v1.38.0 release.

Was this page helpful?
0 / 5 - 0 ratings