Victoriametrics: When relabelling is enabled in vminsert it's causing inconsistency in all metrics

Created on 17 Jul 2020  路  13Comments  路  Source: VictoriaMetrics/VictoriaMetrics

Describe the bug
When relabelling is enabled in vminsert using the -relabelConfig CLI flag the metrics started showing inconsistency when being ingested to vmstorage, one metric has two values, one of which was inconsistent as shown in the screenshot.

To Reproduce
Install vminsert and start the process by passing the -relabelConfig CLI flag and pointing to the relabel rules file

Expected behavior
When relabelling is enabled, metric names and their labels shouldn't have any inconsistency in them and the metrics should have the relabelling rules applied to them.

Screenshots
Screenshot 2020-07-14 at 9 31 39 PM

Version

$ vminsert-prod -version
vminsert-20200708-175209-tags-v1.38.0-cluster-0-g418f0e46c

Used command-line flags

flag{name="csvTrimTimestamp", value="1ms"} 1
flag{name="enableTCP6", value="false"} 1
flag{name="envflag.enable", value="false"} 1
flag{name="envflag.prefix", value=""} 1
flag{name="fs.disableMmap", value="false"} 1
flag{name="graphiteListenAddr", value=""} 1
flag{name="graphiteTrimTimestamp", value="1s"} 1
flag{name="http.disableResponseCompression", value="false"} 1
flag{name="http.maxGracefulShutdownDuration", value="7s"} 1
flag{name="http.pathPrefix", value=""} 1
flag{name="http.shutdownDelay", value="0s"} 1
flag{name="httpListenAddr", value="127.0.0.1:8480"} 1
flag{name="import.maxLineLen", value="104857600"} 1
flag{name="influxListenAddr", value=""} 1
flag{name="influxMeasurementFieldSeparator", value="_"} 1
flag{name="influxSkipSingleField", value="false"} 1
flag{name="influxTrimTimestamp", value="1ms"} 1
flag{name="insert.maxQueueDuration", value="2m0s"} 1
flag{name="loggerErrorsPerSecondLimit", value="10"} 1
flag{name="loggerFormat", value="json"} 1
flag{name="loggerLevel", value="INFO"} 1
flag{name="loggerOutput", value="stdout"} 1
flag{name="maxConcurrentInserts", value="8"} 1
flag{name="maxInsertRequestSize", value="33554432"} 1
flag{name="maxLabelsPerTimeseries", value="30"} 1
flag{name="memory.allowedPercent", value="60"} 1
flag{name="opentsdbHTTPListenAddr", value=""} 1
flag{name="opentsdbListenAddr", value=""} 1
flag{name="opentsdbTrimTimestamp", value="1s"} 1
flag{name="opentsdbhttp.maxInsertRequestSize", value="33554432"} 1
flag{name="opentsdbhttpTrimTimestamp", value="1ms"} 1
flag{name="relabelConfig", value=""} 1
flag{name="replicationFactor", value="2"} 1
flag{name="rpc.disableCompression", value="false"} 1
flag{name="storageNode", value="vmstorage-node-1.com:8400,vmstorage-node-2.com:8400"} 1
flag{name="version", value="false"} 1
bug

Most helpful comment

FYI, the bugfix has been included in v1.39.0. Closing the bug as fixed.

All 13 comments

@jsanant, look here: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/599 and then upgrade to 1.38.1.

@Allineer - Thanks I'll take a look

@valyala - Can I please get an update on this? If you need more information please let me know.

@jsanant , could you provide the following additional information?

  • Which line on the graph is incorrect?
  • Could you provide a screenshot for the same query that covers a time range with both cases - when -relabelConfig is enabled and when it is disabled?
  • Could you provide a screenshot for the inner_query from sum(inner_query_here) by (path), i.e. for disk_used_percent{host=~"...", path!~"/snap.*"}? The screenshot should cover the time range for both cases - when -relabelConfig is enabled and when it is disabled. The screenshot should cover time series labels at the bottom of the graph. If the query returns too many time series, then try filtering out them with additional filters in the query.
  • The green line is the incorrect one

Screenshot of when -relabelConfig is enabled:
Screenshot 2020-07-14 at 9 31 39 PM

Screenshot of when -relabelConfig is not enabled:
Screenshot 2020-07-21 at 10 08 52 PM

Screenshot for the inner query when -relabel is enabled:
Screenshot 2020-07-21 at 9 58 12 PM

Screenshot for the inner query when -relabelConfig is not enabled:
Screenshot 2020-07-21 at 10 07 09 PM

Hope this helps!

@jsanant , thanks for the provided graphs! The last two graphs are quite interesting:

  • The graph when -relabelConfig is enabled contains 12 time series matching the given query, while the graph without -relabelConfig contains only a single time series matching the given query. It is unclear why other metrics with such labels as device="loopN" or fstype="squashfs" aren't shown on the last graph. Probably, these metrics have path label which starts with /snap prefix. Could you try removing the path!~"/snap.*" filter from the query and verify this assumption?
  • All the labels for time series on the graph with -relabelConfig look correct. This rules out the case with a bug, which could lead to incorrect relabeling.
  • Only a single time series on the graph with -relabelConfig contains path label, while others have no this label. There are also 5 time series without mode label. This looks suspicious. Probably, there is a bug in VictoriaMetrics, which sometimes trims a part of labels if -relabelConfig is set. I need to look into this direction.
  • Could you try removing the path!~"/snap.*"

@valyala , you are right, I didn't pass the path!~"/snap.*" in screenshot 3

Screenshot of inner query without path!~"/snap.*" & -relableConfig not enabled:
Screenshot 2020-07-22 at 12 20 56 PM

  • Only a single time series on the graph with -relabelConfig contains path label, while others have no this label. There are also 5 time series without mode label. This looks suspicious. Probably, there is a bug in VictoriaMetrics, which sometimes trims a part of labels if -relabelConfig is set. I need to look into this direction.

So vminsert is performing the relablleing as expected but victoria metrics is storing it in a different format?

So vminsert is performing the relablleing as expected but victoria metrics is storing it in a different format?

No, it looks like vminsert has a bug with relabeling, which sometimes removes a part of labels on the metric such as path or mode. I'm still investigating this case.

The workaround is to perform relabeling on vmagent side.

@jsanant , could you share the contents of -relabelConfig file?

Here is the content for the -relabelConfig file:

- action: replace_all
  source_labels: [__name__]
  target_label: '__name__'
  regex: "-"
  replacement: "_"
- action: labelmap_all
  regex: "-"
  replacement: "_"

I am using the above one with vmagent, and it is working without any issues.

It looks like I figured out the origin of the bug and fixed it in the following commits:

  • Cluster version - c91ccce50cc3f6d23e638b605c6047d4418e24d3
  • Single-node version - 2f612e0c67d45e0ba539ed5ed0b63b2dbeea9797

@jsanant , try building vminsert from the commit c91ccce50cc3f6d23e638b605c6047d4418e24d3 according to these docs and verifying whether it fixes the issue.

@valyala - Verified, its working without any issues! Thank you.

FYI, the bugfix has been included in v1.39.0. Closing the bug as fixed.

Was this page helpful?
5 / 5 - 1 ratings

Related issues

faceair picture faceair  路  3Comments

dima-vm picture dima-vm  路  3Comments

valyala picture valyala  路  4Comments

jelmd picture jelmd  路  3Comments

WilliamDahlen picture WilliamDahlen  路  3Comments