Victoriametrics: vmagent lost some metrics

Created on 8 May 2020  路  7Comments  路  Source: VictoriaMetrics/VictoriaMetrics

Describe the bug
vmagent args (the order matters)

-promscrape.config=/.../targets.yaml                \
-remoteWrite.url=http://127.0.0.1:8426/api/v1/write \
-remoteWrite.url=http://127.0.0.1:8427/api/v1/write \
-remoteWrite.urlRelabelConfig=/.../vm1.yaml \
-remoteWrite.urlRelabelConfig=/.../vm2.yaml \

so we have two nodes of VM

  • vm1 on port 8426
  • vm2 on port 8427

configs:
tagets.yaml

lobal:
  scrape_interval: 60s
  external_labels:
    agent: shkoder
    instance: shkoder

scrape_configs:
  - job_name: extra
    static_configs:
      - targets: ["127.0.0.1:4084"]

    metric_relabel_configs:
      - source_labels: [__name__]
        regex: test_metric
        replacement: true
        target_label: __keep

      - source_labels: [__name__]
        regex: (up|scrape_duration_seconds|scrape_samples_scraped|scrape_samples_post_metric_relabeling)
        replacement: true
        target_label: __keep

      - source_labels: [__keep]
        regex: true
        action: keep

vm1.yaml

- source_labels: [job, agent, __name__]
  regex: extra;shkoder;(up|scrape_duration_seconds|scrape_samples_scraped|scrape_samples_post_metric_relabeling)
  replacement: true
  target_label: __keep

- source_labels: [__keep]
  regex: true
  action: keep

vm2.yaml

- source_labels: [job, agent, __name__]
  regex: extra;shkoder;test_metric
  replacement: true
  target_label: __keep

- source_labels: [__keep]
  regex: true
  action: keep

Using these config i expect that "system" metrics

up|scrape_duration_seconds|scrape_samples_scraped|scrape_samples_post_metric_relabeling

go to the vm1 instance.
All other "business" metrics (in attached example they have name test_metrics) are stored to vm2

here is 100 test metrics

# HELP test_metric Some description
# TYPE test_metric gauge
test_metric{idx="0"} 1.0
test_metric{idx="1"} 1.0
test_metric{idx="2"} 1.0
test_metric{idx="3"} 1.0
test_metric{idx="4"} 1.0
test_metric{idx="5"} 1.0
test_metric{idx="6"} 1.0
test_metric{idx="7"} 1.0
test_metric{idx="8"} 1.0
test_metric{idx="9"} 1.0
test_metric{idx="10"} 1.0
test_metric{idx="11"} 1.0
test_metric{idx="12"} 1.0
test_metric{idx="13"} 1.0
test_metric{idx="14"} 1.0
test_metric{idx="15"} 1.0
test_metric{idx="16"} 1.0
test_metric{idx="17"} 1.0
test_metric{idx="18"} 1.0
test_metric{idx="19"} 1.0
test_metric{idx="20"} 1.0
test_metric{idx="21"} 1.0
test_metric{idx="22"} 1.0
test_metric{idx="23"} 1.0
test_metric{idx="24"} 1.0
test_metric{idx="25"} 1.0
test_metric{idx="26"} 1.0
test_metric{idx="27"} 1.0
test_metric{idx="28"} 1.0
test_metric{idx="29"} 1.0
test_metric{idx="30"} 1.0
test_metric{idx="31"} 1.0
test_metric{idx="32"} 1.0
test_metric{idx="33"} 1.0
test_metric{idx="34"} 1.0
test_metric{idx="35"} 1.0
test_metric{idx="36"} 1.0
test_metric{idx="37"} 1.0
test_metric{idx="38"} 1.0
test_metric{idx="39"} 1.0
test_metric{idx="40"} 1.0
test_metric{idx="41"} 1.0
test_metric{idx="42"} 1.0
test_metric{idx="43"} 1.0
test_metric{idx="44"} 1.0
test_metric{idx="45"} 1.0
test_metric{idx="46"} 1.0
test_metric{idx="47"} 1.0
test_metric{idx="48"} 1.0
test_metric{idx="49"} 1.0
test_metric{idx="50"} 1.0
test_metric{idx="51"} 1.0
test_metric{idx="52"} 1.0
test_metric{idx="53"} 1.0
test_metric{idx="54"} 1.0
test_metric{idx="55"} 1.0
test_metric{idx="56"} 1.0
test_metric{idx="57"} 1.0
test_metric{idx="58"} 1.0
test_metric{idx="59"} 1.0
test_metric{idx="60"} 1.0
test_metric{idx="61"} 1.0
test_metric{idx="62"} 1.0
test_metric{idx="63"} 1.0
test_metric{idx="64"} 1.0
test_metric{idx="65"} 1.0
test_metric{idx="66"} 1.0
test_metric{idx="67"} 1.0
test_metric{idx="68"} 1.0
test_metric{idx="69"} 1.0
test_metric{idx="70"} 1.0
test_metric{idx="71"} 1.0
test_metric{idx="72"} 1.0
test_metric{idx="73"} 1.0
test_metric{idx="74"} 1.0
test_metric{idx="75"} 1.0
test_metric{idx="76"} 1.0
test_metric{idx="77"} 1.0
test_metric{idx="78"} 1.0
test_metric{idx="79"} 1.0
test_metric{idx="80"} 1.0
test_metric{idx="81"} 1.0
test_metric{idx="82"} 1.0
test_metric{idx="83"} 1.0
test_metric{idx="84"} 1.0
test_metric{idx="85"} 1.0
test_metric{idx="86"} 1.0
test_metric{idx="87"} 1.0
test_metric{idx="88"} 1.0
test_metric{idx="89"} 1.0
test_metric{idx="90"} 1.0
test_metric{idx="91"} 1.0
test_metric{idx="92"} 1.0
test_metric{idx="93"} 1.0
test_metric{idx="94"} 1.0
test_metric{idx="95"} 1.0
test_metric{idx="96"} 1.0
test_metric{idx="97"} 1.0
test_metric{idx="98"} 1.0
test_metric{idx="99"} 1.0

Test Cases:


  1. a) run vmagent
    b) perform request to vm1 database
scrape_samples_scraped == 100
scrape_samples_post_metric_relabeling == 100

c) perform request to vm2 database

sum without(idx) (test_metric) == 96

four metrics are lost


  1. a) comment everything in vm1.yaml
    b) re-run vmagent using ervice vmagent stop && service vmagent start
    c) perform request to vm2 database
sum without(idx) (test_metric) == 100

all metrics are there


  1. a) uncomment everything in vm1.yaml
    b) re-run vmagent using ervice vmagent stop && service vmagent start
    c) perform request to vm2 database
sum without(idx) (test_metric) == 96

four metrics are lost


  1. a) change the args order in vmagent
-promscrape.config=/.../targets.yaml                \
-remoteWrite.url=http://127.0.0.1:8427/api/v1/write \
-remoteWrite.url=http://127.0.0.1:8426/api/v1/write \
-remoteWrite.urlRelabelConfig=/.../vm2.yaml \
-remoteWrite.urlRelabelConfig=/.../vm1.yaml \

b) re-run vmagent using ervice vmagent stop && service vmagent start
c) perform request to vm2 database

sum without(idx) (test_metric) == 100

all metrics are there

5.
a) edit configuration of vm1.yaml

- source_labels: [job, agent, __name__]
  regex: extra;shkoder;up
  replacement: true
  target_label: __keep

b) re-run vmagent using ervice vmagent stop && service vmagent start
c) perform request to vm2 database

sum without(idx) (test_metric) == 99

__note__ info from telegram ru chart

bug

All 7 comments

Thanks for the detailed bug report!

It would be great to have annotated screenshots with time range covering all the steps performed in the bug report. I'd also suggest using count instead of sum function in the query, i.e. count(test_metric) without(idx) instead of sum(test_metric) without(idx). This should give more accurate results, which depend only on the number of time series and don't depend on metric values.

@valyala

vmagent config:

-promscrape.config=/.../targets.yaml                \
-remoteWrite.url=http://127.0.0.1:8426/api/v1/write \
-remoteWrite.url=http://127.0.0.1:8427/api/v1/write \
-remoteWrite.urlRelabelConfig=/.../vm1.yaml \
-remoteWrite.urlRelabelConfig=/.../vm2.yaml \

vm1 database (system) - 1st
vm2 database (business) - 2nd

Case 1.

a) vm1.yaml

- source_labels: [job, agent, __name__]
  regex: extra;shkoder;(up|scrape_duration_seconds|scrape_samples_scraped|scrape_samples_post_metric_relabeling)
  replacement: true
  target_label: __keep

- source_labels: [__keep]
  regex: true
  action: keep

b) vm2.yaml

- source_labels: [job, agent, __name__]
  regex: extra;shkoder;test_metric
  replacement: true
  target_label: __keep

- source_labels: [__keep]
  regex: true
  action: keep

c) re-run vmagent using service vmagent stop && service vmagent start
d) requests to the vm1 database:

image

e) request to the vm2 database:

image

Case 2.

a) vm1.yaml

- source_labels: [__keep]
  regex: true
  action: keep

b) vm2.yaml:

- source_labels: [job, agent, __name__]
  regex: extra;shkoder;(up|scrape_duration_seconds|scrape_samples_scraped|scrape_samples_post_metric_relabeling)
  replacement: true
  target_label: __keep

- source_labels: [job, agent, __name__]
  regex: extra;shkoder;test_metric
  replacement: true
  target_label: __keep

- source_labels: [__keep]
  regex: true
  action: keep

c) re-run vmagent using service vmagent stop && service vmagent start
d) no new records in the vm1 database as expected:

image

e) perform requests to vm2 database:

image

image

Case 3.

a) vm1.yaml

- source_labels: [job, agent, __name__]
  regex: extra;shkoder;(up|scrape_duration_seconds|scrape_samples_scraped|scrape_samples_post_metric_relabeling)
  replacement: true
  target_label: __keep

- source_labels: [__keep]
  regex: true
  action: keep

b) vm2.yaml

- source_labels: [job, agent, __name__]
  regex: extra;shkoder;test_metric
  replacement: true
  target_label: __keep

- source_labels: [__keep]
  regex: true
  action: keep

c) re-run vmagent using service vmagent stop && service vmagent start
d) requests to the vm1 database:

image

e) request to the vm2 database:

image

Case 4.

a) vm1.yaml and vm2.yaml not changed since case 3 example.
b) reorder databases and it's configs in the vmagent config:

-promscrape.config=/.../targets.yaml                \
-remoteWrite.url=http://127.0.0.1:8427/api/v1/write \
-remoteWrite.url=http://127.0.0.1:8426/api/v1/write \
-remoteWrite.urlRelabelConfig=/.../vm2.yaml \
-remoteWrite.urlRelabelConfig=/.../vm1.yaml \

vm2 database (business) - 1st
vm1 database (system) - 2nd

c) re-run vmagent using service vmagent stop && service vmagent start
d) requests to the vm1 database:

image

e) request to the vm2 database:

image

Case 5.

a) change vmagent config back:

-promscrape.config=/.../targets.yaml                \
-remoteWrite.url=http://127.0.0.1:8426/api/v1/write \
-remoteWrite.url=http://127.0.0.1:8427/api/v1/write \
-remoteWrite.urlRelabelConfig=/.../vm1.yaml \
-remoteWrite.urlRelabelConfig=/.../vm2.yaml \

b) vm1.yaml

- source_labels: [job, agent, __name__]
  regex: extra;shkoder;up
  replacement: true
  target_label: __keep

- source_labels: [__keep]
  regex: true
  action: keep

leave only one "system" metric

c) vm2.yaml

- source_labels: [job, agent, __name__]
  regex: extra;shkoder;test_metric
  replacement: true
  target_label: __keep

- source_labels: [__keep]
  regex: true
  action: keep

d) re-run vmagent using service vmagent stop && service vmagent start
e) requests to the vm1 database:

image

f) request to the vm2 database:

image

@Allineer , thanks for all these details - they helped identifying the root cause of the bug and fixing it in the commit 96e001d254a267e51dc72bc2a75c72682c4e7411 . This commit will be included in the next release.

Could you build vmagent from this commit according to these docs and verify whether the issue is fixed there?

@valyala, thanks! I'll test it soon.

FYI, the bugfix has been included in v1.35.4.

Fixed! Thanks!

Thanks for the confirmation! Closing the issue as fixed.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

pmitra43 picture pmitra43  路  3Comments

genericgithubuser picture genericgithubuser  路  4Comments

n4mine picture n4mine  路  3Comments

faceair picture faceair  路  3Comments

valyala picture valyala  路  4Comments