Victoriametrics: vmagent lost some metrics

Created on 8 May 2020 · 7Comments · Source: VictoriaMetrics/VictoriaMetrics

Describe the bug
vmagent args (the order matters)

-promscrape.config=/.../targets.yaml                \
-remoteWrite.url=http://127.0.0.1:8426/api/v1/write \
-remoteWrite.url=http://127.0.0.1:8427/api/v1/write \
-remoteWrite.urlRelabelConfig=/.../vm1.yaml \
-remoteWrite.urlRelabelConfig=/.../vm2.yaml \

so we have two nodes of VM

vm1 on port 8426
vm2 on port 8427

configs:
tagets.yaml

lobal:
  scrape_interval: 60s
  external_labels:
    agent: shkoder
    instance: shkoder

scrape_configs:
  - job_name: extra
    static_configs:
      - targets: ["127.0.0.1:4084"]

    metric_relabel_configs:
      - source_labels: [__name__]
        regex: test_metric
        replacement: true
        target_label: __keep

      - source_labels: [__name__]
        regex: (up|scrape_duration_seconds|scrape_samples_scraped|scrape_samples_post_metric_relabeling)
        replacement: true
        target_label: __keep

      - source_labels: [__keep]
        regex: true
        action: keep

vm1.yaml

- source_labels: [job, agent, __name__]
  regex: extra;shkoder;(up|scrape_duration_seconds|scrape_samples_scraped|scrape_samples_post_metric_relabeling)
  replacement: true
  target_label: __keep

- source_labels: [__keep]
  regex: true
  action: keep

vm2.yaml

- source_labels: [job, agent, __name__]
  regex: extra;shkoder;test_metric
  replacement: true
  target_label: __keep

- source_labels: [__keep]
  regex: true
  action: keep

Using these config i expect that "system" metrics

up|scrape_duration_seconds|scrape_samples_scraped|scrape_samples_post_metric_relabeling

go to the vm1 instance.
All other "business" metrics (in attached example they have name test_metrics) are stored to vm2

here is 100 test metrics

# HELP test_metric Some description
# TYPE test_metric gauge
test_metric{idx="0"} 1.0
test_metric{idx="1"} 1.0
test_metric{idx="2"} 1.0
test_metric{idx="3"} 1.0
test_metric{idx="4"} 1.0
test_metric{idx="5"} 1.0
test_metric{idx="6"} 1.0
test_metric{idx="7"} 1.0
test_metric{idx="8"} 1.0
test_metric{idx="9"} 1.0
test_metric{idx="10"} 1.0
test_metric{idx="11"} 1.0
test_metric{idx="12"} 1.0
test_metric{idx="13"} 1.0
test_metric{idx="14"} 1.0
test_metric{idx="15"} 1.0
test_metric{idx="16"} 1.0
test_metric{idx="17"} 1.0
test_metric{idx="18"} 1.0
test_metric{idx="19"} 1.0
test_metric{idx="20"} 1.0
test_metric{idx="21"} 1.0
test_metric{idx="22"} 1.0
test_metric{idx="23"} 1.0
test_metric{idx="24"} 1.0
test_metric{idx="25"} 1.0
test_metric{idx="26"} 1.0
test_metric{idx="27"} 1.0
test_metric{idx="28"} 1.0
test_metric{idx="29"} 1.0
test_metric{idx="30"} 1.0
test_metric{idx="31"} 1.0
test_metric{idx="32"} 1.0
test_metric{idx="33"} 1.0
test_metric{idx="34"} 1.0
test_metric{idx="35"} 1.0
test_metric{idx="36"} 1.0
test_metric{idx="37"} 1.0
test_metric{idx="38"} 1.0
test_metric{idx="39"} 1.0
test_metric{idx="40"} 1.0
test_metric{idx="41"} 1.0
test_metric{idx="42"} 1.0
test_metric{idx="43"} 1.0
test_metric{idx="44"} 1.0
test_metric{idx="45"} 1.0
test_metric{idx="46"} 1.0
test_metric{idx="47"} 1.0
test_metric{idx="48"} 1.0
test_metric{idx="49"} 1.0
test_metric{idx="50"} 1.0
test_metric{idx="51"} 1.0
test_metric{idx="52"} 1.0
test_metric{idx="53"} 1.0
test_metric{idx="54"} 1.0
test_metric{idx="55"} 1.0
test_metric{idx="56"} 1.0
test_metric{idx="57"} 1.0
test_metric{idx="58"} 1.0
test_metric{idx="59"} 1.0
test_metric{idx="60"} 1.0
test_metric{idx="61"} 1.0
test_metric{idx="62"} 1.0
test_metric{idx="63"} 1.0
test_metric{idx="64"} 1.0
test_metric{idx="65"} 1.0
test_metric{idx="66"} 1.0
test_metric{idx="67"} 1.0
test_metric{idx="68"} 1.0
test_metric{idx="69"} 1.0
test_metric{idx="70"} 1.0
test_metric{idx="71"} 1.0
test_metric{idx="72"} 1.0
test_metric{idx="73"} 1.0
test_metric{idx="74"} 1.0
test_metric{idx="75"} 1.0
test_metric{idx="76"} 1.0
test_metric{idx="77"} 1.0
test_metric{idx="78"} 1.0
test_metric{idx="79"} 1.0
test_metric{idx="80"} 1.0
test_metric{idx="81"} 1.0
test_metric{idx="82"} 1.0
test_metric{idx="83"} 1.0
test_metric{idx="84"} 1.0
test_metric{idx="85"} 1.0
test_metric{idx="86"} 1.0
test_metric{idx="87"} 1.0
test_metric{idx="88"} 1.0
test_metric{idx="89"} 1.0
test_metric{idx="90"} 1.0
test_metric{idx="91"} 1.0
test_metric{idx="92"} 1.0
test_metric{idx="93"} 1.0
test_metric{idx="94"} 1.0
test_metric{idx="95"} 1.0
test_metric{idx="96"} 1.0
test_metric{idx="97"} 1.0
test_metric{idx="98"} 1.0
test_metric{idx="99"} 1.0

Test Cases:

a) run vmagent
b) perform request to vm1 database

scrape_samples_scraped == 100
scrape_samples_post_metric_relabeling == 100

c) perform request to vm2 database

sum without(idx) (test_metric) == 96

four metrics are lost

a) comment everything in vm1.yaml
b) re-run vmagent using ervice vmagent stop && service vmagent start
c) perform request to vm2 database

sum without(idx) (test_metric) == 100

all metrics are there

a) uncomment everything in vm1.yaml
b) re-run vmagent using ervice vmagent stop && service vmagent start
c) perform request to vm2 database

sum without(idx) (test_metric) == 96

four metrics are lost

a) change the args order in vmagent

-promscrape.config=/.../targets.yaml                \
-remoteWrite.url=http://127.0.0.1:8427/api/v1/write \
-remoteWrite.url=http://127.0.0.1:8426/api/v1/write \
-remoteWrite.urlRelabelConfig=/.../vm2.yaml \
-remoteWrite.urlRelabelConfig=/.../vm1.yaml \

b) re-run vmagent using ervice vmagent stop && service vmagent start
c) perform request to vm2 database

sum without(idx) (test_metric) == 100

all metrics are there

5.
a) edit configuration of vm1.yaml

- source_labels: [job, agent, __name__]
  regex: extra;shkoder;up
  replacement: true
  target_label: __keep

b) re-run vmagent using ervice vmagent stop && service vmagent start
c) perform request to vm2 database

sum without(idx) (test_metric) == 99

__note__ info from telegram ru chart

bug

Source

tenmozes

All 7 comments

Thanks for the detailed bug report!

It would be great to have annotated screenshots with time range covering all the steps performed in the bug report. I'd also suggest using count instead of sum function in the query, i.e. count(test_metric) without(idx) instead of sum(test_metric) without(idx). This should give more accurate results, which depend only on the number of time series and don't depend on metric values.

valyala on 12 May 2020

@valyala

vmagent config:

-promscrape.config=/.../targets.yaml                \
-remoteWrite.url=http://127.0.0.1:8426/api/v1/write \
-remoteWrite.url=http://127.0.0.1:8427/api/v1/write \
-remoteWrite.urlRelabelConfig=/.../vm1.yaml \
-remoteWrite.urlRelabelConfig=/.../vm2.yaml \

vm1 database (system) - 1st
vm2 database (business) - 2nd

Case 1.

a) vm1.yaml

- source_labels: [job, agent, __name__]
  regex: extra;shkoder;(up|scrape_duration_seconds|scrape_samples_scraped|scrape_samples_post_metric_relabeling)
  replacement: true
  target_label: __keep

- source_labels: [__keep]
  regex: true
  action: keep

b) vm2.yaml

- source_labels: [job, agent, __name__]
  regex: extra;shkoder;test_metric
  replacement: true
  target_label: __keep

- source_labels: [__keep]
  regex: true
  action: keep

c) re-run vmagent using service vmagent stop && service vmagent start
d) requests to the vm1 database:

e) request to the vm2 database:

Case 2.

a) vm1.yaml

- source_labels: [__keep]
  regex: true
  action: keep

b) vm2.yaml:

- source_labels: [job, agent, __name__]
  regex: extra;shkoder;(up|scrape_duration_seconds|scrape_samples_scraped|scrape_samples_post_metric_relabeling)
  replacement: true
  target_label: __keep

- source_labels: [job, agent, __name__]
  regex: extra;shkoder;test_metric
  replacement: true
  target_label: __keep

- source_labels: [__keep]
  regex: true
  action: keep

c) re-run vmagent using service vmagent stop && service vmagent start
d) no new records in the vm1 database as expected:

e) perform requests to vm2 database:

Case 3.

a) vm1.yaml

- source_labels: [job, agent, __name__]
  regex: extra;shkoder;(up|scrape_duration_seconds|scrape_samples_scraped|scrape_samples_post_metric_relabeling)
  replacement: true
  target_label: __keep

- source_labels: [__keep]
  regex: true
  action: keep

b) vm2.yaml

- source_labels: [job, agent, __name__]
  regex: extra;shkoder;test_metric
  replacement: true
  target_label: __keep

- source_labels: [__keep]
  regex: true
  action: keep

c) re-run vmagent using service vmagent stop && service vmagent start
d) requests to the vm1 database:

e) request to the vm2 database:

Case 4.

a) vm1.yaml and vm2.yaml not changed since case 3 example.
b) reorder databases and it's configs in the vmagent config:

-promscrape.config=/.../targets.yaml                \
-remoteWrite.url=http://127.0.0.1:8427/api/v1/write \
-remoteWrite.url=http://127.0.0.1:8426/api/v1/write \
-remoteWrite.urlRelabelConfig=/.../vm2.yaml \
-remoteWrite.urlRelabelConfig=/.../vm1.yaml \

vm2 database (business) - 1st
vm1 database (system) - 2nd

c) re-run vmagent using service vmagent stop && service vmagent start
d) requests to the vm1 database:

e) request to the vm2 database:

Case 5.

a) change vmagent config back:

-promscrape.config=/.../targets.yaml                \
-remoteWrite.url=http://127.0.0.1:8426/api/v1/write \
-remoteWrite.url=http://127.0.0.1:8427/api/v1/write \
-remoteWrite.urlRelabelConfig=/.../vm1.yaml \
-remoteWrite.urlRelabelConfig=/.../vm2.yaml \

b) vm1.yaml

- source_labels: [job, agent, __name__]
  regex: extra;shkoder;up
  replacement: true
  target_label: __keep

- source_labels: [__keep]
  regex: true
  action: keep

leave only one "system" metric

c) vm2.yaml

- source_labels: [job, agent, __name__]
  regex: extra;shkoder;test_metric
  replacement: true
  target_label: __keep

- source_labels: [__keep]
  regex: true
  action: keep

d) re-run vmagent using service vmagent stop && service vmagent start
e) requests to the vm1 database:

f) request to the vm2 database:

Allineer on 12 May 2020

👍1

@Allineer , thanks for all these details - they helped identifying the root cause of the bug and fixing it in the commit 96e001d254a267e51dc72bc2a75c72682c4e7411 . This commit will be included in the next release.

Could you build vmagent from this commit according to these docs and verify whether the issue is fixed there?

valyala on 12 May 2020

🎉1

@valyala, thanks! I'll test it soon.

Allineer on 12 May 2020

FYI, the bugfix has been included in v1.35.4.

valyala on 12 May 2020

Fixed! Thanks!

Allineer on 13 May 2020

Thanks for the confirmation! Closing the issue as fixed.

valyala on 13 May 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

High startup time for vmagent

pmitra43 · 3Comments

cluster data storage and replication

genericgithubuser · 4Comments

escape char in `by` not work

n4mine · 3Comments

Are there any plans to implement recording rules in vmagent ?

faceair · 3Comments

Add ability to set small retention periods (starting from a day)

valyala · 4Comments