Describe the bug
vmagent args (the order matters)
-promscrape.config=/.../targets.yaml \
-remoteWrite.url=http://127.0.0.1:8426/api/v1/write \
-remoteWrite.url=http://127.0.0.1:8427/api/v1/write \
-remoteWrite.urlRelabelConfig=/.../vm1.yaml \
-remoteWrite.urlRelabelConfig=/.../vm2.yaml \
so we have two nodes of VM
configs:
tagets.yaml
lobal:
scrape_interval: 60s
external_labels:
agent: shkoder
instance: shkoder
scrape_configs:
- job_name: extra
static_configs:
- targets: ["127.0.0.1:4084"]
metric_relabel_configs:
- source_labels: [__name__]
regex: test_metric
replacement: true
target_label: __keep
- source_labels: [__name__]
regex: (up|scrape_duration_seconds|scrape_samples_scraped|scrape_samples_post_metric_relabeling)
replacement: true
target_label: __keep
- source_labels: [__keep]
regex: true
action: keep
vm1.yaml
- source_labels: [job, agent, __name__]
regex: extra;shkoder;(up|scrape_duration_seconds|scrape_samples_scraped|scrape_samples_post_metric_relabeling)
replacement: true
target_label: __keep
- source_labels: [__keep]
regex: true
action: keep
vm2.yaml
- source_labels: [job, agent, __name__]
regex: extra;shkoder;test_metric
replacement: true
target_label: __keep
- source_labels: [__keep]
regex: true
action: keep
Using these config i expect that "system" metrics
up|scrape_duration_seconds|scrape_samples_scraped|scrape_samples_post_metric_relabeling
go to the vm1 instance.
All other "business" metrics (in attached example they have name test_metrics) are stored to vm2
here is 100 test metrics
# HELP test_metric Some description
# TYPE test_metric gauge
test_metric{idx="0"} 1.0
test_metric{idx="1"} 1.0
test_metric{idx="2"} 1.0
test_metric{idx="3"} 1.0
test_metric{idx="4"} 1.0
test_metric{idx="5"} 1.0
test_metric{idx="6"} 1.0
test_metric{idx="7"} 1.0
test_metric{idx="8"} 1.0
test_metric{idx="9"} 1.0
test_metric{idx="10"} 1.0
test_metric{idx="11"} 1.0
test_metric{idx="12"} 1.0
test_metric{idx="13"} 1.0
test_metric{idx="14"} 1.0
test_metric{idx="15"} 1.0
test_metric{idx="16"} 1.0
test_metric{idx="17"} 1.0
test_metric{idx="18"} 1.0
test_metric{idx="19"} 1.0
test_metric{idx="20"} 1.0
test_metric{idx="21"} 1.0
test_metric{idx="22"} 1.0
test_metric{idx="23"} 1.0
test_metric{idx="24"} 1.0
test_metric{idx="25"} 1.0
test_metric{idx="26"} 1.0
test_metric{idx="27"} 1.0
test_metric{idx="28"} 1.0
test_metric{idx="29"} 1.0
test_metric{idx="30"} 1.0
test_metric{idx="31"} 1.0
test_metric{idx="32"} 1.0
test_metric{idx="33"} 1.0
test_metric{idx="34"} 1.0
test_metric{idx="35"} 1.0
test_metric{idx="36"} 1.0
test_metric{idx="37"} 1.0
test_metric{idx="38"} 1.0
test_metric{idx="39"} 1.0
test_metric{idx="40"} 1.0
test_metric{idx="41"} 1.0
test_metric{idx="42"} 1.0
test_metric{idx="43"} 1.0
test_metric{idx="44"} 1.0
test_metric{idx="45"} 1.0
test_metric{idx="46"} 1.0
test_metric{idx="47"} 1.0
test_metric{idx="48"} 1.0
test_metric{idx="49"} 1.0
test_metric{idx="50"} 1.0
test_metric{idx="51"} 1.0
test_metric{idx="52"} 1.0
test_metric{idx="53"} 1.0
test_metric{idx="54"} 1.0
test_metric{idx="55"} 1.0
test_metric{idx="56"} 1.0
test_metric{idx="57"} 1.0
test_metric{idx="58"} 1.0
test_metric{idx="59"} 1.0
test_metric{idx="60"} 1.0
test_metric{idx="61"} 1.0
test_metric{idx="62"} 1.0
test_metric{idx="63"} 1.0
test_metric{idx="64"} 1.0
test_metric{idx="65"} 1.0
test_metric{idx="66"} 1.0
test_metric{idx="67"} 1.0
test_metric{idx="68"} 1.0
test_metric{idx="69"} 1.0
test_metric{idx="70"} 1.0
test_metric{idx="71"} 1.0
test_metric{idx="72"} 1.0
test_metric{idx="73"} 1.0
test_metric{idx="74"} 1.0
test_metric{idx="75"} 1.0
test_metric{idx="76"} 1.0
test_metric{idx="77"} 1.0
test_metric{idx="78"} 1.0
test_metric{idx="79"} 1.0
test_metric{idx="80"} 1.0
test_metric{idx="81"} 1.0
test_metric{idx="82"} 1.0
test_metric{idx="83"} 1.0
test_metric{idx="84"} 1.0
test_metric{idx="85"} 1.0
test_metric{idx="86"} 1.0
test_metric{idx="87"} 1.0
test_metric{idx="88"} 1.0
test_metric{idx="89"} 1.0
test_metric{idx="90"} 1.0
test_metric{idx="91"} 1.0
test_metric{idx="92"} 1.0
test_metric{idx="93"} 1.0
test_metric{idx="94"} 1.0
test_metric{idx="95"} 1.0
test_metric{idx="96"} 1.0
test_metric{idx="97"} 1.0
test_metric{idx="98"} 1.0
test_metric{idx="99"} 1.0
Test Cases:
scrape_samples_scraped == 100
scrape_samples_post_metric_relabeling == 100
c) perform request to vm2 database
sum without(idx) (test_metric) == 96
four metrics are lost
ervice vmagent stop && service vmagent startsum without(idx) (test_metric) == 100
all metrics are there
ervice vmagent stop && service vmagent startsum without(idx) (test_metric) == 96
four metrics are lost
-promscrape.config=/.../targets.yaml \
-remoteWrite.url=http://127.0.0.1:8427/api/v1/write \
-remoteWrite.url=http://127.0.0.1:8426/api/v1/write \
-remoteWrite.urlRelabelConfig=/.../vm2.yaml \
-remoteWrite.urlRelabelConfig=/.../vm1.yaml \
b) re-run vmagent using ervice vmagent stop && service vmagent start
c) perform request to vm2 database
sum without(idx) (test_metric) == 100
all metrics are there
5.
a) edit configuration of vm1.yaml
- source_labels: [job, agent, __name__]
regex: extra;shkoder;up
replacement: true
target_label: __keep
b) re-run vmagent using ervice vmagent stop && service vmagent start
c) perform request to vm2 database
sum without(idx) (test_metric) == 99
__note__ info from telegram ru chart
Thanks for the detailed bug report!
It would be great to have annotated screenshots with time range covering all the steps performed in the bug report. I'd also suggest using count instead of sum function in the query, i.e. count(test_metric) without(idx) instead of sum(test_metric) without(idx). This should give more accurate results, which depend only on the number of time series and don't depend on metric values.
@valyala
vmagent config:
-promscrape.config=/.../targets.yaml \
-remoteWrite.url=http://127.0.0.1:8426/api/v1/write \
-remoteWrite.url=http://127.0.0.1:8427/api/v1/write \
-remoteWrite.urlRelabelConfig=/.../vm1.yaml \
-remoteWrite.urlRelabelConfig=/.../vm2.yaml \
vm1 database (system) - 1st
vm2 database (business) - 2nd
a) vm1.yaml
- source_labels: [job, agent, __name__]
regex: extra;shkoder;(up|scrape_duration_seconds|scrape_samples_scraped|scrape_samples_post_metric_relabeling)
replacement: true
target_label: __keep
- source_labels: [__keep]
regex: true
action: keep
b) vm2.yaml
- source_labels: [job, agent, __name__]
regex: extra;shkoder;test_metric
replacement: true
target_label: __keep
- source_labels: [__keep]
regex: true
action: keep
c) re-run vmagent using service vmagent stop && service vmagent start
d) requests to the vm1 database:

e) request to the vm2 database:

a) vm1.yaml
- source_labels: [__keep]
regex: true
action: keep
b) vm2.yaml:
- source_labels: [job, agent, __name__]
regex: extra;shkoder;(up|scrape_duration_seconds|scrape_samples_scraped|scrape_samples_post_metric_relabeling)
replacement: true
target_label: __keep
- source_labels: [job, agent, __name__]
regex: extra;shkoder;test_metric
replacement: true
target_label: __keep
- source_labels: [__keep]
regex: true
action: keep
c) re-run vmagent using service vmagent stop && service vmagent start
d) no new records in the vm1 database as expected:

e) perform requests to vm2 database:


a) vm1.yaml
- source_labels: [job, agent, __name__]
regex: extra;shkoder;(up|scrape_duration_seconds|scrape_samples_scraped|scrape_samples_post_metric_relabeling)
replacement: true
target_label: __keep
- source_labels: [__keep]
regex: true
action: keep
b) vm2.yaml
- source_labels: [job, agent, __name__]
regex: extra;shkoder;test_metric
replacement: true
target_label: __keep
- source_labels: [__keep]
regex: true
action: keep
c) re-run vmagent using service vmagent stop && service vmagent start
d) requests to the vm1 database:

e) request to the vm2 database:

a) vm1.yaml and vm2.yaml not changed since case 3 example.
b) reorder databases and it's configs in the vmagent config:
-promscrape.config=/.../targets.yaml \
-remoteWrite.url=http://127.0.0.1:8427/api/v1/write \
-remoteWrite.url=http://127.0.0.1:8426/api/v1/write \
-remoteWrite.urlRelabelConfig=/.../vm2.yaml \
-remoteWrite.urlRelabelConfig=/.../vm1.yaml \
vm2 database (business) - 1st
vm1 database (system) - 2nd
c) re-run vmagent using service vmagent stop && service vmagent start
d) requests to the vm1 database:

e) request to the vm2 database:

a) change vmagent config back:
-promscrape.config=/.../targets.yaml \
-remoteWrite.url=http://127.0.0.1:8426/api/v1/write \
-remoteWrite.url=http://127.0.0.1:8427/api/v1/write \
-remoteWrite.urlRelabelConfig=/.../vm1.yaml \
-remoteWrite.urlRelabelConfig=/.../vm2.yaml \
b) vm1.yaml
- source_labels: [job, agent, __name__]
regex: extra;shkoder;up
replacement: true
target_label: __keep
- source_labels: [__keep]
regex: true
action: keep
leave only one "system" metric
c) vm2.yaml
- source_labels: [job, agent, __name__]
regex: extra;shkoder;test_metric
replacement: true
target_label: __keep
- source_labels: [__keep]
regex: true
action: keep
d) re-run vmagent using service vmagent stop && service vmagent start
e) requests to the vm1 database:

f) request to the vm2 database:

@Allineer , thanks for all these details - they helped identifying the root cause of the bug and fixing it in the commit 96e001d254a267e51dc72bc2a75c72682c4e7411 . This commit will be included in the next release.
Could you build vmagent from this commit according to these docs and verify whether the issue is fixed there?
@valyala, thanks! I'll test it soon.
FYI, the bugfix has been included in v1.35.4.
Fixed! Thanks!
Thanks for the confirmation! Closing the issue as fixed.