Victoriametrics: Drop-in replacement prometheus not work with promxy (Prometheus HA)

Created on 5 Jun 2019 · 19Comments · Source: VictoriaMetrics/VictoriaMetrics

Hi! I try to deploy prometheus cluster with 3 standalone prometheus + victoriametrics backed and promxy as frontend (https://github.com/jacksontj/promxy), it's great work with prometheus, but if I define backend victoriametrics (:8428) in promxy i get result for query "count(count(node_cpu_seconds_total) by (cpu))" and empty result for "node_cpu_guest_seconds_total" with answer "{"status":"success","data":{"resultType":"matrix","result":[]}}".
At the same time direct requests to the VictoriaMetrics give data. Manual requests and those that promxy generates are different judging by tcpdump.
And I strongly lack extended debag logs.

bug

Source

rumanzo

All 19 comments

@rumanzo , thanks for the bug report.

In the mean time it would be great if you could provide tcpdump logs for requests generated by Promxy to VictoriaMetrics for the node_cpu_guest_seconds_total query and the corresponding response from VictoriaMetrics.

valyala on 5 Jun 2019

I tried to collect all possible useful info and attach archive
victoriametrics_debug.zip

promxy.yml.txt

rumanzo on 6 Jun 2019

Thanks for the additional info. BTW, which VictoriaMetrics version do you use? The similar issue with Promxy has been fixed in v1.13.1. The version may be determined by running ./victoria-metrics-prod --version. cc'ing @ThomasADavis , the original reporter of the issue #20.

valyala on 6 Jun 2019

VictoriaMetrics version victoria-metrics-20190531-150038-tags-v1.18.9-0-gf2cf5d8e
Promxy version v0.0.38
And I saw issue #20 and it is about another case

rumanzo on 6 Jun 2019

From the pcap:

POST /api/v1/query HTTP/1.1
Host: akostin-prometheus-1.openstacklocal:8428
User-Agent: Go-http-client/1.1
Content-Length: 169
Content-Type: application/x-www-form-urlencoded

query=%7Binstance%3D%22akostin-prometheus-3.openstacklocal%3A9100%22%2C__name__%3D%22node_cpu_guest_seconds_total%22%7D%5B1234s%5D&time=2019-06-06T09%3A01%3A14%2B03%3A00HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Content-Type: application/json
Date: Thu, 06 Jun 2019 09:24:05 GMT
Content-Length: 63

{"status":"success","data":{"resultType":"matrix","result":[]}}

which means that this host responded with no data to the following query:

{instance="akostin-prometheus-3.openstacklocal:9100",__name__="node_cpu_guest_seconds_total"}[1234s]&time=2019-06-06T09:01:14+03:00

Although the timestamp is a bit odd (usually its epoch looking) the query looks fine. So for whatever reason the datastore (I'm assuming VictoriaMetrics from this) is not returning data to that query. @rumanzo can you run that same query directly against VictoriaMetrics and see if you get the same result? (I assume you will, but once there is a reproducible query promxy is out of the debug loop-- which should speed things up a bit ;) ).

jacksontj on 6 Jun 2019

I made two requests - to prometheus and to VictoriaMetrics, and VictoriaMetrics returned values, while prometheus is not oO

$ curl -vvv -s 'http://akostin-prometheus-3.openstacklocal:9090/api/v1/query_range?' \
--data-urlencode 'time=2019-06-06T09:01:14+03:00' \                                                                                                      
--data-urlencode 'query=node_cpu_guest_seconds_total{instance="akostin-prometheus-3.openstacklocal:9100",__name__="node_cpu_guest_seconds_total"}[1234s]'
*   Trying 172.27.42.170:9090...
* TCP_NODELAY set
* Connected to akostin-prometheus-3.openstacklocal (172.27.42.170) port 9090 (#0)
> POST /api/v1/query_range? HTTP/1.1
> Host: akostin-prometheus-3.openstacklocal:9090
> User-Agent: curl/7.65.0
> Accept: */*
> Content-Length: 197
> Content-Type: application/x-www-form-urlencoded
>
* upload completely sent off: 197 out of 197 bytes
* Mark bundle as not supporting multiuse
< HTTP/1.1 400 Bad Request
< Content-Type: application/json
< Date: Thu, 06 Jun 2019 16:28:56 GMT
< Content-Length: 117
<
* Connection #0 to host akostin-prometheus-3.openstacklocal left intact
{"status":"error","errorType":"bad_data","error":"invalid parameter 'start': cannot parse \"\" to a valid timestamp"}%

$ curl -vvv -s 'http://akostin-prometheus-3.openstacklocal:8428/api/v1/query_range?' \
--data-urlencode 'time=2019-06-06T09:01:14+03:00' \
--data-urlencode 'query=node_cpu_guest_seconds_total{instance="akostin-prometheus-3.openstacklocal:9100",__name__="node_cpu_guest_seconds_total"}[1234s]'
*   Trying 172.27.42.170:8428...
* TCP_NODELAY set
* Connected to akostin-prometheus-3.openstacklocal (172.27.42.170) port 8428 (#0)
> POST /api/v1/query_range? HTTP/1.1
> Host: akostin-prometheus-3.openstacklocal:8428
> User-Agent: curl/7.65.0
> Accept: */*
> Content-Length: 197
> Content-Type: application/x-www-form-urlencoded
>
* upload completely sent off: 197 out of 197 bytes
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Access-Control-Allow-Origin: *
< Content-Type: application/json
< Date: Thu, 06 Jun 2019 16:29:10 GMT
< Content-Length: 914
<
* Connection #0 to host akostin-prometheus-3.openstacklocal left intact
{"status":"success","data":{"resultType":"matrix","result":[{"metric":{"__name__":"node_cpu_guest_seconds_total","cpu":"0","instance":"akostin-prometheus-3.openstacklocal:9100","job":"akostin-prometheus-servers","mode":"nice"},"values":[[1559838250,"0"],[1559838550,"0"]]},{"metric":{"__name__":"node_cpu_guest_seconds_total","cpu":"0","instance":"akostin-prometheus-3.openstacklocal:9100","job":"akostin-prometheus-servers","mode":"user"},"values":[[1559838250,"0"],[1559838550,"0"]]},{"metric":{"__name__":"node_cpu_guest_seconds_total","cpu":"1","instance":"akostin-prometheus-3.openstacklocal:9100","job":"akostin-prometheus-servers","mode":"nice"},"values":[[1559838250,"0"],[1559838550,"0"]]},{"metric":{"__name__":"node_cpu_guest_seconds_total","cpu":"1","instance":"akostin-prometheus-3.openstacklocal:9100","job":"akostin-prometheus-servers","mode":"user"},"values":[[1559838250,"0"],[1559838550,"0"]]}]}}

rumanzo on 6 Jun 2019

I have not seen anything wrong lately - but we use promxy v0.0.38, with VM 1.18.8, and Prometheus 2.10.0, with Grafana v6.2.0

ThomasADavis on 6 Jun 2019

From the prometheus API docs:

Input timestamps may be provided either in RFC3339 format or as a Unix timestamp in seconds, with optional decimal places for sub-second precision. Output timestamps are always represented as Unix timestamps in seconds.

So it seems that VictoriaMetrics support more timestamp types than prom does :) Whats interesting is that when I send the timestamp as you have on my prom stack with promxy I do get data back. So interesting, but not the issue we are seeing here.

In your query here you are showing data back when you hit http://akostin-prometheus-3.openstacklocal but the TCPdump shows no data from http://akostin-prometheus-1.openstacklocal. Do your 3 hosts all have different data? Right now you have all 3 hosts configured in the same servergroup -- which means promxy assumes they all have the same data (more detail at https://github.com/jacksontj/promxy#what-is-a-servergroup). If these 3 nodes are actually separate you can either have separate servergroups for each or use the relabel_configs to generate different labels per host.

jacksontj on 6 Jun 2019

I get VM works, but Prometheus/Promxy does not, they both complain about the date/timestamp

ThomasADavis on 6 Jun 2019

I have identical configs (prometheus, promxy, victoriametrics) on 3 identical servers. Promxy.yml config attached earlier.
The important thing is that when you change the port in promxy configuration to 9090 (prometheus) , everything starts working.

rumanzo on 6 Jun 2019

Ah, then definitely sounds like an issue with VictoriaMetrics or the routing of metrics from prom to VictoriaMetrics. Promxy in this configuration will hit all 3 VM servers and take the first one that responds (assuming no "holes" in the data), so if (for example) the prometheus servers were only configured to write to 1 VM server promxy could return no data if one of the other 2 responded first. If the writes are being properly routed then it does sound like some issue VM side.

jacksontj on 6 Jun 2019

So it seems that VictoriaMetrics support _more_ timestamp types than prom does :)

VictoriaMetrics just silently replaces invalid time with the current time. This needs to be fixed, since it hides the real error - invalid time.

valyala on 6 Jun 2019

@rumanzo , note that Promxy converts the initial request to /api/v1/query_range to range request to /api/v1/query. Could you issue the following request to all the VM nodes and make sure they return identical non-empty results:

http://akostin-prometheus-1.openstacklocal:8428/api/v1/query?query=%7Binstance%3D%22akostin-prometheus-3.openstacklocal%3A9100%22%2C__name__%3D%22node_cpu_guest_seconds_total%22%7D%5B1234s%5D&time=2019-06-06T09:01:14Z

Adjust the time arg in the url if necessary.

Make the same with Prometheus nodes and compare results.

valyala on 6 Jun 2019

That's a bit scary, I fixed a similar bug on trickster. Imo if the time is
wrong it should just error :)

On Thu, Jun 6, 2019, 12:10 PM Aliaksandr Valialkin notifications@github.com
wrote:

So it seems that VictoriaMetrics support more timestamp types than prom
does :)

VictoriaMetrics just silently replaces invalid time with the current time
https://github.com/VictoriaMetrics/VictoriaMetrics/blob/5cbd4cfca949a3c85354e2408e7ce1a7f9a683bc/app/vmselect/prometheus/prometheus.go#L479.
This needs to be fixed, since it hides the real error - invalid time.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/52?email_source=notifications&email_token=AAMLPHOSH2TOM7F3KSD4W33PZFOJZA5CNFSM4HTU3LM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXD3S4Y#issuecomment-499628403,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAMLPHPLRNCFFITCQZGGY33PZFOJZANCNFSM4HTU3LMQ
.

jacksontj on 6 Jun 2019

👍1

That's a bit scary, I fixed a similar bug on trickster. Imo if the time is
wrong it should just error :)

Agreed and fixed it in the https://github.com/VictoriaMetrics/VictoriaMetrics/commit/667115a5c789cc2e62cbde4382490acb8dfa9fee :) Proper error handling for invalid time parsing will be available in the next release of VictoriaMetrics.

valyala on 6 Jun 2019

🎉1

First I build VictoriaMeters from master and update all nodes to newer version (-20190606-203122-heads-master-0-gf4e63cd)

http://akostin-prometheus-1.openstacklocal:8428/api/v1/query?query=%7Binstance%3D%22akostin-prometheus-3.openstacklocal%3A9100%22%2C__name__%3D%22node_cpu_guest_seconds_total%22%7D%5B1234s%5D&time=2019-06-06T09:01:14Z

I checked on my nodes with this query - now all replies empty {"status":"success","data":{"resultType":"matrix","result":[]}}

And I try another query with another metrics. I get query from grafana query inspector, curl all endpoints, prometheus and victoriametrics gave me normal result, promxy gave me nothing. I found query to victoriametrics from promxy with tcpdump, and try curl with that modified query mannualy:

curl -v 'http://akostin-prometheus-3.openstacklocal:8428/api/v1/query?query=%7Binstance%3D~%22akostin-prometheus-1.openstacklocal%3A9100%22%2Cjob%3D~%22akostin-prometheus-servers%22%2C__name__%3D%22node_memory_Cached_bytes%22%7D%5B621s%5D&time=2019-06-06T23%3A56%3A20%2B03%3A00'

And result with prometheus (:9090) is normal, victoriametrics gave me empty result.

And notice that not all queries broken (standard node_exporter)

rumanzo on 6 Jun 2019

It looks like I found and fixed the root cause of the bug in v1.18.11. VictoriaMetrics improperly handled {__name__ op "string"} label filters. The bug was introduced in v1.18.9 while working on this feature request.
@rumanzo , could you confirm the issue is fixed in v1.18.11?

valyala on 7 Jun 2019

It looks like I found and fixed the root cause of the bug in v1.18.11. VictoriaMetrics improperly handled {__name__ op "string"} label filters. The bug was introduced in v1.18.9 while working on this feature request.
@rumanzo , could you confirm the issue is fixed in v1.18.11?

Hi. Yes, it works right! Thank you!

rumanzo on 7 Jun 2019

Great! Closing the issue then.

valyala on 7 Jun 2019

Was this page helpful?

0 / 5 - 0 ratings