Victoriametrics: query results may incorrectly overlap time series

Created on 7 Sep 2020 · 15Comments · Source: VictoriaMetrics/VictoriaMetrics

Describe the bug

If query step is much greater than data interval (e.g. > 4x), and two series are adjacent in time but not overlapping, the query output or aggregation may incorrectly overlap the time series.

A typical example is build_info{version="..."}. A new version is deployed to instances, which stop updating build_info{version="1.0"} and start updating build_info{version="1.1"}. Due to this bug, at low zoom resolution there will be a point in time where count(build_info{instance="...")) returns 2, even though there is no overlap in the raw data points.

The equivalent query on Prometheus (Thanos) does not exhibit the problem.

To Reproduce

Raw datapoints:

build_info{instance="foo"}[20m]

time                 version
2020-07-22 00:45:10  20.05.2
2020-07-22 00:46:10  20.05.2
2020-07-22 00:47:10  20.05.2
2020-07-22 00:48:10  20.05.2
2020-07-22 00:51:56  20.05.3
2020-07-22 00:52:56  20.05.3
2020-07-22 00:58:56  20.05.3
2020-07-22 01:02:11  20.05.3

query, step 60 (no overlap)

build_info{instance="foo}

2020-07-22 00:47:00  20.05.2
2020-07-22 00:48:00  20.05.2
2020-07-22 00:49:00  20.05.2
2020-07-22 00:52:00  20.05.3
2020-07-22 00:53:00  20.05.3
2020-07-22 00:54:00  20.05.3

query, step 240

build_info{instance="foo}

2020-07-22 00:48:00  20.05.2
2020-07-22 00:52:00  20.05.2    ** overlap **
2020-07-22 00:52:00  20.05.3    ** overlap **
2020-07-22 00:56:00  20.05.3
2020-07-22 01:00:00  20.05.3

Expected behavior

If two series are not overlapping in time by raw data, the query should not treat them as overlapping when evaluating one interval in the output.

An example implementation would be to treat "start" and "end" points of a series differently when quantizing raw data points into time buckets: include the series in the bucket if 1) raw points are continuous in the bucket range, or 2) the series starts in the bucket range. Therefore, if a series ends in a bucket range, it is not included. It's similar to the concept of an open-ended range.

Screenshots

Example graph showing artificial spikes in count(build_info) when there is a deployment causing the version label to change:
Screen Shot 2020-09-05 at 12 45 38 AM

Version

victoria-metrics-20200815-125320-tags-v1.40.0-0-ged00eb3f3

bug

Source

belm0

Most helpful comment

Thank you. For discrepancies like this, it would be nice for VM to have unit tests against the output of the Prometheus query library.

belm0 on 8 Sep 2020

👍2

All 15 comments

@belm0 , thanks for the detailed bug report and the proposed solution! The solution looks good. We'll try implementing it and see how it works.

valyala on 7 Sep 2020

❤2

Thank you. For discrepancies like this, it would be nice for VM to have unit tests against the output of the Prometheus query library.

belm0 on 8 Sep 2020

👍2

+1, I've run into the same issue

propertone on 19 Sep 2020

The other (related) issue is that range queries do not appear to respect the -search.maxStalenessInterval option, while the instant query does.

propertone on 19 Sep 2020

The other (related) issue is that range queries do not appear to respect the -search.maxStalenessInterval option, while the instant query does.

Could you file a separate issue regarding this bug?

valyala on 21 Sep 2020

@valyala would you consider prioritizing this? It's the main regression we have remaining vs. Prometheus.

belm0 on 13 Oct 2020

@belm0 , could you verify whether the issue is fixed in the following commits:

single-node VictoriaMetrics - 9aa3b6576648ca23dba1c11ef213e91123203399 ( see build instructions)
cluster version of VictoriaMetrics - 217c192c88ae3569a8a634655cb927829d0fb1d0 (see build instructions )

Note that the response cache must be reset before testing the bugfix in order to remove previously cached incorrect results. See https://victoriametrics.github.io/#backfilling for more info on how to reset response cache.

valyala on 13 Oct 2020

The bugfix has been included in VictoriaMetrics v1.44.0. @belm0 , @propertone , could you verify whether it works as expected on your workloads?

valyala on 13 Oct 2020

It seems to be resolved in v1.44, thank you!

I added a review comment to the commit.

belm0 on 14 Oct 2020

Now I see the opposite problem, where series unexpectedly disappear before they end (for example at head of the series).

I think it's related to my comment on the commit about correctness of the 90% heuristic.

Screen Shot 2020-10-14 at 2 21 48 PM

belm0 on 14 Oct 2020

👀1

@belm0 , could you share the original query used for building the graph above? It would be great if you could simplify and narrow down the query with more specific label filters, so it is based on a single source time series and still would expose the issue.

valyala on 17 Oct 2020

Adjusted heuristic a bit in the following commits:

single-node VictoriaMetrics - 28353e48ca28e719ef7b9eb1757e85210b9d3037
cluster VictoriaMetrics - ee2902ddafaec1e7802369e010ea7c73099c84d7

Now VictoriaMetrics should drop trailing points for time series containing only a single raw sample on the selected time range.

valyala on 17 Oct 2020

FYI, this introduces regression in queries to /api/v1/query for timestamps close to the current time - see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/845 .

valyala on 1 Nov 2020

FYI, VictoriaMetrics v1.45.0 contains additional enhancements for this issue.

valyala on 2 Nov 2020

the problem near end of query range seems to be resolved in 1.45.0

belm0 on 3 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings