Victoriametrics: query results may incorrectly overlap time series

Created on 7 Sep 2020  路  15Comments  路  Source: VictoriaMetrics/VictoriaMetrics

Describe the bug

If query step is much greater than data interval (e.g. > 4x), and two series are adjacent in time but not overlapping, the query output or aggregation may incorrectly overlap the time series.

A typical example is build_info{version="..."}. A new version is deployed to instances, which stop updating build_info{version="1.0"} and start updating build_info{version="1.1"}. Due to this bug, at low zoom resolution there will be a point in time where count(build_info{instance="...")) returns 2, even though there is no overlap in the raw data points.

The equivalent query on Prometheus (Thanos) does not exhibit the problem.

To Reproduce

Raw datapoints:

build_info{instance="foo"}[20m]

time                 version
2020-07-22 00:45:10  20.05.2
2020-07-22 00:46:10  20.05.2
2020-07-22 00:47:10  20.05.2
2020-07-22 00:48:10  20.05.2
2020-07-22 00:51:56  20.05.3
2020-07-22 00:52:56  20.05.3
2020-07-22 00:58:56  20.05.3
2020-07-22 01:02:11  20.05.3

query, step 60 (no overlap)

build_info{instance="foo}

2020-07-22 00:47:00  20.05.2
2020-07-22 00:48:00  20.05.2
2020-07-22 00:49:00  20.05.2
2020-07-22 00:52:00  20.05.3
2020-07-22 00:53:00  20.05.3
2020-07-22 00:54:00  20.05.3

query, step 240

build_info{instance="foo}

2020-07-22 00:48:00  20.05.2
2020-07-22 00:52:00  20.05.2    ** overlap **
2020-07-22 00:52:00  20.05.3    ** overlap **
2020-07-22 00:56:00  20.05.3
2020-07-22 01:00:00  20.05.3

Expected behavior

If two series are not overlapping in time by raw data, the query should not treat them as overlapping when evaluating one interval in the output.

An example implementation would be to treat "start" and "end" points of a series differently when quantizing raw data points into time buckets: include the series in the bucket if 1) raw points are continuous in the bucket range, or 2) the series starts in the bucket range. Therefore, if a series ends in a bucket range, it is not included. It's similar to the concept of an open-ended range.

Screenshots

Example graph showing artificial spikes in count(build_info) when there is a deployment causing the version label to change:
Screen Shot 2020-09-05 at 12 45 38 AM

Version

victoria-metrics-20200815-125320-tags-v1.40.0-0-ged00eb3f3

bug

Most helpful comment

Thank you. For discrepancies like this, it would be nice for VM to have unit tests against the output of the Prometheus query library.

All 15 comments

@belm0 , thanks for the detailed bug report and the proposed solution! The solution looks good. We'll try implementing it and see how it works.

Thank you. For discrepancies like this, it would be nice for VM to have unit tests against the output of the Prometheus query library.

+1, I've run into the same issue

The other (related) issue is that range queries do not appear to respect the -search.maxStalenessInterval option, while the instant query does.

The other (related) issue is that range queries do not appear to respect the -search.maxStalenessInterval option, while the instant query does.

Could you file a separate issue regarding this bug?

@valyala would you consider prioritizing this? It's the main regression we have remaining vs. Prometheus.

@belm0 , could you verify whether the issue is fixed in the following commits:

  • single-node VictoriaMetrics - 9aa3b6576648ca23dba1c11ef213e91123203399 ( see build instructions)
  • cluster version of VictoriaMetrics - 217c192c88ae3569a8a634655cb927829d0fb1d0 (see build instructions )

Note that the response cache must be reset before testing the bugfix in order to remove previously cached incorrect results. See https://victoriametrics.github.io/#backfilling for more info on how to reset response cache.

The bugfix has been included in VictoriaMetrics v1.44.0. @belm0 , @propertone , could you verify whether it works as expected on your workloads?

It seems to be resolved in v1.44, thank you!

I added a review comment to the commit.

Now I see the opposite problem, where series unexpectedly disappear before they end (for example at head of the series).

I think it's related to my comment on the commit about correctness of the 90% heuristic.

Screen Shot 2020-10-14 at 2 21 48 PM

@belm0 , could you share the original query used for building the graph above? It would be great if you could simplify and narrow down the query with more specific label filters, so it is based on a single source time series and still would expose the issue.

Adjusted heuristic a bit in the following commits:

  • single-node VictoriaMetrics - 28353e48ca28e719ef7b9eb1757e85210b9d3037
  • cluster VictoriaMetrics - ee2902ddafaec1e7802369e010ea7c73099c84d7

Now VictoriaMetrics should drop trailing points for time series containing only a single raw sample on the selected time range.

FYI, this introduces regression in queries to /api/v1/query for timestamps close to the current time - see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/845 .

FYI, VictoriaMetrics v1.45.0 contains additional enhancements for this issue.

the problem near end of query range seems to be resolved in 1.45.0

Was this page helpful?
0 / 5 - 0 ratings

Related issues

valyala picture valyala  路  4Comments

WilliamDahlen picture WilliamDahlen  路  3Comments

isality picture isality  路  3Comments

prdatur picture prdatur  路  3Comments

genericgithubuser picture genericgithubuser  路  4Comments