Victoriametrics: query range returned irrelevant results

Created on 6 Mar 2020  路  16Comments  路  Source: VictoriaMetrics/VictoriaMetrics

Describe the bug
My query request is like this:
/api/v1/query_range?end=1583393533&query={__name__=order,__order_g1__=commit,__order_g4__=response,__order_g7__=error,__order_g9__=""}&start=1583220733&step=270s

The returned result contains a result that obviously cannot be matched

{
    "metric": {
        "__name__": "node_network_flags",
        "beta_kubernetes_io_arch": "true",
        "beta_kubernetes_io_os": "true",
        "cni": "true",
        "device": "veth70d4fc6d",
        "instance": "kube1033",
        "job": "node-exporter",
        "kubernetes_io_arch": "true",
        "kubernetes_io_hostname": "true",
        "kubernetes_io_os": "true",
        "node": "kube1033",
        "node_role_kubernetes_io_node": "true",
        "service_offline": "true",
        "service_online": "true"
    },
    "values": [
        [
            1583353710,
            "4867"
        ]
    ]
}

Screenshots
image

Version
vmstorage-20200213-163134-tags-v1.33.1-cluster-0-gafecb344
vminsert-20200213-163127-tags-v1.33.1-cluster-0-gafecb344
vmselect-20200213-163131-tags-v1.33.1-cluster-0-gafecb344

bug

Most helpful comment

This issue disappeared after upgrading v1.34.4. Thank @valyala 馃憤

All 16 comments

Hi @valyala , Could you give me some debug suggestions? Now it can be reproduced.

Hi @faceair! We aware of the issue with irrelevant matching. It must be somehow connected with index search optimisation in v1.32.0. It does not corrupt the data and affects matching only. So I'd recommend to try to rollback to version lower than v1.32.0 and higher than v1.29.0.

We would appreciate if you can provide a data-set where issue can be reproduced. Meanwhile we investigating the issue and will get back to you as soon as it resolved.

@faceair , sorry for late response. As @hagen1778 already said, it is likely related to a data race in inverted index search introduced in v1.32.* series of releases. We are investigating the issue.

Workarounds are:

  • To stop VictoriaMetrics, remove <-storageDataPath>/cache directory and then start VictoriaMetrics. This removes the cached invalid result, so it is likely it won't be visible again.
  • To downgrade to v1.31.5. It is safe to downgrade to this version. This version shouldn't contain the issue. Note that the <-storageDataPath>/cache directory must be removed before starting the downgraded version.

You can help nailing down the issue by building and running VictoriaMetrics with the enabled race detector. The binary can be built from master branch with make victoria-metrics-race command - it will be stored at bin/victoria-metrics-race. Note that the race detector significantly increases CPU and RAM usage.

See also the related issue https://github.com/VictoriaMetrics/VictoriaMetrics/issues/324 .

I used the make vminsert-prod vmselect-prod vmstorage-prod-race command on the latest cluster branch to complete the build, deleted the cache directory and completed the upgrade of all nodes.

The version of vmstorage is vmstorage-20200313-034303-heads-cluster-0-gbf1869d, I will pay attention to the error output of vmstorage.

In the past hour, the race detector did not find any conflicts.
Yes, as you said race detector consumes too much system resources. Sorry, I can't continue testing.
We are now rolling back the storage node to the vmstorage-20200311-022448-tags-v1.34.3-cluster-0-g187fd89c version, a few errors don't matter.

@faceair , thanks for your efforts! It looks like the issue isn't related to data races, but it is related to inverted index lookup optimizations made in v1.32.* series. I performed cleanup in this code, which could fix the issue - see commit 31a533656ef140cd49326b38110ac91893223a65 . Could you build cluster components from this commit and verify whether the issue is fixed there? The cluster components must be built with make vminsert-prod vmselect-prod vmstorage-prod command as described here.

It is hard to accurately detect whether this issue is fixed. Until now, I only found this one.
I plan to first check that all queries now return irrelevant results and how often they occur. After that I will check the effect of your commit.
I will update the progress next week.

I find that the result of this query is still incorrect.

I upgraded vmstorage to vmstorage-20200315-030405-heads-cluster-0-g8fa9066 in this way:

  1. stop vmstorage
  2. delete vmstorage cache directory
  3. start vmstorafe
  4. delete vmselect cache directory
  5. restart vmselect

The result of this query is still incorrect after upgrading. Looks like it could be another reason ?

I think I've hit the same problem, but when downgrading from 1.34.3 to 1.31.5 (single node version), there is no cache folder ... and it just went into a restart-loop without me being able to attach to the running container.

FYI, it looks like the issue has been finally fixed in the latest commits in master and cluster branches thanks to the data, which provided reliable reproduction of the bug. The bug has been initially introduced in v1.32.6 inside the code for fast merging of big sets.

The corresponding commits are:

  • f3ec424e7d2ad90bc0458acb6b5ce553443b41d6
  • 047849e855b1e13bf194bc803160b637fccaea1c

These commits will be included in the upcoming release.

@faceair , @smlgbl , @cristaloleg , could you verify whether the bug is fixed in v1.34.4?

Note that the folder <-storageDataPath>/cache must be removed manually after stopping the previous VictoriaMetrics process and before starting new process for v1.34.4 , since invalid results can be cached in this folder.

Thanks for the quick release!
When I log into the host where the container is running, there is no cache folder ... am I doing something wrong?

Anywho, a quick check makes me think it's fixed.

This issue disappeared after upgrading v1.34.4. Thank @valyala 馃憤

When I log into the host where the container is running, there is no cache folder ... am I doing something wrong?

VictoriaMetrics creates the cache folder inside -storageDataPath only on graceful shutdown, so caches can be quickly loaded with warm data on the next start. If the cache folder is missing after VictoriaMetrics is stopped, then this usually means failed graceful shutdown - see for error logs for details. Note that graceful shutdown is initiated by sending SIGINT or SIGTERM signals to VictoriaMetrics process.

@smlgbl , @faceair , thanks for confirmation that the bug is fixed in v1.34.4 . Closing the issue!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

n4mine picture n4mine  路  3Comments

valyala picture valyala  路  4Comments

genericgithubuser picture genericgithubuser  路  4Comments

prdatur picture prdatur  路  3Comments

Serrvosky picture Serrvosky  路  3Comments