Victoriametrics: query range returned irrelevant results

Created on 6 Mar 2020 · 16Comments · Source: VictoriaMetrics/VictoriaMetrics

Describe the bug
My query request is like this:
/api/v1/query_range?end=1583393533&query={__name__=order,__order_g1__=commit,__order_g4__=response,__order_g7__=error,__order_g9__=""}&start=1583220733&step=270s

The returned result contains a result that obviously cannot be matched

{
    "metric": {
        "__name__": "node_network_flags",
        "beta_kubernetes_io_arch": "true",
        "beta_kubernetes_io_os": "true",
        "cni": "true",
        "device": "veth70d4fc6d",
        "instance": "kube1033",
        "job": "node-exporter",
        "kubernetes_io_arch": "true",
        "kubernetes_io_hostname": "true",
        "kubernetes_io_os": "true",
        "node": "kube1033",
        "node_role_kubernetes_io_node": "true",
        "service_offline": "true",
        "service_online": "true"
    },
    "values": [
        [
            1583353710,
            "4867"
        ]
    ]
}

Screenshots

Version
vmstorage-20200213-163134-tags-v1.33.1-cluster-0-gafecb344
vminsert-20200213-163127-tags-v1.33.1-cluster-0-gafecb344
vmselect-20200213-163131-tags-v1.33.1-cluster-0-gafecb344

bug

Source

faceair

Most helpful comment

This issue disappeared after upgrading v1.34.4. Thank @valyala 👍

faceair on 27 Mar 2020

👍2

All 16 comments

Hi @valyala , Could you give me some debug suggestions? Now it can be reproduced.

faceair on 10 Mar 2020

Hi @faceair! We aware of the issue with irrelevant matching. It must be somehow connected with index search optimisation in v1.32.0. It does not corrupt the data and affects matching only. So I'd recommend to try to rollback to version lower than v1.32.0 and higher than v1.29.0.

We would appreciate if you can provide a data-set where issue can be reproduced. Meanwhile we investigating the issue and will get back to you as soon as it resolved.

hagen1778 on 10 Mar 2020

@faceair , sorry for late response. As @hagen1778 already said, it is likely related to a data race in inverted index search introduced in v1.32.* series of releases. We are investigating the issue.

Workarounds are:

To stop VictoriaMetrics, remove <-storageDataPath>/cache directory and then start VictoriaMetrics. This removes the cached invalid result, so it is likely it won't be visible again.
To downgrade to v1.31.5. It is safe to downgrade to this version. This version shouldn't contain the issue. Note that the <-storageDataPath>/cache directory must be removed before starting the downgraded version.

You can help nailing down the issue by building and running VictoriaMetrics with the enabled race detector. The binary can be built from master branch with make victoria-metrics-race command - it will be stored at bin/victoria-metrics-race. Note that the race detector significantly increases CPU and RAM usage.

See also the related issue https://github.com/VictoriaMetrics/VictoriaMetrics/issues/324 .

valyala on 11 Mar 2020

I used the make vminsert-prod vmselect-prod vmstorage-prod-race command on the latest cluster branch to complete the build, deleted the cache directory and completed the upgrade of all nodes.

The version of vmstorage is vmstorage-20200313-034303-heads-cluster-0-gbf1869d, I will pay attention to the error output of vmstorage.

faceair on 13 Mar 2020

In the past hour, the race detector did not find any conflicts.
Yes, as you said race detector consumes too much system resources. Sorry, I can't continue testing.
We are now rolling back the storage node to the vmstorage-20200311-022448-tags-v1.34.3-cluster-0-g187fd89c version, a few errors don't matter.

faceair on 13 Mar 2020

@faceair , thanks for your efforts! It looks like the issue isn't related to data races, but it is related to inverted index lookup optimizations made in v1.32.* series. I performed cleanup in this code, which could fix the issue - see commit 31a533656ef140cd49326b38110ac91893223a65 . Could you build cluster components from this commit and verify whether the issue is fixed there? The cluster components must be built with make vminsert-prod vmselect-prod vmstorage-prod command as described here.

valyala on 13 Mar 2020

It is hard to accurately detect whether this issue is fixed. Until now, I only found this one.
I plan to first check that all queries now return irrelevant results and how often they occur. After that I will check the effect of your commit.
I will update the progress next week.

faceair on 14 Mar 2020

I find that the result of this query is still incorrect.

I upgraded vmstorage to vmstorage-20200315-030405-heads-cluster-0-g8fa9066 in this way:

stop vmstorage
delete vmstorage cache directory
start vmstorafe
delete vmselect cache directory
restart vmselect

The result of this query is still incorrect after upgrading. Looks like it could be another reason ?

faceair on 15 Mar 2020

I think I've hit the same problem, but when downgrading from 1.34.3 to 1.31.5 (single node version), there is no cache folder ... and it just went into a restart-loop without me being able to attach to the running container.

smlgbl on 20 Mar 2020

FYI, it looks like the issue has been finally fixed in the latest commits in master and cluster branches thanks to the data, which provided reliable reproduction of the bug. The bug has been initially introduced in v1.32.6 inside the code for fast merging of big sets.

The corresponding commits are:

f3ec424e7d2ad90bc0458acb6b5ce553443b41d6
047849e855b1e13bf194bc803160b637fccaea1c

These commits will be included in the upcoming release.

valyala on 27 Mar 2020

👍1

@faceair , @smlgbl , @cristaloleg , could you verify whether the bug is fixed in v1.34.4?

valyala on 27 Mar 2020

Note that the folder <-storageDataPath>/cache must be removed manually after stopping the previous VictoriaMetrics process and before starting new process for v1.34.4 , since invalid results can be cached in this folder.

valyala on 27 Mar 2020

Thanks for the quick release!
When I log into the host where the container is running, there is no cache folder ... am I doing something wrong?

Anywho, a quick check makes me think it's fixed.

smlgbl on 27 Mar 2020

This issue disappeared after upgrading v1.34.4. Thank @valyala 👍

faceair on 27 Mar 2020

👍2

When I log into the host where the container is running, there is no cache folder ... am I doing something wrong?

VictoriaMetrics creates the cache folder inside -storageDataPath only on graceful shutdown, so caches can be quickly loaded with warm data on the next start. If the cache folder is missing after VictoriaMetrics is stopped, then this usually means failed graceful shutdown - see for error logs for details. Note that graceful shutdown is initiated by sending SIGINT or SIGTERM signals to VictoriaMetrics process.

valyala on 27 Mar 2020

@smlgbl , @faceair , thanks for confirmation that the bug is fixed in v1.34.4 . Closing the issue!

valyala on 27 Mar 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings