Victoriametrics: sum and sum_over_time returning incorrect value when increasing period

Created on 15 Jun 2020 · 14Comments · Source: VictoriaMetrics/VictoriaMetrics

Describe the bug
When querying data until current timestamp with sum() functions and increasing the period of aggregation (7 days, 30 days, 90 days ...), the count becomes incorrect

To Reproduce
Simply provision data in VM until now (7 days for instance). Then print out the total value of the data and see the same data in bigger window (30 days, 90 days, 1 year , ...)
You will see that the total value - that should be invariant - will decrease because the last interval is no more taken into account
Please note that this do not happen if the data are provisioned in the past ( 2 weeks ago for instance) - meaning this is completely related with last time interval.

For example:
This query on 7 days :
api/v1/query_range?query=sum%20by(profile_type)%20(sum_over_time(euicc_total%7Boperator%3D%22STC%22%20%2C%20profile_type%3D~%22profile3%22%2C%20mno%3D~%22TMNO3%22%7D))&start=1591567200&end=1592172000&step=86400
return
```0:Array[1591653600,112]
1:Array[1591740000,194]
2:Array[1591826400,104]
3:Array[1591912800,217]
4:Array[1591999200,67]
5:Array[1592085600,165]
6:Array[1592172000,85]

The same query on 90 days
`api/v1/query_range?query=sum%20by(profile_type)%20(sum_over_time(euicc_total%7Boperator%3D%22STC%22%20%2C%20profile_type%3D~%22profile3%22%2C%20mno%3D~%22TMNO3%22%7D))&start=1584396000&end=1592172000&step=86400`
return this:

0:Array[1591660800,112]
1:Array[1591747200,194]
2:Array[1591833600,104]
3:Array[1591920000,217]
4:Array[1592006400,67]
5:Array[1592092800,165]

as you can see the last interval timestamp has changed - and the last interval until now is missing
This has been reproduced on many screens ... always with the same side-effect.

**Expected behavior**
**The count must not change**

**Screenshots**
I could put screenshot but i think the investigation i did on the query are more clear

**Version**
The line returned when passing `--version` command line flag to binary. For example:

/opt/victoriametrics/victoria-metrics-prod --version
victoria-metrics-20200505-082139-tags-v1.35.2-0-g0c7cddfc
```

Used command-line flags
/opt/victoriametrics/victoria-metrics-prod -storageDataPath /u01/reporting/victoriametrics/data -retentionPeriod 36 -influxSkipSingleField -search.maxStalenessInterval=5m
Additional context

bug question

Source

omerlin

All 14 comments

VictoriaMetrics may adjust the returned timestamps if the number of returned data points exceeds 50 - see the corresponding comment in the code for details.

This behaviour can be disabled by passing -search.disableCache command-line flag to VictoriaMetrics. Another option is to pass nocache=1 query arg to /api/v1/query_range.

valyala on 18 Jun 2020

The number of points are not increasing... We are just seeing the same
points from a longer perspective... And so we expect the same results. But
curiously when increasing the query period, the last interval is truncated
and so the count is false. You can see it in the query example I posted at
the beginning of this post.

Le jeu. 18 juin 2020 à 23:49, Aliaksandr Valialkin notifications@github.com
a écrit :

VictoriaMetrics may adjust the returned timestamps if the number of
returned data points exceeds 50 - see the corresponding comment in the
code
https://github.com/VictoriaMetrics/VictoriaMetrics/blob/a6f16dcc11891c7ab1dbe3e3129cee7210da17bc/app/vmselect/promql/eval.go#L23
for details.

This behaviour can be disabled by passing -search.disableCache
command-line flag to VictoriaMetrics. Another option is to pass nocache=1
query arg to /api/v1/query_range.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/563#issuecomment-646323183,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AEUPYQCLOB6V5VEUS43Z3CTRXKDXHANCNFSM4N6LIM5A
.

omerlin on 21 Jun 2020

Hello,
I have narrowed the issue - this is related to UTC management.
I'm located in France so the time is UTC+2
I have loaded my data on the last 7 days using the back filling api. So all my data are there.
I'm doing sum_over_time() with a granularity of 1 day.
I expect when i query at 30 days, 90 days, 1 year ... to see always the same sum().
Let see with a small python program :

import requests
import time

def query_VM( nb_days):
    max_timestamp=1593036000
    min_timestamp=max_timestamp-nb_days*86400
    print( 'start={}, end={}, duration(days)={}'.format(time.ctime(min_timestamp),time.ctime(max_timestamp),(max_timestamp-min_timestamp)/(86400)))
    query="/api/v1/query_range?query=sum%20by(profile_id)%20(sum_over_time(operation_smdp_result%7Boperator%3D%22STC%22%20%2C%20operation%3D%22DP_DOWNLOAD_PROFILE%22%2C%20operation_result%3D~%22Succeeded%7CSucceed_With_Warning%22%2C%20target_mno%3D~%22MO4%7CSTC%7CTMNO3%22%2C%20profile_id%3D~%22profile1%22%7D))&start={}&end={}&step=86400".format(min_timestamp,max_timestamp)
    r=requests.get("http://localhost:28428/%s" % (query))
    s=0
    for k in r.json()['data']['result'][0]['values']:
        print( time.ctime(k[0]), int(k[1]) )
        s+= int(k[1])
    print( 'SUM=%d' % (s))

as you can see i have a tunnel to localhost:28428 that access directly to the VM API.
Let query on different period, 7 days, 30 days, 60 days, 90 days

query_VM(7)
start=Thu Jun 18 00:00:00 2020, end=Thu Jun 25 00:00:00 2020, duration(days)=7.0
Fri Jun 19 00:00:00 2020 27401
Sat Jun 20 00:00:00 2020 70934
Sun Jun 21 00:00:00 2020 77001
Mon Jun 22 00:00:00 2020 80119
Tue Jun 23 00:00:00 2020 74928
Wed Jun 24 00:00:00 2020 69448
Thu Jun 25 00:00:00 2020 74243
SUM=474074

query_VM(30)
start=Tue May 26 00:00:00 2020, end=Thu Jun 25 00:00:00 2020, duration(days)=30.0
Fri Jun 19 00:00:00 2020 27401
Sat Jun 20 00:00:00 2020 70934
Sun Jun 21 00:00:00 2020 77001
Mon Jun 22 00:00:00 2020 80119
Tue Jun 23 00:00:00 2020 74928
Wed Jun 24 00:00:00 2020 69448
Thu Jun 25 00:00:00 2020 74243
SUM=474074



md5-425e5ab8440378738b6a8a312be0d9cf



query_VM(60)
start=Sun Apr 26 00:00:00 2020, end=Thu Jun 25 00:00:00 2020, duration(days)=60.0
Fri Jun 19 02:00:00 2020 31454
Sat Jun 20 02:00:00 2020 73176
Sun Jun 21 02:00:00 2020 76248
Mon Jun 22 02:00:00 2020 80581
Tue Jun 23 02:00:00 2020 74453
Wed Jun 24 02:00:00 2020 70111
SUM=406023

Bingo ! Here is the "bug" - the date interval suddenly changed with a +2:00:00 - changing the interval counts and even the number of interval.
So the SUM is incorrect because somewhere the UTC time definition changed when going back in time.
I was first thinking it could be related to the time shift in France UTC+1 (Winter time) to UTC+2 (Summer time) ... but as it was the 28th of march - this is not the reason.
What do you think of this please ?
Thanks a lot for your attention
Olivier

omerlin on 25 Jun 2020

👀2

@omerlin , thanks for the detailed investigation!

VictoriaMetrics should store and process all the timestamps in UTC. I'll double check this and return back with findings.

valyala on 25 Jun 2020

@valyala, thanks for all your good work.
I know it's quite a lot of work to investigate such an issue, particularly reproducing issues. I have all the materials for that with back filling injection tool and error reproduction by script as shown above.
I will try to do it myself recompiling the code and try debugging even if i don't know Go programming. If you advise on tooling helper or method to debug go code efficiently let me know.

omerlin on 3 Jul 2020

@omerlin , see build instructions here. I'd suggest you exporting all the raw data that is used in the query via /api/v1/export (just pass match[]=operation_smdp_result%7Boperator%3D%22STC%22%20%2C%20operation%3D%22DP_DOWNLOAD_PROFILE%22%2C%20operation_result%3D~%22Succeeded%7CSucceed_With_Warning%22%2C%20target_mno%3D~%22MO4%7CSTC%7CTMNO3%22%2C%20profile_id%3D~%22profile1%22%7D query arg together with start and end args from /api/v1/query_range), then manually calculating sum_over_time for the returned data in Python script and comparing the results with the results returned by VictoriaMetrics . Note that /api/v1/export returns timestamps in milliseconds, not in seconds, so don't forget dividing them by 1000. Note also that VictoriaMetrics always returns timestamps in UTC.

valyala on 3 Jul 2020

@valyala, i'm working on it. Just one question about the /write API. We are using the InluxDB data format with the timestamp to do the backfilling. But do we need to have a TZ timestamp or a UTC timestamp ?

omerlin on 22 Jul 2020

do we need to have a TZ timestamp or a UTC timestamp ?

VictoriaMetrics treats all the timestamps in the ingested data as UTC.

valyala on 22 Jul 2020

@valyala i finally found the issue - this is not related to timezone but to the modulo function used to have cache buckets alignment.
PromQL query_range are processed into prometheus/prometheus.go - the start/end are altered by this code :

if mayCache {
start, end = promql.AdjustStartEnd(start, end, step)
}

As i'm using Grafana i cannot add a form &nocache=1 in my request and i'm entering in the promql.AdjustStartEnd function logic.
In my case, my step=86400 (one day in seconds)
Then if the number of days is considered as small ( 50 in your code) - the star, end are not changed.

func AdjustStartEnd(start, end, step int64) (int64, int64) {
points := (end-start)/step + 1
if points < minTimeseriesPointsForTimeRounding {
// Too small number of points for rounding.
return start, end
}

as line 26 we have :

// The minimum number of points per timeseries for enabling time rounding.
// This improves cache hit ratio for frequently requested queries over
// big time ranges.
const minTimeseriesPointsForTimeRounding = 50

then over 50 days, the start/end are altered with rounding logic inside the function AdjustStartEnd
The problem is then that the start/end alteration change completly the sum() computation ....

Do you think it could be possible to make the nocache activation per configuration (as the server configuration search.nocache is not taken into account for this kind of query ) ...

Btw ... when writing data using UTC and reading back with grafana i didn't have the issue because my start/step were modulo 86400 !!
Happy to finally understand.
Thanks for your help
Olivier
PS: i have created a small patch with a new parameter flag search.disableQueryCache ... but i'm sure you will propose something better like extending for instance the search.disableCache to the query_range() code also

omerlin on 29 Jul 2020

@omerlin , thanks for the investigation! The commit f0c678c41bc4804fcf4574220852c1633f579cfc stops adjusting of start and end args when VictoriaMetrics is started with -search.disableCache command-line flag. Could you build VictoriaMetrics from this commit according to these docs and verify whether it works as expected for your case?

valyala on 30 Jul 2020

@valyala, thanks again for reactivity ! I have tested and it's working perfectly has expected.
I suppose this fix will be part of the next 1.39.3, right ?

omerlin on 31 Jul 2020

🎉1

I suppose this fix will be part of the next 1.39.3, right ?

Yes.

valyala on 31 Jul 2020

The bugfix has been included in v1.39.3. The bugfix stops adjusting start and end args when -search.disableCache command-line flag is passed during VictoriaMetrics startup.

Closing the issue as fixed.

@omerlin , probably it would be better aligning start and end args to the step value like VictoriaMetrics does before passing them to VictoriaMetrics. In this case the client may take advantage of response caching in VictoriaMetrics.

valyala on 5 Aug 2020

@omerlin , probably it would be better aligning start and end args to the step value like VictoriaMetrics does before passing them to VictoriaMetrics. In this case the client may take advantage of response caching in VictoriaMetrics.

This is not possible easily as the request is sent by grafana using day granularity ... so the start and end are by default defined using local TZ ... as VictoriaMetrics is using UTC for all its calculation. So using last 30 days, last 90 days buttons ... the start and stop are not aligned with internal caching

omerlin on 31 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings