Hello! I'm not sure this is a bug, so let's first start with a question.
Clickhouse 19.7.3.9
I've got two 100% identical CH clusters and a carbon-clickhouse daemon that pushes the metrics into both of them. It uploads the same ROWBINARY formatted file to both clusters.
So the data which ends up in the table should be the same in both clusters.
Rollup:
<graphite_rollup>
<default>
<function>avg</function>
<retention>
<age>0</age>
<precision>60</precision>
</retention>
<retention>
<age>2592000</age>
<precision>300</precision>
</retention>
</default>
</graphite_rollup>
Table:
CREATE TABLE IF NOT EXISTS default.graphite (
Path String CODEC(ZSTD),
Value Float64 CODEC(ZSTD),
Time UInt32 CODEC(ZSTD),
Date Date CODEC(ZSTD),
Timestamp UInt32 CODEC(ZSTD)
) ENGINE = GraphiteMergeTree('graphite_rollup')
PARTITION BY toYYYYMM(Date)
ORDER BY (Path, Time);
I'll just take some example metric and check the results:
SELECT
toDateTime(Time),
Value
FROM graphite
WHERE (Date = '2019-06-09') AND (Path = 'bw_out.bond3.nic.network.tvhosa-pez0402.4.HER.OT.PROD.SC')
ORDER BY Time ASC
LIMIT 10
Cluster1:
βββββtoDateTime(Time)ββ¬βValueββ
β 2019-06-09 00:00:00 β 64 β
β 2019-06-09 00:01:00 β 64 β
β 2019-06-09 00:02:00 β 64 β
β 2019-06-09 00:03:00 β 64 β
β 2019-06-09 00:04:00 β 64 β
β 2019-06-09 00:05:00 β 64 β
β 2019-06-09 00:06:00 β 64 β
β 2019-06-09 00:07:00 β 64 β
β 2019-06-09 00:08:00 β 64 β
β 2019-06-09 00:09:00 β 64 β
βββββββββββββββββββββββ΄ββββββββ
Cluster2:
βββββtoDateTime(Time)ββ¬βValueββ
β 2019-06-09 00:00:00 β 748 β
β 2019-06-09 00:01:00 β 684 β
β 2019-06-09 00:02:00 β 636 β
β 2019-06-09 00:03:00 β 628 β
β 2019-06-09 00:04:00 β 600 β
β 2019-06-09 00:05:00 β 712 β
β 2019-06-09 00:06:00 β 624 β
β 2019-06-09 00:07:00 β 628 β
β 2019-06-09 00:08:00 β 624 β
β 2019-06-09 00:09:00 β 608 β
βββββββββββββββββββββββ΄ββββββββ
This 64..64..64.. looks strange to me.
And the data is different only on some of the time ranges, mostly it's the same.
I did the OPTIMIZE FINAL on table in both clusters just to be sure.
Any idea where this difference might come from?
Ok I've found out where the 64s come from, I get metrics every 30s like this:
βββββtoDateTime(Time)ββ¬βValueββ
β 2019-06-11 13:13:59 β 64 β
β 2019-06-11 13:14:29 β 1168 β
β 2019-06-11 13:14:59 β 64 β
β 2019-06-11 13:15:29 β 1336 β
β 2019-06-11 13:15:59 β 64 β
β 2019-06-11 13:16:29 β 1168 β
βββββββββββββββββββββββ΄ββββββββ
So I get exactly 2 metrics every minute, when averaging to precision 60 then I should get (64+X)/2 average all the time, but for some reason I don't.
Does GraphiteMergeTree use local time somewhere in the rollup so that NTP can interfere?
@Felixoid can you answer?
it could be because a function any applied not avg.
I don't see required-columns Version
and column description
<path_column_name>Path</path_column_name>
<time_column_name>Time</time_column_name>
<value_column_name>Value</value_column_name>
<version_column_name>Version</version_column_name>
check that <graphite_rollup> identical at both clusters
Thank you, Denis.
I'm also not sure what is meant by "carbon-clickhouse daemon that pushes the metrics into both of them". As far as I see, carbon-clickhouse doesn't push metrics to a few points.
@den-crane
it could be because a function any applied not avg.
It is avg. I've cited the graphite_rollup part of config.xml
I don't see required-columns Version and column description
The docs are wrong for a long time.
CH expects Timestamp instead of Version by default.
check that
identical at both clusters
Checked - it's identical, the configuration is rolled out by Ansible to both clusters.
@Felixoid
As far as I see, carbon-clickhouse doesn't push metrics to a few points.
It does if you configure it to do it.
It caches the metrics into a local file in CH-compatible ROWBINARY format and then uploads it to any number of destinations. In my case it's 2 clusters.
As for local time, after reading the CH sources it does not take the local time into account, just rounds the Time to a closest precision (something like new_ts = ts / precision * precision), so to my knowledge NTP and other time stuff should not be an issue for this.
Ok, I've seen in the code, it accepts a slice of upload configs.
Here's the only usage of time during merge, it is used to calculate precision https://github.com/yandex/ClickHouse/blob/master/dbms/src/DataStreams/GraphiteRollupSortedBlockInputStream.cpp#L110
Are you sure that carbon-clickhouse does manipulate a date locally? Don't see something like this in config and code after a quick search.
Here's the only usage of time during merge, it is used to calculate precision
Yeah, it's there just to store the row as-is (with precision=1) if the age is not yet reached. In my case it should not probably matter.
Are you sure that carbon-clickhouse does manipulate a date locally? Don't see something like this in config and code after a quick search.
It does not manipulate a timestamp which goes into the Time column. It's taken from the incoming Carbon metric as-is.
Optionally it can fill the unix timestamp when the metric was received into Timestamp aka Version column, but I don't use it and it's always 0.
I'm now putting all the data into a ReplacingMergeTree table as-is and then transferring it manually into GraphiteMergeTree with INSERT ... SELECT and OPTIMIZE FINAL to catch this behavior, but for now it works ideally...
Well by moving data between tables I couldn't reproduce the issue, but probably it takes some time to manifest itself.
I've recreated all tables from scratch and started ingesting metrics as usual. It took about 7 hours to begin merging with errors:
Cluster 2
ββββββββTimeββ¬ββββtoDateTime(Time)ββ¬βValueββ
β 1560338760 β 2019-06-12 13:26:00 β 636 β
β 1560338820 β 2019-06-12 13:27:00 β 1192 β
β 1560338880 β 2019-06-12 13:28:00 β 664 β
β 1560338940 β 2019-06-12 13:29:00 β 624 β
β 1560339000 β 2019-06-12 13:30:00 β 760 β
β 1560339060 β 2019-06-12 13:31:00 β 668 β
...
... ok for several hours, then at 21:11 errors begin
...
β 1560366480 β 2019-06-12 21:08:00 β 636 β
β 1560366540 β 2019-06-12 21:09:00 β 624 β
β 1560366600 β 2019-06-12 21:10:00 β 812 β
β 1560366660 β 2019-06-12 21:11:00 β 64 β
β 1560366720 β 2019-06-12 21:12:00 β 64 β
β 1560366780 β 2019-06-12 21:13:00 β 64 β
β 1560366840 β 2019-06-12 21:14:00 β 64 β
β 1560366900 β 2019-06-12 21:15:00 β 64 β
β 1560366960 β 2019-06-12 21:16:00 β 64 β
β 1560367020 β 2019-06-12 21:17:00 β 64 β
...
... then at 02:04 ok again
...
β 1560383940 β 2019-06-13 01:59:00 β 64 β
β 1560384000 β 2019-06-13 02:00:00 β 760 β
β 1560384060 β 2019-06-13 02:01:00 β 96 β
β 1560384120 β 2019-06-13 02:02:00 β 616 β
β 1560384180 β 2019-06-13 02:03:00 β 64 β
β 1560384240 β 2019-06-13 02:04:00 β 624 β
β 1560384300 β 2019-06-13 02:05:00 β 712 β
β 1560384360 β 2019-06-13 02:06:00 β 652 β
β 1560384420 β 2019-06-13 02:07:00 β 636 β
β 1560384480 β 2019-06-13 02:08:00 β 636 β
...
Cluster 1 looks OK all the time:
ββββββββTimeββ¬ββββtoDateTime(Time)ββ¬βValueββ
β 1560338760 β 2019-06-12 13:26:00 β 636 β
β 1560338820 β 2019-06-12 13:27:00 β 1192 β
β 1560338880 β 2019-06-12 13:28:00 β 664 β
β 1560338940 β 2019-06-12 13:29:00 β 624 β
β 1560339000 β 2019-06-12 13:30:00 β 760 β
β 1560339060 β 2019-06-12 13:31:00 β 668 β
...
β 1560366480 β 2019-06-12 21:08:00 β 636 β
β 1560366540 β 2019-06-12 21:09:00 β 624 β
β 1560366600 β 2019-06-12 21:10:00 β 812 β
β 1560366660 β 2019-06-12 21:11:00 β 636 β
β 1560366720 β 2019-06-12 21:12:00 β 628 β
β 1560366780 β 2019-06-12 21:13:00 β 628 β
β 1560366840 β 2019-06-12 21:14:00 β 628 β
β 1560366900 β 2019-06-12 21:15:00 β 692 β
β 1560366960 β 2019-06-12 21:16:00 β 636 β
β 1560367020 β 2019-06-12 21:17:00 β 624 β
...
β 1560383940 β 2019-06-13 01:59:00 β 616 β
β 1560384000 β 2019-06-13 02:00:00 β 760 β
β 1560384060 β 2019-06-13 02:01:00 β 96 β
β 1560384120 β 2019-06-13 02:02:00 β 616 β
β 1560384180 β 2019-06-13 02:03:00 β 64 β
β 1560384240 β 2019-06-13 02:04:00 β 624 β
β 1560384300 β 2019-06-13 02:05:00 β 712 β
β 1560384360 β 2019-06-13 02:06:00 β 652 β
β 1560384420 β 2019-06-13 02:07:00 β 636 β
β 1560384480 β 2019-06-13 02:08:00 β 636 β
...
I don't see any errors in the logs (not in this particular time nor any other time).
Any other ideas? Will probably try to downgrade to 19.6 to see if it helps...
To get an additional consistency check you could create a materialized view with MergeTree engine and see if data the same in both clusters.
@Felixoid
Thanks, will try as a next step.
Now I just created two tables in each cluster: one is GraphiteMergeTree, another is ReplacingMergeTree and ingesting the same data into all of them.
Got no other ideas for now...
ReplacingMergeTree in case of Timestamp=0 doesn't make a lot of sense, IMHO.
@Felixoid
It's there just in case a duplicate metric arrives with the same Path,Time (which are the sorting key). For debugging this problem it should not matter much, agree.
@Felixoid
Well, it happened again in the same scenario.
Cluster2 GraphiteMergeTree:
β 1560441780 β 2019-06-13 18:03:00 β 636 β
β 1560441840 β 2019-06-13 18:04:00 β 612 β
β 1560441900 β 2019-06-13 18:05:00 β 696 β
β 1560441960 β 2019-06-13 18:06:00 β 636 β
β 1560442020 β 2019-06-13 18:07:00 β 664 β
β 1560442080 β 2019-06-13 18:08:00 β 636 β
β 1560442140 β 2019-06-13 18:09:00 β 636 β
β 1560442200 β 2019-06-13 18:10:00 β 772 β
β 1560442260 β 2019-06-13 18:11:00 β 636 β
β 1560442320 β 2019-06-13 18:12:00 β 628 β
β 1560442380 β 2019-06-13 18:13:00 β 64 β
β 1560442440 β 2019-06-13 18:14:00 β 64 β
β 1560442500 β 2019-06-13 18:15:00 β 64 β
β 1560442560 β 2019-06-13 18:16:00 β 96 β
β 1560442620 β 2019-06-13 18:17:00 β 64 β
β 1560442680 β 2019-06-13 18:18:00 β 64 β
β 1560442740 β 2019-06-13 18:19:00 β 64 β
β 1560442800 β 2019-06-13 18:20:00 β 64 β
β 1560442860 β 2019-06-13 18:21:00 β 64 β
β 1560442920 β 2019-06-13 18:22:00 β 64 β
β 1560442980 β 2019-06-13 18:23:00 β 64 β
β 1560443040 β 2019-06-13 18:24:00 β 64 β
But the adjacent ReplacingMergeTree table has good data.
Cluster2 ReplacingMergeTree:
β 1560441786 β 2019-06-13 18:03:06 β 1208 β
β 1560441816 β 2019-06-13 18:03:36 β 64 β
β 1560441846 β 2019-06-13 18:04:06 β 1160 β
β 1560441876 β 2019-06-13 18:04:36 β 64 β
β 1560441906 β 2019-06-13 18:05:06 β 1328 β
β 1560441936 β 2019-06-13 18:05:36 β 64 β
β 1560441966 β 2019-06-13 18:06:06 β 1208 β
β 1560441996 β 2019-06-13 18:06:36 β 64 β
β 1560442026 β 2019-06-13 18:07:06 β 1264 β
β 1560442056 β 2019-06-13 18:07:36 β 64 β
β 1560442086 β 2019-06-13 18:08:06 β 1208 β
β 1560442116 β 2019-06-13 18:08:36 β 64 β
β 1560442146 β 2019-06-13 18:09:06 β 1176 β
β 1560442176 β 2019-06-13 18:09:36 β 96 β
β 1560442206 β 2019-06-13 18:10:06 β 1480 β
β 1560442236 β 2019-06-13 18:10:36 β 64 β
β 1560442266 β 2019-06-13 18:11:06 β 1208 β
β 1560442296 β 2019-06-13 18:11:36 β 64 β
β 1560442326 β 2019-06-13 18:12:06 β 1192 β
β 1560442356 β 2019-06-13 18:12:36 β 64 β
β 1560442386 β 2019-06-13 18:13:06 β 1288 β
β 1560442416 β 2019-06-13 18:13:36 β 64 β
β 1560442446 β 2019-06-13 18:14:06 β 1176 β
β 1560442476 β 2019-06-13 18:14:36 β 64 β
β 1560442506 β 2019-06-13 18:15:06 β 1312 β
β 1560442536 β 2019-06-13 18:15:36 β 64 β
β 1560442566 β 2019-06-13 18:16:06 β 1176 β
β 1560442596 β 2019-06-13 18:16:36 β 96 β
β 1560442626 β 2019-06-13 18:17:06 β 1176 β
β 1560442656 β 2019-06-13 18:17:36 β 64 β
β 1560442686 β 2019-06-13 18:18:06 β 1208 β
β 1560442716 β 2019-06-13 18:18:36 β 64 β
β 1560442746 β 2019-06-13 18:19:06 β 1208 β
β 1560442776 β 2019-06-13 18:19:36 β 64 β
β 1560442806 β 2019-06-13 18:20:06 β 1432 β
β 1560442836 β 2019-06-13 18:20:36 β 64 β
β 1560442866 β 2019-06-13 18:21:06 β 1176 β
β 1560442896 β 2019-06-13 18:21:36 β 64 β
β 1560442926 β 2019-06-13 18:22:06 β 1176 β
β 1560442956 β 2019-06-13 18:22:36 β 64 β
β 1560442986 β 2019-06-13 18:23:06 β 1208 β
β 1560443016 β 2019-06-13 18:23:36 β 64 β
β 1560443046 β 2019-06-13 18:24:06 β 1208 β
β 1560443076 β 2019-06-13 18:24:36 β 64 β
Also, I've found it that the same problem occurs also on the first cluster, but for some reason at a much lesser scale, only some rows are affected.
Cluster1 GraphiteMergeTree:
β 1560460440 β 2019-06-13 23:14:00 β 624 β
β 1560460500 β 2019-06-13 23:15:00 β 64 β
β 1560460560 β 2019-06-13 23:16:00 β 96 β
β 1560460620 β 2019-06-13 23:17:00 β 632 β
β 1560460680 β 2019-06-13 23:18:00 β 64 β
β 1560460740 β 2019-06-13 23:19:00 β 64 β
β 1560460800 β 2019-06-13 23:20:00 β 776 β
β 1560460860 β 2019-06-13 23:21:00 β 640 β
β 1560460920 β 2019-06-13 23:22:00 β 64 β
β 1560460980 β 2019-06-13 23:23:00 β 636 β
Cluster1 ReplacingMergeTree:
β 1560456846 β 2019-06-13 22:14:06 β 1224 β
β 1560456876 β 2019-06-13 22:14:36 β 64 β
β 1560456906 β 2019-06-13 22:15:06 β 1352 β
β 1560456936 β 2019-06-13 22:15:36 β 96 β
β 1560456966 β 2019-06-13 22:16:06 β 1184 β
β 1560456996 β 2019-06-13 22:16:36 β 64 β
β 1560457026 β 2019-06-13 22:17:06 β 1192 β
β 1560457056 β 2019-06-13 22:17:36 β 64 β
β 1560457086 β 2019-06-13 22:18:06 β 1184 β
β 1560457116 β 2019-06-13 22:18:36 β 64 β
β 1560457146 β 2019-06-13 22:19:06 β 1184 β
β 1560457176 β 2019-06-13 22:19:36 β 64 β
β 1560457206 β 2019-06-13 22:20:06 β 1504 β
β 1560457236 β 2019-06-13 22:20:36 β 96 β
β 1560457266 β 2019-06-13 22:21:06 β 1200 β
β 1560457296 β 2019-06-13 22:21:36 β 64 β
β 1560457326 β 2019-06-13 22:22:06 β 1208 β
β 1560457356 β 2019-06-13 22:22:36 β 64 β
β 1560457386 β 2019-06-13 22:23:06 β 1208 β
β 1560457416 β 2019-06-13 22:23:36 β 64 β
So I would bet there's some kind of a bug after all here, but why it manifests at different scale I don't know...
Will probably try to downgrade to 19.6 to see if it helps...
No changes was in GraphiteMergeTree after 19.5.
Maybe it would worth to check if Timestamp UInt32 DEFAULT toUInt32(now()) CODEC(ZSTD) will affect somehow.
@Felixoid
Thanks, in my case carbon-clickhouse uploads zero there, so I'll reconfigure it to write a timestamp.
Aside from this, I've turned off replication (used GraphiteMergeTree instead of Replicated...). The averaging errors almost went away and are now in sync between two clusters.
Cluster1:
SELECT
Time,
toDateTime(Time),
Value
FROM shard_02.graphite
WHERE (Path = 'bw_out.bond3.nic.network.tvhosa-pez0402.4.HER.OT.PROD.SC') AND (Value < 100)
ORDER BY Time ASC
ββββββββTimeββ¬ββββtoDateTime(Time)ββ¬βValueββ
β 1560501480 β 2019-06-14 10:38:00 β 96 β
β 1560501780 β 2019-06-14 10:43:00 β 96 β
β 1560546480 β 2019-06-14 23:08:00 β 64 β
β 1560546660 β 2019-06-14 23:11:00 β 64 β
β 1560547380 β 2019-06-14 23:23:00 β 64 β
β 1560553080 β 2019-06-15 00:58:00 β 64 β
β 1560627000 β 2019-06-15 21:30:00 β 64 β
β 1560627120 β 2019-06-15 21:32:00 β 64 β
β 1560627600 β 2019-06-15 21:40:00 β 64 β
β 1560630000 β 2019-06-15 22:20:00 β 64 β
β 1560675960 β 2019-06-16 11:06:00 β 64 β
β 1560724380 β 2019-06-17 00:33:00 β 64 β
β 1560742620 β 2019-06-17 05:37:00 β 96 β
β 1560758759 β 2019-06-17 10:05:59 β 64 β
ββββββββββββββ΄ββββββββββββββββββββββ΄ββββββββ
14 rows in set. Elapsed: 0.006 sec. Processed 46.39 thousand rows, 1.72 MB (7.48 million rows/s., 276.79 MB/s.
Cluster2:
SELECT
Time,
toDateTime(Time),
Value
FROM shard_02.graphite
WHERE (Path = 'bw_out.bond3.nic.network.tvhosa-pez0402.4.HER.OT.PROD.SC') AND (Value < 100)
ORDER BY Time ASC
ββββββββTimeββ¬ββββtoDateTime(Time)ββ¬βValueββ
β 1560501480 β 2019-06-14 10:38:00 β 96 β
β 1560501780 β 2019-06-14 10:43:00 β 96 β
β 1560546480 β 2019-06-14 23:08:00 β 64 β
β 1560546660 β 2019-06-14 23:11:00 β 64 β
β 1560547380 β 2019-06-14 23:23:00 β 64 β
β 1560553080 β 2019-06-15 00:58:00 β 64 β
β 1560627000 β 2019-06-15 21:30:00 β 64 β
β 1560627120 β 2019-06-15 21:32:00 β 64 β
β 1560627600 β 2019-06-15 21:40:00 β 64 β
β 1560630000 β 2019-06-15 22:20:00 β 64 β
β 1560675960 β 2019-06-16 11:06:00 β 64 β
β 1560724380 β 2019-06-17 00:33:00 β 64 β
β 1560742620 β 2019-06-17 05:37:00 β 96 β
β 1560758759 β 2019-06-17 10:05:59 β 64 β
ββββββββββββββ΄ββββββββββββββββββββββ΄ββββββββ
14 rows in set. Elapsed: 0.008 sec. Processed 46.39 thousand rows, 1.72 MB (5.59 million rows/s., 206.86 MB/s.)
They're exactly the same.
Each of these intervals should've averaged to a Value > 500
Looks like a bug to me after all.
@Felixoid
The timestamp didn't help either.
When the problem manifests I'm seeing the following strange entries in the Graphite table:
β 1560782520 β 2019-06-17 16:42:00 β 64 β
β 1560782580 β 2019-06-17 16:43:00 β 64 β
β 1560782580 β 2019-06-17 16:43:00 β 1168 β
β 1560782668 β 2019-06-17 16:44:28 β 1184 β
ββββββββββββββ΄ββββββββββββββββββββββ΄ββββββββ
β 1560782520 β 2019-06-17 16:42:00 β 64 β
β 1560782580 β 2019-06-17 16:43:00 β 64 β
β 1560782640 β 2019-06-17 16:44:00 β 64 β
β 1560782640 β 2019-06-17 16:44:00 β 1184 β
ββββββββββββββ΄ββββββββββββββββββββββ΄ββββββββ
So, for some brief moment of time I have two rows with the same Time. Then the next merge, probably, consumes the second row and leaves only the one with 64. How can they appear?
Maybe we should add bug label so somebody can look in it someday...
Try to add MV with plain MergeTree (from shard_02.graphite) and validate that these duplicates are not from carbon-clickhouse
@den-crane
Will try, but I'm 99.9% sure they aren't.
Because:
One more thing, could you maybe try to test CH older than 19.4.0.49? To exclude #4426
E.g. v19.3.6-stable from here https://github.com/yandex/ClickHouse/tags?after=v19.1.13-testing
@Felixoid
Recreated the cluster from scratch using 19.3.6, will see by the end of day if it changed anything :)
@Felixoid
Nope, same behavior in 19.3.6, two rows with same timestamp:
β 1560866760 β 2019-06-18 16:06:00 β 64 β
β 1560866760 β 2019-06-18 16:06:00 β 1160 β
ββββββββββββββ΄ββββββββββββββββββββββ΄ββββββββ
Then the row with value 1160 is discarded:
β 1560866700 β 2019-06-18 16:05:00 β 64 β
β 1560866760 β 2019-06-18 16:06:00 β 64 β
β 1560866820 β 2019-06-18 16:07:00 β 64 β
While in the adjacent MergeTree table everything is nice:
β 1560866749 β 2019-06-18 16:05:49 β 64 β
β 1560866779 β 2019-06-18 16:06:19 β 1160 β
β 1560866809 β 2019-06-18 16:06:49 β 64 β
β 1560866839 β 2019-06-18 16:07:19 β 1160 β
So your PR didn't break it :)
@Felixoid @den-crane
Guys, any ideas which direction to look? Tested 19.8, same result of course as Graphite* part haven't changed.
This issue manifests all over the place in my CH servers. It's not much visible if the metric does not change much over the averaging period (like bandwidth which stays more or less the same over a minute), but in other cases the data is just lost after merge.
I can't seem to get the circumstances when it starts to merge wrongly, but now on my test server it's wrong >60% of the time:
SELECT (count(*) / 1440) * 100
FROM graphite
WHERE (Path = 'bw_out.bond3.nic.network.tvhosa-pez0402.4.HER.OT.PROD.SC')
AND (Date = yesterday()) AND (Value < 100)
ββmultiply(divide(count(), 1440), 100)ββ
β 61.80555555555556 β
ββββββββββββββββββββββββββββββββββββββββ
This metric should always average to > 100 over a minute.
I can of course throw away GraphiteMergeTree and organize a chain of tables with different averaging periods and mangle the data between them manually, but that looks shitty to me and will go that way only as a last resort...
Unfortunately, I don't know where to look in this case. For me, it looks as well as a bug and we need help from some real engineers here.
@alexey-milovidov and @filimonov, could you consider this as a moderate/major bug?
@alexey-milovidov @filimonov
That would be nice if we can fix it. I went through the Graphite merging functions, but without deeper knowledge about CH it's hard to find where the problem is...
@blind-oracle Need to reproduce your issue somehow. I tried but failed. Everything worked fine. Maybe because my data is too fake.
@den-crane I'll try to create a reproducible setup and will get back.
Probably it will involve pushing some fake metrics into CH directly or though carbon-clickhouse at around the same rate as we do...
Because I have the feeling that the issue is dependent on the ingest rate (in fact on the frequency of background merges I guess) - the more the rate the faster/more frequently it appears.
@den-crane @Felixoid @alexey-milovidov @filimonov
Here's a very dirty script that does the job.
It inserts groups of 35k metrics (6 total) every 5 seconds with value 1 and then the same with value 3.
In the end each metric should average to 2 over 60 sec.
Tested on official Yandex .deb build 19.8.3.8
CREATE TABLE graphite_crap (
Path String,
Value Float64,
Time UInt32,
Date Date DEFAULT today(),
Timestamp UInt32
) ENGINE = GraphiteMergeTree('graphite_rollup')
PARTITION BY toYYYYMMDD(Date)
ORDER BY (Path, Time);
CREATE MATERIALIZED VIEW graphite_crap_mv
ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(Date)
ORDER BY (Path, Time)
AS SELECT * FROM graphite_crap;
<graphite_rollup>
<default>
<function>avg</function>
<retention>
<age>0</age>
<precision>60</precision>
</retention>
</default>
</graphite_rollup>
#!/bin/bash
rm -f /tmp/metrics.txt
for i in {1..35000}
do
echo -e "${i}P\tT\tT\tV" >> /tmp/metrics.txt
done
TS=`date +%s`
while [ 1 ]; do
for V in 1 3
do
for P in 1 2 3 4 5 6
do
sed -e "s/P/$P/" -e "s/T/$TS/g" -e "s/V/$V/" /tmp/metrics.txt | clickhouse-client --query='INSERT INTO graphite_crap (Path, Time, Timestamp, Value) FORMAT TSV'
TS=$((TS+5))
echo $TS
sleep 5
done
done
done
It goes well at the beginning, but after around 15min I see:
SELECT
Value,
toDateTime(Time)
FROM graphite_crap
WHERE Path = '11'
ORDER BY Time ASC
...
β 2 β 2019-06-20 22:16:00 β
β 3 β 2019-06-20 22:17:00 β
β 3 β 2019-06-20 22:18:00 β
β 2 β 2019-06-20 22:19:00 β
...
22:17 and 22:18 have 3 in average, not 2.
While the MV has correct data:
SELECT
Value,
toDateTime(Time)
FROM graphite_crap_mv
WHERE Path = '11'
ORDER BY Time ASC
...
β 1 β 2019-06-20 22:17:21 β
β 3 β 2019-06-20 22:17:51 β
β 1 β 2019-06-20 22:18:21 β
β 3 β 2019-06-20 22:18:51 β
...
I've got the same
ββValueββ¬ββββtoDateTime(Time)ββ
β 1 β 2019-06-20 21:02:00 β
β 2 β 2019-06-20 21:03:00 β
β 2 β 2019-06-20 21:04:00 β
β 2 β 2019-06-20 21:05:00 β
β 2 β 2019-06-20 21:06:00 β
β 1 β 2019-06-20 21:07:00 β
β 1 β 2019-06-20 21:08:00 β
β 2 β 2019-06-20 21:09:00 β
β 2 β 2019-06-20 21:10:00 β
β 2 β 2019-06-20 21:11:00 β
β 3 β 2019-06-20 21:12:00 β
βββββββββ΄ββββββββββββββββββββββ
And created a table like this
CREATE TABLE graphite_crap (
Path String,
Value Float64,
Time DateTime,
Version UInt64 DEFAULT 0,
Date Date DEFAULT today(),
Timestamp UInt32
) ENGINE = GraphiteMergeTree('graphite_rollup')
PARTITION BY toYYYYMMDD(Date)
ORDER BY (Path, Time);
And got
SELECT
Value,
Time
FROM graphite_crap
WHERE Path = '11'
ORDER BY Time ASC
ββValueββ¬ββββββββββββββββTimeββ
β 2 β 2019-06-20 22:02:00 β
β 2 β 2019-06-20 22:03:00 β
β 2 β 2019-06-20 22:04:00 β
β 2 β 2019-06-20 22:05:00 β
β 2 β 2019-06-20 22:06:00 β
β 1 β 2019-06-20 22:07:01 β
βββββββββ΄ββββββββββββββββββββββ
SELECT
Value,
Time
FROM graphite_crap_mv
WHERE Path = '11'
ORDER BY Time ASC
ββValueββ¬ββββββββββββββββTimeββ
β 1 β 2019-06-20 22:02:01 β
β 3 β 2019-06-20 22:02:31 β
β 1 β 2019-06-20 22:03:01 β
β 3 β 2019-06-20 22:03:31 β
β 1 β 2019-06-20 22:04:01 β
β 3 β 2019-06-20 22:04:31 β
β 1 β 2019-06-20 22:05:01 β
β 3 β 2019-06-20 22:05:31 β
β 1 β 2019-06-20 22:06:01 β
β 3 β 2019-06-20 22:06:31 β
β 1 β 2019-06-20 22:07:01 β
βββββββββ΄ββββββββββββββββββββββ
oops, too early, it still has the issue.
ββValueββ¬ββββββββββββββββTimeββ
β 2 β 2019-06-20 22:02:00 β
β 2 β 2019-06-20 22:03:00 β
β 2 β 2019-06-20 22:04:00 β
β 2 β 2019-06-20 22:05:00 β
β 2 β 2019-06-20 22:06:00 β
β 3 β 2019-06-20 22:07:00 β
β 3 β 2019-06-20 22:08:00 β
β 3 β 2019-06-20 22:09:00 β
β 2 β 2019-06-20 22:10:00 β
β 2 β 2019-06-20 22:11:00 β
β 2 β 2019-06-20 22:12:00 β
β 2 β 2019-06-20 22:13:00 β
βββββββββ΄ββββββββββββββββββββββ
Seems age 120 do the trick. Maybe avg works over incomplete data.
<retention>
<age>120</age>
<precision>60</precision>
</retention>
Yes, increasing the age works, but we should either fix this or deny setting age to <= precision :)
Well, age=0 is a totally correct value according to the documentation. And that is defined everywhere https://clickhouse.yandex/docs/en/operations/table_engines/graphitemergetree/#configuration-example
@Felixoid Yep, that's why I consider this behavior a bug and it's kinda strange that nobody noticed it :-/
Either nobody uses age=0 (which I doubt) or nobody checks the numbers... because this happens sooner or later under any load.
No. Age=120 still has the discrepancies.
β 2 β 2019-06-21 01:14:00 β
β 1 β 2019-06-21 01:15:00 β
select avg(Value) a, toStartOfMinute(Time) t, Path from graphite_crap where Time >= '2019-06-20 22:55:00' and Time <= '2019-06-21 02:24:00' group by Path, t having a <> 2
3780000 rows in set.
select avg(Value) a, toStartOfMinute(Time) t, Path from graphite_crap_mv where Time >= '2019-06-20 22:55:00' and Time <= '2019-06-21 02:24:00' group by Path, t having a <> 2
0 rows in set.
I think the issue is in design of how a rollup works. It may be rolls incorrectly in different merges when one minute placed in two different parts.
I saw weird records like
β 3 β 2019-06-21 01:14:00 β
β 1 β 2019-06-21 01:15:28 β
1 come earlier than 3, but 3 already rolled and has time :00 and 1 still is not rolled.
Another problem how to roll AVG in different merges? Because the table does not store sum/count.
@den-crane
No. Age=120 still has the discrepancies.
For now on my test server with a copy of production metrics (~30k/sec) I don't see it. Before the errors were all over the place. But I'll monitor for a couple of days.
It may be rolls incorrectly in different merges when one minute placed in two different parts.
Well in my case they're always in different parts: carbon-clickhouse is set to flush INSERT every 5s. So if the metrics come with 30s interval they always end up in different parts. And it's not a problem for a rollup as I see, e.g:
β 1208 β 2019-06-21 21:18:00 β
β 64 β 2019-06-21 21:18:32 β
Part in which came the value 1208 was merged with some other already (time rounded up to precision), and value 64 is still in a fresh non-merged part. When they will merge then the rollup will average them correcly.
Another problem how to roll AVG in different merges? Because the table does not store sum/count.
Yep, it seems it will have different results depending on how the merges are proceeding. If all parts which have the values for the current precision period are merged at once then the result will be correct. Otherwise it will deviate...
https://t.me/c/1425947904/1706
Denny Crane, [Jun 21, 2019 at 7:39:33 PM (2019-06-21, 7:55:42 PM)]:
ΠΏΠΎΠΏΡΠΎΠ±ΠΎΠ²Π°Π» ΡΠ½ΠΎΠ²Π° ΠΈ Π²ΠΎΡ MV ΠΊΠΎΡΠΎΡΠΎΠ΅ ΡΠΌΠΎΡΠΈΡ Π½Π° graphite_rollup (AVG , age = 120 , precision = 60)
graphite_crap_mv
β 1 β 2019-06-21 21:13:28 β
β 3 β 2019-06-21 21:13:58 β
β 1 β 2019-06-21 21:14:28 β
β 3 β 2019-06-21 21:14:58 β
β 1 β 2019-06-21 21:15:28 β
β 3 β 2019-06-21 21:15:58 β
ΡΠ²Π΅ΡΠ½ΡΠ»ΠΎΡΡ Π²
graphite_crap
β 2 β 2019-06-21 21:13:00 β
β 3 β 2019-06-21 21:14:00 β
β 2 β 2019-06-21 21:15:00 β
ΠΈ Π² ΠΊΠΎΠ½ΡΠ΅ Π΅ΡΡΡ Π·Π°ΠΏΠΈΡΡ
graphite_crap
β 1 β 2019-06-21 22:33:00 β
β 3 β 2019-06-21 22:33:58 β
Ρ.Π΅. "1" Π±ΡΠ»Π° 22:33:28 ΡΠ²Π΅ΡΠ½ΡΠ»Π°ΡΡ Π² 22:33:00
graphite_crap_mv
β 1 β 2019-06-21 22:33:28 β
β 3 β 2019-06-21 22:33:58 β
Π΅ΡΠ΅ ΡΠ°Π· ΡΠΌΠ΅ΡΠΆΠΈΠ»ΠΎΡΡ
β 3 β 2019-06-21 22:33:00 β
Π£ ΠΌΠ΅Π½Ρ ΠΏΠΎΠΊΠ° ΠΎΡΠΈΠ±ΠΎΠΊ ΠΌΠ΅ΡΠΆΠ° Π½Π΅ Π½Π°Π±Π»ΡΠ΄Π°Π΅ΡΡΡ ΠΏΠΎΠ΄ ΠΏΡΠΎΠ΄ΡΠΊΡΠΈΠ²Π½ΠΎΠΉ Π½Π°Π³ΡΡΠ·ΠΊΠΎΠΉ Ρ age=120 precision=60, Π±ΡΠ΄Ρ ΠΌΠΎΠ½ΠΈΡΠΎΡΠΈΡΡ Π΄Π°Π»ΡΡΠ΅ ΠΈ ΡΠΊΡΠΈΠΏΡ ΠΏΠΎΠ³ΠΎΠ½ΡΡ Π΅ΡΠ΅ ΡΠΎΠΆΠ΅ Ρ Π½ΠΎΠ²ΡΠΌΠΈ Π½Π°ΡΡΡΠΎΠΉΠΊΠ°ΠΌΠΈ.
ΠΠ°, ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΡ ΡΠΊΡΠΈΠΏΡ ΠΌΠΈΠ½ΡΡ ΡΠ΅ΡΠ΅Π· 30 Π½Π°ΡΠ°Π»ΠΎ ΠΌΠ΅ΡΠΆΠΈΡΡ ΠΊΡΠΈΠ²ΠΎ.
ββValueββ¬ββββtoDateTime(Time)ββ
β 1 β 2019-06-22 17:05:00 β
β 2 β 2019-06-22 17:06:00 β
β 2 β 2019-06-22 17:07:00 β
...
β 2 β 2019-06-22 17:39:00 β
β 2 β 2019-06-22 17:40:00 β
β 1 β 2019-06-22 17:41:00 β
β 1 β 2019-06-22 17:42:00 β
β 1 β 2019-06-22 17:43:00 β
...
ΠΠ° Π΄Π²ΡΡ ΡΠ΅ΡΠ²Π΅ΡΠ°Ρ Ρ ΡΠ΅Π°Π»ΡΠ½ΡΠΌΠΈ ΠΌΠ΅ΡΡΠΈΠΊΠ°ΠΌΠΈ, ΠΏΡΠΈ ΡΡΠΎΠΌ, ΠΌΠ΅ΡΠΆΠΈΡ Π½ΠΎΡΠΌΠ°Π»ΡΠ½ΠΎ.
ΠΠ° ΠΏΡΠΎΠ΄Π΅ ΡΠΎΠΆΠ΅ ΠΎΡΠΈΠ±ΠΊΠΈ Π²ΡΠ»Π΅Π·Π»ΠΈ, Π½ΠΎ ΠΎΡΠ΅Π½Ρ ΠΌΠ°Π»ΠΎ.
β 1192 β 2019-06-23 19:57:00 β
β 64 β 2019-06-23 19:57:00 β
Π£ ΠΌΠ΅Π½Ρ ΡΠ°ΠΊΠΎΠ΅ ΠΎΡΡΡΠ΅Π½ΠΈΠ΅ ΡΡΠΎ ΠΠ₯ Π΄Π΅Π»Π°Π΅Ρ Π΄Π²Π° ΠΏΠ°ΡΠ°Π»Π»Π΅Π»ΡΠ½ΡΡ ΠΌΠ΅ΡΠΆΠ° Π½Π° ΠΎΠ΄Π½ΠΎΠΉ ΡΠ°Π±Π»ΠΈΡΠ΅ ΠΈ Π² ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠ΅ ΠΏΠΎΠ»ΡΡΠ°Π΅ΡΡΡ Π΄Π²Π° Π·Π½Π°ΡΠ΅Π½ΠΈΡ ΡΡΡΠ΅Π΄Π½Π΅Π½Π½ΡΡ Π΄ΠΎ Π½Π°ΡΠ°Π»Π° ΠΌΠΈΠ½ΡΡΡ... ΠΠΎ ΠΏΠΎ ΠΈΠ΄Π΅Π΅ ΠΎΠ½ Π½Π΅ Π΄ΠΎΠ»ΠΆΠ΅Π½ Π΄Π΅Π»Π°ΡΡ Π΄Π²Π° ΠΌΠ΅ΡΠΆΠ° ΠΏΠ°ΡΠ°Π»Π»Π΅Π»ΡΠ½ΠΎ Π½Π° ΠΎΠ΄Π½Ρ ΡΠ°Π±Π»ΠΈΡΡ. ΠΠ»ΠΈ Π½Π΅Ρ?