Clickhouse: GraphiteMergeTree averaging result is different on two clusters

Created on 11 Jun 2019 · 40Comments · Source: ClickHouse/ClickHouse

Hello! I'm not sure this is a bug, so let's first start with a question.

Clickhouse 19.7.3.9

I've got two 100% identical CH clusters and a carbon-clickhouse daemon that pushes the metrics into both of them. It uploads the same ROWBINARY formatted file to both clusters.

So the data which ends up in the table should be the same in both clusters.

Rollup:

    <graphite_rollup>
        <default>
            <function>avg</function>
            <retention>
                <age>0</age>
                <precision>60</precision>
            </retention>
            <retention>
                <age>2592000</age>
                <precision>300</precision>
            </retention>
        </default>
    </graphite_rollup>

Table:

CREATE TABLE IF NOT EXISTS default.graphite (
  Path String CODEC(ZSTD),
  Value Float64 CODEC(ZSTD),
  Time UInt32 CODEC(ZSTD),
  Date Date CODEC(ZSTD),
  Timestamp UInt32 CODEC(ZSTD)
) ENGINE = GraphiteMergeTree('graphite_rollup')
PARTITION BY toYYYYMM(Date)
ORDER BY (Path, Time);

I'll just take some example metric and check the results:

SELECT
    toDateTime(Time),
    Value
FROM graphite
WHERE (Date = '2019-06-09') AND (Path = 'bw_out.bond3.nic.network.tvhosa-pez0402.4.HER.OT.PROD.SC')
ORDER BY Time ASC
LIMIT 10

Cluster1:

┌────toDateTime(Time)─┬─Value─┐
│ 2019-06-09 00:00:00 │    64 │
│ 2019-06-09 00:01:00 │    64 │
│ 2019-06-09 00:02:00 │    64 │
│ 2019-06-09 00:03:00 │    64 │
│ 2019-06-09 00:04:00 │    64 │
│ 2019-06-09 00:05:00 │    64 │
│ 2019-06-09 00:06:00 │    64 │
│ 2019-06-09 00:07:00 │    64 │
│ 2019-06-09 00:08:00 │    64 │
│ 2019-06-09 00:09:00 │    64 │
└─────────────────────┴───────┘

Cluster2:

┌────toDateTime(Time)─┬─Value─┐
│ 2019-06-09 00:00:00 │   748 │
│ 2019-06-09 00:01:00 │   684 │
│ 2019-06-09 00:02:00 │   636 │
│ 2019-06-09 00:03:00 │   628 │
│ 2019-06-09 00:04:00 │   600 │
│ 2019-06-09 00:05:00 │   712 │
│ 2019-06-09 00:06:00 │   624 │
│ 2019-06-09 00:07:00 │   628 │
│ 2019-06-09 00:08:00 │   624 │
│ 2019-06-09 00:09:00 │   608 │
└─────────────────────┴───────┘

This 64..64..64.. looks strange to me.
And the data is different only on some of the time ranges, mostly it's the same.

I did the OPTIMIZE FINAL on table in both clusters just to be sure.

Any idea where this difference might come from?

bug st-accepted

Source

blind-oracle

All 40 comments

Ok I've found out where the 64s come from, I get metrics every 30s like this:

┌────toDateTime(Time)─┬─Value─┐
│ 2019-06-11 13:13:59 │    64 │
│ 2019-06-11 13:14:29 │  1168 │
│ 2019-06-11 13:14:59 │    64 │
│ 2019-06-11 13:15:29 │  1336 │
│ 2019-06-11 13:15:59 │    64 │
│ 2019-06-11 13:16:29 │  1168 │
└─────────────────────┴───────┘

So I get exactly 2 metrics every minute, when averaging to precision 60 then I should get (64+X)/2 average all the time, but for some reason I don't.

Does GraphiteMergeTree use local time somewhere in the rollup so that NTP can interfere?

blind-oracle on 11 Jun 2019

@Felixoid can you answer?

filimonov on 11 Jun 2019

it could be because a function any applied not avg.

I don't see required-columns Version
and column description


    <path_column_name>Path</path_column_name>
    <time_column_name>Time</time_column_name>
    <value_column_name>Value</value_column_name>
    <version_column_name>Version</version_column_name>

check that <graphite_rollup> identical at both clusters

den-crane on 11 Jun 2019

👍1

Thank you, Denis.

I'm also not sure what is meant by "carbon-clickhouse daemon that pushes the metrics into both of them". As far as I see, carbon-clickhouse doesn't push metrics to a few points.

Felixoid on 11 Jun 2019

👍1

@den-crane

it could be because a function any applied not avg.

It is avg. I've cited the graphite_rollup part of config.xml

I don't see required-columns Version and column description

The docs are wrong for a long time.
CH expects Timestamp instead of Version by default.

See https://github.com/yandex/ClickHouse/blob/db6de83fa1e16e7a5062ee2c3e31d273c186f550/dbms/src/Storages/MergeTree/registerStorageMergeTree.cpp#L173

check that identical at both clusters

Checked - it's identical, the configuration is rolled out by Ansible to both clusters.

@Felixoid

As far as I see, carbon-clickhouse doesn't push metrics to a few points.

It does if you configure it to do it.

It caches the metrics into a local file in CH-compatible ROWBINARY format and then uploads it to any number of destinations. In my case it's 2 clusters.

As for local time, after reading the CH sources it does not take the local time into account, just rounds the Time to a closest precision (something like new_ts = ts / precision * precision), so to my knowledge NTP and other time stuff should not be an issue for this.

blind-oracle on 11 Jun 2019

Ok, I've seen in the code, it accepts a slice of upload configs.

Here's the only usage of time during merge, it is used to calculate precision https://github.com/yandex/ClickHouse/blob/master/dbms/src/DataStreams/GraphiteRollupSortedBlockInputStream.cpp#L110

Are you sure that carbon-clickhouse does manipulate a date locally? Don't see something like this in config and code after a quick search.

Felixoid on 11 Jun 2019

Here's the only usage of time during merge, it is used to calculate precision

Yeah, it's there just to store the row as-is (with precision=1) if the age is not yet reached. In my case it should not probably matter.

Are you sure that carbon-clickhouse does manipulate a date locally? Don't see something like this in config and code after a quick search.

It does not manipulate a timestamp which goes into the Time column. It's taken from the incoming Carbon metric as-is.

Optionally it can fill the unix timestamp when the metric was received into Timestamp aka Version column, but I don't use it and it's always 0.

I'm now putting all the data into a ReplacingMergeTree table as-is and then transferring it manually into GraphiteMergeTree with INSERT ... SELECT and OPTIMIZE FINAL to catch this behavior, but for now it works ideally...

blind-oracle on 11 Jun 2019

Well by moving data between tables I couldn't reproduce the issue, but probably it takes some time to manifest itself.

I've recreated all tables from scratch and started ingesting metrics as usual. It took about 7 hours to begin merging with errors:

Cluster 2

┌───────Time─┬────toDateTime(Time)─┬─Value─┐
│ 1560338760 │ 2019-06-12 13:26:00 │   636 │
│ 1560338820 │ 2019-06-12 13:27:00 │  1192 │
│ 1560338880 │ 2019-06-12 13:28:00 │   664 │
│ 1560338940 │ 2019-06-12 13:29:00 │   624 │
│ 1560339000 │ 2019-06-12 13:30:00 │   760 │
│ 1560339060 │ 2019-06-12 13:31:00 │   668 │
...
... ok for several hours, then at 21:11 errors begin
...
│ 1560366480 │ 2019-06-12 21:08:00 │   636 │
│ 1560366540 │ 2019-06-12 21:09:00 │   624 │
│ 1560366600 │ 2019-06-12 21:10:00 │   812 │
│ 1560366660 │ 2019-06-12 21:11:00 │    64 │
│ 1560366720 │ 2019-06-12 21:12:00 │    64 │
│ 1560366780 │ 2019-06-12 21:13:00 │    64 │
│ 1560366840 │ 2019-06-12 21:14:00 │    64 │
│ 1560366900 │ 2019-06-12 21:15:00 │    64 │
│ 1560366960 │ 2019-06-12 21:16:00 │    64 │
│ 1560367020 │ 2019-06-12 21:17:00 │    64 │
...
... then at 02:04 ok again
...
│ 1560383940 │ 2019-06-13 01:59:00 │    64 │
│ 1560384000 │ 2019-06-13 02:00:00 │   760 │
│ 1560384060 │ 2019-06-13 02:01:00 │    96 │
│ 1560384120 │ 2019-06-13 02:02:00 │   616 │
│ 1560384180 │ 2019-06-13 02:03:00 │    64 │
│ 1560384240 │ 2019-06-13 02:04:00 │   624 │
│ 1560384300 │ 2019-06-13 02:05:00 │   712 │
│ 1560384360 │ 2019-06-13 02:06:00 │   652 │
│ 1560384420 │ 2019-06-13 02:07:00 │   636 │
│ 1560384480 │ 2019-06-13 02:08:00 │   636 │
...

Cluster 1 looks OK all the time:

┌───────Time─┬────toDateTime(Time)─┬─Value─┐
│ 1560338760 │ 2019-06-12 13:26:00 │   636 │
│ 1560338820 │ 2019-06-12 13:27:00 │  1192 │
│ 1560338880 │ 2019-06-12 13:28:00 │   664 │
│ 1560338940 │ 2019-06-12 13:29:00 │   624 │
│ 1560339000 │ 2019-06-12 13:30:00 │   760 │
│ 1560339060 │ 2019-06-12 13:31:00 │   668 │
...
│ 1560366480 │ 2019-06-12 21:08:00 │   636 │
│ 1560366540 │ 2019-06-12 21:09:00 │   624 │
│ 1560366600 │ 2019-06-12 21:10:00 │   812 │
│ 1560366660 │ 2019-06-12 21:11:00 │   636 │
│ 1560366720 │ 2019-06-12 21:12:00 │   628 │
│ 1560366780 │ 2019-06-12 21:13:00 │   628 │
│ 1560366840 │ 2019-06-12 21:14:00 │   628 │
│ 1560366900 │ 2019-06-12 21:15:00 │   692 │
│ 1560366960 │ 2019-06-12 21:16:00 │   636 │
│ 1560367020 │ 2019-06-12 21:17:00 │   624 │
...
│ 1560383940 │ 2019-06-13 01:59:00 │   616 │
│ 1560384000 │ 2019-06-13 02:00:00 │   760 │
│ 1560384060 │ 2019-06-13 02:01:00 │    96 │
│ 1560384120 │ 2019-06-13 02:02:00 │   616 │
│ 1560384180 │ 2019-06-13 02:03:00 │    64 │
│ 1560384240 │ 2019-06-13 02:04:00 │   624 │
│ 1560384300 │ 2019-06-13 02:05:00 │   712 │
│ 1560384360 │ 2019-06-13 02:06:00 │   652 │
│ 1560384420 │ 2019-06-13 02:07:00 │   636 │
│ 1560384480 │ 2019-06-13 02:08:00 │   636 │
...

I don't see any errors in the logs (not in this particular time nor any other time).

I've checked the clusters configs + rolled them out using Ansible again just to be sure, they're 100% identical
Restarted the servers just in case
Recreated all tables from scratch
TIme is synced
The problem manifests only on Cluster 2
When I replaced GraphiteMergeTree with ReplacingMergeTree the tables were 100% identical in both clusters, at least for 24h that I tested, so if they were Graphite tables they should've merged with the same result.

Any other ideas? Will probably try to downgrade to 19.6 to see if it helps...

blind-oracle on 13 Jun 2019

To get an additional consistency check you could create a materialized view with MergeTree engine and see if data the same in both clusters.

Felixoid on 13 Jun 2019

@Felixoid

Thanks, will try as a next step.

Now I just created two tables in each cluster: one is GraphiteMergeTree, another is ReplacingMergeTree and ingesting the same data into all of them.

Got no other ideas for now...

blind-oracle on 13 Jun 2019

ReplacingMergeTree in case of Timestamp=0 doesn't make a lot of sense, IMHO.

Felixoid on 13 Jun 2019

@Felixoid

It's there just in case a duplicate metric arrives with the same Path,Time (which are the sorting key). For debugging this problem it should not matter much, agree.

blind-oracle on 13 Jun 2019

@Felixoid

Well, it happened again in the same scenario.

Cluster2 GraphiteMergeTree:

│ 1560441780 │ 2019-06-13 18:03:00 │   636 │
│ 1560441840 │ 2019-06-13 18:04:00 │   612 │
│ 1560441900 │ 2019-06-13 18:05:00 │   696 │
│ 1560441960 │ 2019-06-13 18:06:00 │   636 │
│ 1560442020 │ 2019-06-13 18:07:00 │   664 │
│ 1560442080 │ 2019-06-13 18:08:00 │   636 │
│ 1560442140 │ 2019-06-13 18:09:00 │   636 │
│ 1560442200 │ 2019-06-13 18:10:00 │   772 │
│ 1560442260 │ 2019-06-13 18:11:00 │   636 │
│ 1560442320 │ 2019-06-13 18:12:00 │   628 │
│ 1560442380 │ 2019-06-13 18:13:00 │    64 │
│ 1560442440 │ 2019-06-13 18:14:00 │    64 │
│ 1560442500 │ 2019-06-13 18:15:00 │    64 │
│ 1560442560 │ 2019-06-13 18:16:00 │    96 │
│ 1560442620 │ 2019-06-13 18:17:00 │    64 │
│ 1560442680 │ 2019-06-13 18:18:00 │    64 │
│ 1560442740 │ 2019-06-13 18:19:00 │    64 │
│ 1560442800 │ 2019-06-13 18:20:00 │    64 │
│ 1560442860 │ 2019-06-13 18:21:00 │    64 │
│ 1560442920 │ 2019-06-13 18:22:00 │    64 │
│ 1560442980 │ 2019-06-13 18:23:00 │    64 │
│ 1560443040 │ 2019-06-13 18:24:00 │    64 │

But the adjacent ReplacingMergeTree table has good data.

Cluster2 ReplacingMergeTree:

│ 1560441786 │ 2019-06-13 18:03:06 │  1208 │
│ 1560441816 │ 2019-06-13 18:03:36 │    64 │
│ 1560441846 │ 2019-06-13 18:04:06 │  1160 │
│ 1560441876 │ 2019-06-13 18:04:36 │    64 │
│ 1560441906 │ 2019-06-13 18:05:06 │  1328 │
│ 1560441936 │ 2019-06-13 18:05:36 │    64 │
│ 1560441966 │ 2019-06-13 18:06:06 │  1208 │
│ 1560441996 │ 2019-06-13 18:06:36 │    64 │
│ 1560442026 │ 2019-06-13 18:07:06 │  1264 │
│ 1560442056 │ 2019-06-13 18:07:36 │    64 │
│ 1560442086 │ 2019-06-13 18:08:06 │  1208 │
│ 1560442116 │ 2019-06-13 18:08:36 │    64 │
│ 1560442146 │ 2019-06-13 18:09:06 │  1176 │
│ 1560442176 │ 2019-06-13 18:09:36 │    96 │
│ 1560442206 │ 2019-06-13 18:10:06 │  1480 │
│ 1560442236 │ 2019-06-13 18:10:36 │    64 │
│ 1560442266 │ 2019-06-13 18:11:06 │  1208 │
│ 1560442296 │ 2019-06-13 18:11:36 │    64 │
│ 1560442326 │ 2019-06-13 18:12:06 │  1192 │
│ 1560442356 │ 2019-06-13 18:12:36 │    64 │
│ 1560442386 │ 2019-06-13 18:13:06 │  1288 │
│ 1560442416 │ 2019-06-13 18:13:36 │    64 │
│ 1560442446 │ 2019-06-13 18:14:06 │  1176 │
│ 1560442476 │ 2019-06-13 18:14:36 │    64 │
│ 1560442506 │ 2019-06-13 18:15:06 │  1312 │
│ 1560442536 │ 2019-06-13 18:15:36 │    64 │
│ 1560442566 │ 2019-06-13 18:16:06 │  1176 │
│ 1560442596 │ 2019-06-13 18:16:36 │    96 │
│ 1560442626 │ 2019-06-13 18:17:06 │  1176 │
│ 1560442656 │ 2019-06-13 18:17:36 │    64 │
│ 1560442686 │ 2019-06-13 18:18:06 │  1208 │
│ 1560442716 │ 2019-06-13 18:18:36 │    64 │
│ 1560442746 │ 2019-06-13 18:19:06 │  1208 │
│ 1560442776 │ 2019-06-13 18:19:36 │    64 │
│ 1560442806 │ 2019-06-13 18:20:06 │  1432 │
│ 1560442836 │ 2019-06-13 18:20:36 │    64 │
│ 1560442866 │ 2019-06-13 18:21:06 │  1176 │
│ 1560442896 │ 2019-06-13 18:21:36 │    64 │
│ 1560442926 │ 2019-06-13 18:22:06 │  1176 │
│ 1560442956 │ 2019-06-13 18:22:36 │    64 │
│ 1560442986 │ 2019-06-13 18:23:06 │  1208 │
│ 1560443016 │ 2019-06-13 18:23:36 │    64 │
│ 1560443046 │ 2019-06-13 18:24:06 │  1208 │
│ 1560443076 │ 2019-06-13 18:24:36 │    64 │

Also, I've found it that the same problem occurs also on the first cluster, but for some reason at a much lesser scale, only some rows are affected.

Cluster1 GraphiteMergeTree:

│ 1560460440 │ 2019-06-13 23:14:00 │   624 │
│ 1560460500 │ 2019-06-13 23:15:00 │    64 │
│ 1560460560 │ 2019-06-13 23:16:00 │    96 │
│ 1560460620 │ 2019-06-13 23:17:00 │   632 │
│ 1560460680 │ 2019-06-13 23:18:00 │    64 │
│ 1560460740 │ 2019-06-13 23:19:00 │    64 │
│ 1560460800 │ 2019-06-13 23:20:00 │   776 │
│ 1560460860 │ 2019-06-13 23:21:00 │   640 │
│ 1560460920 │ 2019-06-13 23:22:00 │    64 │
│ 1560460980 │ 2019-06-13 23:23:00 │   636 │

Cluster1 ReplacingMergeTree:

│ 1560456846 │ 2019-06-13 22:14:06 │  1224 │
│ 1560456876 │ 2019-06-13 22:14:36 │    64 │
│ 1560456906 │ 2019-06-13 22:15:06 │  1352 │
│ 1560456936 │ 2019-06-13 22:15:36 │    96 │
│ 1560456966 │ 2019-06-13 22:16:06 │  1184 │
│ 1560456996 │ 2019-06-13 22:16:36 │    64 │
│ 1560457026 │ 2019-06-13 22:17:06 │  1192 │
│ 1560457056 │ 2019-06-13 22:17:36 │    64 │
│ 1560457086 │ 2019-06-13 22:18:06 │  1184 │
│ 1560457116 │ 2019-06-13 22:18:36 │    64 │
│ 1560457146 │ 2019-06-13 22:19:06 │  1184 │
│ 1560457176 │ 2019-06-13 22:19:36 │    64 │
│ 1560457206 │ 2019-06-13 22:20:06 │  1504 │
│ 1560457236 │ 2019-06-13 22:20:36 │    96 │
│ 1560457266 │ 2019-06-13 22:21:06 │  1200 │
│ 1560457296 │ 2019-06-13 22:21:36 │    64 │
│ 1560457326 │ 2019-06-13 22:22:06 │  1208 │
│ 1560457356 │ 2019-06-13 22:22:36 │    64 │
│ 1560457386 │ 2019-06-13 22:23:06 │  1208 │
│ 1560457416 │ 2019-06-13 22:23:36 │    64 │

So I would bet there's some kind of a bug after all here, but why it manifests at different scale I don't know...

blind-oracle on 14 Jun 2019

Will probably try to downgrade to 19.6 to see if it helps...

No changes was in GraphiteMergeTree after 19.5.

alexey-milovidov on 14 Jun 2019

Maybe it would worth to check if Timestamp UInt32 DEFAULT toUInt32(now()) CODEC(ZSTD) will affect somehow.

Felixoid on 14 Jun 2019

@Felixoid
Thanks, in my case carbon-clickhouse uploads zero there, so I'll reconfigure it to write a timestamp.

Aside from this, I've turned off replication (used GraphiteMergeTree instead of Replicated...). The averaging errors almost went away and are now in sync between two clusters.

Cluster1:

SELECT
    Time,
    toDateTime(Time),
    Value
FROM shard_02.graphite
WHERE (Path = 'bw_out.bond3.nic.network.tvhosa-pez0402.4.HER.OT.PROD.SC') AND (Value < 100)
ORDER BY Time ASC

┌───────Time─┬────toDateTime(Time)─┬─Value─┐
│ 1560501480 │ 2019-06-14 10:38:00 │    96 │
│ 1560501780 │ 2019-06-14 10:43:00 │    96 │
│ 1560546480 │ 2019-06-14 23:08:00 │    64 │
│ 1560546660 │ 2019-06-14 23:11:00 │    64 │
│ 1560547380 │ 2019-06-14 23:23:00 │    64 │
│ 1560553080 │ 2019-06-15 00:58:00 │    64 │
│ 1560627000 │ 2019-06-15 21:30:00 │    64 │
│ 1560627120 │ 2019-06-15 21:32:00 │    64 │
│ 1560627600 │ 2019-06-15 21:40:00 │    64 │
│ 1560630000 │ 2019-06-15 22:20:00 │    64 │
│ 1560675960 │ 2019-06-16 11:06:00 │    64 │
│ 1560724380 │ 2019-06-17 00:33:00 │    64 │
│ 1560742620 │ 2019-06-17 05:37:00 │    96 │
│ 1560758759 │ 2019-06-17 10:05:59 │    64 │
└────────────┴─────────────────────┴───────┘

14 rows in set. Elapsed: 0.006 sec. Processed 46.39 thousand rows, 1.72 MB (7.48 million rows/s., 276.79 MB/s.

Cluster2:

SELECT
    Time,
    toDateTime(Time),
    Value
FROM shard_02.graphite
WHERE (Path = 'bw_out.bond3.nic.network.tvhosa-pez0402.4.HER.OT.PROD.SC') AND (Value < 100)
ORDER BY Time ASC

┌───────Time─┬────toDateTime(Time)─┬─Value─┐
│ 1560501480 │ 2019-06-14 10:38:00 │    96 │
│ 1560501780 │ 2019-06-14 10:43:00 │    96 │
│ 1560546480 │ 2019-06-14 23:08:00 │    64 │
│ 1560546660 │ 2019-06-14 23:11:00 │    64 │
│ 1560547380 │ 2019-06-14 23:23:00 │    64 │
│ 1560553080 │ 2019-06-15 00:58:00 │    64 │
│ 1560627000 │ 2019-06-15 21:30:00 │    64 │
│ 1560627120 │ 2019-06-15 21:32:00 │    64 │
│ 1560627600 │ 2019-06-15 21:40:00 │    64 │
│ 1560630000 │ 2019-06-15 22:20:00 │    64 │
│ 1560675960 │ 2019-06-16 11:06:00 │    64 │
│ 1560724380 │ 2019-06-17 00:33:00 │    64 │
│ 1560742620 │ 2019-06-17 05:37:00 │    96 │
│ 1560758759 │ 2019-06-17 10:05:59 │    64 │
└────────────┴─────────────────────┴───────┘

14 rows in set. Elapsed: 0.008 sec. Processed 46.39 thousand rows, 1.72 MB (5.59 million rows/s., 206.86 MB/s.)

They're exactly the same.
Each of these intervals should've averaged to a Value > 500

Looks like a bug to me after all.

blind-oracle on 17 Jun 2019

@Felixoid

The timestamp didn't help either.

When the problem manifests I'm seeing the following strange entries in the Graphite table:

│ 1560782520 │ 2019-06-17 16:42:00 │    64 │
│ 1560782580 │ 2019-06-17 16:43:00 │    64 │
│ 1560782580 │ 2019-06-17 16:43:00 │  1168 │
│ 1560782668 │ 2019-06-17 16:44:28 │  1184 │
└────────────┴─────────────────────┴───────┘

│ 1560782520 │ 2019-06-17 16:42:00 │    64 │
│ 1560782580 │ 2019-06-17 16:43:00 │    64 │
│ 1560782640 │ 2019-06-17 16:44:00 │    64 │
│ 1560782640 │ 2019-06-17 16:44:00 │  1184 │
└────────────┴─────────────────────┴───────┘

So, for some brief moment of time I have two rows with the same Time. Then the next merge, probably, consumes the second row and leaves only the one with 64. How can they appear?

Maybe we should add bug label so somebody can look in it someday...

blind-oracle on 17 Jun 2019

Try to add MV with plain MergeTree (from shard_02.graphite) and validate that these duplicates are not from carbon-clickhouse

den-crane on 17 Jun 2019

@den-crane

Will try, but I'm 99.9% sure they aren't.

Because:

The metrics are sent in specific seconds (like 28/58), not in 0th second.
Neighboring MergeTree table gets the same data and it does not have duplicates

blind-oracle on 17 Jun 2019

One more thing, could you maybe try to test CH older than 19.4.0.49? To exclude #4426

E.g. v19.3.6-stable from here https://github.com/yandex/ClickHouse/tags?after=v19.1.13-testing

Felixoid on 18 Jun 2019

@Felixoid

Recreated the cluster from scratch using 19.3.6, will see by the end of day if it changed anything :)

blind-oracle on 18 Jun 2019

@Felixoid

Nope, same behavior in 19.3.6, two rows with same timestamp:

│ 1560866760 │ 2019-06-18 16:06:00 │    64 │
│ 1560866760 │ 2019-06-18 16:06:00 │  1160 │
└────────────┴─────────────────────┴───────┘

Then the row with value 1160 is discarded:

│ 1560866700 │ 2019-06-18 16:05:00 │    64 │
│ 1560866760 │ 2019-06-18 16:06:00 │    64 │
│ 1560866820 │ 2019-06-18 16:07:00 │    64 │

While in the adjacent MergeTree table everything is nice:

│ 1560866749 │ 2019-06-18 16:05:49 │    64 │
│ 1560866779 │ 2019-06-18 16:06:19 │  1160 │
│ 1560866809 │ 2019-06-18 16:06:49 │    64 │
│ 1560866839 │ 2019-06-18 16:07:19 │  1160 │

So your PR didn't break it :)

blind-oracle on 18 Jun 2019

@Felixoid @den-crane

Guys, any ideas which direction to look? Tested 19.8, same result of course as Graphite* part haven't changed.

This issue manifests all over the place in my CH servers. It's not much visible if the metric does not change much over the averaging period (like bandwidth which stays more or less the same over a minute), but in other cases the data is just lost after merge.

I can't seem to get the circumstances when it starts to merge wrongly, but now on my test server it's wrong >60% of the time:

SELECT (count(*) / 1440) * 100
FROM graphite
WHERE (Path = 'bw_out.bond3.nic.network.tvhosa-pez0402.4.HER.OT.PROD.SC')
AND (Date = yesterday()) AND (Value < 100)

┌─multiply(divide(count(), 1440), 100)─┐
│                    61.80555555555556 │
└──────────────────────────────────────┘

This metric should always average to > 100 over a minute.

I can of course throw away GraphiteMergeTree and organize a chain of tables with different averaging periods and mangle the data between them manually, but that looks shitty to me and will go that way only as a last resort...

blind-oracle on 20 Jun 2019

Unfortunately, I don't know where to look in this case. For me, it looks as well as a bug and we need help from some real engineers here.

@alexey-milovidov and @filimonov, could you consider this as a moderate/major bug?

Felixoid on 20 Jun 2019

@alexey-milovidov @filimonov
That would be nice if we can fix it. I went through the Graphite merging functions, but without deeper knowledge about CH it's hard to find where the problem is...

blind-oracle on 20 Jun 2019

@blind-oracle Need to reproduce your issue somehow. I tried but failed. Everything worked fine. Maybe because my data is too fake.

den-crane on 20 Jun 2019

@den-crane I'll try to create a reproducible setup and will get back.
Probably it will involve pushing some fake metrics into CH directly or though carbon-clickhouse at around the same rate as we do...

Because I have the feeling that the issue is dependent on the ingest rate (in fact on the frequency of background merges I guess) - the more the rate the faster/more frequently it appears.

blind-oracle on 20 Jun 2019

@den-crane @Felixoid @alexey-milovidov @filimonov

Here's a very dirty script that does the job.

It inserts groups of 35k metrics (6 total) every 5 seconds with value 1 and then the same with value 3.
In the end each metric should average to 2 over 60 sec.

Tested on official Yandex .deb build 19.8.3.8

CREATE TABLE graphite_crap (
  Path String,
  Value Float64,
  Time UInt32,
  Date Date DEFAULT today(),
  Timestamp UInt32
) ENGINE = GraphiteMergeTree('graphite_rollup')
PARTITION BY toYYYYMMDD(Date)
ORDER BY (Path, Time);

CREATE MATERIALIZED VIEW graphite_crap_mv
ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(Date)
ORDER BY (Path, Time)
AS SELECT * FROM graphite_crap;

<graphite_rollup>
    <default>
        <function>avg</function>
        <retention>
            <age>0</age>
            <precision>60</precision>
        </retention>
    </default>
</graphite_rollup>

#!/bin/bash

rm -f /tmp/metrics.txt

for i in {1..35000}
do
        echo -e "${i}P\tT\tT\tV" >> /tmp/metrics.txt
done

TS=`date +%s`
while [ 1 ]; do
    for V in 1 3
    do
        for P in 1 2 3 4 5 6
        do
            sed -e "s/P/$P/" -e "s/T/$TS/g" -e "s/V/$V/" /tmp/metrics.txt | clickhouse-client --query='INSERT INTO graphite_crap (Path, Time, Timestamp, Value) FORMAT TSV'
            TS=$((TS+5))
            echo $TS
            sleep 5
        done
    done
done

It goes well at the beginning, but after around 15min I see:

SELECT 
    Value, 
    toDateTime(Time)
FROM graphite_crap 
WHERE Path = '11'
ORDER BY Time ASC

...
│     2 │ 2019-06-20 22:16:00 │
│     3 │ 2019-06-20 22:17:00 │
│     3 │ 2019-06-20 22:18:00 │
│     2 │ 2019-06-20 22:19:00 │
...

22:17 and 22:18 have 3 in average, not 2.

While the MV has correct data:

SELECT 
    Value, 
    toDateTime(Time)
FROM graphite_crap_mv 
WHERE Path = '11'
ORDER BY Time ASC

...
│     1 │ 2019-06-20 22:17:21 │
│     3 │ 2019-06-20 22:17:51 │
│     1 │ 2019-06-20 22:18:21 │
│     3 │ 2019-06-20 22:18:51 │
...

blind-oracle on 20 Jun 2019

I've got the same

┌─Value─┬────toDateTime(Time)─┐
│     1 │ 2019-06-20 21:02:00 │
│     2 │ 2019-06-20 21:03:00 │
│     2 │ 2019-06-20 21:04:00 │
│     2 │ 2019-06-20 21:05:00 │
│     2 │ 2019-06-20 21:06:00 │
│     1 │ 2019-06-20 21:07:00 │
│     1 │ 2019-06-20 21:08:00 │
│     2 │ 2019-06-20 21:09:00 │
│     2 │ 2019-06-20 21:10:00 │
│     2 │ 2019-06-20 21:11:00 │
│     3 │ 2019-06-20 21:12:00 │
└───────┴─────────────────────┘

And created a table like this

CREATE TABLE graphite_crap (
  Path String,
  Value Float64,
  Time DateTime,
  Version UInt64 DEFAULT 0,
  Date Date DEFAULT today(),
  Timestamp UInt32
) ENGINE = GraphiteMergeTree('graphite_rollup')
PARTITION BY toYYYYMMDD(Date)
ORDER BY (Path, Time);

And got

SELECT
    Value,
    Time
FROM graphite_crap
WHERE Path = '11'
ORDER BY Time ASC

┌─Value─┬────────────────Time─┐
│     2 │ 2019-06-20 22:02:00 │
│     2 │ 2019-06-20 22:03:00 │
│     2 │ 2019-06-20 22:04:00 │
│     2 │ 2019-06-20 22:05:00 │
│     2 │ 2019-06-20 22:06:00 │
│     1 │ 2019-06-20 22:07:01 │
└───────┴─────────────────────┘

SELECT
    Value,
    Time
FROM graphite_crap_mv
WHERE Path = '11'
ORDER BY Time ASC

┌─Value─┬────────────────Time─┐
│     1 │ 2019-06-20 22:02:01 │
│     3 │ 2019-06-20 22:02:31 │
│     1 │ 2019-06-20 22:03:01 │
│     3 │ 2019-06-20 22:03:31 │
│     1 │ 2019-06-20 22:04:01 │
│     3 │ 2019-06-20 22:04:31 │
│     1 │ 2019-06-20 22:05:01 │
│     3 │ 2019-06-20 22:05:31 │
│     1 │ 2019-06-20 22:06:01 │
│     3 │ 2019-06-20 22:06:31 │
│     1 │ 2019-06-20 22:07:01 │
└───────┴─────────────────────┘

den-crane on 21 Jun 2019

oops, too early, it still has the issue.

┌─Value─┬────────────────Time─┐
│     2 │ 2019-06-20 22:02:00 │
│     2 │ 2019-06-20 22:03:00 │
│     2 │ 2019-06-20 22:04:00 │
│     2 │ 2019-06-20 22:05:00 │
│     2 │ 2019-06-20 22:06:00 │
│     3 │ 2019-06-20 22:07:00 │
│     3 │ 2019-06-20 22:08:00 │
│     3 │ 2019-06-20 22:09:00 │
│     2 │ 2019-06-20 22:10:00 │
│     2 │ 2019-06-20 22:11:00 │
│     2 │ 2019-06-20 22:12:00 │
│     2 │ 2019-06-20 22:13:00 │
└───────┴─────────────────────┘

den-crane on 21 Jun 2019

Seems age 120 do the trick. Maybe avg works over incomplete data.

        <retention>
            <age>120</age>
            <precision>60</precision>
        </retention>

den-crane on 21 Jun 2019

Yes, increasing the age works, but we should either fix this or deny setting age to <= precision :)

blind-oracle on 21 Jun 2019

Well, age=0 is a totally correct value according to the documentation. And that is defined everywhere https://clickhouse.yandex/docs/en/operations/table_engines/graphitemergetree/#configuration-example

Felixoid on 21 Jun 2019

@Felixoid Yep, that's why I consider this behavior a bug and it's kinda strange that nobody noticed it :-/

Either nobody uses age=0 (which I doubt) or nobody checks the numbers... because this happens sooner or later under any load.

blind-oracle on 21 Jun 2019

No. Age=120 still has the discrepancies.

│     2 │ 2019-06-21 01:14:00 │
│     1 │ 2019-06-21 01:15:00 │

select avg(Value) a, toStartOfMinute(Time) t, Path from graphite_crap where Time >= '2019-06-20 22:55:00' and Time <= '2019-06-21 02:24:00' group by Path, t having a <> 2
3780000 rows in set.

select avg(Value) a, toStartOfMinute(Time) t, Path from graphite_crap_mv where Time >= '2019-06-20 22:55:00' and Time <= '2019-06-21 02:24:00' group by Path, t having a <> 2
0 rows in set.

I think the issue is in design of how a rollup works. It may be rolls incorrectly in different merges when one minute placed in two different parts.

I saw weird records like
│ 3 │ 2019-06-21 01:14:00 │
│ 1 │ 2019-06-21 01:15:28 │
1 come earlier than 3, but 3 already rolled and has time :00 and 1 still is not rolled.
Another problem how to roll AVG in different merges? Because the table does not store sum/count.

den-crane on 21 Jun 2019

@den-crane

No. Age=120 still has the discrepancies.

For now on my test server with a copy of production metrics (~30k/sec) I don't see it. Before the errors were all over the place. But I'll monitor for a couple of days.

It may be rolls incorrectly in different merges when one minute placed in two different parts.

Well in my case they're always in different parts: carbon-clickhouse is set to flush INSERT every 5s. So if the metrics come with 30s interval they always end up in different parts. And it's not a problem for a rollup as I see, e.g:

│  1208 │ 2019-06-21 21:18:00 │
│    64 │ 2019-06-21 21:18:32 │

Part in which came the value 1208 was merged with some other already (time rounded up to precision), and value 64 is still in a fresh non-merged part. When they will merge then the rollup will average them correcly.

Another problem how to roll AVG in different merges? Because the table does not store sum/count.

Yep, it seems it will have different results depending on how the merges are proceeding. If all parts which have the values for the current precision period are merged at once then the result will be correct. Otherwise it will deviate...

blind-oracle on 21 Jun 2019

https://t.me/c/1425947904/1706

Denny Crane, [Jun 21, 2019 at 7:39:33 PM (2019-06-21, 7:55:42 PM)]:
попробовал снова и вот MV которое сморит на graphite_rollup (AVG , age = 120  , precision = 60)
graphite_crap_mv
│     1 │ 2019-06-21 21:13:28 │
│     3 │ 2019-06-21 21:13:58 │
│     1 │ 2019-06-21 21:14:28 │
│     3 │ 2019-06-21 21:14:58 │
│     1 │ 2019-06-21 21:15:28 │
│     3 │ 2019-06-21 21:15:58 │

свернулось в 
graphite_crap
│     2 │ 2019-06-21 21:13:00 │
│     3 │ 2019-06-21 21:14:00 │
│     2 │ 2019-06-21 21:15:00 │


и в конце есть запись 
graphite_crap
│     1 │ 2019-06-21 22:33:00 │
│     3 │ 2019-06-21 22:33:58 │

т.е. "1" была 22:33:28 свернулась в 22:33:00
graphite_crap_mv
│     1 │ 2019-06-21 22:33:28 │
│     3 │ 2019-06-21 22:33:58 │

еще раз смержилось
│     3 │ 2019-06-21 22:33:00 │

den-crane on 22 Jun 2019

У меня пока ошибок мержа не наблюдается под продуктивной нагрузкой с age=120 precision=60, буду мониторить дальше и скрипт погоняю еще тоже с новыми настройками.

blind-oracle on 22 Jun 2019

Да, используя скрипт минут через 30 начало мержить криво.

┌─Value─┬────toDateTime(Time)─┐
│     1 │ 2019-06-22 17:05:00 │
│     2 │ 2019-06-22 17:06:00 │
│     2 │ 2019-06-22 17:07:00 │
...
│     2 │ 2019-06-22 17:39:00 │
│     2 │ 2019-06-22 17:40:00 │
│     1 │ 2019-06-22 17:41:00 │
│     1 │ 2019-06-22 17:42:00 │
│     1 │ 2019-06-22 17:43:00 │
...

На двух серверах с реальными метриками, при этом, мержит нормально.

blind-oracle on 22 Jun 2019

На проде тоже ошибки вылезли, но очень мало.

│  1192 │ 2019-06-23 19:57:00 │
│    64 │ 2019-06-23 19:57:00 │

У меня такое ощущение что КХ делает два параллельных мержа на одной таблице и в результате получается два значения усредненных до начала минуты... Но по идее он не должен делать два мержа параллельно на одну таблицу. Или нет?

blind-oracle on 23 Jun 2019

Was this page helpful?

0 / 5 - 0 ratings