Describe the bug
Clickhouse is crashing every few hours (different nodes in a cluster every few hours. A same node seems to be crashing every 10~12 hour-ish). apport generates crash report and is shared here - https://gist.github.com/iameugenejo/2827a2c4e8b1a2b1743c47e369dc6930
There's no clickhouse or syslog error logs when this happens.
How to reproduce
19.6.2.11Which interface to use, if matters
ingestion is done through http interface as TSV format
Non-default settings, if any
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโ
โ name โ value โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ max_block_size โ 1000000000 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ min_insert_block_size_rows โ 5000000 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ max_query_size โ 104857600 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ group_by_two_level_threshold โ 0 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ group_by_two_level_threshold_bytes โ 0 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ distributed_aggregation_memory_efficient โ 1 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ max_bytes_before_external_group_by โ 103079215104 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ max_memory_usage โ 94489280512 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโ
CREATE TABLE statements for all tables involved
N/A
Sample data for all these tables, use clickhouse-obfuscator if necessary
N/A
Queries to run that lead to unexpected result
N/A
Expected behavior
A clear and concise description of what you expected to happen.
No crash
Error message and/or stacktrace
https://gist.github.com/iameugenejo/2827a2c4e8b1a2b1743c47e369dc6930
Additional context
OS is Ubuntu 16.04
We have 7 shards, 2-replicas each with 4 major replicated summing merge trees.
Disks used are SSDs in RAID-0 configuration
Ingestion throughput is 1.4 billion / day per replica
Disk usage is 70G / day per replica
Zookeeper nodes are 300k.
sudo dmesg|tail -300
check for OOM messages like
Jun 10 13:34:25 s-reserve kernel: [827998.184637] Killed process 24184 (clickhouse-serv)
total-vm:19133724kB, anon-rss:9965712kB, file-rss:0kB, shmem-rss:0kB
Do you use CODEC(Delta for Date columns? or columns with Int16 / Int8 ?
https://github.com/yandex/ClickHouse/issues/5480
https://github.com/yandex/ClickHouse/pull/5786
I see this:
[Mon May 20 21:17:52 2019] Out of memory: Kill process 7820 (clickhouse-serv) score 975 or sacrifice child
[Mon May 20 21:17:52 2019] Killed process 7820 (clickhouse-serv) total-vm:392840976kB, anon-rss:127940644kB, file-rss:0kB, shmem-rss:0kB
but this isn't related. The crash happened this morning, 8/13/2019
and no, we don't use CODEC, Int16 or Int8
The raw core report has the following section:
ProcStatus:
Name: clickhouse-serv
Umask: 0027
State: D (disk sleep)
We updated to this new version a little less than 3 months ago after using 18.16.1 for a while. This has never happened before, and it's been happening for the entire duration of time we've been using this new version.
Should I attempt to upgrade the clickhouse to the latest release, or should I downgrade to 18.16.1?
Your crashes are in LFAlloc that were experimental feature. I think update to 19.11 would be the best solution.
Thank you. I upgraded it to 19.11.7.40 and confirmed the issue is gone.
Most helpful comment
Your crashes are in LFAlloc that were experimental feature. I think update to 19.11 would be the best solution.