Clickhouse: Crashing every few hours

Created on 13 Aug 2019  ยท  6Comments  ยท  Source: ClickHouse/ClickHouse

Describe the bug
Clickhouse is crashing every few hours (different nodes in a cluster every few hours. A same node seems to be crashing every 10~12 hour-ish). apport generates crash report and is shared here - https://gist.github.com/iameugenejo/2827a2c4e8b1a2b1743c47e369dc6930

There's no clickhouse or syslog error logs when this happens.

How to reproduce

  • Which ClickHouse server version to use:
    19.6.2.11
  • Which interface to use, if matters
    ingestion is done through http interface as TSV format

  • Non-default settings, if any

โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ name                                     โ”ƒ value        โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ max_block_size                           โ”‚ 1000000000   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ min_insert_block_size_rows               โ”‚ 5000000      โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ max_query_size                           โ”‚ 104857600    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ group_by_two_level_threshold             โ”‚ 0            โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ group_by_two_level_threshold_bytes       โ”‚ 0            โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ distributed_aggregation_memory_efficient โ”‚ 1            โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ max_bytes_before_external_group_by       โ”‚ 103079215104 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ max_memory_usage                         โ”‚ 94489280512  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
  • CREATE TABLE statements for all tables involved
    N/A

  • Sample data for all these tables, use clickhouse-obfuscator if necessary
    N/A

  • Queries to run that lead to unexpected result
    N/A

Expected behavior
A clear and concise description of what you expected to happen.
No crash

Error message and/or stacktrace
https://gist.github.com/iameugenejo/2827a2c4e8b1a2b1743c47e369dc6930

Additional context
OS is Ubuntu 16.04

We have 7 shards, 2-replicas each with 4 major replicated summing merge trees.

Disks used are SSDs in RAID-0 configuration

Ingestion throughput is 1.4 billion / day per replica

Disk usage is 70G / day per replica

Zookeeper nodes are 300k.

bug obsolete-version

Most helpful comment

Your crashes are in LFAlloc that were experimental feature. I think update to 19.11 would be the best solution.

All 6 comments

sudo dmesg|tail -300
check for OOM messages like


Jun 10 13:34:25 s-reserve kernel: [827998.184637] Killed process 24184 (clickhouse-serv) 
  total-vm:19133724kB, anon-rss:9965712kB, file-rss:0kB, shmem-rss:0kB

Do you use CODEC(Delta for Date columns? or columns with Int16 / Int8 ?
https://github.com/yandex/ClickHouse/issues/5480
https://github.com/yandex/ClickHouse/pull/5786

I see this:

[Mon May 20 21:17:52 2019] Out of memory: Kill process 7820 (clickhouse-serv) score 975 or sacrifice child
[Mon May 20 21:17:52 2019] Killed process 7820 (clickhouse-serv) total-vm:392840976kB, anon-rss:127940644kB, file-rss:0kB, shmem-rss:0kB

but this isn't related. The crash happened this morning, 8/13/2019

and no, we don't use CODEC, Int16 or Int8

The raw core report has the following section:

ProcStatus:
 Name:  clickhouse-serv
 Umask: 0027
 State: D (disk sleep)

We updated to this new version a little less than 3 months ago after using 18.16.1 for a while. This has never happened before, and it's been happening for the entire duration of time we've been using this new version.

Should I attempt to upgrade the clickhouse to the latest release, or should I downgrade to 18.16.1?

Your crashes are in LFAlloc that were experimental feature. I think update to 19.11 would be the best solution.

Thank you. I upgraded it to 19.11.7.40 and confirmed the issue is gone.

Was this page helpful?
0 / 5 - 0 ratings