Clickhouse: Crashing every few hours

Created on 13 Aug 2019 · 6Comments · Source: ClickHouse/ClickHouse

Describe the bug
Clickhouse is crashing every few hours (different nodes in a cluster every few hours. A same node seems to be crashing every 10~12 hour-ish). apport generates crash report and is shared here - https://gist.github.com/iameugenejo/2827a2c4e8b1a2b1743c47e369dc6930

There's no clickhouse or syslog error logs when this happens.

How to reproduce

Which ClickHouse server version to use:
19.6.2.11

Which interface to use, if matters
ingestion is done through http interface as TSV format
Non-default settings, if any

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ name                                     ┃ value        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ max_block_size                           │ 1000000000   │
├──────────────────────────────────────────┼──────────────┤
│ min_insert_block_size_rows               │ 5000000      │
├──────────────────────────────────────────┼──────────────┤
│ max_query_size                           │ 104857600    │
├──────────────────────────────────────────┼──────────────┤
│ group_by_two_level_threshold             │ 0            │
├──────────────────────────────────────────┼──────────────┤
│ group_by_two_level_threshold_bytes       │ 0            │
├──────────────────────────────────────────┼──────────────┤
│ distributed_aggregation_memory_efficient │ 1            │
├──────────────────────────────────────────┼──────────────┤
│ max_bytes_before_external_group_by       │ 103079215104 │
├──────────────────────────────────────────┼──────────────┤
│ max_memory_usage                         │ 94489280512  │
└──────────────────────────────────────────┴──────────────┘

CREATE TABLE statements for all tables involved
N/A
Sample data for all these tables, use clickhouse-obfuscator if necessary
N/A
Queries to run that lead to unexpected result
N/A

Expected behavior
A clear and concise description of what you expected to happen.
No crash

Error message and/or stacktrace
https://gist.github.com/iameugenejo/2827a2c4e8b1a2b1743c47e369dc6930

Additional context
OS is Ubuntu 16.04

We have 7 shards, 2-replicas each with 4 major replicated summing merge trees.

Disks used are SSDs in RAID-0 configuration

Ingestion throughput is 1.4 billion / day per replica

Disk usage is 70G / day per replica

Zookeeper nodes are 300k.

bug obsolete-version

Source

iameugenejo

Most helpful comment

Your crashes are in LFAlloc that were experimental feature. I think update to 19.11 would be the best solution.

4ertus2 on 19 Aug 2019

👍2

All 6 comments

sudo dmesg|tail -300
check for OOM messages like


Jun 10 13:34:25 s-reserve kernel: [827998.184637] Killed process 24184 (clickhouse-serv) 
  total-vm:19133724kB, anon-rss:9965712kB, file-rss:0kB, shmem-rss:0kB

Do you use CODEC(Delta for Date columns? or columns with Int16 / Int8 ?
https://github.com/yandex/ClickHouse/issues/5480
https://github.com/yandex/ClickHouse/pull/5786

den-crane on 13 Aug 2019

I see this:

[Mon May 20 21:17:52 2019] Out of memory: Kill process 7820 (clickhouse-serv) score 975 or sacrifice child
[Mon May 20 21:17:52 2019] Killed process 7820 (clickhouse-serv) total-vm:392840976kB, anon-rss:127940644kB, file-rss:0kB, shmem-rss:0kB

but this isn't related. The crash happened this morning, 8/13/2019

and no, we don't use CODEC, Int16 or Int8

iameugenejo on 13 Aug 2019

The raw core report has the following section:

ProcStatus:
 Name:  clickhouse-serv
 Umask: 0027
 State: D (disk sleep)

iameugenejo on 13 Aug 2019

We updated to this new version a little less than 3 months ago after using 18.16.1 for a while. This has never happened before, and it's been happening for the entire duration of time we've been using this new version.

Should I attempt to upgrade the clickhouse to the latest release, or should I downgrade to 18.16.1?

iameugenejo on 13 Aug 2019

Your crashes are in LFAlloc that were experimental feature. I think update to 19.11 would be the best solution.

4ertus2 on 19 Aug 2019

👍2

Thank you. I upgraded it to 19.11.7.40 and confirmed the issue is gone.

iameugenejo on 19 Aug 2019

Was this page helpful?

0 / 5 - 0 ratings