Clickhouse: Too many open files while using materialized view with kafka source

Created on 22 Sep 2018  Â·  22Comments  Â·  Source: ClickHouse/ClickHouse

Greetings.
I have Ubuntu 18.04 source system. Clickhouse server version - 18.12.17. Default config.
Database structure creation:

create table a (EventDate Date) ENGINE MergeTree() partition by toYYYYMM(EventDate) order by (EventDate) settings index_granularity=8192;

create table k (EventDate Date) ENGINE=Kafka SETTINGS
  kafka_broker_list = 'localhost:9092,localhost:9093,localhost:9094',
  kafka_topic_list = 'a',
  kafka_group_name = 'group1',
  kafka_format = 'JSONEachRow';

create materialized view w to a as select * from k;

As soon, as materialized view was created, amount of file descriptors for clickhouse server process starts to grow each second to the limit of max open files.
Error logs are empty. Usual log contains only debug ant trace info. However syslog contains following errors:

Sep 22 18:22:38 kernel: [25449.716131] netlink: 'TCPHandler': attribute type 1 has an invalid length.
Sep 22 18:22:38 kernel: [25449.729949] netlink: 'TCPHandler': attribute type 1 has an invalid length.
Sep 22 18:22:40 kernel: [25451.228134] netlink: 'ParalInputsProc': attribute type 1 has an invalid length.
Sep 22 18:22:40 kernel: [25451.228175] netlink: 'ParalInputsProc': attribute type 1 has an invalid length.
Sep 22 18:22:42 kernel: [25453.227433] netlink: 'ParalInputsProc': attribute type 1 has an invalid length.

Am I missing something? Any help would be much appreciated.

Most helpful comment

We are also seeing this issue. Clickhouse currently has 456,357 files open; they are mostly netlink sockets:

clickhous 5174 clickhouse *377u     sock                0,8       0t0 41551004 protocol: NETLINK
clickhous 5174 clickhouse *378u     sock                0,8       0t0 41551005 protocol: NETLINK
clickhous 5174 clickhouse *395u     sock                0,8       0t0 41551817 protocol: NETLINK
clickhous 5174 clickhouse *396u     sock                0,8       0t0 41551818 protocol: NETLINK

At this point the kafka table engine + MV is unusable. Our clickhouse version is 18.12.13 revision 54407. I'm happy to provide more info to help debug the problem.

All 22 comments

Does setting kafka_num_consumers = 1 in the creation of the Kafka table stop the errors?

https://clickhouse.yandex/docs/en/operations/table_engines/kafka/

Thank you for response. Changed kafka table creation to:
CREATE TABLE default.k ( EventDate Date) ENGINE = Kafka SETTINGS kafka_broker_list = 'localhost:9092,localhost:9093,localhost:9094', kafka_topic_list = 'a', kafka_group_name = 'group1', kafka_format = 'JSONEachRow', kafka_num_consumers = 1
Unfortunately, problem is still present.

We are also seeing this issue. Clickhouse currently has 456,357 files open; they are mostly netlink sockets:

clickhous 5174 clickhouse *377u     sock                0,8       0t0 41551004 protocol: NETLINK
clickhous 5174 clickhouse *378u     sock                0,8       0t0 41551005 protocol: NETLINK
clickhous 5174 clickhouse *395u     sock                0,8       0t0 41551817 protocol: NETLINK
clickhous 5174 clickhouse *396u     sock                0,8       0t0 41551818 protocol: NETLINK

At this point the kafka table engine + MV is unusable. Our clickhouse version is 18.12.13 revision 54407. I'm happy to provide more info to help debug the problem.

We meet this problem too. So sad!

1) can you try to consume kafka messages with kafkacat from the same host where CH is installed?
2) Do you use clickhouse from packages, or build from sources?
3) can you also check clickhouse-server.log (not an err.log) for something suspicious?
4) how many nodes do you have in your kafka cluster?

  1. Kafkacat and native consume\produce scripts of kafka dist work normally with local kafka.
  2. Clickhouse was installed from package
  3. Nothing suspicious in either log.
  4. 3 nodes of kafka.

Kafka is running via following docker-compose:

version: '2'
services:
  zookeeper:
    image: wurstmeister/zookeeper
    ports:
      - "2181:2181"

  kafka1:
    image: wurstmeister/kafka
    ports:
      - "9092:9092"
    environment:
      HOSTNAME_COMMAND: "route -n | awk '/UG[ \t]/{print $$2}'"
      KAFKA_ZOOKEEPER_CONNECT: "zookeeper:2181"
      KAFKA_BROKER_ID: "1"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock

  kafka2:
    image: wurstmeister/kafka
    ports:
      - "9093:9092"
    environment:
      HOSTNAME_COMMAND: "route -n | awk '/UG[ \t]/{print $$2}'"
      KAFKA_ZOOKEEPER_CONNECT: "zookeeper:2181"
      KAFKA_BROKER_ID: "2"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock

  kafka3:
    image: wurstmeister/kafka
    ports:
      - "9094:9092"
    environment:
      HOSTNAME_COMMAND: "route -n | awk '/UG[ \t]/{print $$2}'"
      KAFKA_ZOOKEEPER_CONNECT: "zookeeper:2181"
      KAFKA_BROKER_ID: "3"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock

Clickhouse server is running on system outside of created docker network. If clickhouse is added as new service to docker-compose - all works fine. So as workaround for now we moved production implementation to separate docker container.

Kafkacat and native consume\produce scripts of kafka dist work normally

Kafkacat is more representative for debugging (it uses the same lib for Kafka access as ClickHouse).

work normally with local kafka.

What do you mean by local? Can you check to consume messages by kafkacat in the same configuration where ClickHouse had problems? I.e. (if I understand you correctly): local kafkacat consuming messages from 3-node docker based cluster.

@arstd , @snorecone in your case it's also related somehow to docker?

@filimonov no we're not running with docker at all. Kafka, Zookeeper and Clickhouse are all running on ec2 instances - ubuntu 16.04; clickhouse is installed from packages - 18.12.13 revision 54407.

Sorry for misunderstanding. I have only one implementation of kafka on local environment - 3-node docker based cluster. Kafkacat successfully produce and consume messages from this cluster. If clickhouse server is running on host machine - bug takes place. If clickhouse added as a service to docker network with kafka - all works fine.
Bug also takes place on production environment, where we hadn't any docker at all. Source system there is Debian Jessie, 3 node kafka cluster.

I have kind of the same issue. But my logs have huge numbers of error.
I have Ubuntu 18.04.1, Clickhouse 18.14.9 revision 54409, instaled without docker from packages, 3
Kafka nodes.

2018.10.24 12:38:08.038702 [ 22 ] {} <Error> r0.events (ReplicatedMergeTreeRestartingThread): void DB::ReplicatedMergeTreeRestartingThread::run(): Code: 999, e.displayText() = Coordination::Exception: All
connection tries failed while connecting to ZooKeeper. Addresses: 10.90.17.4:2181, 10.90.17.5:2181, 10.90.17.9:2181
Poco::Exception. Code: 1000, e.code() = 24, e.displayText() = I/O error: Too many open files, e.what() = I/O error, 10.90.17.4:2181
Poco::Exception. Code: 1000, e.code() = 24, e.displayText() = I/O error: Too many open files, e.what() = I/O error, 10.90.17.5:2181
Poco::Exception. Code: 1000, e.code() = 24, e.displayText() = I/O error: Too many open files, e.what() = I/O error, 10.90.17.9:2181
Poco::Exception. Code: 1000, e.code() = 24, e.displayText() = I/O error: Too many open files, e.what() = I/O error, 10.90.17.4:2181
Poco::Exception. Code: 1000, e.code() = 24, e.displayText() = I/O error: Too many open files, e.what() = I/O error, 10.90.17.5:2181
Poco::Exception. Code: 1000, e.code() = 24, e.displayText() = I/O error: Too many open files, e.what() = I/O error, 10.90.17.9:2181
Poco::Exception. Code: 1000, e.code() = 24, e.displayText() = I/O error: Too many open files, e.what() = I/O error, 10.90.17.4:2181
Poco::Exception. Code: 1000, e.code() = 24, e.displayText() = I/O error: Too many open files, e.what() = I/O error, 10.90.17.5:2181
Poco::Exception. Code: 1000, e.code() = 24, e.displayText() = I/O error: Too many open files, e.what() = I/O error, 10.90.17.9:2181
 (Connection loss), e.what() = Coordination::Exception, Stack trace:

0. /usr/bin/clickhouse-server(StackTrace::StackTrace()+0x16) [0x99e9626]
1. /usr/bin/clickhouse-server(Coordination::Exception::Exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int)+0x31) [0x7cabf81]
2. /usr/bin/clickhouse-server(Coordination::Exception::Exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)+0x1a5) [0x7cacd35]
3. /usr/bin/clickhouse-server(Coordination::ZooKeeper::connect(std::vector<Poco::Net::SocketAddress, std::allocator<Poco::Net::SocketAddress> > const&, Poco::Timespan)+0x40a) [0x7ccdc7a]
4. /usr/bin/clickhouse-server(Coordination::ZooKeeper::ZooKeeper(std::vector<Poco::Net::SocketAddress, std::allocator<Poco::Net::SocketAddress> > const&, std::__cxx11::basic_string<char, std::char_traits<c
har>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >
const&, Poco::Timespan, Poco::Timespan, Poco::Timespan)+0x3e9) [0x7cce279]
5. /usr/bin/clickhouse-server(zkutil::ZooKeeper::init(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::a
llocator<char> > const&, int, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x390) [0x7cbf840]
6. /usr/bin/clickhouse-server(zkutil::ZooKeeper::startNewSession() const+0xd9) [0x7cc0309]
7. /usr/bin/clickhouse-server(DB::Context::getZooKeeper() const+0xe0) [0x7570d90]
8. /usr/bin/clickhouse-server(DB::ReplicatedMergeTreeRestartingThread::run()+0x79) [0x78d8b99]
9. /usr/bin/clickhouse-server(DB::BackgroundSchedulePool::TaskInfo::execute()+0xef) [0x796d23f]
10. /usr/bin/clickhouse-server(DB::BackgroundSchedulePool::threadFunction()+0xba) [0x796e00a]
11. /usr/bin/clickhouse-server() [0xa3da90f]
12. /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f404ba826db]
13. /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f404b00188f]

When I tried to connect to clickhouse I receive

ClickHouse client version 18.14.9.
Connecting to localhost:9000.
Connected to ClickHouse server version 18.14.9 revision 54409.

prusech003 :) Cannot load data for command line suggestions: Code: 32, e.displayText() = DB::Exception: Attempt to read after eof: while receiving packet from localhost:9000, 127.0.0.1, e.what() = DB::Exception

Also, I have 2 Gb * 3 nodes logs from last 12 hours, if you need something, let me know.

2018.10.26 17:11:45.063142 [ 616383 ] {} <Error> void DB::ThreadStatus::initPerformanceCounters(): Code: 412, e.displayText() = DB::Exception: Can't create PF_NETLINK socket, errno: 24, strerror: Too many open files, e.what() = DB::Exception, Stack trace:

0. /usr/bin/clickhouse-server(StackTrace::StackTrace()+0x16) [0x99e9626]
1. /usr/bin/clickhouse-server(DB::Exception::Exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)+0x22) [0x3087172]
2. /usr/bin/clickhouse-server(DB::throwFromErrno(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int)+0x1bc) [0x99ce23c]
3. /usr/bin/clickhouse-server(DB::TaskStatsInfoGetter::TaskStatsInfoGetter()+0x1bc) [0x99f0c1c]
4. /usr/bin/clickhouse-server(DB::ThreadStatus::initPerformanceCounters()+0x281) [0x99f21b1]
5. /usr/bin/clickhouse-server(DB::ThreadStatus::attachQuery(std::shared_ptr<DB::ThreadGroupStatus> const&, bool)+0x221) [0x767e971]
6. /usr/bin/clickhouse-server(DB::CurrentThread::attachTo(std::shared_ptr<DB::ThreadGroupStatus> const&)+0x3a) [0x767edea]
7. /usr/bin/clickhouse-server(DB::ParallelInputsProcessor<DB::UnionBlockInputStream<(DB::StreamUnionMode)0>::Handler, (DB::StreamUnionMode)0>::thread(std::shared_ptr<DB::ThreadGroupStatus>, unsigned long)+0x78) [0x75fd648]
8. /usr/bin/clickhouse-server(std::thread::_State_impl<std::thread::_Invoker<std::tuple<DB::ParallelInputsProcessor<DB::UnionBlockInputStream<(DB::StreamUnionMode)0>::Handler, (DB::StreamUnionMode)0>::process()::{lambda()#1}> > >::_M_run()+0x4f) [0x75fdc9f]
9. /usr/bin/clickhouse-server() [0xa3da90f]
10. /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f8f86c0f6db]
11. /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f8f8618e88f]

Try setcap cap_net_admin=-ep /usr/bin/clickhouse
NETLINK is used for 'collecting metrics from kernel'

We got too many open files when we added net_admin caps for clickhouse binary. We are using kafka too.

We don't use neither kafka nor docker (and we use just very basic features of Clickhouse) but I see lots of such messages (I also don't see out-of-file-handlers errors though):

kernel netlink: 'TCPHandler': attribute type 1 has an invalid length.
kernel netlink: 'TCPHandler': attribute type 1 has an invalid length.
kernel netlink: 'TCPHandler': attribute type 1 has an invalid length.
kernel netlink: 'TCPHandler': attribute type 1 has an invalid length.
kernel netlink: 'AsyncBlockInput': attribute type 1 has an invalid length.
kernel netlink: 'ParalInputsProc': attribute type 1 has an invalid length.
kernel netlink: 'ParalInputsProc': attribute type 1 has an invalid length.

@fillest I see similar dmesg output for a Ubuntu 16.04.5 server which have clickhouse-server 18.14.15 installed. We are also using basic features of ClickHouse.

Simply run clickhouse-client and connect to local host would generate such error messages in dmesg.

[Mon Dec  3 13:31:06 2018] netlink: 'TCPHandler': attribute type 1 has an invalid length.
[Mon Dec  3 13:31:06 2018] netlink: 'AsyncBlockInput': attribute type 1 has an invalid length.
[Mon Dec  3 13:31:06 2018] netlink: 'ParalInputsProc': attribute type 1 has an invalid length.
[Mon Dec  3 13:31:06 2018] netlink: 'ParalInputsProc': attribute type 1 has an invalid length.
[Mon Dec  3 13:31:06 2018] netlink: 'AsyncBlockInput': attribute type 1 has an invalid length.
[Mon Dec  3 13:31:06 2018] netlink: 'TCPHandler': attribute type 1 has an invalid length.

The server is for test purpose and don't have much workload. ClickhouseServer have only 104 open file, connecting a client will increase open file number to 106.

ClickHouse server logs shows there's a query executed immediately after clickhouse-client gets connected:

SELECT DISTINCT arrayJoin(extractAll(name, '[\\w_]{2,}')) AS res FROM (SELECT name FROM system.functions UNION ALL SELECT name FROM system.table_engines UNION ALL SELECT name FROM system.formats UNION ALL SELECT name FROM system.table_functions UNION ALL SELECT name FROM system.data_type_families UNION ALL SELECT name FROM system.settings UNION ALL SELECT concat(func.name, comb.name) FROM system.functions AS func CROSS JOIN system.aggregate_function_combinators AS comb WHERE is_aggregate UNION ALL SELECT name FROM system.databases LIMIT 10000 UNION ALL SELECT DISTINCT name FROM system.tables LIMIT 10000 UNION ALL SELECT DISTINCT name FROM system.columns LIMIT 10000) WHERE notEmpty(res)

This query triggers the above dmesg log. But seems like any query can trigger the above dmesg log, e.g. show databases, show tables and any select query on data tables.

@splichy I've tried the setcap command as you mentioned, but still gets netlink log messages.

netlink dmesg problem solved - https://github.com/yandex/ClickHouse/pull/3749

@proller Thanks. The netlink error message disappears after applying your fix.

But still there is problem with netlink file descriptors open

> ps aux | grep clickhouse
clickho+ 23434 12.9 11.8 7978952 967100 ?      Ssl  Dec25 184:24 /usr/bin/clickhouse-server --config=/etc/clickhouse-server/config.xml

> sudo lsof -p 23434 | wc -l
184505

The number is constantly growing!
And upon reaching the limit - server going down

But still there is problem with netlink file descriptors open
The number is constantly growing!

4017

Have anyone faced this issue after new release?
Our clickhouse version is at 19.1.6 version.
However, the netlink file descriptors still keeps constantly increasing.

lsof -p <pid> | wc -l

The following has been observed, when ever the materialized view is trying to query the consumer(queue) table. It would increment the file descriptor count by 4. When there is no materialized view, there is no increment.

4017 is in 19.1.6 release. However, this behavior still exists

2019.02.12 01:12:45.289371 [ 44 ] {} StorageKafka (search_request_consumer): Started streaming to 1 attached views
2019.02.12 01:12:45.291034 [ 148 ] {} StorageKafka (search_request_consumer): Creating formatted reader
2019.02.12 01:12:45.291077 [ 150 ] {} StorageKafka (search_request_consumer): Creating formatted reader
2019.02.12 01:12:45.291045 [ 149 ] {} StorageKafka (search_request_consumer): Creating formatted reader
2019.02.12 01:12:45.291088 [ 151 ] {} StorageKafka (search_request_consumer): Creating formatted reader
2019.02.12 01:12:47.792008 [ 148 ] {} ThreadStatus: Thread 148 exited
2019.02.12 01:12:47.792104 [ 149 ] {} ThreadStatus: Thread 149 exited
2019.02.12 01:12:47.792170 [ 150 ] {} ThreadStatus: Thread 150 exited
2019.02.12 01:12:47.792293 [ 44 ] {} UnionBlockInputStream: Waiting for threads to finish
2019.02.12 01:12:47.792313 [ 151 ] {} ThreadStatus: Thread 151 exited

A bit more info, our setup is with Kafka Engine with kafka_num_consumers = 4

with gdb setting break point at TaskStatsInfoGetter::TaskStatsInfoGetter() and TaskStatsInfoGetter::~TaskStatsInfoGetter()
We can see that for each materialized view interval, the constructor is called 4 times and no destructors is called.

Was this page helpful?
0 / 5 - 0 ratings