When we upgraded our Clickhouse cluster from version 19.16.2.2 to version 19.17.6.36, distributed SELECT queries stopped working and we saw the following error message in the log:
2020.01.02 14:49:53.874989 [ 99 ] {...} <Error> executeQuery: Code: 32, e.displayText() = DB::Exception: Attempt to read after eof (version 19.17.6.36 (official build)) (from [::ffff:192.168.0.14]:56306) (in query: SELECT offset, update FROM default.test WHERE (topic = 'test1') AND (partition = 1) ORDER BY offset DESC LIMIT 1), Stack trace:
0. 0x55e0ed21f0e0 StackTrace::StackTrace() /usr/bin/clickhouse
1. 0x55e0ed21eeb5 DB::Exception::Exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int) /usr/bin/clickhouse
2. 0x55e0ecd5fec9 DB::throwReadAfterEOF() /usr/bin/clickhouse
3. 0x55e0ed2aa1cb DB::readVarUInt(unsigned long&, DB::ReadBuffer&) /usr/bin/clickhouse
4. 0x55e0ed2a323e DB::TCPHandler::isQueryCancelled() /usr/bin/clickhouse
5. 0x55e0ed2a5bfb DB::TCPHandler::processOrdinaryQuery() /usr/bin/clickhouse
6. 0x55e0ed2a8f56 DB::TCPHandler::runImpl() /usr/bin/clickhouse
7. 0x55e0ed2a914b DB::TCPHandler::run() /usr/bin/clickhouse
8. 0x55e0f11c2fe0 Poco::Net::TCPServerConnection::start() /usr/bin/clickhouse
9. 0x55e0f11c36fd Poco::Net::TCPServerDispatcher::run() /usr/bin/clickhouse
10. 0x55e0f2899871 Poco::PooledThread::run() /usr/bin/clickhouse
11. 0x55e0f289761c Poco::ThreadImpl::runnableEntry(void*) /usr/bin/clickhouse
12. 0x55e0f300d780 ? /usr/bin/clickhouse
13. 0x7f14ef11efa3 start_thread /lib/x86_64-linux-gnu/libpthread-2.28.so
14. 0x7f14ef0404cf clone /lib/x86_64-linux-gnu/libc-2.28.so
2020.01.02 14:50:06.953139 [ 99 ] {...} <Error> executeQuery: Code: 32, e.displayText() = DB::Exception: Attempt to read after eof (version 19.17.6.36 (official build)) (from [::ffff:192.168.0.14]:56316) (in query: SELECT offset, update FROM default.test WHERE (topic = 'test1') AND (partition = 1) ORDER BY offset DESC LIMIT 1), Stack trace:
0. 0x55e0ed21f0e0 StackTrace::StackTrace() /usr/bin/clickhouse
1. 0x55e0ed21eeb5 DB::Exception::Exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int) /usr/bin/clickhouse
2. 0x55e0ecd5fec9 DB::throwReadAfterEOF() /usr/bin/clickhouse
3. 0x55e0ed2aa1cb DB::readVarUInt(unsigned long&, DB::ReadBuffer&) /usr/bin/clickhouse
4. 0x55e0ed2a323e DB::TCPHandler::isQueryCancelled() /usr/bin/clickhouse
5. 0x55e0ed2a5bfb DB::TCPHandler::processOrdinaryQuery() /usr/bin/clickhouse
6. 0x55e0ed2a8f56 DB::TCPHandler::runImpl() /usr/bin/clickhouse
7. 0x55e0ed2a914b DB::TCPHandler::run() /usr/bin/clickhouse
8. 0x55e0f11c2fe0 Poco::Net::TCPServerConnection::start() /usr/bin/clickhouse
9. 0x55e0f11c36fd Poco::Net::TCPServerDispatcher::run() /usr/bin/clickhouse
10. 0x55e0f2899871 Poco::PooledThread::run() /usr/bin/clickhouse
11. 0x55e0f289761c Poco::ThreadImpl::runnableEntry(void*) /usr/bin/clickhouse
12. 0x55e0f300d780 ? /usr/bin/clickhouse
13. 0x7f14ef11efa3 start_thread /lib/x86_64-linux-gnu/libpthread-2.28.so
14. 0x7f14ef0404cf clone /lib/x86_64-linux-gnu/libc-2.28.so
Switching back to version 19.17.5.18 solved the issue.
Example cluster config:
<cluster1>
<shard>
<replica>
<host>ch01</host>
<port>9440</port>
<secure>1</secure>
</replica>
</shard>
<shard>
<replica>
<host>ch02</host>
<port>9440</port>
<secure>1</secure>
</replica>
</shard>
...
</cluster1>
Unfortunately we could not replicate this issues in our test environment.
https://github.com/ClickHouse/ClickHouse/compare/v19.17.5.18-stable...v19.17.6.36-stable
That may be the possible reason: https://github.com/ClickHouse/ClickHouse/pull/8128 (didn't check)
After #8128 socket timeout start being respected (which were ignored before for TLS).
Can you try to increase default timeouts?
Version 19.17.6.36 and version 20.1.2.4 give the same errors even when increasing the timeouts:
<!-- Connection timeout if there are no replicas. default=10 -->
<connect_timeout>60</connect_timeout>
<!-- default=300 -->
<receive_timeout>600</receive_timeout>
<!-- default=300 -->
<send_timeout>600</send_timeout>
This would be running a sample distributed query on version 19.17.5.18
...:~# clickhouse-client
ClickHouse client version 19.17.5.18 (official build).
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 19.17.5 revision 54428.
... :) SELECT field_a, field_b, field_c FROM xx WHERE field_d = '...' AND field_e = 0 ORDER BY field_a DESC LIMIT 1 FORMAT CSV;
SELECT
field_a,
field_b,
field_c
FROM x_all
WHERE (field_d = '...') AND (field_e = 0)
ORDER BY field_a DESC
LIMIT 1
FORMAT CSV
123456789,"2020-01-22 15:30:47",123456789
1 rows in set. Elapsed: 2.109 sec. Processed 9.94 billion rows, 52.32 GB (4.71 billion rows/s., 24.81 GB/s.)
This would be running the distributed query on version 20.1.2.4. Error is happening almost immediately.
...:~# clickhouse-client
ClickHouse client version 20.1.2.4 (official build).
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 20.1.2 revision 54431.
... :) SELECT field_a, field_b, field_c FROM x_all WHERE field_d = '...' AND field_e = 0 ORDER BY field_a DESC LIMIT 1 FORMAT CSV;
SELECT
field_a,
field_b,
field_c
FROM x_all
WHERE (field_d = '...') AND (field_e = 0)
ORDER BY field_a DESC
LIMIT 1
FORMAT CSV
↙ Progress: 6.74 billion rows, 45.87 GB (3.49 billion rows/s., 23.73 GB/s.) 67%Received exception from server (version 20.1.2):
Code: 209. DB::Exception: Received from localhost:9000. DB::Exception: Timeout exceeded while reading from socket (111.111.111.111:9440): while receiving packet from 02.xx:9440.
0 rows in set. Elapsed: 1.973 sec. Processed 6.74 billion rows, 45.87 GB (3.42 billion rows/s., 23.24 GB/s.)
We setup a test cluster with the same data model and some sample data on the same hardware.
We get the same error on the test cluster with TLS enabled (although less often) and don't see the timeout error with TLS disabled.
The sample query takes about 1 sec. with TLS enabled and around 0.05 sec. without TLS.
Can you please check that: https://github.com/ClickHouse/ClickHouse/issues/5160#issuecomment-488498503
I have now tested the timeout parameter connect_timeout_with_failover_ms on the latest version 20.1.2.4, even set it to 10000ms. Still the same Timeout exceeded error (well before the 10 sec. timeout). Please be aware that we do not use replica for this cluster.
Queries run fine on the cluster if TLS is disabled.
After a while HTTPS client connections stopped working and we saw the following error message:
2020.01.27 15:19:08.159844 [ 104 ] {} <Error> ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 0, e.displayText() = SSL Exception: error:2400006E:random number generator::error retrieving entropy0. 0xbc3a63c Poco::IOException::IOException(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) in /usr/bin/clickhouse
1. 0x9ba0579 Poco::Net::NetException::NetException(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) in /usr/bin/clickhouse
2. 0x9d18dd9 Poco::Net::SSLException::SSLException(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) in /usr/bin/clickhouse
3. 0x9d1f698 Poco::Net::SecureSocketImpl::handleError(int) in /usr/bin/clickhouse
4. 0x9d1fc43 Poco::Net::SecureSocketImpl::completeHandshake() in /usr/bin/clickhouse
5. 0x9d201a0 Poco::Net::SecureSocketImpl::receiveBytes(void*, int, int) in /usr/bin/clickhouse
6. 0x9b93664 Poco::Net::HTTPSession::receive(char*, int) in /usr/bin/clickhouse
7. 0x9b936cb Poco::Net::HTTPSession::get() in /usr/bin/clickhouse
8. 0x9b8d602 Poco::Net::HTTPHeaderStreamBuf::readFromDevice(char*, long) in /usr/bin/clickhouse
9. 0x9b8d4b8 Poco::BasicBufferedStreamBuf<char, std::__1::char_traits<char>, Poco::Net::HTTPBufferAllocator>::underflow() in /usr/bin/clickhouse
10. 0xc64d03a std::__1::basic_streambuf<char, std::__1::char_traits<char> >::uflow() in /usr/bin/clickhouse
11. 0xc654226 std::__1::basic_istream<char, std::__1::char_traits<char> >::get() in /usr/bin/clickhouse
12. 0x9b8fee3 Poco::Net::HTTPRequest::read(std::__1::basic_istream<char, std::__1::char_traits<char> >&) in /usr/bin/clickhouse
13. 0x9b92b21 Poco::Net::HTTPServerRequestImpl::HTTPServerRequestImpl(Poco::Net::HTTPServerResponseImpl&, Poco::Net::HTTPServerSession&, Poco::Net::HTTPServerParams*) in /usr/bin/clickhouse
14. 0x9bb8cc1 Poco::Net::HTTPServerConnection::run() in /usr/bin/clickhouse
15. 0x9bb5677 Poco::Net::TCPServerConnection::start() in /usr/bin/clickhouse
16. 0x9bb5a6d Poco::Net::TCPServerDispatcher::run() in /usr/bin/clickhouse
17. 0xbca44ff Poco::PooledThread::run() in /usr/bin/clickhouse
18. 0xbca15c8 Poco::ThreadImpl::runnableEntry(void*) in /usr/bin/clickhouse
19. 0xbca2e69 ? in /usr/bin/clickhouse
20. 0x7fa3 start_thread in /lib/x86_64-linux-gnu/libpthread-2.28.so
21. 0xf94cf clone in /lib/x86_64-linux-gnu/libc-2.28.so
(version 20.1.2.4 (official build)
check cat /proc/sys/kernel/random/entropy_avail
Try rng-tools or haveged at test cluster https://www.cyberciti.biz/open-source/debian-ubuntu-centos-linux-setup-additional-entropy-for-server-using-aveged-rng-tools-utils/ to check that an entropy starvation is the reason.
Also:
1) do you face that issue only for Distributed queries?
2) clickhouse-client connected from the same node to secure port works properly?
3) Do you see any errors in logs of the nodes initiating query, or in logs or shards, when the exception happen?
4) you say you're using https too, did you configure ports for Distributed engine correctly (didn't mess https_port and tcp_port_secure)?
5) when you get that exception - are all cluster nodes updated & clickhouse-server restarted?
6) if you are familiar with docker/docker-compose - can you create a minimal sample setup reproducing that issue?
check
cat /proc/sys/kernel/random/entropy_availTry rng-tools or haveged at test cluster https://www.cyberciti.biz/open-source/debian-ubuntu-centos-linux-setup-additional-entropy-for-server-using-aveged-rng-tools-utils/ to check that an entropy starvation is the reason.
We have already haveged running on our servers. cat /proc/sys/kernel/random/entropy_avail gives back numbers around 3400.
- do you face that issue only for Distributed queries?
Locally the query runs fine.
- clickhouse-client connected from the same node to secure port works properly?
When we run clickhouse-client -s we get the following error message
ClickHouse client version 20.1.2.4 (official build).
Connecting to localhost:9440 as user default.
Code: 210. DB::NetException: SSL Exception: error:14000438:SSL routines::tlsv1 alert internal error (localhost:9440)
In the server error log the following message appears:
2020.01.28 08:49:09.364718 [ 100 ] {} <Error> ServerErrorHandler: Code: 210, e.displayText() = DB::NetException: SSL Exception: error:2400006E:random number generator::error retrieving entropy, while reading from socket ([::1]:16308), Stack trace (when copying this message, always include the lines below):
0. 0xbc31d9c Poco::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) in /usr/bin/clickhouse
1. 0x4f6ccd9 DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) in /usr/bin/clickhouse
2. 0x4cfc854 ? in /usr/bin/clickhouse
3. 0x4ff0111 DB::TCPHandler::runImpl() in /usr/bin/clickhouse
4. 0x4ff0d5c DB::TCPHandler::run() in /usr/bin/clickhouse
5. 0x9bb5677 Poco::Net::TCPServerConnection::start() in /usr/bin/clickhouse
6. 0x9bb5a6d Poco::Net::TCPServerDispatcher::run() in /usr/bin/clickhouse
7. 0xbca44ff Poco::PooledThread::run() in /usr/bin/clickhouse
8. 0xbca15c8 Poco::ThreadImpl::runnableEntry(void*) in /usr/bin/clickhouse
9. 0xbca2e69 ? in /usr/bin/clickhouse
10. 0x7fa3 start_thread in /lib/x86_64-linux-gnu/libpthread-2.28.so
11. 0xf94cf clone in /lib/x86_64-linux-gnu/libc-2.28.so
(version 20.1.2.4 (official build))
- Do you see any errors in logs of the nodes initiating query, or in logs or shards, when the exception happen?
See 2.
- you say you're using https too, did you configure ports for Distributed engine correctly (didn't mess
https_portandtcp_port_secure)?
Ports and the secure flag were set correctly.
- when you get that exception - are all cluster nodes updated & clickhouse-server restarted?
After updating and restarting the clickhouse server process, we verified manually the correct version is running on all the servers.
- if you are familiar with docker/docker-compose - can you create a minimal sample setup reproducing that issue?
Yes we have a docker-compose setup with the same data model. Scripts to inject random data and simulate load. We are able to replicate the entropy issue. Issue does not appear directly after start but after some time (~1.5h).
I saw that Clickhouse switched from libressl to openssl. To recap the issues:
Do you use any non-standard ssl-related settings?
If you have adjusted the list of ciphers (or other ssl-related settings) in <server> section here, then you should make the same adjustments for a client too in a section just below (there is the misleading comment that it's used only for dicts).
For clickhouse-client you need to adjust client settings separately (in /etc/clickhouse-client/config.xml)
Can you try to connect to ClickHouse via openssl command-line tool?
See https://www.feistyduck.com/library/openssl-cookbook/online/ch-testing-with-openssl.html
We use the standard settings on the server and the client:
<disableProtocols>sslv2,sslv3</disableProtocols>
<preferServerCiphers>true</preferServerCiphers>
With the docker-compose (3 zookeeper and 3 clickhouse server, debian:buster) test setup and a minimal table, we are able to replicate the issue.
CREATE TABLE test ON CLUSTER cluster_test (
`datum` DateTime,
`id` Int32
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(datum)
ORDER BY id;
CREATE TABLE test_all ON CLUSTER cluster_test
(
`datum` DateTime,
`id` Int32
)
ENGINE = Distributed(cluster_test, default, test, rand());
We insert incremental id's with a fixed date and use a simple query in a loop (python process pool).
In the beginning the TLS connections work, docker exec -it click1 openssl s_client -connect click1:8443 output:
CONNECTED(00000003)
Can't use SSL_get_servername
depth=0 CN = localhost
verify error:num=18:self signed certificate
verify return:1
depth=0 CN = localhost
verify return:1
---
Certificate chain
0 s:CN = localhost
i:CN = localhost
---
Server certificate
-----BEGIN CERTIFICATE-----
MIIC/jCCAeagAwIBAgIJAPT0ShPSztF4MA0GCSqGSIb3DQEBCwUAMBQxEjAQBgNV
BAMMCWxvY2FsaG9zdDAeFw0xOTA3MzAxMTQ4MzJaFw0yMDA3MjkxMTQ4MzJaMBQx
EjAQBgNVBAMMCWxvY2FsaG9zdDCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoC
ggEBAOPsB5PzD60NYzSpF0lRmtuyGTI9ofxfMJz3wrXxNbHzk/WeKoeQtgUVPkep
V61pDppFTpsnvr4+oO3pMGoWtIrabgL35Dd6DohJTtv5Vhk/SkpzX4b4RatMIPIP
3DU5y36xyaZZj9rgKUl8JdM7Q2gqo2SLrTovtNgatwBB3Y1kfj1jERUiBp5Xyk/h
fsogCmX+MmF6UdGbtUC3d+jsym+Am7peS9qBYV0Os4JTOAFMyYFIRTkmyKZeH+Ev
75g/hXDy2pb1zXIu3J+j/OXMiHeSHi7iMdA7AwefVjMQbDeHUNMCMs70U/id8FxW
ha/7Q27y5pD69Mdm95QwjQm/FGkCAwEAAaNTMFEwHQYDVR0OBBYEFBWuzAN+bi8i
tDTeGOVNeUj8BF2VMB8GA1UdIwQYMBaAFBWuzAN+bi8itDTeGOVNeUj8BF2VMA8G
A1UdEwEB/wQFMAMBAf8wDQYJKoZIhvcNAQELBQADggEBAH0B99VLSPzvQOFFVQVz
J68wL8bXiXR2/dPVVpsHkE+t9TDLZ73+AOK5UgpRD7/Ctckg7j5G+pG6IX1PYSGw
eGZ6+K9YWH2owh4y9QLT4G/jayD8Ri01r/GXJOeBijE/LNI3SurrLBs57uraeFv1
MIjFPJPd0g5AMptgoVEGnbxsAUK5p4DXfdG98zXOD/yTtzoH8xcWN+GE5DZdkjxe
3kcCVQxJ8tcBHYva/wiWUcwJWopj0EqYzKOiExGuMSioaQX62GKUqXqNMqQByVua
9Tn6CgiPD3zyO95+v3dm1tKD0k1V+p84Z+X8r9/Myf+WxrPFQxBwhtE9xTcg9A8B
iBc=
-----END CERTIFICATE-----
subject=CN = localhost
issuer=CN = localhost
---
No client certificate CA names sent
Peer signing digest: SHA256
Peer signature type: RSA-PSS
Server Temp Key: ECDH, P-256, 256 bits
---
SSL handshake has read 1452 bytes and written 699 bytes
Verification error: self signed certificate
---
New, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384
Server public key is 2048 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 18 (self signed certificate)
---
---
Post-Handshake New Session Ticket arrived:
SSL-Session:
Protocol : TLSv1.3
Cipher : TLS_AES_256_GCM_SHA384
Session-ID: 9F838585E9259C8F1B0B18F285383E1AFD91DA3E42ACB5AFB4258AD2EBEEE556
Session-ID-ctx:
Resumption PSK: 576CD1F3ACA4886A8A734F8E517AB38C0888268AF096EB6943DBB5927106E9CE97AF3C0FECC413FA11ABF80123C1F5DD
PSK identity: None
PSK identity hint: None
SRP username: None
TLS session ticket lifetime hint: 7200 (seconds)
TLS session ticket:
0000 - 46 90 54 c9 7b 0f 0e 16-2e d9 22 9f 65 9a a0 00 F.T.{.....".e...
0010 - ed e0 9b fe 1b 53 91 b3-62 83 30 c3 af c4 a1 bf .....S..b.0.....
0020 - 17 3d 8c 29 58 9a 24 4e-d3 88 50 98 90 61 ae 64 .=.)X.$N..P..a.d
0030 - 09 9c 74 67 f5 40 14 de-ae 1c c9 7b 1e e2 77 c1 ..tg.@.....{..w.
0040 - d3 40 42 a1 12 8e b9 fb-20 99 b0 f2 b8 45 e2 4b .@B..... ....E.K
0050 - 8b 1a 35 a3 35 88 b6 42-36 f7 b4 3f 4d 47 27 6f ..5.5..B6..?MG'o
0060 - 13 22 5a 97 81 46 19 d6-16 f4 22 9e 9c cb 5b 7d ."Z..F...."...[}
0070 - e8 1a 37 84 02 2e 3a 4f-c8 ed 0d 9d c5 83 12 55 ..7...:O.......U
0080 - 88 2d 55 86 e3 ce 61 a2-18 9e ff ae 28 6b eb 10 .-U...a.....(k..
0090 - 35 30 a0 f2 65 fc c2 4d-e4 53 0e bb 26 b4 9d 77 50..e..M.S..&..w
00a0 - 10 44 3c 2d 3b 58 c4 9f-a8 23 23 29 98 26 9a 25 .D<-;X...##).&.%
00b0 - 0e ed a0 e4 ab 92 75 53-40 3c a5 98 cf 21 5b 20 ......uS@<...![
00c0 - 52 7c 98 dd 53 ee 14 dc-0b b1 66 90 ad fb c1 53 R|..S.....f....S
Start Time: 1580217416
Timeout : 7200 (sec)
Verify return code: 18 (self signed certificate)
Extended master secret: no
Max Early Data: 0
---
read R BLOCK
---
Post-Handshake New Session Ticket arrived:
SSL-Session:
Protocol : TLSv1.3
Cipher : TLS_AES_256_GCM_SHA384
Session-ID: 3E20FFF2B91760C6C9F24818ED4A14087DFB0C50468DB332471AC2B168835D22
Session-ID-ctx:
Resumption PSK: 334FE16C82762EE9713EA014D37FDFB90AA6BC242A112842AAE3D53AE01ACE4B60642382A7999387B0C009087C28557B
PSK identity: None
PSK identity hint: None
SRP username: None
TLS session ticket lifetime hint: 7200 (seconds)
TLS session ticket:
0000 - 46 90 54 c9 7b 0f 0e 16-2e d9 22 9f 65 9a a0 00 F.T.{.....".e...
0010 - 51 74 3e d4 af f4 9a 34-b0 df e8 9a c2 2d f3 fd Qt>....4.....-..
0020 - 1f 76 df 7a a4 41 51 25-51 78 17 7e ce 43 a6 0b .v.z.AQ%Qx.~.C..
0030 - 69 da 8e 0d f9 f3 46 06-17 da c7 50 5b 2d 3f b2 i.....F....P[-?.
0040 - c9 ca c1 9c de 50 37 5f-0b 10 73 48 2d 46 39 16 .....P7_..sH-F9.
0050 - 24 84 3c 62 b7 88 2e 5f-ff 44 eb b1 9b 44 f6 9b $.<b..._.D...D..
0060 - 89 86 2e 6f 46 26 bc f9-e7 94 a3 a7 e6 cb 61 d8 ...oF&........a.
0070 - a5 5d 7b 24 5c 77 ba 02-fd 7b 8d da ff 8c a5 4c .]{$\w...{.....L
0080 - 9d 23 5d 0c 4f a5 14 2f-e5 3e 1d 2d 04 07 a4 9f .#].O../.>.-....
0090 - 39 09 d7 da 78 c5 f7 c7-8f 88 6e ef 1a e9 9c 5e 9...x.....n....^
00a0 - e7 b9 61 cc 4a 37 7b a9-66 2c 45 61 bc c2 85 35 ..a.J7{.f,Ea...5
00b0 - be eb 04 60 e3 be 3c aa-bc 00 19 3a 28 80 ea 42 ...`..<....:(..B
00c0 - 1d 29 31 c4 be a2 b7 fa-bd d5 8b 2e 2d 5e fc fd .)1.........-^..
Start Time: 1580217416
Timeout : 7200 (sec)
Verify return code: 18 (self signed certificate)
Extended master secret: no
Max Early Data: 0
---
read R BLOCK
^C
After around 40 minutes, the TLS connections begin to fail and docker exec -it click1 openssl s_client -connect click1:8443 shows the following output:
CONNECTED(00000003)
140359067051136:error:14094438:SSL routines:ssl3_read_bytes:tlsv1 alert internal error:../ssl/record/rec_layer_s3.c:1544:SSL alert number 80
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 7 bytes and written 293 bytes
Verification: OK
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 0 (ok)
---
All 3 server instances now show the following error:
2020.01.28 13:55:07.612275 [ 60 ] {} <Error> ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 0, e.displayText() = SSL Exception: error:2400006E:random number generator::error retrieving entropy0. 0xbc3a63c Poco::IOException::IOException(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) in /usr/bin/clickhouse
1. 0x9ba0579 Poco::Net::NetException::NetException(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) in /usr/bin/clickhouse
2. 0x9d18dd9 Poco::Net::SSLException::SSLException(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) in /usr/bin/clickhouse
3. 0x9d1f698 Poco::Net::SecureSocketImpl::handleError(int) in /usr/bin/clickhouse
4. 0x9d1fc43 Poco::Net::SecureSocketImpl::completeHandshake() in /usr/bin/clickhouse
5. 0x9d201a0 Poco::Net::SecureSocketImpl::receiveBytes(void*, int, int) in /usr/bin/clickhouse
6. 0x9b93664 Poco::Net::HTTPSession::receive(char*, int) in /usr/bin/clickhouse
7. 0x9b936cb Poco::Net::HTTPSession::get() in /usr/bin/clickhouse
8. 0x9b8d602 Poco::Net::HTTPHeaderStreamBuf::readFromDevice(char*, long) in /usr/bin/clickhouse
9. 0x9b8d4b8 Poco::BasicBufferedStreamBuf<char, std::__1::char_traits<char>, Poco::Net::HTTPBufferAllocator>::underflow() in /usr/bin/clickhouse
10. 0xc64d03a std::__1::basic_streambuf<char, std::__1::char_traits<char> >::uflow() in /usr/bin/clickhouse
11. 0xc654226 std::__1::basic_istream<char, std::__1::char_traits<char> >::get() in /usr/bin/clickhouse
12. 0x9b8fee3 Poco::Net::HTTPRequest::read(std::__1::basic_istream<char, std::__1::char_traits<char> >&) in /usr/bin/clickhouse
13. 0x9b92b21 Poco::Net::HTTPServerRequestImpl::HTTPServerRequestImpl(Poco::Net::HTTPServerResponseImpl&, Poco::Net::HTTPServerSession&, Poco::Net::HTTPServerParams*) in /usr/bin/clickhouse
14. 0x9bb8cc1 Poco::Net::HTTPServerConnection::run() in /usr/bin/clickhouse
15. 0x9bb5677 Poco::Net::TCPServerConnection::start() in /usr/bin/clickhouse
16. 0x9bb5a6d Poco::Net::TCPServerDispatcher::run() in /usr/bin/clickhouse
17. 0xbca44ff Poco::PooledThread::run() in /usr/bin/clickhouse
18. 0xbca15c8 Poco::ThreadImpl::runnableEntry(void*) in /usr/bin/clickhouse
19. 0xbca2e69 ? in /usr/bin/clickhouse
20. 0x7fa3 start_thread in /lib/x86_64-linux-gnu/libpthread-2.28.so
21. 0xf94cf __clone in /lib/x86_64-linux-gnu/libc-2.28.so
(version 20.1.2.4 (official build)
cat /proc/sys/kernel/random/entropy_avail shows: 3812.
All further attempts to use TLS fail.
I was able to hit the issue faster with inserting little data only in the beginning and then hitting the 3 servers with many queries (per server 20 processes querying). The issue happened 14 minutes later.
could be related to https://github.com/ClickHouse/ClickHouse/issues/4470#issuecomment-576466122
@dstruck what Linux distro do you use?
And version of openssl ?
dpkg -l 'openssl'
+++-==============-================-============-=
ii openssl 1.1.1d-0+deb10u2 amd64
And version of openssl ?
At that point CH should use bundled openssl (unless you rebuild ClickHouse with custom flags).
20.1 actually uses trunc version 3.0.0-dev directly from master branch with few customizations.
https://github.com/ClickHouse-Extras/openssl/commits/c74e7895eb0d219d4007775eec134dd2bcd9d1ae
@alexey-milovidov maybe we should try switch to one of mature & stable branches instead, for example https://github.com/openssl/openssl/tree/OpenSSL_1_1_1-stable (which is actively maintained and supported too)
See https://github.com/ClickHouse/ClickHouse/pull/8218
But that can affect only 20.1 (19.17 is still on libressl)
Looking at just the TLS issue in https://github.com/ClickHouse/ClickHouse/issues/8886, I can see a failed call during the handshake.
stracing the ch http handler I see:
09:04:36 shmget(0x72, 1, 000) = -1 ENOENT (No such file or directory)
09:04:36 uname({sysname="Linux", nodename="clickhouse-demo2", ...}) = 0
(note, I don't see this code path at all during a successful handshake)
0x72 is OPENSSL_RAND_SEED_DEVRANDOM_SHM_ID and probably comes from this line:
https://github.com/openssl/openssl/blob/OpenSSL_1_1_1d/crypto/rand/rand_unix.c#L391
What I don't understand is the code comment that follows that... "Check the kernel's version and fail if it is too recent."
Also - one of customizations seems related: https://github.com/ClickHouse-Extras/openssl/commit/a38119230ea8491c28dd80963a8dfc3a4fffe213
yeah, that doesn't exactly match the upstream commit https://github.com/openssl/openssl/commit/9484b67dfb0fc69326b4d94c2040751b205baa24#diff-764f070a63820217e9abd41ff93edc3bL104-R108
@dstruck Can you please recheck on 19.17.9 ? Fix for timeouts was released there.
Will do. Can't build the docker test environment at the moment, get a BADSIG (also on the servers). Reported the issue in a separate ticket: #9063
Works again. Probably BADSIG was temporarily reported due to updates to the repository.
Unfortunately I can only replicate the TLS issues with the 20. series in the test setup. I have tested the 20.1.4.14 version and the problem persists.
20.1.2.4
requests.exceptions.SSLError: HTTPSConnectionPool(host='localhost', port=9445): Max retries exceeded with url: /?query=SELECT+count%28%29+FROM+test_all+WHERE+id+%3D+10%3B (Caused by SSLError(SSLError(1, '[SSL: TLSV1_ALERT_INTERNAL_ERROR] tlsv1 alert internal error (_ssl.c:1056)')))
20.1.4.14
requests.exceptions.SSLError: HTTPSConnectionPool(host='localhost', port=9445): Max retries exceeded with url: /?query=SELECT+count%28%29+FROM+test_all+WHERE+id+%3D+10%3B (Caused by SSLError(SSLError(1, '[SSL: TLSV1_ALERT_INTERNAL_ERROR] tlsv1 alert internal error (_ssl.c:2488)')))
That means the problem introduced by
https://github.com/ClickHouse/ClickHouse/pull/8128
Was successfully fixed by
https://github.com/ClickHouse/ClickHouse/pull/9026
The issue is still present in the 20.1.4.14 version.
Should I open a new issue for this. I can also provide a docker test setup how to replicate the issue.
@dstruck yes, it will be backported, but not yet https://github.com/ClickHouse/ClickHouse/commit/7cd87125c72a1e779baa609606d6d8dae60b97e0 (see the current tags v20.2.1.2443-prestable v20.2.1.2442-testing v20.2.1.2425-testing v20.2.1.2412-testing v20.2.1.2394-testing v20.2.1.2386-testing v20.2.1.2382-testing v20.2.1.2379-testing v20.2.1.2372-testing v20.2.1.2369-testing v20.2.1.2362-testing v20.2.1.2359-testing v20.2.1.2354-testing v20.2.1.2352-testing v20.2.1.2343-testing v20.2.1.2337-testing)
Thank you for the feedback.
I will then wait and re-test when the backported version is released.
There is also another ssl issue in 20.1, reported here #8886, PR with some progress & discussion here: #8956
Feedback: latest stable version 20.1.6.30 runs fine with intra-cluster TLS connections (secure=1) and client -> clickhouse server TLS connections.
Most helpful comment
Feedback: latest stable version 20.1.6.30 runs fine with intra-cluster TLS connections (secure=1) and client -> clickhouse server TLS connections.