Clickhouse: Zookeeper XID overflow errors. Table in read only mode

Created on 16 Apr 2019 · 8Comments · Source: ClickHouse/ClickHouse

(you don't have to strictly follow this form)

Describe the bug
Some table stuck in read only mode, and can not recover from that state.
Detach read-only table operation will timeout all the time.
Restart clickhouse node can help, but is very inconvenient.

How to reproduce

Which ClickHouse server version to use
ClickHouse server version 18.14.13 revision 54409
Also tried ClickHouse server version 19.3.6 revision 54415
zookeeper
zookeeper-3.4.13

Expected behavior
Table can recover from read-only mode, or I can reload ready-only table manually by detach/attach operation.

Error message and/or stacktrace

the zookeeper session is also expired, I doubt whether it is the reason that cause table in read-only mode.

database: adshonor
table: locked_request_201904260200
is_leader: 0
is_readonly: 1
is_session_expired: 1

clickhouse server err log
void DB::AsynchronousMetrics::update(): Cannot get replica delay for table: adshonor.locked_request_201904260200: Code: 242, e.displayText() = DB::Exception: Table is in readonly mode, Stack trace:

clickhouse-server(StackTrace::StackTrace()+0x16) [0x5d9fdc6]
clickhouse-server(DB::Exception::Exception(std::string const&, int)+0x1f) [0x2ceb75f]
clickhouse-server(DB::StorageReplicatedMergeTree::assertNotReadonly() const+0x49) [0x5677459]
clickhouse-server(DB::StorageReplicatedMergeTree::getReplicaDelays(long&, long&)+0x30) [0x5679ca0]
clickhouse-server(DB::AsynchronousMetrics::update()+0x4e5) [0x53f9c25]
clickhouse-server(DB::AsynchronousMetrics::run()+0x5a) [0x53faf7a]
clickhouse-server(ThreadFromGlobalPool::ThreadFromGlobalPool
clickhouse-server(ThreadPoolImpl::worker(std::_List_iterator)+0x199) [0x5da8a79]
clickhouse-server() [0x65ce82f]
/lib64/libpthread.so.0(+0x7e25) [0x7fe6aff0ae25]
/lib64/libc.so.6(clone+0x6d) [0x7fe6af72d35d]

Additional context
Add any other context about the problem here.

bug comp-zookeeper

Source

github1youlc

Most helpful comment

@243f6a8885a308d313198a2e037 Sorry, I missed your answer.
This issue will be solved (at least for some of these cases) in https://github.com/yandex/ClickHouse/issues/6045

alexey-milovidov on 27 Aug 2019

❤1 🎉1 😄1

All 8 comments

I am not a ClickHouse expert (one of a recent user), but I have seen a scenario like this before.

At that time, "XID overflow" was the direct cause of this issue.
(Restarting the server will resolve the error (and this is what I did at that time), but some auto-recovery mechanism should be prepared against this situation.)

Can @github1youlc check if there is an error message like "XID overflow" above that error?

243f6a8885a308d313198a2e037 on 19 Apr 2019

👍1

I am not a ClickHouse expert (one of a recent user), but I have seen a scenario like this before.

At that time, "XID overflow" was the direct cause of this issue.
(Restarting the server will resolve the error (and this is what I did at that time), but some auto-recovery mechanism should be prepared against this situation.)

Can @github1youlc check if there is an error message like "XID overflow" above that error?

Thank you for this clue, I find the appearance in clickhouse-server error log.

2019.04.10 01:12:49.884863 [ 35 ] {} adshonor.unlocked_201905020000 (ReplicatedMergeTreeRestartingThread): ZooKeeper session has expired. Switching to a new session.
2019.04.10 01:12:49.954709 [ 43 ] {} zkutil::EphemeralNodeHolder::~EphemeralNodeHolder(): Code: 999, e.displayText() = Coordination::Exception: XID overflow (Session expired), Stack trace:

clickhouse-server(StackTrace::StackTrace()+0x16) [0x5d9fdc6]
clickhouse-server(Coordination::Exception::Exception(std::string const&, int, int)+0x28) [0x5aa65e8]
clickhouse-server(Coordination::Exception::Exception(std::string const&, int)+0xad) [0x5aa682d]
clickhouse-server(Coordination::ZooKeeper::pushRequest(Coordination::ZooKeeper::RequestInfo&&)+0x1f4) [0x5abc484]
clickhouse-server(Coordination::ZooKeeper::remove(std::string const&, int, std::function
clickhouse-server(zkutil::ZooKeeper::removeImpl(std::string const&, int)+0x7a) [0x5aaaafa]
clickhouse-server(zkutil::ZooKeeper::tryRemove(std::string const&, int)+0x15) [0x5aaab95]
clickhouse-server(std::_Sp_counted_ptr_inplace, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x1e) [0x2dfbe6e]
clickhouse-server(std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()+0x56) [0x2cec596]
clickhouse-server(DB::ReplicatedMergeTreeRestartingThread::partialShutdown()+0x7b) [0x57bcc0b]
clickhouse-server(DB::ReplicatedMergeTreeRestartingThread::run()+0x4e7) [0x57beb77]
clickhouse-server(DB::BackgroundSchedulePool::TaskInfo::execute()+0xd8) [0x5868e28]
clickhouse-server(DB::BackgroundSchedulePool::threadFunction()+0x62) [0x58696d2]
clickhouse-server() [0x5869744]
clickhouse-server(ThreadPoolImpl::worker(std::_List_iterator)+0x199) [0x5da8a79]
clickhouse-server() [0x65ce82f]
/lib64/libpthread.so.0(+0x7e25) [0x7f296b690e25]
/lib64/libc.so.6(clone+0x6d) [0x7f296aeb335d]

github1youlc on 23 Apr 2019

I figure out the problem in my case, and I make a fix here. 9527

github1youlc on 16 May 2019

I tried the fix merged above and found that the problem is NOT completely fixed.

Environment

OS: Ubuntu 16.04
Database structure: 1 shard, 2 replica (to use ZooKeeper)
ClickHouse: ea739d193229df3f7230a8f5cfd90444bfca6de0 with following modifications:

ClickHouse$ git diff
diff --git a/contrib/ssl b/contrib/ssl
--- a/contrib/ssl
+++ b/contrib/ssl
@@ -1 +1 @@
-Subproject commit ba8de796195ff9d8bb0249ce289b83226b848b77
+Subproject commit ba8de796195ff9d8bb0249ce289b83226b848b77-dirty
diff --git a/dbms/src/Common/ZooKeeper/ZooKeeperImpl.cpp b/dbms/src/Common/ZooKeeper/ZooKeeperImpl.cpp
index 4abb97f..41e18b9 100644
--- a/dbms/src/Common/ZooKeeper/ZooKeeperImpl.cpp
+++ b/dbms/src/Common/ZooKeeper/ZooKeeperImpl.cpp
@@ -1430,6 +1430,8 @@ void ZooKeeper::pushRequest(RequestInfo && info)
         if (!info.request->xid)
         {
             info.request->xid = next_xid.fetch_add(1);
+            if (info.request->xid == close_xid)
+                throw Exception("xid equal to close_xid", ZSESSIONEXPIRED);
             if (info.request->xid < 0)
                 throw Exception("XID overflow", ZSESSIONEXPIRED);
         }
diff --git a/dbms/src/Common/ZooKeeper/ZooKeeperImpl.h b/dbms/src/Common/ZooKeeper/ZooKeeperImpl.h
index 2486857..bfeac5e 100644
--- a/dbms/src/Common/ZooKeeper/ZooKeeperImpl.h
+++ b/dbms/src/Common/ZooKeeper/ZooKeeperImpl.h
@@ -180,7 +180,7 @@ private:

     int64_t session_id = 0;

-    std::atomic<XID> next_xid {1};
+    std::atomic<XID> next_xid {((1 << 30) - 128) << 1};
     std::atomic<bool> expired {false};
     std::mutex push_request_mutex;

diff --git a/debian/changelog b/debian/changelog
index 06ae50f..a1fd2e7 100644
--- a/debian/changelog
+++ b/debian/changelog
@@ -1,5 +1,5 @@
-clickhouse (19.5.3.1) unstable; urgency=low
+clickhouse (19.5.3) unstable; urgency=low

   * Modified source code

- -- clickhouse-release <[email protected]>  Mon, 15 Apr 2019 21:51:50 +0300
+ -- XXX <XXX@XXX>  Wed, 22 May 2019 11:55:05 +0000
(END)

I modified the initial value of next_xid in order to reproduce XID overflow easier.

I sent INSERT into replicated table about 180 times and got the following variations of errors:

2 <Response [500]> Code: 225, e.displayText() = DB::Exception: ZooKeeper session has been expired. (version 19.5.3.1)

3 <Response [500]> Code: 225, e.displayText() = DB::Exception: ZooKeeper session has been expired. (version 19.5.3.1)

... (NOTE: cause of this part is XID overflow in 2nd request. (it could happen because I also sent some other INSERTs to another table))

38 <Response [500]> Code: 225, e.displayText() = DB::Exception: ZooKeeper session has been expired. (version 19.5.3.1)

39 <Response [500]> Code: 225, e.displayText() = DB::Exception: ZooKeeper session has been expired. (version 19.5.3.1)

40 <Response [200]> 
41 <Response [200]> 

... (NOTE: 42 to 50 are all 200, the same (missing counts are all 200) is true with the following)

51 <Response [500]> Code: 244, e.displayText() = DB::Exception: Unrecoverable network error while adding block 13 with ID 'all_8286278104605411708_8426542356555497121': Session expired (version 19.5.3.1)

52 <Response [200]> 
53 <Response [200]> 
54 <Response [200]> 
55 <Response [200]> 

75 <Response [500]> Code: 244, e.displayText() = DB::Exception: Unrecoverable network error while adding block 37 with ID 'all_12886400267097872502_4893161622399995132': Session expired (version 19.5.3.1)

76 <Response [200]> 
77 <Response [200]> 
78 <Response [200]> 
79 <Response [200]> 

88 <Response [500]> Code: 242, e.displayText() = DB::Exception: Table is in readonly mode (version 19.5.3.1)

89 <Response [200]> 
90 <Response [200]> 
91 <Response [200]> 
92 <Response [200]> 

100 <Response [500]> Code: 244, e.displayText() = DB::Exception: Unrecoverable network error while adding block 61 with ID 'all_4654505216073472627_8127550766113632338': Session expired (version 19.5.3.1)

101 <Response [200]> 
102 <Response [200]> 
103 <Response [200]> 
104 <Response [200]> 

124 <Response [500]> Code: 244, e.displayText() = DB::Exception: Unrecoverable network error while adding block 85 with ID 'all_15103531304195667749_8570370559661181307': Session expired (version 19.5.3.1)

125 <Response [200]> 
126 <Response [200]> 
127 <Response [200]> 
128 <Response [200]> 

148 <Response [500]> Code: 244, e.displayText() = DB::Exception: Unrecoverable network error while adding block 109 with ID 'all_16942967266523965882_6650752932106655323': Session expired (version 19.5.3.1)

149 <Response [200]> 
150 <Response [200]> 
151 <Response [200]> 
152 <Response [200]> 

161 <Response [500]> Code: 242, e.displayText() = DB::Exception: Table is in readonly mode (version 19.5.3.1)

162 <Response [200]> 
163 <Response [200]> 
164 <Response [200]> 
165 <Response [200]> 

173 <Response [500]> Code: 244, e.displayText() = DB::Exception: Unrecoverable network error while adding block 133 with ID 'all_8196154836730591930_7365956032707223292': Session expired (version 19.5.3.1)

174 <Response [200]> 
175 <Response [200]> 
176 <Response [200]> 
177 <Response [200]>

Note that the behavior of 'Table is in readonly mode' changes from "Dead and never recover" to "Dead permanently but revive soon".

I have some questions about this issue:

When INSERT (partially, i.e. in one replica/shard) fails, how can we maintain data consistency? (especially when we are using both shard and replica?)
Can we have some ways to know current next_xid value of clickhouse-server? (perhaps via metrics?) (my intension is to know and restart ClickHouse before XID really overflows)
The error message "Unrecoverable network error while adding block" is not enough explaining the situation here, since this is caused by XID overflow and not by hardware problem. Should we (create and) use ZXIDOVERFLOW (or something like this) instead of ZSESSIONEXPIRED ?

243f6a8885a308d313198a2e037 on 24 May 2019

Any news? I have the same problem

perez1987 on 7 Jul 2019

@243f6a8885a308d313198a2e037

The fix that @github1youlc is only for deadlock. Now when xid overflows, ZooKeeper session will forcefully expire and ClickHouse will establish a new ZooKeeper session - everything will continue to work normally just in the same way as after network errors.

You have made changes to check what happens if xid overflows very quickly. In that case, ZooKeeper session will be re-established all the time and the system cannot function normally.

We can avoid session expiration on xid overflow by simply allowing it to overflow but skipping reserved values... you can use atomic compare-and-swap instead of atomic increment for this purpose. In normal circumstances, xid cannot overflow more frequently than operation timeout and overflow won't harm.

alexey-milovidov on 7 Jul 2019

👀1 😕1 👍1

@alexey-milovidov Thank you for your reaction.

I am recently wondering what would happen if INSERT, CREATE TABLE or DROP TABLE queries partially failed by ZSESSIONEXPIRED (In general, not only by xid overflow).

Leastways, I discovered "we can no more drop/create a table with this name" situation after partial failure of CREATE TABLE / DROP TABLE . (this means xid overflow, which causes ZSESSIONEXPIRED, can be harmful. )

CREATE TABLE IF NOT EXISTS foo.bar ON CLUSTER replicated_cluster_m
(
    `P` Int32,
    `Q` Int64,
    `R` Int16
)
ENGINE = ReplicatedSummingMergeTree('/clickhouse/tables/foo/{shard}/bar', '{server}', (Q, R))
ORDER BY P

DROP TABLE IF EXISTS foo.bar ON CLUSTER replicated_cluster_m

partial CREATE failure scenario (A):
- 1st CREATE: Code: 999, e.displayText() = Coordination::Exception: xid equal to close_xid (Session expired) (version 19.5.3.1) on server server_1
- CREATEs after this: eternal Code: 253, e.displayText() = DB::Exception: Replica /clickhouse/tables/foo/1/bar/replicas/server_0 already exists. (version 19.5.3.1)
partial DROP failure scenario (B):
- 1st DROP: Code: 242, e.displayText() = DB::Exception: Can't drop readonly replicated table (need to drop data in ZooKeeper as well) (version 19.5.3.1)
- 2nd DROP: success
- CLEATEs after this: Code: 253, e.displayText() = DB::Exception: Replica /clickhouse/tables/foo/2/bar/replicas/server_3 already exists. (version 19.5.3.1)
partial DROP failure scenario (C):
- 1st DROP: Code: 305, e.displayText() = DB::Exception: Table was not dropped because ZooKeeper session has expired. (version 19.5.3.1)
- 2nd DROP: success
- CREATEs after this: Code: 253, e.displayText() = DB::Exception: Replica /clickhouse/tables/foo/1/bar/replicas/server_0 already exists. (version 19.5.3.1)

What I want to ask are the following three questions:

During INSERT, there are several transactions between ZooKeeper (mentioned in https://clickhouse.yandex/docs/en/operations/table_engines/replication/ ). Does anything like the scenarios above possibly happen during INSERT ?
You stated that xid cannot overflow more frequently than operation timeout, and does operation timeout here mean restarting clickhouse-server frequently (about once per week or so) ? We have experienced xid overflow in about 40 days operation.
Is there a way to see the current value of next_xid from outside?

p.s. Should I create a new issue about this?

243f6a8885a308d313198a2e037 on 9 Jul 2019

@243f6a8885a308d313198a2e037 Sorry, I missed your answer.
This issue will be solved (at least for some of these cases) in https://github.com/yandex/ClickHouse/issues/6045

alexey-milovidov on 27 Aug 2019

❤1 🎉1 😄1

Was this page helpful?

0 / 5 - 0 ratings