This is a continuation of https://discuss.linuxcontainers.org/t/how-to-fix-only-having-two-database-nodes-in-cluster/6417. It turns out, I never was able to remove the faulty entry in raft_nodes and still have this problem.
Currently, when I run lxc cluster ls:
$ lxc cluster ls
+---------+--------------------------+----------+--------+-------------------+--------------+
| NAME | URL | DATABASE | STATE | MESSAGE | ARCHITECTURE |
+---------+--------------------------+----------+--------+-------------------+--------------+
| chino | https://172.16.0.6:8443 | NO | ONLINE | fully operational | x86_64 |
+---------+--------------------------+----------+--------+-------------------+--------------+
| cocoa | https://172.16.0.7:8443 | NO | ONLINE | fully operational | x86_64 |
+---------+--------------------------+----------+--------+-------------------+--------------+
| mayoi | https://172.16.0.16:8443 | NO | ONLINE | fully operational | x86_64 |
+---------+--------------------------+----------+--------+-------------------+--------------+
| nadeko | https://172.16.0.18:8443 | NO | ONLINE | fully operational | x86_64 |
+---------+--------------------------+----------+--------+-------------------+--------------+
| rize | https://172.16.0.8:8443 | NO | ONLINE | fully operational | x86_64 |
+---------+--------------------------+----------+--------+-------------------+--------------+
| shinobu | https://172.16.0.15:8443 | NO | ONLINE | fully operational | x86_64 |
+---------+--------------------------+----------+--------+-------------------+--------------+
| suruga | https://172.16.0.17:8443 | YES | ONLINE | fully operational | x86_64 |
+---------+--------------------------+----------+--------+-------------------+--------------+
| tippy | https://172.16.0.5:8443 | NO | ONLINE | fully operational | x86_64 |
+---------+--------------------------+----------+--------+-------------------+--------------+
| tsubasa | https://172.16.0.14:8443 | YES | ONLINE | fully operational | x86_64 |
+---------+--------------------------+----------+--------+-------------------+--------------+
| tsukihi | https://172.16.0.20:8443 | NO | ONLINE | fully operational | x86_64 |
+---------+--------------------------+----------+--------+-------------------+--------------+
So there are only two database nodes. Checking raft_nodes for one database node (in this case, hitagi) show the following:
$ lxd sql local 'select * from raft_nodes'
+----+------------------+------+
| id | address | role |
+----+------------------+------+
| 2 | 172.16.0.14:8443 | 0 |
| 3 | :8443 | 2 |
| 4 | 172.16.0.16:8443 | 2 |
| 5 | 172.16.0.20:8443 | 2 |
| 7 | 172.16.0.15:8443 | 2 |
| 9 | 172.16.0.18:8443 | 2 |
| 11 | 172.16.0.8:8443 | 2 |
| 12 | 172.16.0.5:8443 | 2 |
| 13 | 172.16.0.6:8443 | 2 |
| 14 | 172.16.0.7:8443 | 2 |
| 15 | 172.16.0.17:8443 | 0 |
+----+------------------+------+
Deleting the row does temporarily work, but after a log entry like the following, the row containing :8443 returns:
t=2020-04-04T20:19:06+0900 lvl=info msg="Upgrading -1 nodes not part of raft configuration"
t=2020-04-04T20:19:06+0900 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[:8443:{ID:3 Address::8443 Role:spare}]"
These two lines appear every ten seconds in the log, regardless of whether I had removed the row.
Interestingly, the other database node has the following raft_nodes table:
$ lxd sql local 'select * from raft_nodes'
+----+------------------+------+
| id | address | role |
+----+------------------+------+
| 2 | 172.16.0.14:8443 | 0 |
| 4 | 172.16.0.16:8443 | 2 |
| 5 | 172.16.0.20:8443 | 2 |
| 7 | 172.16.0.15:8443 | 2 |
| 9 | 172.16.0.18:8443 | 2 |
| 11 | 172.16.0.8:8443 | 2 |
| 12 | 172.16.0.5:8443 | 2 |
| 13 | 172.16.0.6:8443 | 2 |
| 14 | 172.16.0.7:8443 | 2 |
| 15 | 172.16.0.17:8443 | 0 |
+----+------------------+------+
Furthermore, sometimes under some combination of stopping/starting/restarting these two database nodes, the node which has the weird raft_nodes table switches up.
Unfortunately, since this cluster was set up a while ago, I don't have details or a way to reproduce this behavior.
dmesg)lxc info NAME --show-log)lxc config show NAME --expanded)lxc monitor while reproducing the issue)@freeekanayaka
I've pushed #7138 which should hopefully fix your problem.
Thank you!
Most helpful comment
I've pushed #7138 which should hopefully fix your problem.