Lxd: Cannot get rid of bad entry in raft_nodes

Created on 4 Apr 2020 · 3Comments · Source: lxc/lxd

Required information

Distribution: Ubuntu
Distribution version: 18.04.4 LTS
The output of "lxc info" or if that fails:
- suruga: https://gist.github.com/mt-caret/c1a1e64f834e6f24ca2f59052a9841f9
- tsubasa: https://gist.github.com/mt-caret/e3b51138e8369d143a545c083904b480

Issue description

This is a continuation of https://discuss.linuxcontainers.org/t/how-to-fix-only-having-two-database-nodes-in-cluster/6417. It turns out, I never was able to remove the faulty entry in raft_nodes and still have this problem.

Currently, when I run lxc cluster ls:

$ lxc cluster ls
+---------+--------------------------+----------+--------+-------------------+--------------+
|  NAME   |           URL            | DATABASE | STATE  |      MESSAGE      | ARCHITECTURE |
+---------+--------------------------+----------+--------+-------------------+--------------+
| chino   | https://172.16.0.6:8443  | NO       | ONLINE | fully operational | x86_64       |
+---------+--------------------------+----------+--------+-------------------+--------------+
| cocoa   | https://172.16.0.7:8443  | NO       | ONLINE | fully operational | x86_64       |
+---------+--------------------------+----------+--------+-------------------+--------------+
| mayoi   | https://172.16.0.16:8443 | NO       | ONLINE | fully operational | x86_64       |
+---------+--------------------------+----------+--------+-------------------+--------------+
| nadeko  | https://172.16.0.18:8443 | NO       | ONLINE | fully operational | x86_64       |
+---------+--------------------------+----------+--------+-------------------+--------------+
| rize    | https://172.16.0.8:8443  | NO       | ONLINE | fully operational | x86_64       |
+---------+--------------------------+----------+--------+-------------------+--------------+
| shinobu | https://172.16.0.15:8443 | NO       | ONLINE | fully operational | x86_64       |
+---------+--------------------------+----------+--------+-------------------+--------------+
| suruga  | https://172.16.0.17:8443 | YES      | ONLINE | fully operational | x86_64       |
+---------+--------------------------+----------+--------+-------------------+--------------+
| tippy   | https://172.16.0.5:8443  | NO       | ONLINE | fully operational | x86_64       |
+---------+--------------------------+----------+--------+-------------------+--------------+
| tsubasa | https://172.16.0.14:8443 | YES      | ONLINE | fully operational | x86_64       |
+---------+--------------------------+----------+--------+-------------------+--------------+
| tsukihi | https://172.16.0.20:8443 | NO       | ONLINE | fully operational | x86_64       |
+---------+--------------------------+----------+--------+-------------------+--------------+

So there are only two database nodes. Checking raft_nodes for one database node (in this case, hitagi) show the following:

$ lxd sql local 'select * from raft_nodes'
+----+------------------+------+
| id |     address      | role |
+----+------------------+------+
| 2  | 172.16.0.14:8443 | 0    |
| 3  | :8443            | 2    |
| 4  | 172.16.0.16:8443 | 2    |
| 5  | 172.16.0.20:8443 | 2    |
| 7  | 172.16.0.15:8443 | 2    |
| 9  | 172.16.0.18:8443 | 2    |
| 11 | 172.16.0.8:8443  | 2    |
| 12 | 172.16.0.5:8443  | 2    |
| 13 | 172.16.0.6:8443  | 2    |
| 14 | 172.16.0.7:8443  | 2    |
| 15 | 172.16.0.17:8443 | 0    |
+----+------------------+------+

Deleting the row does temporarily work, but after a log entry like the following, the row containing :8443 returns:

t=2020-04-04T20:19:06+0900 lvl=info msg="Upgrading -1 nodes not part of raft configuration"
t=2020-04-04T20:19:06+0900 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[:8443:{ID:3 Address::8443 Role:spare}]"

These two lines appear every ten seconds in the log, regardless of whether I had removed the row.

Interestingly, the other database node has the following raft_nodes table:

$ lxd sql local 'select * from raft_nodes'
+----+------------------+------+
| id |     address      | role |
+----+------------------+------+
| 2  | 172.16.0.14:8443 | 0    |
| 4  | 172.16.0.16:8443 | 2    |
| 5  | 172.16.0.20:8443 | 2    |
| 7  | 172.16.0.15:8443 | 2    |
| 9  | 172.16.0.18:8443 | 2    |
| 11 | 172.16.0.8:8443  | 2    |
| 12 | 172.16.0.5:8443  | 2    |
| 13 | 172.16.0.6:8443  | 2    |
| 14 | 172.16.0.7:8443  | 2    |
| 15 | 172.16.0.17:8443 | 0    |
+----+------------------+------+

Furthermore, sometimes under some combination of stopping/starting/restarting these two database nodes, the node which has the weird raft_nodes table switches up.

Steps to reproduce

Unfortunately, since this cluster was set up a while ago, I don't have details or a way to reproduce this behavior.

Information to attach

[ ] Any relevant kernel output (dmesg)
[ ] Container log (lxc info NAME --show-log)
[ ] Container configuration (lxc config show NAME --expanded)
[ ] Main daemon log (at /var/log/lxd/lxd.log or /var/snap/lxd/common/lxd/logs/lxd.log)
[ ] Output of the client with --debug
[ ] Output of the daemon with --debug (alternatively output of lxc monitor while reproducing the issue)