Lxd: LXD database - unexpected response type 6

Created on 18 Aug 2018 · 39Comments · Source: lxc/lxd

snap have updated automatically to 3.4, and nothing works anymore.

Distribution: ubuntu
- Distribution version: 18.04.1 LTS 32 bits on atom n270
- The output of "lxc info" or if that fails:
  /snap/bin/lxc info
  Error: Failed to fetch http://unix.socket/1.0: 500 Internal Server Error
- Kernel version: 4.15.0-32-generic #35-Ubuntu SMP Fri Aug 10 17:55:48 UTC 2018 i686 i686 i686 GNU/Linux
- LXC version:
  /snap/bin/lxc --version
  3.4
- LXD version:
  snap list
  Name Version Rev Tracking Publisher Notes
  core 16-2.34.3 5146 stable canonical core
  lxd 3.4 8385 stable canonical -
- Storage backend in use: btrfs, compressed with lzo

Issue description

snap have updated lxd to 3.4. Since then, nothing works any more.

First, i found that all containers (2) was stopped.
-I tried to list them, it was very slof, but answered.
tried to sop: sudo snap stop lxd
-> it said it did it but when i listes processes, it was still there
- tried 2,3 times to stop lxd
  
  -> said ok every time
  
  I finally had to kill lxd
- started lxd again
  
  and now, nothing work

containers are stopped
lxc command fail:
/snap/bin/lxc info
Error: Failed to fetch http://unix.socket/1.0: 500 Internal Server Error
lxd take lot of cpu/sys
in logs, it loops:
lvl=warn msg="Failed to rollback transaction after error (Failed to fetch nodes: unexpected response type 6): cannot start a transaction within a transaction" t=2018-08-18T09:51:05+0200
lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: unexpected response type 6" t=2018-08-18T09:51:05+0200
lvl=warn msg="Failed to get current cluster nodes: failed to begin transaction: unexpected response type 7" t=2018-08-18T09:51:06+0200
lvl=warn msg="Failed to rollback transaction after error (Failed to fetch nodes: unexpected response type 6): cannot start a transaction within a transaction" t=2018-08-18T09:51:07+0200
lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: unexpected response type 6" t=2018-08-18T09:51:07+0200
lvl=warn msg="Failed to get current cluster nodes: failed to begin transaction: unexpected response type 7" t=2018-08-18T09:51:08+0200
lvl=warn msg="Failed to rollback transaction after error (Failed to fetch nodes: unexpected response type 6): cannot start a transaction within a transaction" t=2018-08-18T09:51:09+0200
lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: unexpected response type 6" t=2018-08-18T09:51:09+0200
lvl=warn msg="Failed to get current cluster nodes: failed to begin transaction: unexpected response type 7" t=2018-08-18T09:51:10+0200

Steps to reproduce

I have done nothing
I have done nothing
I have done nothing ;-) ;-) snap did...

Information to attach

[ ] Output of the client with --debug
/snap/bin/lxc --debug info
DBUG[08-18|10:11:00] Connecting to a local LXD over a Unix socket
DBUG[08-18|10:11:00] Sending request to LXD etag= method=GET url=http://unix.socket/1.0
Error: Failed to fetch http://unix.socket/1.0: 500 Internal Server Error

dmesg:

[255820.008382] audit: type=1400 audit(1534561683.200:135): apparmor="DENIED" operation="open" profile="snap.lxd.hook.configure" name="/var/lib/snapd/hostfs/usr/lib/os-release" pid=8759 comm="snap-exec" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[255820.071777] audit: type=1400 audit(1534561683.260:136): apparmor="DENIED" operation="open" profile="snap.lxd.hook.configure" name="/var/lib/snapd/hostfs/usr/lib/os-release" pid=8768 comm="snapctl" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[255820.133003] audit: type=1400 audit(1534561683.324:137): apparmor="DENIED" operation="open" profile="snap.lxd.hook.configure" name="/var/lib/snapd/hostfs/usr/lib/os-release" pid=8777 comm="snapctl" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[255820.197571] audit: type=1400 audit(1534561683.388:138): apparmor="DENIED" operation="open" profile="snap.lxd.hook.configure" name="/var/lib/snapd/hostfs/usr/lib/os-release" pid=8783 comm="snapctl" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[255820.256379] audit: type=1400 audit(1534561683.448:139): apparmor="DENIED" operation="open" profile="snap.lxd.hook.configure" name="/var/lib/snapd/hostfs/usr/lib/os-release" pid=8792 comm="snapctl" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[255820.313201] audit: type=1400 audit(1534561683.504:140): apparmor="DENIED" operation="open" profile="snap.lxd.hook.configure" name="/var/lib/snapd/hostfs/usr/lib/os-release" pid=8800 comm="snapctl" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[264675.302793] audit: type=1400 audit(1534570538.492:141): apparmor="DENIED" operation="capable" profile="snap.lxd.lxc" pid=12601 comm="lxc" capability=2 capname="dac_read_search"
[264675.302809] audit: type=1400 audit(1534570538.492:142): apparmor="DENIED" operation="capable" profile="snap.lxd.lxc" pid=12601 comm="lxc" capability=1 capname="dac_override"
[264675.309940] audit: type=1400 audit(1534570538.500:143): apparmor="DENIED" operation="capable" profile="snap.lxd.lxc" pid=12644 comm="cat" capability=2 capname="dac_read_search"
[264675.309959] audit: type=1400 audit(1534570538.500:144): apparmor="DENIED" operation="capable" profile="snap.lxd.lxc" pid=12644 comm="cat" capability=1 capname="dac_override"

[ ] Main daemon log (at /var/log/lxd/lxd.log or /var/snap/lxd/common/lxd/logs/lxd.log)
[ ] Output of the daemon with --debug (alternatively output of lxc monitor while reproducing the issue)

lxd-strace.zip
lxd-log.zip

Source

gaetanquentin

All 39 comments

This is a duplicate of #4944.

There's a good chance that the issue was not actually fixed as it supposed to be. @stgraber any way we can confirm that @gaetanquentin's snap revision was built with dqlite master 515e6f7 or later? I believe that's the case, but want to double check.

freeekanayaka on 18 Aug 2018

@freeekanayaka yeah, 8385 has those commits.

@gaetanquentin can you snap refresh lxd? That should get you up to 8393. I don't think it'll help with this particular problem, but should take care of any further leftover LXD processes.

Assuming the issue still happens then, can you make a tarball of /var/snap/lxd/common/lxd/database and send that to stgraber at ubuntu dot com?

stgraber on 18 Aug 2018

@stgraber

root@nas:/var/snap/lxd# snap refresh lxd
le snap "lxd" n'a aucune mise à jour disponible

root@nas:/var/snap/lxd# ls -la
total 4
drwxr-xr-x 1 root root 50 août 18 05:07 .
drwxr-xr-x 1 root root 14 févr. 12 2018 ..
drwxr-xr-x 1 root root 6 févr. 14 2018 8298
drwxr-xr-x 1 root root 6 févr. 14 2018 8351
drwxr-xr-x 1 root root 6 févr. 14 2018 8385
drwxr-xr-x 1 root root 92 août 18 08:34 common
lrwxrwxrwx 1 root root 4 août 18 05:07 current -> 8385

gaetanquentin on 18 Aug 2018

Ah, 8385 is current but i386, that's why the revs didn't line up.

stgraber on 19 Aug 2018

@gaetanquentin can you provide the content of /var/snap/lxd/common/lxd/logs/lxd.log?

stgraber on 19 Aug 2018

Ideally, starting at when the daemon starts.

stgraber on 19 Aug 2018

And assuming that just restarting the daemon doesn't fix this, it'd be useful if you could do:

snap set lxd daemon.debug=true
systemctl reload snap.lxd.daemon

Which would then make the logs much more verbose. If you do that and still run into the issue, then again, please provide lxd.log starting at the point where LXD started and until you start seeing the repetitive log entries.

stgraber on 19 Aug 2018

@gaetanquentin ah and a full ps fauxww prior to restarting the daemon (while it still gives you the 500 errors) would be nice too.

stgraber on 19 Aug 2018

strangely now it works again . always with the same version. (8385)
lxd.log were still attached in this thread, with lxd-strace.zip, a strace of lxd

i won't stop it again for now since i need the containers running.

gaetanquentin on 19 Aug 2018

@gaetanquentin sorry I missed the tarballs, okay, so the beginning of it looks like:

lvl=info msg="Updating images" t=2018-08-18T08:35:12+0200
lvl=warn msg="Failed to rollback transaction after error (driver: bad connection): unexpected response type 7" t=2018-08-18T08:35:16+0200
lvl=warn msg="Failed to rollback transaction after error (cannot fetch node config from database: unexpected response type 6): unexpected response type 7" t=2018-08-18T08:35:17+0200
lvl=warn msg="Failed to rollback transaction after error (Failed to fetch nodes: unexpected response type 6): unexpected response type 7" t=2018-08-18T08:35:17+0200
lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: unexpected response type 6" t=2018-08-18T08:35:17+0200
lvl=info msg="Done updating instance types" t=2018-08-18T08:35:17+0200
lvl=warn msg="Failed to rollback transaction after error (unexpected response type 6): unexpected response type 7" t=2018-08-18T08:35:18+0200
err="unexpected response type 6" lvl=eror msg="Failed to expire logs" t=2018-08-18T08:35:18+0200
lvl=info msg="Done expiring log files" t=2018-08-18T08:35:18+0200
lvl=warn msg="Failed to rollback transaction after error (unexpected response type 6): unexpected response type 7" t=2018-08-18T08:35:18+0200
err="unexpected response type 6" fp=4d574fd7ecf8d7f758cb212d92e8fef12876a8c0f03e5048cabbf179482919c7 lvl=eror msg="Error loading image" t=2018-08-18T08:35:18+0200
lvl=info msg="Done updating images" t=2018-08-18T08:35:18+0200

@freeekanayaka

stgraber on 19 Aug 2018

@stgraber did you receive the database content on your email?

gaetanquentin on 20 Aug 2018

@gaetanquentin yes he did and he forwarded it to me. I could not reproduce the issue so far. Did it happen again btw? If so, it'd be great if you can attach the log again, since it might be that the failure mode shown in the first logs you sent is now fixed, but there's another failure mode which we still have to catch.

freeekanayaka on 20 Aug 2018

@freeekanayaka for the moment, containers are working and lxd commands run fast. Have you tried on a slow i386 archi?

gaetanquentin on 20 Aug 2018

sometimes there are errors:
lvl=dbug msg="Database error: &errors.errorString{s:\"sql: no rows in result set\"}" t=2018-08-20T09:31:34+0200

i am trying to stop a container, it does nothing, command is still running. in logs there is that:
ip=@ lvl=dbug method=GET msg=handling t=2018-08-20T14:08:51+0200 url=/1.0/operations/863c56f9-4b12-4a47-bddc-969f1366c069
action=shutdown created=2017-03-14T05:47:59+0100 ephemeral=false lvl=info msg="Shutting down container" name=showfinders t=2018-08-20T14:08:51+0200 timeout=-1s used=2018-08-19T15:31:55+0200
ip=@ lvl=dbug method=GET msg=handling t=2018-08-20T14:12:05+0200 url=/1.0

with stop -f , it works

snap stop lxd, worked too
snap start lxd launched the lxd daemon

but after:
lxc list
Error: Failed to fetch http://unix.socket/1.0: 500 Internal Server Error

i had to wait a minute or more for the lxd answer commands again. but it very slow again, unusable.

in logs:
lvl=dbug msg="Database error: &errors.errorString{s:\"cannot fetch node config from database: unexpected response type 6\"}" t=2018-08-20T14:17:36+0200
ip=@ lvl=dbug method=GET msg=handling t=2018-08-20T14:17:36+0200 url=/1.0
lvl=warn msg="Failed to rollback transaction after error (cannot fetch node config from database: unexpected response type 6): unexpected response type 7" t=2018-08-20T14:17:36+0200
lvl=dbug msg="Database error: &errors.errorString{s:\"cannot fetch node config from database: unexpected response type 6\"}" t=2018-08-20T14:17:36+0200
lvl=warn msg="Failed to rollback transaction after error (Failed to fetch nodes: unexpected response type 6): unexpected response type 7" t=2018-08-20T14:17:37+0200
lvl=dbug msg="Database error: Failed to fetch nodes: unexpected response type 6" t=2018-08-20T14:17:37+0200
lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: unexpected response type 6" t=2018-08-20T14:17:37+0200
lvl=warn msg="Failed to rollback transaction after error (cannot fetch node config from database: unexpected response type 6): unexpected response type 7" t=2018-08-20T14:17:37+0200
lvl=dbug msg="Database error: &errors.errorString{s:\"cannot fetch node config from database: unexpected response type 6\"}" t=2018-08-20T14:17:37+0200
ip=@ lvl=dbug method=GET msg=handling t=2018-08-20T14:17:37+0200 url=/1.0
lvl=warn msg="Failed to rollback transaction after error (cannot fetch node config from database: unexpected response type 6): unexpected response type 7" t=2018-08-20T14:17:38+0200
lvl=dbug msg="Database error: &errors.errorString{s:\"cannot fetch node config from database: unexpected response type 6\"}" t=2018-08-20T14:17:38+0200
lvl=warn msg="Failed to rollback transaction after error (Failed to fetch nodes: unexpected response type 6): unexpected response type 7" t=2018-08-20T14:17:38+0200
lvl=dbug msg="Database error: Failed to fetch nodes: unexpected response type 6" t=2018-08-20T14:17:38+0200
lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: unexpected response type 6" t=2018-08-20T14:17:38+0200
lvl=warn msg="Failed to rollback transaction after error (cannot fetch node config from database: unexpected response type 6): unexpected response type 7" t=2018-08-20T14:17:39+0200
lvl=dbug msg="Database error: &errors.errorString{s:\"cannot fetch node config from database: unexpected response type 6\"}" t=2018-08-20T14:17:39+0200
ip=@ lvl=dbug method=GET msg=handling t=2018-08-20T14:17:39+0200 url=/1.0
lvl=warn msg="Failed to rollback transaction after error (cannot fetch node config from database: unexpected response type 6): unexpected response type 7" t=2018-08-20T14:17:39+0200
lvl=dbug msg="Database error: &errors.errorString{s:\"cannot fetch node config from database: unexpected response type 6\"}" t=2018-08-20T14:17:39+0200
lvl=warn msg="Failed to rollback transaction after error (Failed to fetch nodes: unexpected response type 6): unexpected response type 7" t=2018-08-20T14:17:40+0200
lvl=dbug msg="Database error: Failed to fetch nodes: unexpected response type 6" t=2018-08-20T14:17:40+0200
lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: unexpected response type 6" t=2018-08-20T14:17:40+0200
lvl=warn msg="Failed to rollback transaction after error (cannot fetch node config from database: unexpected response type 6): unexpected response type 7" t=2018-08-20T14:17:40+0200
lvl=dbug msg="Database error: &errors.errorString{s:\"cannot fetch node config from database: unexpected response type 6\"}" t=2018-08-20T14:17:40+0200
ip=@ lvl=dbug method=GET msg=handling t=2018-08-20T14:17:40+0200 url=/1.0
lvl=warn msg="Failed to rollback transaction after error (cannot fetch node config from database: unexpected response type 6): unexpected response type 7" t=2018-08-20T14:17:41+0200
lvl=dbug msg="Database error: &errors.errorString{s:\"cannot fetch node config from database: unexpected response type 6\"}" t=2018-08-20T14:17:41+0200
lvl=warn msg="Failed to rollback transaction after error (Failed to fetch nodes: unexpected response type 6): unexpected response type 7" t=2018-08-20T14:17:41+0200
lvl=dbug msg="Database error: Failed to fetch nodes: unexpected response type 6" t=2018-08-20T14:17:41+0200
lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: unexpected response type 6" t=2018-08-20T14:17:41+0200
lvl=warn msg="Failed to rollback transaction after error (cannot fetch node config from database: unexpected response type 6): unexpected response type 7" t=2018-08-20T14:17:42+0200
lvl=dbug msg="Database error: &errors.errorString{s:\"cannot fetch node config from database: unexpected response type 6\"}" t=2018-08-20T14:17:42+0200
ip=@ lvl=dbug method=GET msg=handling t=2018-08-20T14:17:42+0200 url=/1.0
lvl=warn msg="Failed to rollback transaction after error (cannot fetch node config from database: unexpected response type 6): unexpected response type 7" t=2018-08-20T14:17:42+0200
lvl=dbug msg="Database error: &errors.errorString{s:\"cannot fetch node config from database: unexpected response type 6\"}" t=2018-08-20T14:17:42+0200
lvl=warn msg="Failed to rollback transaction after error (Failed to fetch nodes: unexpected response type 6): unexpected response type 7" t=2018-08-20T14:17:43+0200
lvl=dbug msg="Database error: Failed to fetch nodes: unexpected response type 6" t=2018-08-20T14:17:43+0200
lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: unexpected response type 6" t=2018-08-20T14:17:43+0200
lvl=warn msg="Failed to rollback transaction after error (cannot fetch node config from database: unexpected response type 6): unexpected response type 7" t=2018-08-20T14:17:43+0200
lvl=dbug msg="Database error: &errors.errorString{s:\"cannot fetch node config from database: unexpected response type 6\"}" t=2018-08-20T14:17:43+0200

[ ...]

vl=info msg="Dqlite: connected address=0 attempt=0" t=2018-08-20T14:18:25+0200
ip=@ lvl=dbug method=GET msg=handling t=2018-08-20T14:18:26+0200 url=/1.0
ip=@ lvl=dbug method=GET msg=handling t=2018-08-20T14:18:27+0200 url=/internal/ready
ip=@ lvl=dbug method=GET msg=handling t=2018-08-20T14:18:48+0200 url=/1.0
ip=@ lvl=dbug method=GET msg=handling t=2018-08-20T14:18:50+0200 url="/1.0/containers?recursion=2"
ip=@ lvl=dbug method=GET msg=handling t=2018-08-20T14:20:13+0200 url=/1.0
ip=@ lvl=dbug method=GET msg=handling t=2018-08-20T14:20:15+0200 url="/1.0/containers?recursion=2"
lvl=info msg="Raft: Starting snapshot up to 170473" t=2018-08-20T14:20:29+0200
lvl=info msg="Raft: Compacting logs from 168041 to 169964" t=2018-08-20T14:20:29+0200
lvl=info msg="Raft: Snapshot to 170473 complete" t=2018-08-20T14:20:30+0200
lvl=info msg="Dqlite: closing client" t=2018-08-20T14:20:36+0200
lvl=info msg="Dqlite: handling new connection (fd=28)" t=2018-08-20T14:20:36+0200
lvl=info msg="Dqlite: connected address=0 attempt=0" t=2018-08-20T14:20:36+0200
ip=@ lvl=dbug method=GET msg=handling t=2018-08-20T14:21:32+0200 url=/1.0
ip=@ lvl=dbug method=GET msg=handling t=2018-08-20T14:21:34+0200 url="/1.0/containers?recursion=2"

i think he is doing things at launch time, very time/io consuming. On slow systems, it make it hangs.

didn't you seen something on the strace zip?

gaetanquentin on 20 Aug 2018

There is a lot of io:

Total DISK READ : 0.00 B/s | Total DISK WRITE : 293.53 K/s
Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 533.69 K/s

TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
11892 be/4 root 0.00 B/s 220.15 K/s 0.00 % 34.46 % lxd --logfile /var/snap/lxd/common/lxd/logs/lxd.log --group lxd --debug
11869 be/4 root 0.00 B/s 73.38 K/s 0.00 % 10.54 % lxd --logfile /var/snap/lxd/common/lxd/logs/lxd.log --group lxd --debug

gaetanquentin on 20 Aug 2018

now lxc list report this:

lxc list
+----------------+-------+------+------+------------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+----------------+-------+------+------+------------+-----------+
| xxx | ERROR | | | PERSISTENT | |
+----------------+-------+------+------+------------+-----------+
| yyy | ERROR | | | PERSISTENT | |
+----------------+-------+------+------+------------+-----------+

gaetanquentin on 20 Aug 2018

@gaetanquentin could you please attach again the content of /var/snap/lxd/common/lxd/logs/lxd.log ? and possibly of the last rotated logs in the same directory

freeekanayaka on 20 Aug 2018

I have a similar problem. Rebooting the machine does not fix it.

Command:

LD_LIBRARY_PATH=/snap/lxd/current/lib/ LXD_DIR=/var/snap/lxd/common/lxd/ /snap/lxd/current/bin/lxd --verbose --debug --group lxd

Output:

INFO[08-20|23:04:09] LXD 3.4 is starting in normal mode       path=/var/snap/lxd/common/lxd
INFO[08-20|23:04:09] Kernel uid/gid map: 
INFO[08-20|23:04:09]  - u 0 0 4294967295 
INFO[08-20|23:04:09]  - g 0 0 4294967295 
INFO[08-20|23:04:09] Configured LXD uid/gid map: 
INFO[08-20|23:04:09]  - u 0 165536 65536 
INFO[08-20|23:04:09]  - g 0 165536 65536 
WARN[08-20|23:04:09] CGroup memory swap accounting is disabled, swap limits will be ignored. 
INFO[08-20|23:04:09] Initializing local database 
INFO[08-20|23:04:09] Initializing database gateway 
INFO[08-20|23:04:09] Start database node                      address= id=1
INFO[08-20|23:04:10] Raft: Restored from snapshot 1-9086-1534764785477 
INFO[08-20|23:04:10] Raft: Initial configuration (index=1): [{Suffrage:Voter ID:1 Address:0}] 
INFO[08-20|23:04:10] Raft: Node at 0 [Leader] entering Leader state 
INFO[08-20|23:04:10] Dqlite: starting event loop 
DBUG[08-20|23:04:10] Dqlite: accepting connections 
INFO[08-20|23:04:10] LXD isn't socket activated 
INFO[08-20|23:04:10] Starting /dev/lxd handler: 
INFO[08-20|23:04:10]  - binding devlxd socket                 socket=/var/snap/lxd/common/lxd/devlxd/sock
INFO[08-20|23:04:10] REST API daemon: 
INFO[08-20|23:04:10]  - binding Unix socket                   socket=/var/snap/lxd/common/lxd/unix.socket
INFO[08-20|23:04:10] Initializing global database 
INFO[08-20|23:04:10] Dqlite: handling new connection (fd=18) 
INFO[08-20|23:04:10] Dqlite: connected address=0 attempt=0 
INFO[08-20|23:04:11] Initializing storage pools 
DBUG[08-20|23:04:11] Initializing and checking storage pool "default" 
DBUG[08-20|23:04:12] Checking BTRFS storage pool "default" 
INFO[08-20|23:04:12] Initializing networks 
INFO[08-20|23:04:15] Dqlite: closing client 
INFO[08-20|23:04:15] Dqlite: handling new connection (fd=20) 
INFO[08-20|23:04:16] Dqlite: connected address=0 attempt=0 
INFO[08-20|23:04:22] Pruning leftover image files 
INFO[08-20|23:04:22] Done pruning leftover image files 
INFO[08-20|23:04:22] Loading configuration 
DBUG[08-20|23:04:22] Initialized inotify with file descriptor 17 
INFO[08-20|23:04:30] Connected to MAAS controller 
INFO[08-20|23:04:30] Pruning expired images 
INFO[08-20|23:04:31] Done pruning expired images 
INFO[08-20|23:04:31] Expiring log files 
INFO[08-20|23:04:31] Updating instance types 
INFO[08-20|23:04:32] Done expiring log files 
INFO[08-20|23:04:41] Done updating instance types 
WARN[08-20|23:04:58] Failed to rollback transaction after error (Failed to fetch nodes: driver: bad connection): unexpected response type 7 
DBUG[08-20|23:04:58] Database error: Failed to fetch nodes: driver: bad connection 
DBUG[08-20|23:04:58] Retry failed db interaction (Failed to fetch nodes: driver: bad connection) 
WARN[08-20|23:04:59] Failed to rollback transaction after error (Failed to fetch nodes: unexpected response type 6): unexpected response type 7 
DBUG[08-20|23:04:59] Database error: Failed to fetch nodes: unexpected response type 6

I also tried to replace the database with /var/snap/lxd/common/lxd/database/global /var/snap/lxd/common/lxd/database/global.bak temporarily. It lets me use lxc but the paths have changed since the bak was taken. I can't start containers anyway. It seems the problem is around the global database.

evpo on 20 Aug 2018

lxd-logs.tar.gz

FYI, it hangs again:

lxc list
Error: Failed to fetch http://unix.socket/1.0: 500 Internal Server Error

gaetanquentin on 20 Aug 2018

I've been experiencing the same error, which I've been able to reproduce on several occasions by running watch lxc list for a period of several hours.

AdamIsrael on 20 Aug 2018

Just a quick update, we have a reproducer that lets me trigger this within seconds on a test system.
It appears to be tied to I/O performance against the LXD database, which explains why it's intermittent and why most of our users are immune (thanks to SSDs I guess).

stgraber on 20 Aug 2018

And we have what looks like a potential fix, expecting it to release in the next few hours.

stgraber on 20 Aug 2018

🎉1

The fix is now available in the stable channel, please all run snap refresh lxd.

stgraber on 20 Aug 2018

works fine, thanks.

gaetanquentin on 20 Aug 2018

👍1

It works for me too. Thanks

evpo on 21 Aug 2018

@stgraber unfortunately doesn't work for me. I'm on lxd 3.5 8774 stable.

leonklingele on 23 Sep 2018

[..]
Sep 23 14:03:26 [HOST] lxd.daemon[12593]: lvl=warn msg="Failed to rollback transaction after error (Failed to fetch nodes: unexpected response
type 6): unexpected response type 7" t=2018-09-23T14:03:26+0200
Sep 23 14:03:26 [HOST] lxd.daemon[12593]: lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: unexpected response type 6"
 t=2018-09-23T14:03:26+0200
Sep 23 14:03:27 [HOST] lxd.daemon[12593]: lvl=warn msg="Failed to rollback transaction after error (Failed to fetch nodes: unexpected response
type 6): unexpected response type 7" t=2018-09-23T14:03:27+0200
Sep 23 14:03:27 [HOST] lxd.daemon[12593]: lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: unexpected response type 6"
 t=2018-09-23T14:03:27+0200
Sep 23 14:03:28 [HOST] lxd.daemon[12593]: lvl=warn msg="Failed to rollback transaction after error (Failed to fetch nodes: unexpected response
type 6): unexpected response type 7" t=2018-09-23T14:03:28+0200
Sep 23 14:03:28 [HOST] lxd.daemon[12593]: lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: unexpected response type 6"
 t=2018-09-23T14:03:28+0200
[..]

leonklingele on 23 Sep 2018

Seeing this as well on
lxd 3.5 9010 stable canonical✓ -

~~~~
lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: unexpected response type 6" t=2018-10-06T16:09:34+0200
lvl=warn msg="Failed to rollback transaction after error (cannot fetch node config from database: unexpected response type 6): unexpected response type 7" t=2018-10-06T16:09:35+0200
lvl=warn msg="Failed to rollback transaction after error (cannot fetch node config from database: unexpected response type 6): unexpected response type 7" t=2018-10-06T16:09:35+0200
lvl=warn msg="Failed to rollback transaction after error (cannot fetch node config from database: unexpected response type 6): unexpected response type 7" t=2018-10-06T16:09:35+0200
lvl=warn msg="Failed to rollback transaction after error (cannot fetch node config from database: unexpected response type 6): unexpected response type 7" t=2018-10-06T16:09:35+0200
lvl=warn msg="Failed to rollback transaction after error (cannot fetch node config from database: unexpected response type 6): unexpected response type 7" t=2018-10-06T16:09:35+0200
lvl=warn msg="Failed to rollback transaction after error (cannot fetch node config from database: unexpected response type 6): unexpected response type 7" t=2018-10-06T16:09:35+0200

~~~~

sabaini on 6 Oct 2018

@sabaini if you're still stuck in this state, try systemctl reload snap.lxd.daemon that will hopefully unstick you.

@freeekanayaka we've seen a few reports of this happening (have seen it on some of our machines), any ideas?

stgraber on 6 Oct 2018

👍1

I think this bug should be re-opened.

I've been seeing this occasionally, and just had a report of it on a jenkins slave, both running 3.6.

I can semi-consistently trigger it on my laptop, while I've been writing/testing integration tests that use pylxd to create and destroy containers. While doing so, I'm running a session with watch lxc list. If I do that long enough, I will hit the error and have to restart the service.

Are there any logs I can capture that'll help isolate this?

AdamIsrael on 15 Oct 2018

We're also still seeing this on Bionic, is the fix backported to the deb version available on Bionic? We are deploying with juju so we do not necessarily have the luxury of choosing to use the snap version of LXD.

fnordahl on 3 Dec 2018

3.0.3 currently in proposed has the updated dqlite, so hopefully that'll do the trick

stgraber on 3 Dec 2018

The same here on Xenial:

ii lxd 3.0.2-0ubuntu1~16.04.1 amd64 Container hypervisor based on LXC - daemon

tomposmiko on 7 Dec 2018

same here.. bionic.

jimisan on 24 Jan 2019

Anyone hitting this, please make sure you're on 3.0.3 or higher, then if it keeps happening, we can look into it.

stgraber on 24 Jan 2019

Got the same issue today on my BETA machine, but not on my PROD machine fortunately.
The BETA machine is much slower than PROD, in case this information might be useful.

the snap refresh lxd did fix the lxc listissue => unexpected response type 6
however I noticed there were the last 2 containers missing (the last 2 I've recently created, a couple of days ago). The list shows up 8 of 10 containers I expected.

Looks very weird to me.
By the way, the 2 hidden containers were still running in the background (ping worked).
However I could not enter in (lxc exec), probably because they were still hidden (not listed).

So, I had the idea to restart lxd, sudo systemctl restart snap.lxd.daemon.service. It seems OK apparently. Unfortunately, the last 2 containers don't exist anymore, no more ping. It looks like I've definitely erased these 2 (hidden) containers. Wow, frustrating.

These containers were 'staging' containers, so it's not a big deal. But I wonder what if it would happen on production containers. Feeling a bit scared.

pvincent974b on 24 Aug 2019

@pvincent974b what version of LXD was this machine running? The 3.0.x stable series or the 3.x development one?

freeekanayaka on 24 Aug 2019

@freeekanayaka
The 3.0.x stable series, I guess so.
Months ago, I've installed with this command snap install lxd.
Quite sure, I did not try a development branch.

Last news

Good news, I re-create the 2 containers by using the same names.
Noticed a message of type 'Error: Common start logic: The container is already running'
Then, exec -- bash into these containers, and all the previous data were still there.
Cool.

CRIU related !?

hum, weeks ago, I have started playing with CRIU.
I know it is still experimental. That may explained why the last 2 containers (CRIU enabled) I created were missing after upgrade.

Hope it helps

pvincent974b on 24 Aug 2019

@sabaini if you're still stuck in this state, try systemctl reload snap.lxd.daemon that will hopefully unstick you.

@freeekanayaka we've seen a few reports of this happening (have seen it on some of our machines), any ideas?

I have been getting a similar error. Any calls to lxc binary were just failing.

lxc --debug info
DBUG[09-05|10:21:34] Connecting to a local LXD over a Unix socket
DBUG[09-05|10:21:34] Sending request to LXD                   method=GET url=http://unix.socket/1.0 etag=
Error: failed to begin transaction: exec

t=2019-09-04T11:49:37+0000 lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: driver: bad connection"
t=2019-09-04T11:49:38+0000 lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: driver: bad connection"
t=2019-09-04T11:49:39+0000 lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: driver: bad connection"
t=2019-09-04T11:49:40+0000 lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: driver: bad connection"
t=2019-09-04T11:49:40+0000 lvl=warn msg="Failed to get current cluster nodes: failed to begin transaction: exec"
t=2019-09-04T11:49:41+0000 lvl=warn msg="Failed to get current cluster nodes: failed to begin transaction: exec"
t=2019-09-04T11:49:42+0000 lvl=warn msg="Failed to get current cluster nodes: failed to begin transaction: exec"
t=2019-09-04T11:49:43+0000 lvl=warn msg="Failed to get current cluster nodes: failed to begin transaction: exec"
t=2019-09-04T11:49:43+0000 lvl=eror msg="Failed to load containers for scheduled snapshots" err="failed to begin transaction: exec"
t=2019-09-04T11:49:43+0000 lvl=eror msg="Failed to load containers for snapshot expiry" err="failed to begin transaction: exec"

The containers in the cluster, however, were just fine with the web-services running on them responding to the external calls and pings.

The command by @stgraber fixed this issue.
systemctl reload snap.lxd.daemon