Lxd: Regression: "error: websocket: close 1006 (abnormal closure): unexpected EOF"

Created on 6 Sep 2017  ·  20Comments  ·  Source: lxc/lxd

Required information

  • Distribution: Ubuntu Xenial
  • Distribution version: 16.04.2
  • This issue is reproducible from 2.15 (built from commit https://github.com/lxc/lxd/commit/62908649b6e64e6f6cd15c709901e3dbacc90c3e to 2.17 as released in the current xenial-backports version 2.17-0ubuntu2~ubuntu16.04.1)
    driver: lxc
    driver_version: 2.0.8
    kernel: Linux
    kernel_architecture: x86_64
    kernel_version: 4.10.0-29-generic
    server: lxd
    storage: zfs

Issue description

The script that I am appending next works without issues in 2.12 , but fails to
run on 2.17 (xenial-backports) version.

I have bisected the regression and I tracked it down to the following offending
commit:

And the subsequent commit:

I built a LXD/LXC from the previous commit https://github.com/lxc/lxd/commit/665f9013f70ae99a92c6c8c8bd08892b5dfd6cbf
and the script works well as well as with any other previous version to the offending commits.

start.sh

#!/bin/bash
for x in `seq 1 20`; do 
./reproducer.sh test$x &
done
wait

reproducer.sh

#!/bin/bash

NAME=$1
lxc init ubuntu: $NAME
while pgrep -f 'lxc init' >/dev/null; do
echo "$NAME: wait for all lxc done"
sleep 2
done

lxc start $NAME

while true; do
lxc exec $NAME -- sleep 5
done

Please check the full reproducer description next.

Steps to reproduce

  1. Run lxd init and create a zfs pool as follows
|  NAME   | DESCRIPTION | DRIVER |             SOURCE             | USED BY |
+---------+-------------+--------+--------------------------------+---------+
| default |             | zfs    | /var/lib/lxd/disks/default.img | 19      |
+---------+-------------+--------+--------------------------------+---------+
  1. Run ./start.sh (described previously)
  2. Check for the output of the previous command and wait for the multiple occurrences
    of the error to appear:

2017-09-05 20:23:57 Creating test3
2017-09-05 20:23:57 Creating test15
2017-09-05 20:23:57 Creating test13
[...snip]
2017-09-05 20:23:58 Creating test4
2017-09-05 20:23:58 Creating test14
test11: wait for all lxc done
error: websocket: close 1006 (abnormal closure): unexpected EOF
2017-09-05 20:24:15 test8: wait for all lxc done
2017-09-05 20:24:17 test11: wait for all lxc done
2017-09-05 20:24:17 test8: wait for all lxc done
error: websocket: close 1006 (abnormal closure): unexpected EOF
test13: wait for all lxc done
[...]

Information to attach

  • Lsof of all the LXD processes.
  • Logs from LXC Client with --debug --verbose
  • Logs from LXD running with --debug --verbose
  • Strace for all the LXD processes

strace-lxd-monitor.log.txt
lxd-debug.log.txt
lxc-client-debug.log.txt
lsof-debug.log.txt

Incomplete

Most helpful comment

The same problem on many servers with lxd 3.0.3-0ubuntu1~18.04.1 on Ubuntu 18.04.3.
Happens occasionally, difficult to reproduce.

All 20 comments

Unable to reproduce this on my local system. Will try on a clean Ubuntu 16.04 next.

Same result on a fresh system:

stgraber@castiana:~/data/code/lxc/lxd-pkg-snap (latest-candidate)$ rssh [email protected]
Warning: Permanently added 'vm01.maas.mtl' (ECDSA) to the list of known hosts.
root@vm01:~# apt update
Hit:1 http://us.archive.ubuntu.com/ubuntu xenial InRelease
Hit:2 http://us.archive.ubuntu.com/ubuntu xenial-updates InRelease
Hit:3 http://us.archive.ubuntu.com/ubuntu xenial-backports InRelease
Hit:4 http://us.archive.ubuntu.com/ubuntu xenial-security InRelease
Reading package lists... Done                     
Building dependency tree       
Reading state information... Done
2 packages can be upgraded. Run 'apt list --upgradable' to see them.
root@vm01:~# apt dist-upgrade
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Calculating upgrade... Done
The following packages will be upgraded:
  snapd ubuntu-core-launcher
2 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Need to get 10.7 MB of archives.
After this operation, 4,466 kB of additional disk space will be used.
Do you want to continue? [Y/n] 
Get:1 http://us.archive.ubuntu.com/ubuntu xenial-updates/main amd64 ubuntu-core-launcher amd64 2.27.5 [1,562 B]
Get:2 http://us.archive.ubuntu.com/ubuntu xenial-updates/main amd64 snapd amd64 2.27.5 [10.7 MB]
Fetched 10.7 MB in 0s (45.8 MB/s)
(Reading database ... 59856 files and directories currently installed.)
Preparing to unpack .../ubuntu-core-launcher_2.27.5_amd64.deb ...
Unpacking ubuntu-core-launcher (2.27.5) over (2.26.10) ...
Preparing to unpack .../snapd_2.27.5_amd64.deb ...
Warning: Stopping snapd.service, but it can still be activated by:
  snapd.socket
Unpacking snapd (2.27.5) over (2.26.10) ...
Processing triggers for man-db (2.7.5-1) ...
Setting up snapd (2.27.5) ...
Installing new version of config file /etc/apparmor.d/usr.lib.snapd.snap-confine.real ...
Setting up ubuntu-core-launcher (2.27.5) ...
root@vm01:~# apt install -t xenial-backports lxd lxd-client
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  ebtables xdelta3
Suggested packages:
  criu lxd-tools
The following NEW packages will be installed:
  ebtables xdelta3
The following packages will be upgraded:
  lxd lxd-client
2 upgraded, 2 newly installed, 0 to remove and 0 not upgraded.
Need to get 5,123 kB of archives.
After this operation, 1,945 kB disk space will be freed.
Do you want to continue? [Y/n] 
Get:1 http://us.archive.ubuntu.com/ubuntu xenial-updates/main amd64 ebtables amd64 2.0.10.4-3.4ubuntu2 [79.4 kB]
Get:2 http://us.archive.ubuntu.com/ubuntu xenial-backports/main amd64 lxd amd64 2.17-0ubuntu2~ubuntu16.04.1 [3,000 kB]
Get:3 http://us.archive.ubuntu.com/ubuntu xenial-backports/main amd64 lxd-client amd64 2.17-0ubuntu2~ubuntu16.04.1 [1,976 kB]
Get:4 http://us.archive.ubuntu.com/ubuntu xenial/universe amd64 xdelta3 amd64 3.0.8-dfsg-1ubuntu2 [67.1 kB]
Fetched 5,123 kB in 0s (14.8 MB/s) 
Selecting previously unselected package ebtables.
(Reading database ... 59859 files and directories currently installed.)
Preparing to unpack .../ebtables_2.0.10.4-3.4ubuntu2_amd64.deb ...
Unpacking ebtables (2.0.10.4-3.4ubuntu2) ...
Preparing to unpack .../lxd_2.17-0ubuntu2~ubuntu16.04.1_amd64.deb ...
Warning: Stopping lxd.service, but it can still be activated by:
  lxd.socket
Unpacking lxd (2.17-0ubuntu2~ubuntu16.04.1) over (2.0.10-0ubuntu1~16.04.2) ...
Warning: Stopping lxd.service, but it can still be activated by:
  lxd.socket
Preparing to unpack .../lxd-client_2.17-0ubuntu2~ubuntu16.04.1_amd64.deb ...
Unpacking lxd-client (2.17-0ubuntu2~ubuntu16.04.1) over (2.0.10-0ubuntu1~16.04.2) ...
Selecting previously unselected package xdelta3.
Preparing to unpack .../xdelta3_3.0.8-dfsg-1ubuntu2_amd64.deb ...
Unpacking xdelta3 (3.0.8-dfsg-1ubuntu2) ...
Processing triggers for systemd (229-4ubuntu19) ...
Processing triggers for ureadahead (0.100.0-19) ...
Processing triggers for man-db (2.7.5-1) ...
Setting up ebtables (2.0.10.4-3.4ubuntu2) ...
update-rc.d: warning: start and stop actions are no longer supported; falling back to defaults
Setting up lxd-client (2.17-0ubuntu2~ubuntu16.04.1) ...
Setting up xdelta3 (3.0.8-dfsg-1ubuntu2) ...
Setting up lxd (2.17-0ubuntu2~ubuntu16.04.1) ...
Installing new version of config file /etc/init.d/lxd ...
Removing obsolete conffile /etc/init/lxd.conf ...
Setting up lxd dnsmasq configuration.

Old bridge configuration detected in /etc/default/lxd-bridge, upgrading
Unsetting deprecated profile options
Attempting to kill current lxd-bridge
Stock LXD installation detected, resetting to new defaults
Cleaning up lxd-bridge state files

Processing triggers for systemd (229-4ubuntu19) ...
Processing triggers for ureadahead (0.100.0-19) ...
root@vm01:~# lxd init
Do you want to configure a new storage pool (yes/no) [default=yes]? 
Name of the new storage pool [default=default]: 
Name of the storage backend to use (dir, btrfs, lvm) [default=dir]: 
Would you like LXD to be available over the network (yes/no) [default=no]? 
Would you like stale cached images to be updated automatically (yes/no) [default=yes]? 
Would you like to create a new network bridge (yes/no) [default=yes]? 
What should the new bridge be called [default=lxdbr0]? 
What IPv4 address should be used (CIDR subnet notation, “auto” or “none”) [default=auto]? 
What IPv6 address should be used (CIDR subnet notation, “auto” or “none”) [default=auto]? 
LXD has been successfully configured.
root@vm01:~# mkdir a
root@vm01:~# rmdir a
root@vm01:~# cd /home/ubuntu/
root@vm01:/home/ubuntu# ls -lh
total 8.0K
-rwxr-xr-x 1 ubuntu ubuntu 209 Sep  6 04:04 reproducer.sh
-rwxr-xr-x 1 ubuntu ubuntu  77 Sep  6 04:04 start.sh
root@vm01:/home/ubuntu# ./start.sh 
If this is your first time using LXD, you should also run: lxd init
To start your first container, try: lxc launch ubuntu:16.04

Creating test1
Creating test2
Creating test10
Creating test16
Creating test13
Creating test19
Creating test12
Creating test8
Creating test3
Creating test6
Creating test20
Creating test14
Creating test17
Creating test18
Creating test4
Creating test9
Creating test11
Creating test7
Creating test5
Creating test15
test9: wait for all lxc done0% (8.54MB/s)   
test19: wait for all lxc done
test18: wait for all lxc done
test1: wait for all lxc done
test13: wait for all lxc done
test14: wait for all lxc done
test3: wait for all lxc done
test16: wait for all lxc done
test11: wait for all lxc done
test8: wait for all lxc done
test10: wait for all lxc done
test2: wait for all lxc done
test15: wait for all lxc done
test5: wait for all lxc done
test6: wait for all lxc done
test20: wait for all lxc done
test7: wait for all lxc done
test4: wait for all lxc done
test12: wait for all lxc done   

Been sitting there for 5 minutes or so, running "lxc exec" against the various containers.

What kind of load is your system under when those errors show up?

I've managed to get LXD to get some of those when I interact with over 100 containers in parallel, at first glance those are caused by the system being so loaded that it's not scheduling enough time for the websocket client to react to events before the daemon closes the session as a result of the process exitting.

It looks like in the past such errors were being silently dropped, the new client code is much stricter about catching such websocket errors and surfacing them back to the user.

My guess is that the old code would suffer from the exact same issue but rather than showing you an error, it would simply exit 0 and never report that some of the data got lost.

Increasing parallelism even more, we eventually even hit a bunch of "error: not found", which is what you'd expect if the client gets so busy that it fails to retrieve the operation data within 5s of the operation having completed (the client attempts to fetch state of an operation that's already been garbage collected).

@stgraber

Thanks for looking into this.

OK, I can't definitively reproduce the problem in a newly deployed xenial virtual machine, but on a production system with moderate load average (56 cores, 22 points of load average, 5 minutes window) it's fairly easy to get the "error: websocket: close 1006 (abnormal closure): unexpected EOF" errors during the lxc exec loop" on 2.17. Maybe as part of the reproducer we should stress and induct
some cpu load into the system.

As you already mentioned these errors were probably dismissed in earlier versions (before https://github.com/lxc/lxd/commit/e4bd65bbff29675a8e0e24c623c7038ab89e0bbe) and this
was getting unnoticed by the tools that make use of lxc exec in parallel.

Also, I am not versed yet on the lxd internals, but assuming that the idea of " being so loaded that it's not scheduling enough time for the websocket client to react to events before the daemon closes the session " is true, why is this happening? Shouldn't be the writer waiting for all the message/data to be flushed/read on the socket before ending the session?

The LXD daemon is pretty aggressive on websocket handling, if the client doesn't handle events in a timely manner, the connection gets dropped. That's needed so that we don't end up with a bunch of open connections on the server when a network problem occurs. It's obviously less relevant for clients using the unix socket, but the LXD daemon doesn't know how someone connected to the API.

@stgraber Could you please point me to the current timeouts/deadlines at the server and client side? I don't think my system was so overloaded as I'd expect ms latency as no swap was in use at the moment of the problem and the load average was high, but not extreme at the seconds levels ..

Also, if you look at the specific websocket error 1006 (abnormal closure) this seems
to indicate that the problem is that the connection was dropped without a proper close frame being sent [0] and according to the documentation this is typically caused by proxies that doesn't
handles the termination properly.

"Used to indicate that a connection was closed abnormally (that is, with no close frame being sent) when a status code is expected."

My understanding is that at that point the client was either reading or terminating the connection
when the server decided to abruptly close the connection without sending the close frame. Is that behavior expected?

[0] https://developer.mozilla.org/en-US/docs/Web/API/CloseEvent

Sent a pull request which fixes a race in the server side code. This fixes at least one cause for the message you've been getting at least on my test systems.

@niedbalski LXD never sends a close frame because it never closes that connection. A /1.0/events API call is never ending. The only way it can get disconnected is the server going down or network failure, both of which cause a network disconnection without a websocket close frame.

@stgraber

I just tested a fresh build of 2.17 with the patch for #3771 applied on top, and it seems
that the reported message (close 1006) isn't being returned anymore.

Thank you!

Closed? I still see that message with lxd --version 3.0.0.
Error: websocket: close 1006 (abnormal closure): unexpected EOF

I installed lxd using snap, container is running, connected to it with exec bash and leave the session open. Some time later the error occurs, the open terminal sessions close.

I suspect that when snaps refresh with a updated version, the LXD kick out the user in a session.

I was just doing one job using two containers on LXD, when after some minutes I was kicked out in both ...
Then without any reason I just tried to look what version I was running and what build, and I saw that it has updated my version in less than one minute.

I don't know how exactly how snaps manages this but now I just set a scheduled update for it.

Yes, I just noticed the same issue; it seems I get this error whenever the snap updates itself.

@mpontillo the only thing you can do about it is to schedule your updates to a specific day of month or
else compile your own lxd server

https://snapdocs.labix.org/configuration-options/87

Yes, my workaround for now is sudo snap set lxd refresh.timer=fri,23:00-01:00; this way I'm not interrupted with updates during the work week.

Great ... just notice that you can set fri1 or fri5 to set it for the first or last friday of the month.

I saw this today, on an unloaded system running lxd 3.16. It's not clear how to reproduce. Are there logs that can be attached?

I think one way to reproduce this is:

  • install a earlier version;
  • start a new job inside your container;
  • schedule the time of Snap to update your LXD software a few seconds or minutes later;

Then you will see it stop your container to do the update.

The same problem on many servers with lxd 3.0.3-0ubuntu1~18.04.1 on Ubuntu 18.04.3.
Happens occasionally, difficult to reproduce.

Was this page helpful?
0 / 5 - 0 ratings