Lxd: Container creation finishes but operation doesn't

Created on 7 Mar 2017 · 80Comments · Source: lxc/lxd

Information

Distribution: Ubuntu
Distribution version: 16.04
The output of "lxc info" or if that fails:
- Kernel version: 4.4.0-65-generic
- LXC version: 2.0.7
- LXD version: 2.10.1
- Storage backend in use: zfs 0.6.5.6-0ubuntu15

Issue

I'm working on a lxd api client to automatically create and start containers. I'm currently writing some automated tests and sometimes the test will time out because the operation of the containers creation wont ever finish.

But the container seems to be successfully created when listing existing containers. For debugging purposes, instead of just waiting for the /wait to finish I also added a simple status-polling every 5 seconds – it also continuously returns that the operation is still running.

It happens more often the more I run the same test, as soon as I restart the lxd service it works properly for a while again.

While running the service in terminal and killing it with ^C suddenly it decides to finish the operation before exiting.

What the test does:

Create persistent container
Run command
Create snapshot 1
Run command
Create snapshot 2
Run command
Create another persistent container based on snapshot 1 (this is where it gets stuck)

Logs

[x] any relevant kernel output (dmesg) – nothing logged
[X] container log (lxc info NAME --show-log)
lxc-info--show-log.txt
[X] main daemon log (/var/log/lxd.log)
lxd-log.txt
[x] output of the client with --debug – attached output of my custom client
custom-client--debug-out.txt
[X] output of the daemon with --debug
lxd--debug-out.txt

Source

patrickd-

All 80 comments

stgraber@dakara:~$ lxc launch ubuntu:16.04 blah
Creating blah
Starting blah
stgraber@dakara:~$ lxc exec blah -- echo abc
abc
stgraber@dakara:~$ lxc snapshot blah
stgraber@dakara:~$ lxc exec blah -- echo abc
abc
stgraber@dakara:~$ lxc snapshot blah
stgraber@dakara:~$ lxc exec blah -- echo abc
abc
stgraber@dakara:~$ lxc copy blah/snap1 blah1
stgraber@dakara:~$

So not super easy to reproduce.

stgraber on 7 Mar 2017

Looking at the log, I see "Copying ZFS container storage" but no matching "Copied ZFS container storage" which would imply that it's stuck in there, unless there is more debug output after what you pasted.

stgraber on 7 Mar 2017

@brauner any idea why we'd be stuck in there?

stgraber on 7 Mar 2017

@patrickd- can you paste "ps fauxww" when LXD is stuck? I wonder if there's a stuck zfs process somewhere in there.

stgraber on 7 Mar 2017

The code itself doesn't lend itself to being stuck, so no. Taking a look at the logs now.

brauner on 7 Mar 2017

unless there is more debug output after what you pasted.

I messed up and pasted the last part of the daemon output into the custom client log sorry!
heres the full daemon debug output
lxd--debug-out.txt

patrickd- on 7 Mar 2017

The container storage seems to be copied fine (according to the daemon log).

brauner on 7 Mar 2017

Ok, so that does hint at the zfs copy operation being held until ctrl+c somehow... I really wonder if there is any stuck zfs process then.

stgraber on 7 Mar 2017

@brauner well, it finishes copying when the daemon is being killed, so that suggests the subcommand was stuck and got unstuck by the signal... a zfs clone doesn't take 7 minutes to complete.

stgraber on 7 Mar 2017

Right, I just saw the ctrl+c.

brauner on 7 Mar 2017

@stgraber, we had a similar report about lxc exec being timed-out so I wonder whether this is lxc exec's fault somehow.

brauner on 7 Mar 2017

here's the ps output while it was stuck:
ps.txt

patrickd- on 7 Mar 2017

Ok, so it's not stuck waiting on the zfs subprocess then.

stgraber on 7 Mar 2017

I'm making you a build of what we currently have in LXD git master, if you can reproduce it with that, then I'll send you a build with a ton of debugging added to that particular ZFS function.

stgraber on 7 Mar 2017

👍1

btw, at some point there was a sleep 60 while it was stuck, not sure if that's correct

oot     22799  0.0  0.0  97448  5684 ?        Ss   23:26   0:00 [lxc monitor] /var/lib/lxd/containers spwn-5298241ee1af32f916ad47ec8e88d9e2acd5c82f
165536   22819  0.5  0.0  37508  4088 ?        Ss   23:26   0:00  \_ /sbin/init
....
165536   23465  0.0  0.0   4508  1288 ?        S    23:26   0:00      \_ /bin/sh /etc/init.d/ondemand background
165536   23486  0.0  0.0   4380   656 ?        S    23:26   0:00          \_ sleep 60

patrickd- on 7 Mar 2017

https://dl.stgraber.org/lxd-3037
sha256: ad96813bb5ecc29dde483b48ee284682df3f128d7b1006f2c313300585970bdf

To run it:

systemctl stop lxd.service lxd.socket
./lxd-3037 --debug --group lxd

stgraber on 7 Mar 2017

That "sleep 60" is inside a container and is part of the normal Ubuntu boot sequence, nothing to be concerned about.

stgraber on 7 Mar 2017

I've got good and bad news, after running it a couple of times it never got stuck during creation again but it got stuck while getting the container state twice.

Attached is the full log of your build from creation until it gets stuck
lxd-3037-out.txt

patrickd- on 7 Mar 2017

Does it always get stuck at that spot?

stgraber on 8 Mar 2017

Based on the log, it looks like it's getting stuck in zfs again, this time, reading a filesystem property. If it always gets stuck at that spot, I'll give you another build with added debugging for that code path.

stgraber on 8 Mar 2017

Actually, with our new codebase, it should be pretty easy to have LXD log every single subprocess it spawns, so we can see if it's indeed the zfs command getting stuck somehow.

stgraber on 8 Mar 2017

Updated built at https://dl.stgraber.org/lxd-3037
sha256: f8ee6977effba85fe72364055e9bf98cf5e97b5ad0754e2e650d6b08b22b545c

stgraber on 8 Mar 2017

This will make LXD log (at ERROR level) every single command that it spawns as well as whether it succeeds or fails. If we get stuck in zfs, we should see it spawn the command but then never return.

stgraber on 8 Mar 2017

I kept rerunning and rerunning the test with lxd-3037 and I could never reproduce it again (wanted to give you the ps output which i forgot to make).

I then started to run the whole test suite and it got stuck in a different place this time

DBUG[03-08|00:06:10] Found cert                               k=0
DBUG[03-08|00:06:10] handling                                 ip=127.0.0.1:34152 method=POST url=/1.0/containers/spwn-3343e9f6-9eef-40cc-bbe5-494e7eea2669/exec
DBUG[03-08|00:06:10] 
    {
        "command": [
            "tail",
            "-f",
            "-n",
            "+1",
            "/.spwn.log"
        ],
        "environment": {},
        "interactive": true,
        "wait-for-websocket": true

(the } is also missing in the output)

here the ps while that was stuck
ps.txt

patrickd- on 8 Mar 2017

Can you paste the output of "dmesg" too?

Also, I'd like to point out that you're running the 4.4.0-65-generic kernel which was pulled out of updates due to a number of regressions affecting containers. I'm not sure that this is the source of your problem, but you may want to boot 4.4.0-64-generic and try again.

stgraber on 8 Mar 2017

dmesg.txt

I just updated to 4.4.0-66 do you know if this contains fixes for the container regressions?

patrickd- on 8 Mar 2017

Yes, 4.4.0-66 fixes that particular regression.

stgraber on 8 Mar 2017

reproduced it with new build:

DBUG[03-08|00:45:06] handling                                 ip=127.0.0.1:46336 method=GET url=/1.0/containers/spwn-5298241ee1af32f916ad47ec8e88d9e2acd5c82f/state
DBUG[03-08|00:45:06] Initializing a ZFS driver. 
EROR[03-08|00:45:06] issue-3037: starting: zfs get -H -p -o value used lxd2/containers/spwn-5298241ee1af32f916ad47ec8e88d9e2acd5c82f

patrickd- on 8 Mar 2017

executing that manually while it is stick instantly returns a value

$ sudo zfs get -H -p -o value used lxd2/containers/spwn-5298241ee1af32f916ad47ec8e88d9e2acd5c82f
7637504

patrickd- on 8 Mar 2017

And do you see any trace of that "zfs" process? The log indicates that it spawned the command and that it hasn't yet returned, so it should be running...

stgraber on 8 Mar 2017

Is that on the -66 kernel?

stgraber on 8 Mar 2017

$ ps -lA | grep zfs
0 Z     0 11104 10325  0  80   0 -     0 -      pts/2    00:00:00 zfs <defunct>
$ lxc info | grep kernel
  kernel: Linux
  kernel_architecture: x86_64
  kernel_version: 4.4.0-66-generic

patrickd- on 8 Mar 2017

where does that zfs process sit in the process tree (ps fauxww)? is it under lxd or somewhere else?

stgraber on 8 Mar 2017

It looks like the zfs command was properly spawned, it ran, exitted and then lxd never received the signal telling it that the process is done...

stgraber on 8 Mar 2017

Can you also post the whole debug output?

stgraber on 8 Mar 2017

patrickd  4868  0.9  0.1 663104 43752 ?        Sl   00:34   0:12          \_ /usr/lib/gnome-terminal/gnome-terminal-server
patrickd  4875  0.0  0.0  24892  6300 pts/2    Ss   00:34   0:00          |   \_ bash
root     10324  0.0  0.0  60456  4204 pts/2    S+   00:44   0:00          |   |   \_ sudo ./lxd-3037-2 --debug --group lxd
root     10325  0.3  0.0 1130020 32628 pts/2   Sl+  00:44   0:02          |   |       \_ ./lxd-3037-2 --debug --group lxd
root     11104  0.0  0.0      0     0 pts/2    Z+   00:45   0:00          |   |           \_ [zfs] <defunct>

patrickd- on 8 Mar 2017

lxd-debug.txt

patrickd- on 8 Mar 2017

@brauner I'm seeing a forkexec return at the exact same second as that zfs get is spawned in the log. Could it be that we're messing with the sig handler somehow?

stgraber on 8 Mar 2017

DBUG[03-08|00:45:06] Got error getting next reader websocket: close 1000 (normal), &{%!s(*os.file=&{19 master <nil>})}·
EROR[03-08|00:45:06] issue-3037: starting: zfs get -H -p -o value used lxd2/containers/spwn-5298241ee1af32f916ad47ec8e88d9e2acd5c82f·

specifically

stgraber on 8 Mar 2017

@patrickd- so the problem is that LXD isn't getting the sigchld signal needed for it to finish execution of the child process. Based on timing, it could be that it's the code handling the /exec API which is somehow interfering with the signal handler and causing the failure. That or it's a Go signal handling bug, which unfortunately wouldn't be exactly new...

stgraber on 8 Mar 2017

lxc exec installs a signal.Notify() but it doesn't watch for SIGCHLD.

brauner on 8 Mar 2017

/me reads signal manpage of golang.

brauner on 8 Mar 2017

It states:

Notify causes package signal to relay incoming signals to c. If no signals are provided, all incoming signals will be relayed to c. Otherwise, just the provided signals will.

We provide it a set of signals and SIGCHLD is not among them.

brauner on 8 Mar 2017

Talking about https://github.com/lxc/lxd/blob/master/lxc/exec_unix.go#L27.

brauner on 8 Mar 2017

@brauner that's in the client though, we're only interested about the server code here since he's using a custom client.

stgraber on 8 Mar 2017

@brauner looking through the exec code, the part that I see as a potential problem is the FindProcess and procState handling. I'm not sure how those exactly work and worry that they may not be as thread safe as they seem...

stgraber on 8 Mar 2017

Hm, that would be a go bug then...

brauner on 8 Mar 2017

@brauner yeah but the fact that they mention so may os implementation details in the help isn't giving me a lot of confidence :)

stgraber on 8 Mar 2017

@stgraber, so what could happen is that the PID of one of the processes gets reused.

brauner on 8 Mar 2017

@brauner yeah, I'm not a big fan of FindProcess always succeeding and just returning you a random pid there... I'm not sure how that'd cause the problem though. If we somehow ended up waiting on the zfs process there, it wouldn't be in Z state...

stgraber on 8 Mar 2017

(However likely that is...) And FindProcess() finds the wrong process associated with that PID and that somehow messes up LXD.

brauner on 8 Mar 2017

I can try to remove the FindProcess() and directly wait on the pid. Don't know if that would mess with our current exec handling... Trying now...

brauner on 8 Mar 2017

👍1

It definitely wouldn't be the first time that we'd end up in signal-handling hell...

brauner on 8 Mar 2017

@brauner the most likely explanation for what we're seeing here is that we got two sigchld and Go didn't have time to process the first one before the second came in, causing it to be dropped because signals on Linux suck.

stgraber on 8 Mar 2017

Wouldn't they be able to buffer the pending signals.

brauner on 8 Mar 2017

I think I'll still try to take out the FindProcess() call and see if I can still have exec work as it does now.

brauner on 8 Mar 2017

Ah, I know why I used FindProcess() because the golang Wait() call requires to be called from a process struct...

brauner on 8 Mar 2017

The functions in syscall might help.

brauner on 8 Mar 2017

really smells like https://github.com/golang/go/issues/15437

stgraber on 8 Mar 2017

which isn't really a Go issue so much as a Linux issue but it still sucks :)

stgraber on 8 Mar 2017

Yeah, in cases like these I never know whether to be pleased that it isn't our fault or to not be please since we can't fix it properly... :)

brauner on 8 Mar 2017

hmm, anything I can do on api-client side? could throttling the api requests help?

patrickd- on 8 Mar 2017

@brauner hmm, so what I don't understand is that this should only ever be a problem if relying on SIGCHLD, not when explicitly calling wait() right?

stgraber on 8 Mar 2017

Yes, unless FindProcess() has some weird internals.

brauner on 8 Mar 2017

@brauner but what we see getting stuck in Z is a process that was spawned through exec.Command so that should definitely be using Wait() and not depend on sigchld...

stgraber on 8 Mar 2017

@patrickd- is there an easy way you can run your code without using /exec to see if you can get it to jam without it?

stgraber on 8 Mar 2017

@brauner I'm testing an approach which gets rid of that code entirely

stgraber on 8 Mar 2017

yes, I can pretty easily comment that out. will give it a try

patrickd- on 8 Mar 2017

reran test without /exec usage about 30 times without any problems (before it was like 1 of 5 runs failed)

getting some sleep now, thanks for looking into this, let me know if there's anything I can help with! :)

patrickd- on 8 Mar 2017

New binary for you. @patrickd-

https://dl.stgraber.org/lxd-3037
sha256: a1dbbecf2e6945e4ff0217363cff624adbb89d0bd828e03d07daa9bc8484dae1

stgraber on 8 Mar 2017

@patrickd- please try this binary with the full test including /exec stuff as that includes a fix for pid handling in there and may well fix the issue you got.

stgraber on 8 Mar 2017

alright, now i'm curious.. trying it now :grimacing:

patrickd- on 8 Mar 2017

👍1

@patrickd- very much appreciated. :)

brauner on 8 Mar 2017

i can no longer reproduce any issues with the latest build! seems to work great! :)
(never been able to run the complete test suite this many times without a single failure)

this one's going to bed in a goooood mood.. :)

patrickd- on 8 Mar 2017

yay, thanks for the tests!

stgraber on 8 Mar 2017

Great to hear. Closing. Please re-open in case that bug reappears. :)

brauner on 8 Mar 2017

Should keep open since the fix hasn't been merged yet :)

stgraber on 8 Mar 2017

brauner on 8 Mar 2017

big thanks for the super fast help and fix!

(btw, your work is powering the free online service simplytest.me enabling people to easily test Drupal modules and themes online)

patrickd- on 8 Mar 2017

👍1

Cool!

brauner on 8 Mar 2017

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Unknown configuration key "lxc.aa_profile"

sajoupa · 4Comments

Ubuntu-Core container image defaults to the edge channel, not stable

simos · 3Comments

lxc list output should be easier to parse

rrva · 5Comments

Container name isn't a valid hostname.

shaun-ba · 3Comments

Unprivileged containers fail to start

spacekookie · 3Comments