I'm working on a lxd api client to automatically create and start containers. I'm currently writing some automated tests and sometimes the test will time out because the operation of the containers creation wont ever finish.
But the container seems to be successfully created when listing existing containers. For debugging purposes, instead of just waiting for the /wait to finish I also added a simple status-polling every 5 seconds – it also continuously returns that the operation is still running.
It happens more often the more I run the same test, as soon as I restart the lxd service it works properly for a while again.
While running the service in terminal and killing it with ^C suddenly it decides to finish the operation before exiting.
What the test does:
stgraber@dakara:~$ lxc launch ubuntu:16.04 blah
Creating blah
Starting blah
stgraber@dakara:~$ lxc exec blah -- echo abc
abc
stgraber@dakara:~$ lxc snapshot blah
stgraber@dakara:~$ lxc exec blah -- echo abc
abc
stgraber@dakara:~$ lxc snapshot blah
stgraber@dakara:~$ lxc exec blah -- echo abc
abc
stgraber@dakara:~$ lxc copy blah/snap1 blah1
stgraber@dakara:~$
So not super easy to reproduce.
Looking at the log, I see "Copying ZFS container storage" but no matching "Copied ZFS container storage" which would imply that it's stuck in there, unless there is more debug output after what you pasted.
@brauner any idea why we'd be stuck in there?
@patrickd- can you paste "ps fauxww" when LXD is stuck? I wonder if there's a stuck zfs process somewhere in there.
The code itself doesn't lend itself to being stuck, so no. Taking a look at the logs now.
unless there is more debug output after what you pasted.
I messed up and pasted the last part of the daemon output into the custom client log sorry!
heres the full daemon debug output
lxd--debug-out.txt
The container storage seems to be copied fine (according to the daemon log).
Ok, so that does hint at the zfs copy operation being held until ctrl+c somehow... I really wonder if there is any stuck zfs process then.
@brauner well, it finishes copying when the daemon is being killed, so that suggests the subcommand was stuck and got unstuck by the signal... a zfs clone doesn't take 7 minutes to complete.
Right, I just saw the ctrl+c.
@stgraber, we had a similar report about lxc exec being timed-out so I wonder whether this is lxc exec's fault somehow.
here's the ps output while it was stuck:
ps.txt
Ok, so it's not stuck waiting on the zfs subprocess then.
I'm making you a build of what we currently have in LXD git master, if you can reproduce it with that, then I'll send you a build with a ton of debugging added to that particular ZFS function.
btw, at some point there was a sleep 60 while it was stuck, not sure if that's correct
oot 22799 0.0 0.0 97448 5684 ? Ss 23:26 0:00 [lxc monitor] /var/lib/lxd/containers spwn-5298241ee1af32f916ad47ec8e88d9e2acd5c82f
165536 22819 0.5 0.0 37508 4088 ? Ss 23:26 0:00 \_ /sbin/init
....
165536 23465 0.0 0.0 4508 1288 ? S 23:26 0:00 \_ /bin/sh /etc/init.d/ondemand background
165536 23486 0.0 0.0 4380 656 ? S 23:26 0:00 \_ sleep 60
https://dl.stgraber.org/lxd-3037
sha256: ad96813bb5ecc29dde483b48ee284682df3f128d7b1006f2c313300585970bdf
To run it:
systemctl stop lxd.service lxd.socket
./lxd-3037 --debug --group lxd
That "sleep 60" is inside a container and is part of the normal Ubuntu boot sequence, nothing to be concerned about.
I've got good and bad news, after running it a couple of times it never got stuck during creation again but it got stuck while getting the container state twice.
Attached is the full log of your build from creation until it gets stuck
lxd-3037-out.txt
Does it always get stuck at that spot?
Based on the log, it looks like it's getting stuck in zfs again, this time, reading a filesystem property. If it always gets stuck at that spot, I'll give you another build with added debugging for that code path.
Actually, with our new codebase, it should be pretty easy to have LXD log every single subprocess it spawns, so we can see if it's indeed the zfs command getting stuck somehow.
Updated built at https://dl.stgraber.org/lxd-3037
sha256: f8ee6977effba85fe72364055e9bf98cf5e97b5ad0754e2e650d6b08b22b545c
This will make LXD log (at ERROR level) every single command that it spawns as well as whether it succeeds or fails. If we get stuck in zfs, we should see it spawn the command but then never return.
I kept rerunning and rerunning the test with lxd-3037 and I could never reproduce it again (wanted to give you the ps output which i forgot to make).
I then started to run the whole test suite and it got stuck in a different place this time
DBUG[03-08|00:06:10] Found cert k=0
DBUG[03-08|00:06:10] handling ip=127.0.0.1:34152 method=POST url=/1.0/containers/spwn-3343e9f6-9eef-40cc-bbe5-494e7eea2669/exec
DBUG[03-08|00:06:10]
{
"command": [
"tail",
"-f",
"-n",
"+1",
"/.spwn.log"
],
"environment": {},
"interactive": true,
"wait-for-websocket": true
(the } is also missing in the output)
here the ps while that was stuck
ps.txt
Can you paste the output of "dmesg" too?
Also, I'd like to point out that you're running the 4.4.0-65-generic kernel which was pulled out of updates due to a number of regressions affecting containers. I'm not sure that this is the source of your problem, but you may want to boot 4.4.0-64-generic and try again.
I just updated to 4.4.0-66 do you know if this contains fixes for the container regressions?
Yes, 4.4.0-66 fixes that particular regression.
reproduced it with new build:
DBUG[03-08|00:45:06] handling ip=127.0.0.1:46336 method=GET url=/1.0/containers/spwn-5298241ee1af32f916ad47ec8e88d9e2acd5c82f/state
DBUG[03-08|00:45:06] Initializing a ZFS driver.
EROR[03-08|00:45:06] issue-3037: starting: zfs get -H -p -o value used lxd2/containers/spwn-5298241ee1af32f916ad47ec8e88d9e2acd5c82f
executing that manually while it is stick instantly returns a value
$ sudo zfs get -H -p -o value used lxd2/containers/spwn-5298241ee1af32f916ad47ec8e88d9e2acd5c82f
7637504
And do you see any trace of that "zfs" process? The log indicates that it spawned the command and that it hasn't yet returned, so it should be running...
Is that on the -66 kernel?
$ ps -lA | grep zfs
0 Z 0 11104 10325 0 80 0 - 0 - pts/2 00:00:00 zfs <defunct>
$ lxc info | grep kernel
kernel: Linux
kernel_architecture: x86_64
kernel_version: 4.4.0-66-generic
where does that zfs process sit in the process tree (ps fauxww)? is it under lxd or somewhere else?
It looks like the zfs command was properly spawned, it ran, exitted and then lxd never received the signal telling it that the process is done...
Can you also post the whole debug output?
patrickd 4868 0.9 0.1 663104 43752 ? Sl 00:34 0:12 \_ /usr/lib/gnome-terminal/gnome-terminal-server
patrickd 4875 0.0 0.0 24892 6300 pts/2 Ss 00:34 0:00 | \_ bash
root 10324 0.0 0.0 60456 4204 pts/2 S+ 00:44 0:00 | | \_ sudo ./lxd-3037-2 --debug --group lxd
root 10325 0.3 0.0 1130020 32628 pts/2 Sl+ 00:44 0:02 | | \_ ./lxd-3037-2 --debug --group lxd
root 11104 0.0 0.0 0 0 pts/2 Z+ 00:45 0:00 | | \_ [zfs] <defunct>
@brauner I'm seeing a forkexec return at the exact same second as that zfs get is spawned in the log. Could it be that we're messing with the sig handler somehow?
DBUG[03-08|00:45:06] Got error getting next reader websocket: close 1000 (normal), &{%!s(*os.file=&{19 master <nil>})}·
EROR[03-08|00:45:06] issue-3037: starting: zfs get -H -p -o value used lxd2/containers/spwn-5298241ee1af32f916ad47ec8e88d9e2acd5c82f·
specifically
@patrickd- so the problem is that LXD isn't getting the sigchld signal needed for it to finish execution of the child process. Based on timing, it could be that it's the code handling the /exec API which is somehow interfering with the signal handler and causing the failure. That or it's a Go signal handling bug, which unfortunately wouldn't be exactly new...
lxc exec installs a signal.Notify() but it doesn't watch for SIGCHLD.
/me reads signal manpage of golang.
It states:
Notify causes package signal to relay incoming signals to c. If no signals are provided, all incoming signals will be relayed to c. Otherwise, just the provided signals will.
We provide it a set of signals and SIGCHLD is not among them.
Talking about https://github.com/lxc/lxd/blob/master/lxc/exec_unix.go#L27.
@brauner that's in the client though, we're only interested about the server code here since he's using a custom client.
@brauner looking through the exec code, the part that I see as a potential problem is the FindProcess and procState handling. I'm not sure how those exactly work and worry that they may not be as thread safe as they seem...
Hm, that would be a go bug then...
@brauner yeah but the fact that they mention so may os implementation details in the help isn't giving me a lot of confidence :)
@stgraber, so what could happen is that the PID of one of the processes gets reused.
@brauner yeah, I'm not a big fan of FindProcess always succeeding and just returning you a random pid there... I'm not sure how that'd cause the problem though. If we somehow ended up waiting on the zfs process there, it wouldn't be in Z state...
(However likely that is...) And FindProcess() finds the wrong process associated with that PID and that somehow messes up LXD.
I can try to remove the FindProcess() and directly wait on the pid. Don't know if that would mess with our current exec handling... Trying now...
It definitely wouldn't be the first time that we'd end up in signal-handling hell...
@brauner the most likely explanation for what we're seeing here is that we got two sigchld and Go didn't have time to process the first one before the second came in, causing it to be dropped because signals on Linux suck.
Wouldn't they be able to buffer the pending signals.
I think I'll still try to take out the FindProcess() call and see if I can still have exec work as it does now.
Ah, I know why I used FindProcess() because the golang Wait() call requires to be called from a process struct...
The functions in syscall might help.
really smells like https://github.com/golang/go/issues/15437
which isn't really a Go issue so much as a Linux issue but it still sucks :)
Yeah, in cases like these I never know whether to be pleased that it isn't our fault or to not be please since we can't fix it properly... :)
hmm, anything I can do on api-client side? could throttling the api requests help?
@brauner hmm, so what I don't understand is that this should only ever be a problem if relying on SIGCHLD, not when explicitly calling wait() right?
Yes, unless FindProcess() has some weird internals.
@brauner but what we see getting stuck in Z is a process that was spawned through exec.Command so that should definitely be using Wait() and not depend on sigchld...
@patrickd- is there an easy way you can run your code without using /exec to see if you can get it to jam without it?
@brauner I'm testing an approach which gets rid of that code entirely
yes, I can pretty easily comment that out. will give it a try
reran test without /exec usage about 30 times without any problems (before it was like 1 of 5 runs failed)
getting some sleep now, thanks for looking into this, let me know if there's anything I can help with! :)
New binary for you. @patrickd-
https://dl.stgraber.org/lxd-3037
sha256: a1dbbecf2e6945e4ff0217363cff624adbb89d0bd828e03d07daa9bc8484dae1
@patrickd- please try this binary with the full test including /exec stuff as that includes a fix for pid handling in there and may well fix the issue you got.
alright, now i'm curious.. trying it now :grimacing:
@patrickd- very much appreciated. :)
i can no longer reproduce any issues with the latest build! seems to work great! :)
(never been able to run the complete test suite this many times without a single failure)
this one's going to bed in a goooood mood.. :)
yay, thanks for the tests!
Great to hear. Closing. Please re-open in case that bug reappears. :)
Should keep open since the fix hasn't been merged yet :)
:)
big thanks for the super fast help and fix!
(btw, your work is powering the free online service simplytest.me enabling people to easily test Drupal modules and themes online)
Cool!