Compose: docker-compose hangs

Created on 29 Apr 2019  Â·  41Comments  Â·  Source: docker/compose

Description of the issue

In a new server installation (on a VPS provided by OVH) running Ubuntu 18.x the Docker system works flawlessy and fast but docker-compose sometimes hangs indefinitely for no reason. And for every command, from version to ps, up, logs and so on.
I tried everything regarding DNS and name resolution, even disabling systemd-resolved.
When it works it is fast. When it hangs I can see the /tmp/_Mxyz123123 directory created.
Running with verbose hangs before any output. And then completes flawlessy.

Context information (for bug reports)

Output of docker-compose version

docker-compose version 1.24.0, build 0aa59064
docker-py version: 3.7.2
CPython version: 3.6.8
OpenSSL version: OpenSSL 1.1.0j  20 Nov 2018

Output of docker version

Client:
 Version:           18.09.5
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        e8ff056
 Built:             Thu Apr 11 04:43:57 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.5
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       e8ff056
  Built:            Thu Apr 11 04:10:53 2019
  OS/Arch:          linux/amd64
  Experimental:     false

Output of docker-compose config
(Make sure to add the relevant -f and other flags)

... not related to a specific docker-compose file

Steps to reproduce the issue

  1. docker-compose

Observed result

Sometimes hangs

Expected result

Run fast.

Stacktrace / full error message

No stacktrace but i managed to strace the behavior.

[...]
openat(AT_FDCWD, "/tmp/_MEIsYFQ2P/lib/python3.6/config-3.6m-x86_64-linux-gnu/Makefile", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3</tmp/_MEIsYFQ2P/lib/python3.6/config-3.6m-x86_64-linux-gnu/Makefile> <0.000038>
fstat(3</tmp/_MEIsYFQ2P/lib/python3.6/config-3.6m-x86_64-linux-gnu/Makefile>, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 <0.000020>
write(3</tmp/_MEIsYFQ2P/lib/python3.6/config-3.6m-x86_64-linux-gnu/Makefile>, "# Generated automatically from M"..., 69632) = 69632 <0.000130>
fchmod(3</tmp/_MEIsYFQ2P/lib/python3.6/config-3.6m-x86_64-linux-gnu/Makefile>, 0700) = 0 <0.000050>
write(3</tmp/_MEIsYFQ2P/lib/python3.6/config-3.6m-x86_64-linux-gnu/Makefile>, "type.o   -o Modules/xxsubtype$(E"..., 43) = 43 <0.000029>
close(3</tmp/_MEIsYFQ2P/lib/python3.6/config-3.6m-x86_64-linux-gnu/Makefile>) = 0 <0.000021>
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f2088645a10) = 3110 <0.000276>
rt_sigaction(SIGINT, {sa_handler=0x404840, sa_mask=[INT], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f2087c55f20}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0 <0.000051>
rt_sigaction(SIGKILL, {sa_handler=0x404840, sa_mask=[KILL], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f2087c55f20}, 0x7ffe82f7d900, 8) = -1 EINVAL (Invalid argument) <0.000031>
rt_sigaction(SIGTERM, {sa_handler=0x404840, sa_mask=[TERM], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f2087c55f20}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0 <0.000028>
wait4(-1, 

[... here it hangs and then after 114s it continues output ...]

wait4(-1, docker-compose version 1.24.0, build 0aa59064
docker-py version: 3.7.2
CPython version: 3.6.8
OpenSSL version: OpenSSL 1.1.0j  20 Nov 2018
[{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 3110 <114.945565>
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3110, si_uid=0, si_status=0, si_utime=43, si_stime=6} ---
rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[INT], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f2087c55f20}, {sa_handler=0x404840, sa_mask=[INT], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f2087c55f20}, 8) = 0 <0.000094>
rt_sigaction(SIGKILL, {sa_handler=SIG_DFL, sa_mask=[KILL], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f2087c55f20}, 0x7ffe82f7d900, 8) = -1 EINVAL (Invalid argument) <0.000150>

[...]

Additional information

Linux vps[omit] 4.15.0-48-generic #51-Ubuntu SMP Wed Apr 3 08:28:49 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Ubuntu 18.04.2 LTS

docker-compose installed with curl

I have lot of other docker-compose environments working and this is the first time I see this behavior. Any idea on what to check or where to look?

kinbug

Most helpful comment

Same behavior for me as well. I'm running on a Digital Ocean droplet running Ubuntu 18.04.2 LTS with 4.15.0-48-generic x86_64 GNU/Linux kernel. Installed via curl today.

All 41 comments

@xcash Thank you for the ticket. We are going to investigate. Do you have any minimal docker-compose.yaml triggering that issue? Or does the compose file contain some specific elements (like volumes or dedicated network)?

@jcsirot No as it happens even when issuing a simple docker-compose version without a docker-compose.yaml. It seems not related to the docker-compose.yml. It seems it hangs before even checking for it, like, as in the strace above, when checking for docker subsystem.

Btw the compose file mounts local volumes and uses default network. Uses a bitnami/nginx image.

This happens to me as well, running on a Google Compute Engine instance with the same docker versions as above. Also Ubuntu 18.04. Even docker-compose --help does this, so no docker-compose.yaml is involved.

Same behavior for me as well. I'm running on a Digital Ocean droplet running Ubuntu 18.04.2 LTS with 4.15.0-48-generic x86_64 GNU/Linux kernel. Installed via curl today.

Same problem for me, Linode with Ubuntu 18.04.2. Going back to version 1.23.0 fixed it, although I had to recreate the containers.

It seems like after the last update of Docker itself the problem is gone... can you guys (@hschalamonek @bitbuck3t @will-emmerson ) check it too?

@xcash I'm running on Docker version 18.09.6, build 481bc77
and it's still happening.

Ubuntu 18.04.2LTS x64 with everything up to date

as a work around: you must call docker-compose command once and
after that issue your desired command.

@xcash was your strace capture (from the original issue) captured with -fff to follow forks/clones?

If I strace locally (running 1.24.0) which doesn't hang (there's a brief pause, ~1s) I do notice:

clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f07acbaea10) = 16880
rt_sigaction(SIGINT, {sa_handler=0x404840, sa_mask=[INT], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f07acbe88e0}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGKILL, {sa_handler=0x404840, sa_mask=[KILL], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f07acbe88e0}, 0x7ffdb3f8c1f0, 8) = -1 EINVAL (Invalid argument)
rt_sigaction(SIGTERM, {sa_handler=0x404840, sa_mask=[TERM], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f07acbe88e0}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 16880
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=16880, si_uid=1000, si_status=0, si_utime=60, si_stime=8} ---
rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[INT], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f07acbe88e0}, {sa_handler=0x404840, sa_mask=[INT], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f07acbe88e0}, 8) = 0
rt_sigaction(SIGKILL, {sa_handler=SIG_DFL, sa_mask=[KILL], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f07acbe88e0}, 0x7ffdb3f8c1f0, 8) = -1 EINVAL (Invalid argument)
rt_sigaction(SIGTERM, {sa_handler=SIG_DFL, sa_mask=[TERM], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f07acbe88e0}, {sa_handler=0x404840, sa_mask=[TERM], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f07acbe88e0}, 8) = 0
openat(AT_FDCWD, "/tmp/_MEI9VTJSw", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3

e.g. at the point you have a hang I have a child which is spawned and then exits.

I think the /tmp/ stuff is the self-extracting stuff from https://github.com/pyinstaller/pyinstaller.

The first exec looks to be some sort of self exec, and it then execs /sbin/ldconfig twice. What I saw was (please excuse my ascii...):

+-- /tmp/dc.16879:execve("./docker-compose-Linux-x86_64", ["./docker-compose-Linux-x86_64", "version"], 0x7fff52d107c0 /* 56 vars */) = 0
  ` /tmp/dc.16879:clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f07acbaea10) = 16880
    `-- /tmp/dc.16880:execve("/home/ijc/development/Docker/compose/compose/docker-compose-Linux-x86_64", ["./docker-compose-Linux-x86_64", "version"], 0xfa0530 /* 58 vars */) = 0
     |
     +- /tmp/dc.16880:clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fb8f4c91a10) = 16881
     |  `-- /tmp/dc.16881:execve("/sbin/ldconfig", ["/sbin/ldconfig", "-p"], 0x7fb8f425ef00 /* 2 vars */) = 0
     |
     `- /tmp/dc.16880:clone(child_stack=NULL, f lags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fb8f4c91a10) = 16882
        `-- /tmp/dc.16882:execve("/sbin/ldconfig", ["/sbin/ldconfig", "-p"], 0x7fb8efbbb168 /* 2 vars */) = 0

I wonder if there is something which is slowing /sbin/ldconfig on your system unacceptably? e.g. an nfs or autofs mount in the paths it is hitting or something like that.

FWIW I was running strace -s 4096 -o /tmp/dc -fff ./docker-compose-Linux-x86_64 version. You might also want to add -tt to timestamp all the calls and see if there is an obviously slow bit.

@xcash I'm running on Docker version 18.09.6, build 481bc77
and it's still happening.

Ubuntu 18.04.2LTS x64 with _everything_ up to date

On that machine (the only one I had with that problem) it disappeared and we use the same version :(

# docker-compose version
docker-compose version 1.24.0, build 0aa59064

# docker version
Client:
 Version:           18.09.6
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        481bc77
 Built:             Sat May  4 02:35:57 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.6
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       481bc77
  Built:            Sat May  4 01:59:36 2019
  OS/Arch:          linux/amd64
  Experimental:     false

@xcash was your strace capture (from the original issue) captured with -fff to follow forks/clones?

Sadly not.

e.g. at the point you have a hang I have a child which is spawned and then exits.

I think the /tmp/ stuff is the self-extracting stuff from https://github.com/pyinstaller/pyinstaller.

I've seen the self extraction stuff before that line. In the output I evicted from the post.

I wonder if there is something which is slowing /sbin/ldconfig on your system unacceptably? e.g. an nfs or autofs mount in the paths it is hitting or something like that.

Considering it is a VPS it's entirely possible something like that behind the scenes.
I tried issuing a docker command (not compose) while docker-compose was locking/locked and it worked seamlessy but it could've been an operation that didn't hit the filesystem..

FWIW I was running strace -s 4096 -o /tmp/dc -fff ./docker-compose-Linux-x86_64 version. You might also want to add -tt to timestamp all the calls and see if there is an obviously slow bit.

Should this happen again i will use your strace options. :)

Thanks @xcash

Considering it is a VPS it's entirely possible something like that behind the scenes.

FWIW one thing I considered mentioning before was that some of the lower tier virtual machines on AWS (and I presume other clouds) are scheduled in such a way that if they use all their "credit" for a given period they essentially hang (the whole VM is descheduled) until they rebuild up sufficient credit.

I _think_ this is probably not the case for you, since only docker-compose is hanging not the whole VM (AIUI, e.g. you see docker cli commands working while in the hung state). Also hangs of 114s like you are seeing are so close to triggering Linux's default soft-lockup check (120s) that I think someone would have mentioned that if it were relevant.

So I think this is almost certainly not relevant to this case (at least to your particular instance), but worth mentioning just in case.

Should this happen again i will use your strace options. :)

Thank you!

BTW I think running /sbin/ldconfig while you are in the "bad" state should also take a long time, which would be interesting to validate.

I am still having the same problem, with upgraded everything, but can confirm that running docker-compose on its own does seem to fix it.

@will-emmerson please can you try collecting some logs with strace (see my reply to @xcash above for the runes) while the problem is exhibiting itself.

I ran the following command
strace -s 4096 -o /tmp/dc -fff -tt docker-compose --version

and it created these files:

3220120 May 21 10:34 dc.29714
8457740 May 21 10:34 dc.29717
  27423 May 21 10:33 dc.29718
  27423 May 21 10:33 dc.29719

What should I do with them?

I also ran strace -c docker-compose --version and it output this:

    docker-compose version 1.24.0, build 0aa59064
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
    62.83   0.046038       46038         1           wait4
    20.47   0.014998          88       170           write
    9.71    0.007118          82        87           unlink
    2.79    0.002041           8       248           read
    0.94    0.000689           4       191           openat
    0.83    0.000609           3       231        98 stat
    0.61    0.000447          56         8           munmap
    0.36    0.000266          22        12           rmdir
    0.31    0.000225           1       191           close
    0.23    0.000166           7        24           getdents
    0.22    0.000160          13        12           mkdir
    0.21    0.000153         153         1           clone
    0.20    0.000144           1       192           fstat
    0.19    0.000140           2        87           fchmod
    0.07    0.000053           1        90           lseek
    0.02    0.000015           2         9           brk
    0.01    0.000010           2         6         2 rt_sigaction
    0.01    0.000007           0        17           mmap
    0.00    0.000000           0         7           mprotect
    0.00    0.000000           0         5         5 access
    0.00    0.000000           0         1           getpid
    0.00    0.000000           0         1           execve
    0.00    0.000000           0         1           readlink
    0.00    0.000000           0         1           arch_prctl
    ------ ----------- ----------- --------- --------- ----------------
    100.00    0.073279                  1593       105 total

I've never used strace before.

@will-emmerson easiest is probably to tar (or zip) them up and attach them here, thanks.

I'm not sure I want to upload a bunch of logs to a public place when I don't really know what's in them.

@will-emmerson hrm, yes, that's fair. please zip/tar them and mail them to ijc (at) docker.com.

I've sent them.

Received, thanks.

So your process tree is:

29714 `docker-compose --version`
`-- 29717 self exec of `docker-compose --version`
    `-- 29718 exec `/sbin/ldconfig` → exec `/sbin/ldconfig.real`
    `-- 29719 exec `/sbin/ldconfig` → exec `/sbin/ldconfig.real`

The two processes 29718 and 29719 execute fully in <1s. e.g. dc.29718 is

6678-will-emmerson$ sed -n -e '1p;$p' dc.29718
10:33:59.478903 set_robust_list(0x7fa0b3052a20, 24) = 0
10:33:59.483282 +++ exited with 0 +++
6678-will-emmerson$ sed -n -e '1p;$p' dc.29719
10:33:59.597443 set_robust_list(0x7fa0b3052a20, 24) = 0
10:33:59.602154 +++ exited with 0 +++

Which I think rules out something slowing ldconfig down, at least in your case.

It's the two main processes which take some time:

6678-will-emmerson$ sed -n -e '1p;$p' dc.29714
10:33:58.821291 execve("/usr/local/bin/docker-compose", ["docker-compose", "--version"], 0x7ffea9630aa8 /* 23 vars */) = 0
10:34:49.283590 +++ exited with 0 +++
6678-will-emmerson$ sed -n -e '1p;$p' dc.29717
10:33:59.145695 execve("/usr/local/bin/docker-compose", ["docker-compose", "--version"], 0xa6d630 /* 25 vars */) = 0
10:34:49.268734 +++ exited with 0 +++

In dc.29714 (the parent) there is a gap of ~48s after 10:34:01.454543:

10:33:59.145224 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fd813295a10) = 29717
10:33:59.145455 rt_sigaction(SIGINT, {sa_handler=0x404840, sa_mask=[INT], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fd8128a5f20}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
10:33:59.145495 rt_sigaction(SIGKILL, {sa_handler=0x404840, sa_mask=[KILL], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fd8128a5f20}, 0x7ffd2ab7dcd0, 8) = -1 EINVAL (Invalid argument)
10:33:59.145521 rt_sigaction(SIGTERM, {sa_handler=0x404840, sa_mask=[TERM], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fd8128a5f20}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
10:33:59.145545 wait4(-1, 0x7ffd2ab7dedc, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
10:34:01.163587 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.163639 wait4(-1, 0x7ffd2ab7dedc, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
10:34:01.279720 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.279774 wait4(-1, 0x7ffd2ab7dedc, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
10:34:01.305847 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.305904 wait4(-1, 0x7ffd2ab7dedc, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
10:34:01.333863 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.333915 wait4(-1, 0x7ffd2ab7dedc, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
10:34:01.396106 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.396157 wait4(-1, 0x7ffd2ab7dedc, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
10:34:01.454491 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.454543 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 29717
10:34:49.268835 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=29717, si_uid=0, si_status=0, si_utime=44, si_stime=6} ---
10:34:49.268861 rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[INT], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fd8128a5f20}, {sa_handler=0x404840, sa_mask=[INT], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fd8128a5f20}, 8) = 0
10:34:49.268891 rt_sigaction(SIGKILL, {sa_handler=SIG_DFL, sa_mask=[KILL], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fd8128a5f20}, 0x7ffd2ab7dcd0, 8) = -1 EINVAL (Invalid argument)
10:34:49.268916 rt_sigaction(SIGTERM, {sa_handler=SIG_DFL, sa_mask=[TERM], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fd8128a5f20}, {sa_handler=0x404840, sa_mask=[TERM], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fd8128a5f20}, 8) = 0

Which seems to be the parent waiting for the child.

In the child (dc.29717) the gap is after 10:34:01.454593:

10:33:59.629565 openat(AT_FDCWD, "/dev/random", O_RDONLY) = 3
10:33:59.629613 poll([{fd=3, events=POLLIN}], 1, -1) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
10:34:01.163675 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.163690 restart_syscall(<... resuming interrupted poll ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
10:34:01.279809 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.279823 restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
10:34:01.305939 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.305952 restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
10:34:01.333951 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.333965 restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
10:34:01.396192 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.396205 restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
10:34:01.454579 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.454593 restart_syscall(<... resuming interrupted restart_syscall ...>) = 1
10:34:49.023121 close(3)                = 0

Aha! So the process is blocked in a poll syscall (poll([{fd=3, events=POLLIN}], 1, -1)) where the only passed fd is fd 3, which is a recently opened fd onto /dev/random (openat(AT_FDCWD, "/dev/random", O_RDONLY) = 3).

The /dev/random device is well known for blocking when there is insufficient entropy on the system. And VPSs/cloud VMs are well known for having a general shortage of entropy.

There is no use of /dev/random in compose (outside of a single unrelated test case), nor is there in docker-py, so this must be in one of our dependencies.

Although I cannot reproduce the hang I _can_ see the use of /dev/random in my own strace (my system must have plenty of entropy I guess).

The maddest bit is that after blocking waiting for /dev/random to become readable it immediately closes it and opens /dev/urandom (the non-blocking one) instead and reads from that.

Well, thank you for those logs, they have at least given me a string to pull on and see where it takes me...

No problem, it makes sense as it's a new VPS with very little use. And installing haveged fixes the problem.

Not long before the /dev/random dance I described I see references to cryptography.hazmat.bindings._openssl.so in the traces.

With that knowledge this comment seems awfully telling:
https://github.com/pyca/cryptography/blob/master/src/_cffi_src/openssl/src/osrandom_engine.c#L123...L127

Which in turn seems to derive from:
https://github.com/pyca/cryptography/issues/4645
https://github.com/pyca/cryptography/pull/4656

So I feel pretty sure in declaring that this isn't a compose issue as such, but some combination of Python's behaviour and lack of entropy on some systems. Installing haveged or some other entropy daemon or otherwise ensuring you have sufficient entropy seems like the right answer.

I'm going to close since this does not appear to be a bug in compose, thanks for your help in getting to the bottom of the issue.

FWIW https://lwn.net/Articles/693189/ is an interesting article on a related subject (I'm not sure whether or not this issue is the one being described there, but the symptoms are certainly similar).

@ijc congrats for the troubleshoot Ian! :)
It makes sense even in my setup as the problem disappeared after that VPS went in production and thus started creating entropy.

Yeah! I had number 6 in entropy_avail, after Haveged installation number increased to ~2500 and now docker-compose commands are running pretty fast.

Very strange issue, @ijc thank you for your investigation that issue was so unexpectable but interesting to discover.

I don't understand the conclusion of this issue. At the end of it all, what can I do to not have my docker-compose commands from hanging??

I don't understand the conclusion of this issue. At the end of it all, what can I do to not have my docker-compose commands from hanging??

You can try my patch here to keey away from entropy, pynacl, libsodium, paramiko and anything else that you don't need: https://github.com/docker/compose/issues/6552#issuecomment-508124668 .

I don't understand the conclusion of this issue. At the end of it all, what can I do to not have my docker-compose commands from hanging??

@sebastialonso After reading this thread I solved it by installing haveged (https://github.com/docker/compose/issues/6678#issuecomment-494398740) on my VPS to get more entropy.

Here is another link on the same subject: https://www.digitalocean.com/community/tutorials/how-to-setup-additional-entropy-for-cloud-servers-using-haveged

This isn't really fixed, you guys are just giving workarounds and patches... like the issue is still happening even on the latest version.

I don't mean anything in my post, but there's a difference between a closed issue and one that is hanging because the proper solution hasn't been found yet.

Even if the author got satisfied, it doesn't mean that we should ignore this issue. no?

Dont get me wrong.

Thanks @Phuker, but I can't find the file you mention on my machine.

@cfarvidson In my cloud servers I still don't have this issue, this is happening in my local machine. It seems really bad to me to install a third party software for a bug inside the docker-compose package, so I guess I'll just wait for some kind of answer.

Hi @M034B5

This isn't really fixed, you guys are just giving workarounds and patches... like the issue is still
happening even on the latest version.

This is a typical "won't fix" case since this is not an issue in Docker or Compose. It's a general system entropy issue.
If a software needs entropy and the system is not providing enough it's an OS issue. And everything you do to bypass that is a general workaround.

I don't mean anything in my post, but there's a difference between a closed issue and one that is hanging because the proper solution hasn't been found yet.

Indeed. But here the proper solution is having a good amount of entropy provided by the operating system. :) You are searching the "proper solution" in the wrong place.

Even if the author got satisfied, it doesn't mean that we should ignore this issue. no?
Dont get me wrong.

Don't get me wrong too and I'm not the author. I'm only providing what i think is the right point of view on this issue and why it's closed. :)

@sebastialonso
As a recap on this issue. The problem is that Docker requires system entropy.
Sometimes the OS is not providing enough "natural entropy" for a number of reasons.
In these cases you need to "fix the OS" generating more entropy and that's what haveged is built for.
Without that, Docker waits for enough entropy and that depends on what the OS is doing in the background.

Thanks @xcash, I'll install this haveged tool to solve my issues.

If this was a problem for you please upvote https://github.com/docker/compose/issues/6931 so we can get better docs, which would have avoided lots of wasted time chasing down this problem.

No, it didn't. Turns out my entropy was already high enough before installing that package. I installed anyway, but everything stands the same.

Why should "docker-compose --help" require system entropy?

That's a good point, and that's precisely what I'm experiencing.

Could be related to the amount of disk/memory the docker daemon needs to
operate?

On Thu, 10 Oct 2019, 06:13 lazarcf, notifications@github.com wrote:

Why should "docker-compose --help" require system entropy?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/docker/compose/issues/6678?email_source=notifications&email_token=AA2U4TI3XX7K32WZAUW73ZDQN3W23A5CNFSM4HJC7I62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEA3QLBA#issuecomment-540476804,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AA2U4TJV42B34CKLBI5DW63QN3W23ANCNFSM4HJC7I6Q
.

any update to the issue?

Installing haveged fixed this problem for me.

CentOS 7
docker 19.03.5
docker-compose 1.25.0

I encountered this issue, but realized I was in a Python 3.8 virtualenv, ran docker-compose up -d outside of it and no longer had any weird stalling issues.

I want to also confirm that installing haveged fixed our problem.

Using Digital Ocean's Droplet, with:

  • ubuntu 18 LTS
  • docker 19.03.1
  • docker-compose 1.26.2
  • python based project, and porbably depends heavily on entropy such as SSL, etc

We used this article to install haveged
https://www.digitalocean.com/community/tutorials/how-to-setup-additional-entropy-for-cloud-servers-using-haveged

I hope this helps anyone just in case.

FWIW, installing jitterentropy-rngd on a small server running Ubuntu 20.04 (via apt install jitterentropy-rngd) resolved this problem for me.

Now cat /proc/sys/kernel/random/entropy_avail shows a reasonable amount of entropy available (even after a fresh boot) and the hangs from docker-compose --help or docker-compose --version are _significantly_ reduced.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

29e7e280-0d1c-4bba-98fe-f7cd3ca7500a picture 29e7e280-0d1c-4bba-98fe-f7cd3ca7500a  Â·  3Comments

CrimsonGlory picture CrimsonGlory  Â·  3Comments

maltefiala picture maltefiala  Â·  3Comments

DhairyashilBhosale picture DhairyashilBhosale  Â·  3Comments

dazorni picture dazorni  Â·  3Comments