Log output of a postgres container stops after running for a while which is causing subsequent database queries to hang due to failing to write to stderr.
Output of "docker-compose version"
$ docker-compose version
docker-compose version 1.21.2, build a133471`
$ python -c 'import docker; print docker.version'
3.3.0
md5-61506c423f4e4ca73923778a79fba231
Output of "docker version"
md5-61506c423f4e4ca73923778a79fba231
Client:
Version: 18.03.1-ce
API version: 1.37
Go version: go1.9.5
Git commit: 9ee9f40
Built: Thu Apr 26 07:17:20 2018
OS/Arch: linux/amd64
Experimental: false
Orchestrator: swarm
Server:
Engine:
Version: 18.03.1-ce
API version: 1.37 (minimum version 1.12)
Go version: go1.9.5
Git commit: 9ee9f40
Built: Thu Apr 26 07:15:30 2018
OS/Arch: linux/amd64
Experimental: false
md5-61506c423f4e4ca73923778a79fba231
Output of "docker-compose config"
md5-898c4d3d6eaca96980ebcbfcad3a486a
networks:
default:
ipam:
config:
- subnet: 192.168.42.0/24
services:
web:
build:
context: /usr/src/app
environment:
POSTGRES_1_PORT: tcp://postgres:5432
expose:
- '8000'
links:
- postgres
ports:
- 8000:8000/tcp
tmpfs:
- /tmp:rw,noexec,nosuid,nodev,size=1g,mode=777
volumes:
- /usr/src/app:/app:rw
- state:/data:rw
- runtime:/run/app:rw
postgres:
environment:
PGDATA: /data/postgres
image: postgres:9.3
volumes:
- state:/data:rw
- runtime:/run/postgresql:rw
version: '2.2'
volumes:
runtime: {}
state: {}
md5-c3cea4b8714c57fcb0ad99d87fd1a14c
12:46:20 write(2, "ERROR: "..., 119
If I run sudo tail -F /proc/<pid>/fd/2
on the same process, the application test suite is no longer hung and resumes.
The application test suite should complete without hanging.
N/A
OS version / distribution, docker-compose
install method, etc.
pip install docker-compose
The last known good version combination where I cannot reproduce the issue is docker-compose 1.20.1 + docker-py 3.1.4. I start to see the issue if I upgrade docker-py to 3.2.0. A bisect points to this commit as the culprit: https://github.com/docker/docker-py/commit/e9f31e1e277a39eb275cdd3cc29a42adf0bc5094
@tail Thanks for the report!
You seem to be in a unique position to help out with this as you have a reproducible usecase. Would you be willing to submit a patch? My guess is that reverting this changed line will take care of the issue, but you'd be able to verify that directly.
@shin- I'll give it a shot. That seems likely since I'm using Python 2. Out of curiosity, what was the check intending to do?
Hy guys,
It seems I have the same problem, but it take 3 days to appears, so I'm not completely sure. Context is applications which log a lot of output on stdout or stderr. Same "Observed result" as above.
pip install docker-compose
$ docker-compose version
docker-compose version 1.21.2, build a133471
docker-py version: 3.3.0
CPython version: 2.7.12
OpenSSL version: OpenSSL 1.0.2g 1 Mar 2016
$ python -c 'import docker; print docker.version'
3.3.0
I tried to revert the changed line, we didn't have the problem until I revert the change.
I'll add detail, if I found some.
I've also run into this issue. I'm currently testing with the referenced line reverted. Preliminarily, it seems to be working well. I'll continue testing it and report back early next week. In the meantime, here is some of my system info:
$ docker-compose version
docker-compose version 1.23.1, build b02f130
docker-py version: 3.5.1
CPython version: 2.7.5
OpenSSL version: OpenSSL 1.0.2k-fips 26 Jan 2017
$ docker version
Client:
Version: 18.06.0-ce
API version: 1.38
Go version: go1.10.3
Git commit: 0ffa825
Built: Wed Jul 18 19:08:18 2018
OS/Arch: linux/amd64
Experimental: false
Server:
Engine:
Version: 18.06.0-ce
API version: 1.38 (minimum version 1.12)
Go version: go1.10.3
Git commit: 0ffa825
Built: Wed Jul 18 19:10:42 2018
OS/Arch: linux/amd64
Experimental: false
I've updated compose (and dependencies), and I'm still seeing the same problem and the same fix seems to work.
$ docker-compose version
docker-compose version 1.23.2, build 1110ad0
docker-py version: 3.7.0
CPython version: 2.7.5
OpenSSL version: OpenSSL 1.0.2k-fips 26 Jan 2017
I've also developed a simple synthetic test I can share; I believe this triggers the same problem as my application. I've run it with docker-compose up
, and it will run forever with the fix, but without the fix the python process will exit when stderr can no longer be written to without blocking. The container shim and compose will remain running so you have to watch the process list to see python exit; I don't know if that may be indicative of a separate problem somewhere in the Docker ecosystem or not. The container will stop after compose exits, however.
$ docker-compose config
services:
tester:
command: python /tester.py
image: python:2
stop_signal: SIGINT
volumes:
- /PATH/TO/PROJECT/tester.py:/tester.py:ro
version: '3.0'
Confirmed test case here on a work project - a test of an nginx config where nginx is hammered with requests, nginx spits logs to stdout, freezes half way through the test.
docker-compose version is 1.23.2.
Redirecting the logs to /dev/null "fixes" the problem. Confirmed that the problem does not occur on other machines running docker-compose 1.17.1 and 1.22.0.
@shin- why hasn't the six.PY3 check been backed out of docker-py? It's clearly reproducible and removing six.PY3 and
from the condition around the select block has been shown to fix the issue.
I'm a little confused as to why this doesn't impact every container running under docker-compose but I've encountered this very issue and dropping six.PY3 and
does in fact fix it.
I don't understand why under python 2 this select check wouldn't want to happen?
@abrugh (or anyone else) please raise a PR with the fix.
Here is a minimal test case (courtesy of @bschlinker).
Program versions:
bschlinker@[red]:~/docker-compose-broken-1.20/$ docker-compose -v
docker-compose version 1.20.0, build ca8d3c6
bschlinker@[red]:~/docker-compose-broken-1.20/$ docker -v
Docker version 17.05.0-ce, build 89658be
OS information:
bschlinker@[red]:~/docker-compose-broken-1.20/$ lsb_release -dri
Distributor ID: Debian
Description: Debian GNU/Linux 8.10 (jessie)
Release: 8.10
Test environment:
docker-compose.yml
:
version: '2.1'
services:
testcontainer:
build: ./testcontainer
testcontainer/Dockerfile
:
FROM debian:jessie
ADD entrypoint.sh .
RUN chmod 0755 entrypoint.sh
ENTRYPOINT ["./entrypoint.sh"]
testcontainer/entrypoint.sh
:
#!/bin/bash
while true
do
echo $(date +"%T")
sleep 120
done
Starting the container:
bschlinker@[red]:~/docker-compose-broken-1.20$ /usr/bin/python /usr/local/bin/docker-compose -f docker-compose.yml up --no-deps --exit-code-from testcontainer testcontainer
WARNING: using --exit-code-from implies --abort-on-container-exit
Starting dockercomposebroken120_testcontainer_1 ... done
Attaching to dockercomposebroken120_testcontainer_1
testcontainer_1 | 17:55:15
We will never see another log message print from the container here. However, we can see that the container continues to print via docker-compose logs
:
bschlinker@[red]:~/docker-compose-broken-1.20$ docker-compose logs
Attaching to dockercomposebroken120_testcontainer_1
testcontainer_1 | 17:55:15
testcontainer_1 | 17:57:15
testcontainer_1 | 17:59:15
testcontainer_1 | 18:01:15
testcontainer_1 | 18:03:15
testcontainer_1 | 18:05:15
testcontainer_1 | 18:07:16
testcontainer_1 | 18:09:16
testcontainer_1 | 18:11:16
After messages stop printing from the initial call to docker-compose up
, we can see a queue begin to form on the underlying unix socket:
Capturing the PID of docker-compose
:
bschlinker@[red]:~/docker-compose-broken-1.20/testcontainer$ sudo ps aux | grep testcontainer
bschlin+ 10537 0.5 1.1 232068 45616 pts/0 Sl+ 17:55 0:03 /usr/bin/python /usr/local/bin/docker-compose -f docker-compose.yml up --no-deps --exit-code-from testcontainer testcontainer
...
Recv-Q
shows 68:
bschlinker@[red]:~/docker-compose-broken-1.20/testcontainer$ sudo ss -xp | grep 10537
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port
u_str ESTAB 68 0 * 120919 * 119928 users:(("python",pid=10537,fd=10))
A few minutes later, Recv-Q
shows 102:
bschlinker@[red]:~/docker-compose-broken-1.20/testcontainer$ sudo ss -xp | tr -s " " | grep 10537
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port
u_str ESTAB 102 0 * 120919 * 119928 users:(("python",pid=10537,fd=10))
The above blocks compose after the first print on Python2; the bug does not happen when using Python3.
FWIW, this happens with journald logging driver as well.
Am I correct that if a log reader (docker logs
, or whatever client using the API) hangs, the process writing logs will eventually hang as well?
Might depend on whether the application does blocking vs nonblocking I/O. I expect applications doing blocking I/O will eventually get stuck when the kernel buffer fills up.
Just had this happen to another instance of BIRD; apparently got stuck writing a log message:
# strace -p 23390
strace: Process 23390 attached
write(2, "2019-04-15 18:16:05 <WARN> ", 27
I've experienced many processes hanging eventually after what seems like the stderr buffer filling up. By tailing the process' stderr file descripter actually un-hangs the process.
A bit simplified version of testcase for python 2 by @InvalidInterrupt
version: '3'
services:
tester:
image: ubuntu
command: bash -c 'sleep 61 && yes'
stop_signal: SIGKILL
This can be run with docker-compose up
. It works only on python 2. After 61 seconds, letters y
won't be printed to console, yes
process will be stuck on
write(1, "y\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\n"..., 8192
(which can be seen in htop
/strace
).
Note sleep 61
- this doesn't work (for me at least) with lesser timeout, so a 60-second timeout/delay somewhere may be involved here. Reading logs of that container with docker logs --tail 100 -f
will show that container printed some letters y
and then stopped printing. Also, we noticed that killing docker-compose
process with SIGKILL
(or other unhandled-and-thus-fatal signal, say, SIGUSR2
) will un-block hung process.
Also, if we instead kill process running in container, leaving docker-compose
process running, we observe the effect described by @InvalidInterrupt : containerd shim and compose will remain running and docker ps
will show that container is running, although it can be seen that containerd shim has no child processes. Attempt to docker kill
that container or docker exec
into it will get stuck, pressing Ctrl+C
for running docker-compose
will lead to http timeout error - but, again, killing compose process with death signal will solve the issue.
The issue won't happen with -d flag (docker-compose up -d
).
Versions:
$ docker version
Client:
Version: 18.09.5
API version: 1.39
Go version: go1.10.8
Git commit: e8ff056
Built: Thu Apr 11 04:43:57 2019
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 18.09.5
API version: 1.39 (minimum version 1.12)
Go version: go1.10.8
Git commit: e8ff056
Built: Thu Apr 11 04:10:53 2019
OS/Arch: linux/amd64
Experimental: false
$ docker-compose version
docker-compose version 1.24.0, build 0aa5906
docker-py version: 3.7.2
CPython version: 2.7.15rc1
OpenSSL version: OpenSSL 1.1.0g 2 Nov 2017
We also encountered a similar problem that happens both with python 2 and python 3. It happens when there is a dependency between services and docker-compose up
is run with only dependent service explicitly specified. In this case, dependency will also be started, but it's logs will not be printed to console - and, if that dependency logs a lot, it will be eventually stuck similarly to previous case. Test case:
version: '3'
services:
dependent:
image: ubuntu
command: bash -c 'while true; do date && echo just a note to show container is alive && sleep 1; done'
links: [logger_dependency]
stop_signal: SIGKILL
logger_dependency:
image: ubuntu
command: 'yes'
stop_signal: SIGKILL
It should be run with docker-compose up dependent
. Here, links
can be replaced with depends_on
, effect will be the same. The problem won't happen with docker-compose up
or docker-compose up dependent logger_dependency
.
Again, yes
process can be seen stuck with htop
/strace
or docker logs --tail 100 -f
. This happens both on python 2 and python 3.
Killing docker-compose
process will, again, fix that, killing yes
won't stop the container and shim will stay running.
Neither of described cases reproduces with docker-compose run
.
Compose version:
$ docker-compose version
docker-compose version 1.24.0, build 0aa5906
docker-py version: 3.7.2
CPython version: 3.6.7
OpenSSL version: OpenSSL 1.1.0g 2 Nov 2017
(docker version is the same as for previous test)
Still produces with docker-py 4.0.1 on Python 3.6.8 (bionic). But it doesn't reproduce on Python 2.
@shin- What's the point of that PY3 check in docker/utils/socket.py?
Still occuring on CentOS7:
-bash-4.2$ docker-compose version
docker-compose version 1.24.1, build 4667896
docker-py version: 3.7.3
CPython version: 2.7.5
OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013
-bash-4.2$ python --version
Python 2.7.5
I was to workaround temporarily by adding
logging:
driver: "none"
to each service defined in the docker-compose file.
I'm seeing the following when running tests within docker-compose on Gitlab.com Shared Runners:
I tried raising COMPOSE_HTTP_TIMEOUT because of observing the connection between its 60s default and findings by @asterite3 that over 60s intervals before further output trigger this issue.
This might or might not have been the same problem: https://github.com/docker/compose/issues/5973#issuecomment-404285106 . Though it (incorrectly?) talks about a different environment variable name and
Python2, docker-compose (1.24.1) and the docker daemon (19.03.3) themselves running in Docker, with a docker-in-docker setup through a TLS socket.
https://github.com/docker/docker-py/issues/2425 also has details about similar issue occurring on Python2 only. Switching to python3 could thus be a viable workaround. Deprecating docker-compose on python2 would probably be a good idea.
Python 2 will be deprecated on January 2020 anyway
Hello, I wanted to add that we have an issue very similar to this in PyRDP on Python 3.6 (see linked issue) ~with logging to stdout
~, I'm not sure if it's the same exact issue, but the call stacks we've collected point to execution blocking in a write()
call to stdout. ~Migrating from Python 2 might not be a viable workaround.~ EDIT: I got confused between the docker-compose version and the PyRDP version. Docker compose was on Python 2.7. We're currently trying with the latest compose.
I'll be going through this issue on Monday and trying the suggested patches to see if it can improve our situation. We're also in the process of trying to forego compose
log output to confirm that this is indeed the issue at hand.
EDIT:
tail -F /proc/fd/2
did flush the buffers successfully.
And I was mistaken, we were logging to stderr
as well.
As far as I can tell, docker-compose
always attaches to all started containers, but only reads the logs from those configured by the user (the main service when using docker-compose run
). This likely builds up backpressure for all other containers as their output buffers fill, but are never drained, because no log readers are set up by compose. Eventually, this backpressure causes the docker engine to stop processing the containers' stdout/stderr. That's why I see the issue with docker-compose run
and docker-compose up
, but not with docker-compose up -d
, because it immediately detaches.
For docker-compose run
I was able to fix it with the following patch, which prevents compose from attaching to dependency containers:
diff --git a/compose/cli/main.py b/compose/cli/main.py
index 250d06c9..69c838d0 100644
--- a/compose/cli/main.py
+++ b/compose/cli/main.py
@@ -1316,6 +1316,7 @@ def run_one_off_container(container_options, project, service, options, toplevel
service_names=deps,
start_deps=True,
strategy=ConvergenceStrategy.never,
+ detached=True,
rescale=False
)
docker-compose up
logic seems more complicated, I haven't further investigated it because my use case only involved run
.
Most helpful comment
We also encountered a similar problem that happens both with python 2 and python 3. It happens when there is a dependency between services and
docker-compose up
is run with only dependent service explicitly specified. In this case, dependency will also be started, but it's logs will not be printed to console - and, if that dependency logs a lot, it will be eventually stuck similarly to previous case. Test case:It should be run with
docker-compose up dependent
. Here,links
can be replaced withdepends_on
, effect will be the same. The problem won't happen withdocker-compose up
ordocker-compose up dependent logger_dependency
.Again,
yes
process can be seen stuck withhtop
/strace
ordocker logs --tail 100 -f
. This happens both on python 2 and python 3.Killing
docker-compose
process will, again, fix that, killingyes
won't stop the container and shim will stay running.Neither of described cases reproduces with
docker-compose run
.Compose version:
(docker version is the same as for previous test)