Compose: Logs stop consuming for container, blocking writes to stderr

Created on 13 Jun 2018  路  23Comments  路  Source: docker/compose

Description of the issue

Log output of a postgres container stops after running for a while which is causing subsequent database queries to hang due to failing to write to stderr.

Context information (for bug reports)

Output of "docker-compose version"
$ docker-compose version
docker-compose version 1.21.2, build a133471`

$ python -c 'import docker; print docker.version'
3.3.0



md5-61506c423f4e4ca73923778a79fba231



Output of "docker version"



md5-61506c423f4e4ca73923778a79fba231



Client:
 Version:      18.03.1-ce
 API version:  1.37
 Go version:   go1.9.5
 Git commit:   9ee9f40
 Built:        Thu Apr 26 07:17:20 2018
 OS/Arch:      linux/amd64
 Experimental: false
 Orchestrator: swarm

Server:
 Engine:
  Version:      18.03.1-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.5
  Git commit:   9ee9f40
  Built:        Thu Apr 26 07:15:30 2018
  OS/Arch:      linux/amd64
  Experimental: false



md5-61506c423f4e4ca73923778a79fba231



Output of "docker-compose config"



md5-898c4d3d6eaca96980ebcbfcad3a486a



networks:
  default:
    ipam:
      config:
      - subnet: 192.168.42.0/24
services:
  web:
    build:
      context: /usr/src/app
    environment:
      POSTGRES_1_PORT: tcp://postgres:5432
    expose:
    - '8000'
    links:
    - postgres
    ports:
    - 8000:8000/tcp
    tmpfs:
    - /tmp:rw,noexec,nosuid,nodev,size=1g,mode=777
    volumes:
    - /usr/src/app:/app:rw
    - state:/data:rw
    - runtime:/run/app:rw
  postgres:
    environment:
      PGDATA: /data/postgres
    image: postgres:9.3
    volumes:
    - state:/data:rw
    - runtime:/run/postgresql:rw
version: '2.2'
volumes:
  runtime: {}
  state: {}



md5-c3cea4b8714c57fcb0ad99d87fd1a14c



12:46:20 write(2, "ERROR: "..., 119

If I run sudo tail -F /proc/<pid>/fd/2 on the same process, the application test suite is no longer hung and resumes.

Expected result

The application test suite should complete without hanging.

Stacktrace / full error message

N/A

Additional information

OS version / distribution, docker-compose install method, etc.

  • Ubuntu 16.04 64-bit
  • pip install docker-compose

The last known good version combination where I cannot reproduce the issue is docker-compose 1.20.1 + docker-py 3.1.4. I start to see the issue if I upgrade docker-py to 3.2.0. A bisect points to this commit as the culprit: https://github.com/docker/docker-py/commit/e9f31e1e277a39eb275cdd3cc29a42adf0bc5094

docker-py kinbug statu0-triage

Most helpful comment

We also encountered a similar problem that happens both with python 2 and python 3. It happens when there is a dependency between services and docker-compose up is run with only dependent service explicitly specified. In this case, dependency will also be started, but it's logs will not be printed to console - and, if that dependency logs a lot, it will be eventually stuck similarly to previous case. Test case:

version: '3'
services:
  dependent:
    image: ubuntu
    command: bash -c 'while true; do date && echo just a note to show container is alive && sleep 1; done'
    links: [logger_dependency]
    stop_signal: SIGKILL
  logger_dependency:
    image: ubuntu
    command: 'yes'
    stop_signal: SIGKILL

It should be run with docker-compose up dependent. Here, links can be replaced with depends_on, effect will be the same. The problem won't happen with docker-compose up or docker-compose up dependent logger_dependency.
Again, yes process can be seen stuck with htop/strace or docker logs --tail 100 -f. This happens both on python 2 and python 3.
Killing docker-compose process will, again, fix that, killing yes won't stop the container and shim will stay running.

Neither of described cases reproduces with docker-compose run.

Compose version:

$ docker-compose version
docker-compose version 1.24.0, build 0aa5906
docker-py version: 3.7.2
CPython version: 3.6.7
OpenSSL version: OpenSSL 1.1.0g  2 Nov 2017

(docker version is the same as for previous test)

All 23 comments

@tail Thanks for the report!

You seem to be in a unique position to help out with this as you have a reproducible usecase. Would you be willing to submit a patch? My guess is that reverting this changed line will take care of the issue, but you'd be able to verify that directly.

@shin- I'll give it a shot. That seems likely since I'm using Python 2. Out of curiosity, what was the check intending to do?

Hy guys,

It seems I have the same problem, but it take 3 days to appears, so I'm not completely sure. Context is applications which log a lot of output on stdout or stderr. Same "Observed result" as above.

  • Ubuntu 16.04 64-bit
  • pip install docker-compose
  • docker version: 18.05.0-ce
  • docker-compose
$ docker-compose version
docker-compose version 1.21.2, build a133471
docker-py version: 3.3.0
CPython version: 2.7.12
OpenSSL version: OpenSSL 1.0.2g  1 Mar 2016
$  python -c 'import docker; print docker.version' 
3.3.0

I tried to revert the changed line, we didn't have the problem until I revert the change.

I'll add detail, if I found some.

I've also run into this issue. I'm currently testing with the referenced line reverted. Preliminarily, it seems to be working well. I'll continue testing it and report back early next week. In the meantime, here is some of my system info:

  • CentOS 7 Minimal
$ docker-compose version
docker-compose version 1.23.1, build b02f130
docker-py version: 3.5.1
CPython version: 2.7.5
OpenSSL version: OpenSSL 1.0.2k-fips  26 Jan 2017
$ docker version
Client:
 Version:           18.06.0-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        0ffa825
 Built:             Wed Jul 18 19:08:18 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.0-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       0ffa825
  Built:            Wed Jul 18 19:10:42 2018
  OS/Arch:          linux/amd64
  Experimental:     false

I've updated compose (and dependencies), and I'm still seeing the same problem and the same fix seems to work.

$ docker-compose version
docker-compose version 1.23.2, build 1110ad0
docker-py version: 3.7.0
CPython version: 2.7.5
OpenSSL version: OpenSSL 1.0.2k-fips  26 Jan 2017

I've also developed a simple synthetic test I can share; I believe this triggers the same problem as my application. I've run it with docker-compose up, and it will run forever with the fix, but without the fix the python process will exit when stderr can no longer be written to without blocking. The container shim and compose will remain running so you have to watch the process list to see python exit; I don't know if that may be indicative of a separate problem somewhere in the Docker ecosystem or not. The container will stop after compose exits, however.

$ docker-compose config
services:
  tester:
    command: python /tester.py
    image: python:2
    stop_signal: SIGINT
    volumes:
    - /PATH/TO/PROJECT/tester.py:/tester.py:ro
version: '3.0'

compose-stderr-test.zip

Confirmed test case here on a work project - a test of an nginx config where nginx is hammered with requests, nginx spits logs to stdout, freezes half way through the test.

docker-compose version is 1.23.2.

Redirecting the logs to /dev/null "fixes" the problem. Confirmed that the problem does not occur on other machines running docker-compose 1.17.1 and 1.22.0.

@shin- why hasn't the six.PY3 check been backed out of docker-py? It's clearly reproducible and removing six.PY3 and from the condition around the select block has been shown to fix the issue.

I'm a little confused as to why this doesn't impact every container running under docker-compose but I've encountered this very issue and dropping six.PY3 and does in fact fix it.

I don't understand why under python 2 this select check wouldn't want to happen?

@abrugh (or anyone else) please raise a PR with the fix.

Here is a minimal test case (courtesy of @bschlinker).

Program versions:

bschlinker@[red]:~/docker-compose-broken-1.20/$ docker-compose -v
docker-compose version 1.20.0, build ca8d3c6
bschlinker@[red]:~/docker-compose-broken-1.20/$ docker -v
Docker version 17.05.0-ce, build 89658be

OS information:

bschlinker@[red]:~/docker-compose-broken-1.20/$ lsb_release -dri
Distributor ID: Debian
Description:    Debian GNU/Linux 8.10 (jessie)
Release:    8.10

Test environment:

docker-compose.yml:

version: '2.1'

services:
  testcontainer:
    build: ./testcontainer

testcontainer/Dockerfile:

FROM debian:jessie
ADD entrypoint.sh .
RUN chmod 0755 entrypoint.sh
ENTRYPOINT ["./entrypoint.sh"]

testcontainer/entrypoint.sh:

#!/bin/bash
while true
do
    echo $(date +"%T")
    sleep 120
done

Starting the container:

bschlinker@[red]:~/docker-compose-broken-1.20$ /usr/bin/python /usr/local/bin/docker-compose -f docker-compose.yml up --no-deps --exit-code-from testcontainer testcontainer
WARNING: using --exit-code-from implies --abort-on-container-exit
Starting dockercomposebroken120_testcontainer_1 ... done
Attaching to dockercomposebroken120_testcontainer_1
testcontainer_1  | 17:55:15

We will never see another log message print from the container here. However, we can see that the container continues to print via docker-compose logs:

bschlinker@[red]:~/docker-compose-broken-1.20$ docker-compose logs
Attaching to dockercomposebroken120_testcontainer_1
testcontainer_1  | 17:55:15
testcontainer_1  | 17:57:15
testcontainer_1  | 17:59:15
testcontainer_1  | 18:01:15
testcontainer_1  | 18:03:15
testcontainer_1  | 18:05:15
testcontainer_1  | 18:07:16
testcontainer_1  | 18:09:16
testcontainer_1  | 18:11:16

After messages stop printing from the initial call to docker-compose up, we can see a queue begin to form on the underlying unix socket:

Capturing the PID of docker-compose:

bschlinker@[red]:~/docker-compose-broken-1.20/testcontainer$ sudo ps aux | grep testcontainer
bschlin+ 10537  0.5  1.1 232068 45616 pts/0    Sl+  17:55   0:03 /usr/bin/python /usr/local/bin/docker-compose -f docker-compose.yml up --no-deps --exit-code-from testcontainer testcontainer
...

Recv-Q shows 68:

bschlinker@[red]:~/docker-compose-broken-1.20/testcontainer$ sudo ss -xp | grep 10537
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port
u_str ESTAB 68 0 * 120919 * 119928  users:(("python",pid=10537,fd=10))

A few minutes later, Recv-Q shows 102:

bschlinker@[red]:~/docker-compose-broken-1.20/testcontainer$ sudo ss -xp | tr -s " " | grep 10537
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port
u_str ESTAB 102 0 * 120919 * 119928 users:(("python",pid=10537,fd=10))

The above blocks compose after the first print on Python2; the bug does not happen when using Python3.

FWIW, this happens with journald logging driver as well.

Am I correct that if a log reader (docker logs, or whatever client using the API) hangs, the process writing logs will eventually hang as well?

Might depend on whether the application does blocking vs nonblocking I/O. I expect applications doing blocking I/O will eventually get stuck when the kernel buffer fills up.

Just had this happen to another instance of BIRD; apparently got stuck writing a log message:

# strace -p 23390
strace: Process 23390 attached
write(2, "2019-04-15 18:16:05 <WARN> ", 27

I've experienced many processes hanging eventually after what seems like the stderr buffer filling up. By tailing the process' stderr file descripter actually un-hangs the process.

A bit simplified version of testcase for python 2 by @InvalidInterrupt

version: '3'
services:
  tester:
    image: ubuntu
    command: bash -c 'sleep 61 && yes'
    stop_signal: SIGKILL

This can be run with docker-compose up. It works only on python 2. After 61 seconds, letters y won't be printed to console, yes process will be stuck on

write(1, "y\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\n"..., 8192

(which can be seen in htop/strace).
Note sleep 61 - this doesn't work (for me at least) with lesser timeout, so a 60-second timeout/delay somewhere may be involved here. Reading logs of that container with docker logs --tail 100 -f will show that container printed some letters y and then stopped printing. Also, we noticed that killing docker-compose process with SIGKILL (or other unhandled-and-thus-fatal signal, say, SIGUSR2) will un-block hung process.
Also, if we instead kill process running in container, leaving docker-compose process running, we observe the effect described by @InvalidInterrupt : containerd shim and compose will remain running and docker ps will show that container is running, although it can be seen that containerd shim has no child processes. Attempt to docker kill that container or docker exec into it will get stuck, pressing Ctrl+C for running docker-compose will lead to http timeout error - but, again, killing compose process with death signal will solve the issue.

The issue won't happen with -d flag (docker-compose up -d).

Versions:

$ docker version
Client:
 Version:           18.09.5
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        e8ff056
 Built:             Thu Apr 11 04:43:57 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.5
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       e8ff056
  Built:            Thu Apr 11 04:10:53 2019
  OS/Arch:          linux/amd64
  Experimental:     false
$ docker-compose version
docker-compose version 1.24.0, build 0aa5906
docker-py version: 3.7.2
CPython version: 2.7.15rc1
OpenSSL version: OpenSSL 1.1.0g  2 Nov 2017

We also encountered a similar problem that happens both with python 2 and python 3. It happens when there is a dependency between services and docker-compose up is run with only dependent service explicitly specified. In this case, dependency will also be started, but it's logs will not be printed to console - and, if that dependency logs a lot, it will be eventually stuck similarly to previous case. Test case:

version: '3'
services:
  dependent:
    image: ubuntu
    command: bash -c 'while true; do date && echo just a note to show container is alive && sleep 1; done'
    links: [logger_dependency]
    stop_signal: SIGKILL
  logger_dependency:
    image: ubuntu
    command: 'yes'
    stop_signal: SIGKILL

It should be run with docker-compose up dependent. Here, links can be replaced with depends_on, effect will be the same. The problem won't happen with docker-compose up or docker-compose up dependent logger_dependency.
Again, yes process can be seen stuck with htop/strace or docker logs --tail 100 -f. This happens both on python 2 and python 3.
Killing docker-compose process will, again, fix that, killing yes won't stop the container and shim will stay running.

Neither of described cases reproduces with docker-compose run.

Compose version:

$ docker-compose version
docker-compose version 1.24.0, build 0aa5906
docker-py version: 3.7.2
CPython version: 3.6.7
OpenSSL version: OpenSSL 1.1.0g  2 Nov 2017

(docker version is the same as for previous test)

Still produces with docker-py 4.0.1 on Python 3.6.8 (bionic). But it doesn't reproduce on Python 2.

@shin- What's the point of that PY3 check in docker/utils/socket.py?

Still occuring on CentOS7:

-bash-4.2$ docker-compose version
docker-compose version 1.24.1, build 4667896
docker-py version: 3.7.3
CPython version: 2.7.5
OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013

-bash-4.2$ python --version
Python 2.7.5

I was to workaround temporarily by adding

logging:
    driver: "none"

to each service defined in the docker-compose file.

I'm seeing the following when running tests within docker-compose on Gitlab.com Shared Runners:

  • after any slightly longer period during which a service doesn't output log output (e.g. while waiting for another slow service to start up), no more log output from it is collected/printed, as in https://github.com/docker/compose/issues/6018#issuecomment-484864810
  • period seems to be linked to COMPOSE_HTTP_TIMEOUT, i.e. raising it from the default 60s to e.g. 600s makes it so that all output is collected and printed (none of my services stay quiet that long)

I tried raising COMPOSE_HTTP_TIMEOUT because of observing the connection between its 60s default and findings by @asterite3 that over 60s intervals before further output trigger this issue.

This might or might not have been the same problem: https://github.com/docker/compose/issues/5973#issuecomment-404285106 . Though it (incorrectly?) talks about a different environment variable name and

Python2, docker-compose (1.24.1) and the docker daemon (19.03.3) themselves running in Docker, with a docker-in-docker setup through a TLS socket.

https://github.com/docker/docker-py/issues/2425 also has details about similar issue occurring on Python2 only. Switching to python3 could thus be a viable workaround. Deprecating docker-compose on python2 would probably be a good idea.

Python 2 will be deprecated on January 2020 anyway

Hello, I wanted to add that we have an issue very similar to this in PyRDP on Python 3.6 (see linked issue) ~with logging to stdout~, I'm not sure if it's the same exact issue, but the call stacks we've collected point to execution blocking in a write() call to stdout. ~Migrating from Python 2 might not be a viable workaround.~ EDIT: I got confused between the docker-compose version and the PyRDP version. Docker compose was on Python 2.7. We're currently trying with the latest compose.

I'll be going through this issue on Monday and trying the suggested patches to see if it can improve our situation. We're also in the process of trying to forego compose log output to confirm that this is indeed the issue at hand.

EDIT:
tail -F /proc/fd/2 did flush the buffers successfully.
And I was mistaken, we were logging to stderr as well.

As far as I can tell, docker-compose always attaches to all started containers, but only reads the logs from those configured by the user (the main service when using docker-compose run). This likely builds up backpressure for all other containers as their output buffers fill, but are never drained, because no log readers are set up by compose. Eventually, this backpressure causes the docker engine to stop processing the containers' stdout/stderr. That's why I see the issue with docker-compose run and docker-compose up, but not with docker-compose up -d, because it immediately detaches.

For docker-compose run I was able to fix it with the following patch, which prevents compose from attaching to dependency containers:

diff --git a/compose/cli/main.py b/compose/cli/main.py
index 250d06c9..69c838d0 100644
--- a/compose/cli/main.py
+++ b/compose/cli/main.py
@@ -1316,6 +1316,7 @@ def run_one_off_container(container_options, project, service, options, toplevel
                 service_names=deps,
                 start_deps=True,
                 strategy=ConvergenceStrategy.never,
+                detached=True,
                 rescale=False
             )

docker-compose up logic seems more complicated, I haven't further investigated it because my use case only involved run.

Was this page helpful?
0 / 5 - 0 ratings