docker-py running container blocks access to "docker stats"

Created on 9 Sep 2016  路  19Comments  路  Source: docker/docker-py

I am using docker-py to run a short-running (1m-2h) container in the background and I'd like to monitor that container's stats to record and characterize its I/O, CPU, and memory usage every second or two. But it appears that while containers are running under docker-py I can't get stats, either from my code through docker-py (c.stats()) or the command line docker stats. Streaming or no streaming, both seem to just hang until they timeout, or the container just happens to finish and unblock docker.

I should note that I can call other functions on the docker-py Client, such as containers(). Just that the stats() call with no streaming or attempting to iterate over the generator with streaming just hangs until the container exits. Similar story with the command line. docker ps works, docker stats hangs.

I'm using Ubuntu 14.04, docker-py version 1.9.0, and docker version 1.12.1, plain vanilla from the repos, connecting over a local domain socket as configured by default (/var/run/docker.sock).

levedocker-engine

Most helpful comment

So this is actually fixed in upstream: https://github.com/docker/docker/pull/25905

The problem is that container has disabled networking (SandboxID is set to "") and stat collector fails to pick networking stats (obviously) and won't publish _any_ stats to channel, meaning server will never respond with any stats.

To workaround this issue: enable networking.
To fix this issue: wait for upstream to release a new version of docker (> 1.12.1).

All 19 comments

This also occurs on Ubuntu 16.04. Here's my code, slightly edited for brevity. Maybe I'm just doing it wrong and someone can correct me:

    c = Client(base_url='unix://var/run/docker.sock', version='auto')

    container_name = str(uuid.uuid4())
    volumes=['/mnt/container_local']
    host_config = c.create_host_config(
        binds={
            container_local_mnt: {
                'bind': volumes[0],
                'mode': 'rw'
            }
        })

    container_kwargs = {
        'image'    : manifest['image'],  # just a docker image name...
        'name'     : container_name,
        'hostname' : container_name,
        'command'  : [command],  # the program I'm trying to get stats on
        'entrypoint' : 'sh -c',
        'volumes' : volumes,
        'working_dir' : volumes[0],
        'host_config' : host_config,
        'detach'   : True,
        'network_disabled' : True,
        'environment' : {},
    }

    container = c.create_container(**container_kwargs)
    c.start(container=container['Id'])

    # just want to see it printed out for now...
    while container['Id'] in [i['Id'] for i in c.containers()]:
        # vvv this will always timeout and throw an exception unless the container exits first vvv
        print c.stats(container['Id'], stream=False)
        # ^^^ this will always timeout and throw an exception unless the container exits first ^^^
        time.sleep(2)

    exit_code = c.wait(container=container)

This is very interesting. We had an almost identical bug report recently in our bug tracker:

https://bugzilla.redhat.com/show_bug.cgi?id=1374265

What I was able to figure out is that engine is likely blocked on getting stats from cgroups (for some reason) and hence it doesn't respond with anything).

Would it be possible to send a http request via ncat to see what the response will be?

$ ncat -U /var/run/docker.sock
GET /v1.22/containers/580dc1eaae69/stats?stream=0 HTTP/1.1
Host: 127.0.0.1  # press <enter> twice and wait for response

HTTP/1.1 200 OK
Content-Type: application/json
Server: Docker/1.10.3 (linux)
Date: Fri, 09 Sep 2016 09:57:20 GMT
Transfer-Encoding: chunked

1127
{"read":"2016-09-09T09:57:20.896712812Z","precpu_stats":{"cpu_usage":{"total_usage":981229

In case of no response, we should likely reassign to docker/docker.

Yup, that bug report looks like exactly what I'm seeing, except I'm also seeing it on docker 1.12.1 on Ubuntu. Trying the ncat direct http test now, it appears to be hanging indefinitely (I'm assuming it'll return when the container finishes).

I'd like to point out that while this might be a bug in docker, it is triggered by some behavior in docker-py. If I run a container from the command line I can get the stats in parallel no problem. Is it possible this happens because the docker-py client holds open the connection to the socket (ie. doesn't close/reopen between function calls) and this doesn't play nice with how docker stats expects to be able to stream data back to the client?

Come to think of it, I'm going to test exactly this by closing the client after starting the container, using a fresh client until the stats runs out, and opening a third one for everything after that.

If I run a container from the command line I can get the stats in parallel no problem.

That's an interesting twist. So maybe some container configuration you specify messes up with cgroups so docker is not able to get stats. Let me see...

I initially opened #1194 because I seemed to having a similar problem, but this github issue frames the problem much better than I did.

Reproduced!

And even found the thing which breaks stats. When I removed 'network_disabled' : True, I was able to get stats once again.

If I use {'network_disabled' : False} when using the create_container method this will fix this?

@michaelbarton that's response to OP's first comment; when I took his container configuration I was hitting the issue; as soon as I removed the network_disabled entry; it started to work

Doing {'network_disabled' : False} did not fix it for me, neither did explicitly closing the docker-py client and reconnecting before/after trying to get stats. I am beginning to think that Tomas is on the right track that there's something about the way the container is started that prevents docker from collecting the stats.

Wait! {'network_disabled' : False} did fix my problem! But it was masked by another important problem that might be related the source of the bug. In my code I iterate until the container disappears from the list of running containers:

while container['Id'] in [i['Id'] for i in c.containers()]:
        print c.stats(container['Id'], stream=False)
        time.sleep(2)

With {'network_disabled' : False} that loops correctly _until the container is no longer running_. Then, and _I suspect there's a race condition here_, what should be the final call to stats() hangs. My solution is to set the client timeout to something smaller and less annoying like 15 seconds and do this:

    while container['Id'] in [i['Id'] for i in c.containers()]:
        try:
            stats = c.stats(container['Id'], stream=False)
        except:
            break
        print stats
        time.sleep(2)

So this is actually fixed in upstream: https://github.com/docker/docker/pull/25905

The problem is that container has disabled networking (SandboxID is set to "") and stat collector fails to pick networking stats (obviously) and won't publish _any_ stats to channel, meaning server will never respond with any stats.

To workaround this issue: enable networking.
To fix this issue: wait for upstream to release a new version of docker (> 1.12.1).

I think there is a chance of a race condition between checking whether a container is still running and then subsequently trying to get the Docker stats for it. An alternative might be to break if the container is no longer running, and just pass in the except block, I think otherwise there could be the chance that is there really is a network timeout error and you stop collecting metrics from the container when it is still running.

I mention this because I spent all of yesterday struggling with the same problem.
https://github.com/bioboxes/bioboxes-py/blob/master/biobox/cgroup.py

@michaelbarton: yeah, definitely in my code there's a race condition but I'd expect a graceful failure (ie. stats of 0). I was thinking that I was actually racing against a bug in docker where trying to get stats of a stopped container hung. Which according to @TomasTomecek might be the case!

For anyone else tripping over this while we wait for docker to get fixed, I'm using the generator, streaming version of statssince it's a bit cleaner (still have to catch the final timeout):

c = Client(base_url='unix://var/run/docker.sock', version='auto', timeout=15)

# ... create/start your container *WITH NETWORKING ENABLED* ...

try:
    for stat in c.stats(container['Id']):
        print stat
except Exception as e:
    pass  # Gross.

# ... collect up container exit code, stdout, stderr, etc.

@michaelbarton: Oh hey! You're in bioinformatics, too! Birds of a feather suffer through frustrating docker bugs together... :)

Thanks for both of your input here, this has been very useful in helping me
update my code.

Glad you guys were able to figure it out! Thanks @TomasTomecek for helping get to the bottom of this. Since the problem has been identified as an engine bug, I'll go ahead and close this issue, but feel free to reopen if there's anything else.

Would you mind updating this issue again when the underlying engine bug is resolved? Thank you.

Was the bug in the Docker engine causing this eventually resolved? Is there a linking issue?

It's been tagged for 1.13.0, which is the upcoming release. https://github.com/docker/docker/pull/25905

If you want, you can try the current 1.13 release candidate and see if it indeed solves your issue.

Was this page helpful?
0 / 5 - 0 ratings