Moby: Cannot remove network due to active endpoint, but cannot stop/remove containers

Created on 6 Jun 2016 · 58Comments · Source: moby/moby

Output of docker version:

Client:
 Version:      1.11.1
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   5604cbe
 Built:        Tue Apr 26 23:30:23 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.1
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   5604cbe
 Built:        Tue Apr 26 23:30:23 2016
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 15
 Running: 13
 Paused: 0
 Stopped: 2
Images: 215
Server Version: 1.11.1
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 248
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge null host overlay
Kernel Version: 4.4.0-22-generic
Operating System: Ubuntu 16.04 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.686 GiB
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Cluster store: consul://xxx
Cluster advertise: yyy

I am trying to delete a network with docker network rm <network>, but it complains with

Error response from daemon: network xxx_default has active endpoints

Indeed, when I run docker inspect xxx_default I got:

"Containers": {
            "ep-3dd9d8a572c1bfa877da875f3f0640dba9fe0bdf7ff6090a2171dcbebc926b55": {
                "Name": "release_diyaserver_1",
                "EndpointID": "3dd9d8a572c1bfa877da875f3f0640dba9fe0bdf7ff6090a2171dcbebc926b55",
                "MacAddress": "02:42:0a:00:03:04",
                "IPv4Address": "10.0.3.4/24",
                "IPv6Address": ""
            },
            "ep-da1587e9a9fed7d767d79e1ff724a6f6afe56126dae097d9967a9196022ad103": {
                "Name": "release_server-postgresql_1",
                "EndpointID": "da1587e9a9fed7d767d79e1ff724a6f6afe56126dae097d9967a9196022ad103",
                "MacAddress": "02:42:0a:00:03:03",
                "IPv4Address": "10.0.3.3/24",
                "IPv6Address": ""
            }
        }

But when I try to docker stop/rm any of these two containers (either by name or ID) I got:

Error response from daemon: No such container: release_diyaserver_1

So basically I'm stuck with a useless network, which I can't rm, and this is a real problem because I need to recreate the containers having that same name, but it complains with I try to recreate them.

Is there a way I can get out of this?

It's overlay networks, and I run consul as at the KV store.
There is only one consul node, on the same host (because I don't need the multi host network know)

Thx in advance.

arenetworking versio1.11

Source

nschoe

👍20

Most helpful comment

Can you try using --force to disconnect the container?

docker network disconnect --force <network> release_diyaserver_1

thaJeztah on 6 Jun 2016

👍50 ❤6

All 58 comments

Can you try using --force to disconnect the container?

docker network disconnect --force <network> release_diyaserver_1

thaJeztah on 6 Jun 2016

👍50 ❤6

Yes I had tried that and got a different error message from daemon. I don't recall it, unfortunately. But in the end it still did not work.

I ended up destroying my consul server and the consul Named Volume to erase the data and restarted the server.

But this is extremely unfortunate:I can't afford to do that too often :/

I think this has to do with the consul server and docker not correctly registering that a container is removed.

Don't know if other people had the same problem?

nschoe on 6 Jun 2016

We've seen this in situations where a daemon was not shutdown cleanly, but I don't recall if there were other situations.

ping @mavenugo any suggestions?

thaJeztah on 6 Jun 2016

Typically when you see containers in docker network inspect output with a ep- prefix, that means it can be either of 2 cases -

these are stale endpoints left over in the DB. For those cases, docker network disconnect should help.
these are remote endpoints seen in other nodes that are part of the overlay network. The only way to clean them up are from that specific host.

mavenugo on 6 Jun 2016

👍2

Thanks @mavenugo for coming here :-)

I've noticed the ep- prefix of the container and suspected that indeed, it was something like that.

However, I can confirm that docker network disconnect <network> ep-xxx did not solve the problem, because the daemon responded with no such container: ep-xxx.

And there were, in fact, no other nodes that are part of the overlay: this is a single-host overlay network (I don't need multi-host yet, but I do need a high number of subnets, which bridge cannot give me yet (see #21776).

For sanity let's time it happens, I will re-check with the --force option to docker networks disconnect, but I'm 90% sure I tried it and it failed :/

Thanks for your support!

nschoe on 6 Jun 2016

👍2

@nschoe let me close this issue for now, but happy to reopen if there's more information to investigate. Thanks for reporting!

thaJeztah on 6 Jun 2016

@thaJeztah Hi again, it was not long ^^

The issue happened again:

$ docker network inspect appsapps_default

[
    {
        "Name": "appsapps_default",
        "Id": "3cda39493fb0c42f966e3f8a9d4458b3574716062445a7f469f776dc680fa71d",
        "Scope": "global",
        "Driver": "overlay",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": {},
            "Config": [
                {
                    "Subnet": "10.0.2.0/24",
                    "Gateway": "10.0.2.1/24"
                }
            ]
        },
        "Internal": false,
        "Containers": {
            "83bfe13fa1fcb60f75e64f4df44f59029d2e2471ceac81177d23fa3b0cdd2b79": {
                "Name": "appsapps_diya-apps_1",
                "EndpointID": "9ae34cecd1a0190792c84842e510e9d6e363b37581aa8ccbac8bd2ea32f0966f",
                "MacAddress": "",
                "IPv4Address": "",
                "IPv6Address": ""
            },
            "fe4f82c07613d060f78c15a0e285b115e9cb560862af0077cb2d792993ab6e09": {
                "Name": "appsapps_portal-db_1",
                "EndpointID": "44072162d4bc012f9529ee78d257281b557a98b2336bc344a053283cd1d07e76",
                "MacAddress": "",
                "IPv4Address": "",
                "IPv6Address": ""
            }
        },
        "Options": {},
        "Labels": {}
    }
]

And neither containers exist when I do docker ps -a.

Then I tried this:

$ docker stop appsapps_diya-apps_1

Error response from daemon: No such container: appsapps_diya-apps_1

$ docker rm appsapps_diya-apps_1 

Error response from daemon: No such container: appsapps_diya-apps_1

Then network disconnect:

$ docker network disconnect appsapps_default appsapps_diya-apps_1

Error response from daemon: No such container: appsapps_diya-apps_1

and with the -f:

$ docker network disconnect -f appsapps_default appsapps_diya-apps_1

Error response from daemon: unable to force delete endpoint appsapps_diya-apps_1: invalid endpoint locator identifier

Note: I have tried every above command by replacing the container name with the container ID, I got the same result.
Weird thing: the container ID doesn't start with ep- this time.

Note that this is an overlay network, and I have only had this problem with overlay networks.

Any idea? Solution?

nschoe on 8 Jun 2016

👍1

No ideas, but let me reopen the issue

thaJeztah on 8 Jun 2016

Thanks.

For information, I have been trying many things right now: restarting the docker daemon, restarting the consul container, I feel the only solution left is to destroy the consul instance (with its Named Volume), loosing everything.

Then restarting the docker daemon and recreating a consul server.

Some food for thoughts:

is the fact that my consul server is in a container a problem?
is the fact that I have only one consul server a problem? (I remind you that I haven't setup a multi host configuration, the consul is only here because I need it to have overlay networks)

nschoe on 8 Jun 2016

Same issue.
Everything is similar to nschoe.
Containers runnig by compose in swarm cluster.

nejtr0n on 22 Jun 2016

@nschoe
Regarding the error you get during the force endpoint disconnect:

Error response from daemon: unable to force delete endpoint appsapps_diya-apps_1: invalid endpoint locator identifier

specifically regarding invalid endpoint locator identifier, this could be due to the fact you started the daemon without --cluster-advertise option. Can you confirm that is the case ?
Thanks

aboch on 22 Jun 2016

Cluster-advertise is present for me:
/usr/bin/docker daemon -H tcp://10.0.0.1:35100 -H unix:///var/run/docker.sock --cluster-advertise=10.0.0.1:35100 --cluster-store=consul://10.0.0.1:35004 --label dc=dc1 --label phy=phy1 --label cluster_id=1 --dns=10.0.0.1

nejtr0n on 23 Jun 2016

@aboch thanks for answering, but unfortunately, no, I confirm that I _have_ the --cluster-advertise set:

--cluster-store=consul://127.0.0.1:8500 --cluster-advertise=enp0s31f6:2376

nschoe on 23 Jun 2016

@nejtr0n @nschoe I went through the code and I was not able to find yet how the invalid endpoint locator identifier issue would occur if the daemon instance which first created the container was started with the cluster-advertise option.

At the moment, I can only think the first time the container was created, the daemon was not started with the advertise option (Passing the option to subsequent daemon start would not fix it). Let me know if this possibly happened during your workflow.
Thanks.

aboch on 23 Jun 2016

@aboch I cannot be 100% sure that I did not do that ---though I'm still pretty sure--- so I won't say I did not.

However, today I've got the same problem again: I cannot connect my container to my network because "service endpoint with name diya-weather-server already exists", which I can confirm by running docker network inspect on the said network, note that, as before, the container's ID in this case doesn't have the ep- prefix, but a plain hash.

However I cannot disconnect even with -f this non-existing container, neither by name nor by ID.

So I'm really stuck because I don't know what to do :/

I have checked in /var/lib/docker/containers but there is no directory whose name is the hash of the non-existing container so I can't even manually delete it :/

nschoe on 30 Jun 2016

i'm try to remove swarm overlay endpoints, system say ok, but on step when i try remove the overlay network i get some error too:

Error response from daemon: network es-swarm-overlay has active endpoints

westsouthnight on 1 Jul 2016

I also seem to have having this issue. It seems the containers were originally created on a node that no longer exists, so unfortunately I can't try what @mavenugo said. Any ideas?

I did think creating a node with the same name MAY work, but I wanted to confirm it would before I spent time setting one up!

adamlc on 20 Jul 2016

Hi again, I have some "news".
I keep getting this error and this has become a real problem since once of our production servers was shutdown, and restarted, all (docker) systems were unable to restart due to this error.

I have dug deeper into the problem and I think I have found a way to reproduce the error (at least I have been doing this three or four times and it failed every time, so I suppose this counts as reproducible!).

NOTE: I previously thought this was a problem due to the fact that I was running a single-node consul cluster and this node was itself a docker on the host I was making crash. But I have successfully ruled that out: I have created the consul cluster on a remote server, and the consul node that I run on my docker host is only a consul client that connects to the remote consul server.
So I'm going to show the steps to reproduce the error with a remote server, but I'm confident it will be the same if you don't have a remote server: you just have to create the consul server on the same host---this was the way I was before.

So, first versions and stuff:

$ docker info

Containers: 3
 Running: 1
 Paused: 0
 Stopped: 2
Images: 111
Server Version: 1.11.2
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 143
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: overlay bridge null host
Kernel Version: 4.4.0-31-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 7.616 GiB
Name: nschoe-PC
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Cluster store: consul://127.0.0.1:8500

$ docker version

Client:
 Version:      1.11.2
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   b9f10c9
 Built:        Wed Jun  1 22:00:43 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.2
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   b9f10c9
 Built:        Wed Jun  1 22:00:43 2016
 OS/Arch:      linux/amd64

I'm running on Ubuntu 16.04.

Steps to reproduce the error

# On my remote server (same docker version), I create a consul server inside a docker container.
Here is the docker-compose.yml file:

version: '2'
services:
    consul:
        image: consul
        hostname: "node-remote"
        command: "consul agent -server -bootstrap-expect 1 -data-dir /consul/data -bind 192.168.0.1"
        volumes:
            - consul-kv-store:/consul/data
        network_mode: "host"
        restart: always

volumes:
    consul-kv-store:
        driver: local

So nothing fancy: I simply create a consul node server, which bootstraps itself (because only 1 server is needed). Just in case you're wondering about 192.168.0.1, it's because I've set up a OpenVPN tunnel between the remote and my computer, and it's its interface.

# On my local computer, I start a consul client that connects to this remote server. The compose file is:

version: '2'
services:
    consul:
        image: consul
        hostname: "node-nschoePC"
        command: "consul agent -data-dir /consul/data -bind 192.168.0.6 -join 192.168.0.1"
        volumes:
            - consul-kv-store:/consul/data
        network_mode: "host"
        restart: always

volumes:
    consul-kv-store:
        driver: local

Very similar: simply a consul client.

# My docker host daemon is configured with such (systemd service file:)

[Unit]
Description=Docker Application Container Engine (insecure registry and consul)
After=network-online.target docker.socket openvpn.service

[Service]
TasksMax=infinity
ExecStart=
ExecStart=/usr/bin/docker daemon -H fd:// --cluster-store=consul://127.0.0.1:8500

Nothing fancy: I set the TaskMax to infinity because usually I need to create a big number of stacks and it quickly reaches the max default number.
The interesting line is the --cluster-store=consul://127.0.0.1:8500 which instructs the docker daemon to contact the consul cluster. This is the address of our dockerized consul client.
Note that we have After=openvpn.service to make sure docker waits for the VPN tunnel to be effective before trying to start and reach the consul server.

# Now the containers stuff

First create some overlay networks:

docker network create --driver overlay network1
docker network create --driver overlay network2

Check that there have been created (but we did not have any error so...)

docker network ls
NETWORK ID          NAME                DRIVER
ce53302f8b87        bridge              bridge               
03d2e9971ea8        docker_gwbridge     bridge              
96bdb8ad89f9        host                host                
d15fa9fe60e1        network1            overlay             
4cbcac116962        network2            overlay             
25f20e6716fd        none                null

Second, create containers using the overlay network:

docker run -d -t --name cont1 --net network1 --restart=always ubuntu bash
docker run -d -t --name cont2 --net network2 --restart=always ubuntu bash

Just two containers, each one using an overlay network. The -t option and the bash command keeps them running. Note the use of --restart=always

Last, issue a machine reboot: reboot.

Upon machine reboot,

docker ps

NAMES               STATUS
consul_consul_1     2 minutes

So my consul container restarted correctly: but now my two containers cont1
and cont2. When I query with the -a option:

docker ps -a

CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS                       PORTS               NAMES
5c1c9354b3dd        ubuntu              "bash"                   2 minutes ago       Exited (128) 2 minutes ago                       cont1
c14fd5737e10        ubuntu              "bash"                   3 minutes ago       Exited (128) 2 minutes ago                       cont2
48c31a4f9a74        consul              "docker-entrypoint.sh"   48 minutes ago      Up About a minute                                consul_consul_1

So both of my containers have exited with return code 128 (I'm not sure what this means, btw).

When trying to see the logs, nothing seems anormal:

docker logs cont1

root@5c1c9354b3dd:/# exit

It seems to have exited gracefully on host restart. But I have the errors that I mention in my first post: I cannot start the container because I've got the "network already has endpoint with name cont1", then I try deleting the container, and disconnecting from the network. but then I've got the error saying there is not such container.

The only solution here is to log on the remote server running the consul server (which has not crashed of course), run docker-compose down -v to delete the persistent volume, and restart the consul.

But there are many things that I don't understand: why won't the containers automatically restart upon reboot? They _have_ been created with --restart-always, and besides, the container running the consul client _does_ reboot! Why note cont1 and cont2?
And then, why is there this broken state? I previously thought it was due to the fact that the docker daemon was shutdown before it could commit the changes to it's consul server---since it was itself a docker. But now the server is outside, so it seems it _still_ doesn't have a chance to commit the changes?

Does this mean that running overlay network makes docker installations not reliable and especially not reliable to host crash?

What worries me too, is that it was a graceful shutdown, run with reboot, what will it be when the kernel panics, or the machine crashes so hards that it has to restart?

Reading the messages again I see @aboch suggested that it might be the --cluster-advertise parameter. I am re-running my tests right now and will keep you up to date, but with this new intel, do you guys see anything that might be the problem?

nschoe on 27 Jul 2016

👍5

I may have some additional info on the subject, namely: why won't the containers start on reboot, where the consul node doest restart.

Running docker inspect on a container that was stopped with Exit code 128 and which was supposed to restart showed me that I had:

"State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 128,
            "Error": "network client3_default not found",
            "StartedAt": "2016-07-28T08:21:50.134417614Z",
            "FinishedAt": "2016-07-28T08:22:27.323382758Z"
        },

The interesting line here is the Error: apparently it did not start because it did not find the overlay network. I suspect a race condition here between the dockerized consul agent and my docker containers. They all have restart policies, but I suppose by the time the docker containers start, the dockerized consul cluster is not yet ready, so the overlay networks are not yet available and all fail to start.

I think I'm doomed here and I need to have my consul agent outside of docker so that I can make it a systemd service and have docker wait for the consul agent to be started before starting itself.

This is one possibility. I'm wondering about another one now, that I'm going to test right now: have the consul server on the remote, but no local consul agent on the host. Rather than making my docker daemon contact the cluster store at 127.0.0.1:8500, I'll try making it contact the cluster store at 192.168.0.1:8500 (the remote server).

I'll keep you up.

nschoe on 28 Jul 2016

Sorry for multi posting, but I think the more info we have, the easiest it will be to diagnose the problem. Actually I think I have confirmed the whole problem I have is due to a combination of factors:

I'm using overlay network, so I need a KV store
I'm using a dockerized (consul) KV store on the host
I have restart policies for the container using the overlay network

And actually, this all seems clear: overlay network information is stored in the KV store. If we use restart policies on all container (containers using the overlay network and containers running the KV store), they will restart when the host reboots.
The problem is that we cannot specify some priorities between containers starting. This is normal and I understand it: normally, it's the application's responsibility to check for its service availability, not docker's.

But here it's not an application problem: it's docker not being able to start the container because it's overlay network is not ready, and its overlay network is not ready because the container running the KV store is not finished starting yet.

So what I think is best is for docker to have some retry capabilities: if it cannot start the container because the network doesn;t exist and it is configure to use a --cluster-store, then it should wait then retry.

nschoe on 28 Jul 2016

Running those services in a container is a known problem; same with (e.g.) a container that is used for collecting logs. We've had a proposal for introducing a concept of "system containers", but possibly we can think of implementing such services through the new "plugin" feature, that's in 1.12 experimental

thaJeztah on 28 Jul 2016

@thaJeztah ah thanks for answering!

Okay so this is what I was suspecting---still happy I diagnosed this, this is tricky!

I don't know yet of the plugin feature, I haven't tried 1.12 yet. I'd suggest that it is removed from docker documentation, because reading it, running the consul agent in a container looks like a good idea, which is not really :/

I'm currently trying to have the consul cluster on a remote host, and making it possible to query this remote store directly.

Thanks.

nschoe on 28 Jul 2016

Okay so here I am again, with some (bad) news.

I have configured my remote consul server so that it can be queried remotely (on the port 8500). I have then configured my docker daemon with --cluster-store pointing to this remote consul server.

I have then created overlay networks and container that joined this overlay network, on my host. They were running, with a restart policy, then I simply rebooted the host with reboot.

When it started back up, no container were started, docker ps -a indicated that error code was 128, as usual.

When inspecting the container, I've got the error service endpoint with name xxx already exists. And indeed, when I run docker network inspect <network-name> I've got my containers that appear connected.

I can run docker network disconnect <network-name> <container-name>, it doesn't return an error, but then when I run docker network inspect <network-name> it is still marked as connected!
And in the output given by docker network inspect, the container's ID begin with ep-xxxx.

So it's sort of frustrating because unless I'm doing something wrong: it is currently impossible to have auto restart capabilities when running overlay network.

I don't know the cause: it seems like hen the docker daemon stops (gracefully), it doesn't "unregister" or disconnect the containers from the network. I hope this helps to pinpoint the error location, because that is a huge problem not having auto restart capabilities :/

nschoe on 28 Jul 2016

@nschoe sorry that you have to go through these issues. when you see ep-xxxx, that means that particular node couldn't find a container that matches the name of the endpoint. It could either be the container with the same name exists in a remote node. Or the endpoint is infact stale in the local node. If you believe the endpoint is stale in the local node, try using the -f option in docker network disconnect -f to forcefully remove a container from a network.

But regardless of the above workaround, i would like to understand why your containers fail the auto-restart policy. Can you get the daemon logs during the bootup and also highlight the container-id / endpoint-id that we are talking about ?

mavenugo on 28 Jul 2016

@mavenugo thanks for commenting :-)
Yeah I'm sorry too, but hey, I know this is pretty complex setup and Docker is still young. It doesn't "owe" me anything, I'm just trying to give as much information as I can to help Docker improve :-)

Okay for the ep-xxx prefix, so yeah I'll try using -f if needed. But actually that's not really the problem: since I have to intervene by hand, I'm as good as deleting the consul cluster and restarting it.

What I really need is full auto-restart capabilities: in case the server is shutdown and restarts I'd like my whole stack to pop itself back to life, as restart policies are supposed to do :-)

Anyway, I'd like to understand why they fail too. So my last setup was this:

I had a cluster server on a remote server, that I configured so that I could query the 8500 port on its public IP (this is through an encrypted VPN so as not to leak anything).
I have confirmed that I could indeed query the remote cluster server
I have configured my local docker daemon with --cluster-store=192.168.0.1:8500, _i.e._ the ip of the remote cluster server.
there is no consul cluster running on the host (just to be clear). So I have completely removed the previous problem cause: there is no race conditions now between the cluster container and the other containers. The consul cluster on the remote host doesn't not fail or crash. As soon as the docker daemon starts it is able to query the consul cluster.

So even with that setup, the containers using the overlay network fail to restart. And I have observed with docker network inspect that there were _still_ nodes connected to the overlay network after reboot. So my best guess---this is still a guess---is that when the docker daemon exits, it doesn't notify the cluster, or not correctly, or doesn't disconnect nodes from the network, I'm not sure how this is supposed to work.

I will try yo give you the docker daemon logs now. Maybe that will shed some new lights. Stay tuned :-)

nschoe on 28 Jul 2016

@nschoe thanks for your understanding and helping to improve docker. Yes. we had problems with running kv-store as a container as well on the local machine where overlay containers are running due to the container bootup order. But @cpuguy83 addressed that using https://github.com/docker/docker/pull/22561.

Regardless, with your current setup, container bootup order shouldnt matter. When the daemon comes up, it will try to do the cleanup of stale containers (left behind during previous daemon down) and will result in cleaning up the states stored in kv-store as well. Hence, it will be very useful if you start the daemon in debug mode -D and capture the bootup logs in its entirety. Please start from a clean cluster / daemon states to repeat the reproduction so that we have an upto-date and cleaner debug logs to look into.

BTW, All these problems are solved in 1.12.0 with swarm-mode. No more external KV-Stores ... yayyyy !!!

mavenugo on 28 Jul 2016

So I wanted to capture the daemon logs, but it is starting to be painful to reboot the whole machine every time, so I stopped my systemd docker service, and started launching the docker daemon by hand, capturing the output.

So I started the daemon with sudo docker daemon --cluster-store=consul://192.168.0.1:8500 >> docker_logs 2>&1 and I've got this awful lot of logs: link here

Note that at this point, I have _no_ containers. It was before I tried my experiment, I was just looking at the logs to see if my captured failed. So... is this normal to have all these errors at the beginning of the launch?

By the way it was not launched with -D. Is this normal?

EDIT: the docker daemon start is very long: it is unresponsive while it dumps all those logs. Typically, docker ps hangs until all these startup error logs are dumped, and then resolves. But this takes easily 30 seconds. Is my system in a dirty state somehow? (Note that I have 0 containers while running docker ps -a).

nschoe on 28 Jul 2016

Ah @mavenugo you beat me to it. I saw that @cpuguy83 was working on it, did see the PR, though.

I saw that 1.12 brought many new features, but I'm not ready to switch to it yet: this is my production setup, and I'm not yet familiar with 1.12 and its retro compatibility to switch just yet :-)

@mavenugo the funny thing is: I'm don't even need overlay networks for swarm cluster: I am not trying to build swarm cluster. The _only_ reason I'm using overlay driver network is because our production setup launches dozens of containers stacks, each stack being comprised of two or three containers that are linked inside a custom network. The problem is that with this, one can only create about 25 stacks, because then, docker runs out of subnets. This is due to the fact that by default, bridge subnets are /16, which is much too large for me, /24 would have been enough and there would have been way more subnets available.

I saw that overlay networks were /24 so I switched to them. But the truth is: all my containers are on the same host, so technically I would not need overlay network. _By the way, will 1.12 swarm-mode help me with this?_

But anyway, this is very good that you found a way to get rid of external KV-store.

nschoe on 28 Jul 2016

Okay someone must be kidding me... I have tried several times creating my stacks, stopping the daemon and restarting it, and the container restarted. Damn it, I cannot seem to reproduce the error now that I am logging of course...

I'm still including the logs before and after restart, _maybe_ this will help anyway. I'd like to know if my setup is in a broken state or now before pursuing: can anyone tell me if these loooong stacks of error messages on startup are normal or not?

Thanks in advances.

nschoe on 28 Jul 2016

Hi, it's me again.

It's about @cpuguy83's #22561 PR. I've tried it and it doesn't seem to work on my machine.

Since I was worried about all those error messages on docker daemon startup, what I did is completely uninstall docker, rm -rf /var/lib/docker/ afterwards and started from a clean, fresh install.

So I have confirmed that none of these messages appeared again on docker startup, and it was _much_ faster to start. I have tried creating containers / deleting them, restarting, etc. No more error messages in the logs.

Then I setup a basic consul server on the host, in a container, for information, the compose file is:

version: '2'
services:
    consul:
        image: consul
        hostname: "node-nschoePC"
        command: "consul agent -server -bootstrap-expect 1 -data-dir /consul/data -bind 127.0.0.1"
        volumes:
            - consul-kv-store:/consul/data
        network_mode: "host"
        restart: always

volumes:
    consul-kv-store:
        driver: local

And I have set up the service file with --cluster-store:consul://127.0.0.1:8500.

Then I created an overlay network with docker network create -d overlay network1, then I made two containers join this network with docker run -dt --name=cont1 --net=network1 --restart=always ubuntu bash (same with --name=cont2).

I have confirmed they were running, waited a little, then sudo systemctl restart docker, and I now I had those error / warning message logs back in the journal.

Here are the logs from the moment it began to shut down to the moment it finished starting back up. At this point, I had one consul container, one overlay network and two simple containers joined to this network. That's all. I find these are rather verbose logs for just a fresh install, no?

-- Unit docker.service has begun shutting down.
Jul 29 09:42:49 nschoe-PC docker[30019]: time="2016-07-29T09:42:49+02:00" level=info msg="stopping containerd after receiving terminated"
Jul 29 09:42:49 nschoe-PC docker[30019]: time="2016-07-29T09:42:49.916328501+02:00" level=info msg="Processing signal 'terminated'"
Jul 29 09:42:49 nschoe-PC docker[30019]: time="2016-07-29T09:42:49+02:00" level=fatal msg="containerd: serve grpc" error="accept unix /var/run/docker/libcontainerd/docker-containerd.sock: use of closed network connection"
Jul 29 09:42:49 nschoe-PC docker[30019]: time="2016-07-29T09:42:49.917241049+02:00" level=error msg="failed to receive event from containerd: rpc error: code = 13 desc = \"transport is closing\""
Jul 29 09:42:52 nschoe-PC docker[30019]: time="2016-07-29T09:42:52.920909523+02:00" level=info msg="New containerd process, pid: 32416\n"
Jul 29 09:42:54 nschoe-PC docker[30019]: time="2016-07-29T09:42:54.797855093+02:00" level=warning msg="container 43a1512bd1f11374f7a941cc33667076334ca9090b53245fd348ed7aa61d6652 restart canceled"
Jul 29 09:42:54 nschoe-PC docker[30019]: time="2016-07-29T09:42:54.799715931+02:00" level=warning msg="Failed getting network for ep 2572e177590718adc1b23a120b2a5d2d1400ccebf659ef873574987470aa3607 during sandbox 7d99d9ceb146f191bc89c22cf3361230e2c2fde460f90d2fb7bca7bb65844d04 delete: network dac2cf5864ab6a4cfbdd76d506914ac9a9e646c3191ded08d54cd45e3b171568 not found"
Jul 29 09:42:54 nschoe-PC docker[30019]: time="2016-07-29T09:42:54.799787943+02:00" level=error msg="Error deleting sandbox id 7d99d9ceb146f191bc89c22cf3361230e2c2fde460f90d2fb7bca7bb65844d04 for container 43a1512bd1f11374f7a941cc33667076334ca9090b53245fd348ed7aa61d6652: could not cleanup all the endpoints in container 43a1512bd1f11374f7a941cc33667076334ca9090b53245fd348ed7aa61d6652 / sandbox 7d99d9ceb146f191bc89c22cf3361230e2c2fde460f90d2fb7bca7bb65844d04"
Jul 29 09:42:54 nschoe-PC docker[30019]: time="2016-07-29T09:42:54.811410767+02:00" level=warning msg="container 53960e0bb487775bd50462a8a929e97c34e6bca03ee9e62b929501bb170b6136 restart canceled"
Jul 29 09:42:54 nschoe-PC docker[30019]: time="2016-07-29T09:42:54.812236849+02:00" level=warning msg="Failed getting network for ep fb05ac904b932f714a45645db73bd8b30c618f4ef652e0beb001f13dcc165fca during sandbox c6872ce8d79e37c24230837961513493510f7c856aafdb2eaafe6a71fb9fdc01 delete: network dac2cf5864ab6a4cfbdd76d506914ac9a9e646c3191ded08d54cd45e3b171568 not found"
Jul 29 09:42:54 nschoe-PC docker[30019]: time="2016-07-29T09:42:54.812274681+02:00" level=error msg="Error deleting sandbox id c6872ce8d79e37c24230837961513493510f7c856aafdb2eaafe6a71fb9fdc01 for container 53960e0bb487775bd50462a8a929e97c34e6bca03ee9e62b929501bb170b6136: could not cleanup all the endpoints in container 53960e0bb487775bd50462a8a929e97c34e6bca03ee9e62b929501bb170b6136 / sandbox c6872ce8d79e37c24230837961513493510f7c856aafdb2eaafe6a71fb9fdc01"
Jul 29 09:42:54 nschoe-PC docker[30019]: time="2016-07-29T09:42:54.817222420+02:00" level=warning msg="container 58441707ee642951847c581de767c90e54f0352deae5966c2a1d21098a673221 restart canceled"
Jul 29 09:42:54 nschoe-PC docker[30019]: time="2016-07-29T09:42:54+02:00" level=info msg="stopping containerd after receiving terminated"
Jul 29 09:42:55 nschoe-PC systemd[1]: Stopped Docker Application Container Engine.
-- Subject: Unit docker.service has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit docker.service has finished shutting down.
Jul 29 09:42:55 nschoe-PC systemd[1]: Starting Docker Application Container Engine...
-- Subject: Unit docker.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit docker.service has begun starting up.
Jul 29 09:42:56 nschoe-PC docker[32466]: time="2016-07-29T09:42:56.002295997+02:00" level=info msg="New containerd process, pid: 32473\n"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.068018828+02:00" level=info msg="[graphdriver] using prior storage driver \"aufs\""
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.076361679+02:00" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.085021001+02:00" level=info msg="Firewalld running: false"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.201057143+02:00" level=error msg="getNetworkFromStore for nid dac2cf5864ab6a4cfbdd76d506914ac9a9e646c3191ded08d54cd45e3b171568 failed while trying to build sandbox for cleanup: network dac2cf5864ab6a4cfbdd76d506914ac9a9e646c3191ded08d54cd45e3b171568 not found"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.201794118+02:00" level=info msg="Removing stale sandbox c6872ce8d79e37c24230837961513493510f7c856aafdb2eaafe6a71fb9fdc01 (53960e0bb487775bd50462a8a929e97c34e6bca03ee9e62b929501bb170b6136)"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.201948815+02:00" level=warning msg="Failed getting network for ep fb05ac904b932f714a45645db73bd8b30c618f4ef652e0beb001f13dcc165fca during sandbox c6872ce8d79e37c24230837961513493510f7c856aafdb2eaafe6a71fb9fdc01 delete: network dac2cf5864ab6a4cfbdd76d506914ac9a9e646c3191ded08d54cd45e3b171568 not found"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.201971568+02:00" level=error msg="failed to delete sandbox c6872ce8d79e37c24230837961513493510f7c856aafdb2eaafe6a71fb9fdc01 while trying to cleanup: could not cleanup all the endpoints in container 53960e0bb487775bd50462a8a929e97c34e6bca03ee9e62b929501bb170b6136 / sandbox c6872ce8d79e37c24230837961513493510f7c856aafdb2eaafe6a71fb9fdc01"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.214721471+02:00" level=error msg="getNetworkFromStore for nid dac2cf5864ab6a4cfbdd76d506914ac9a9e646c3191ded08d54cd45e3b171568 failed while trying to build sandbox for cleanup: network dac2cf5864ab6a4cfbdd76d506914ac9a9e646c3191ded08d54cd45e3b171568 not found"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.215064451+02:00" level=info msg="Removing stale sandbox 7d99d9ceb146f191bc89c22cf3361230e2c2fde460f90d2fb7bca7bb65844d04 (43a1512bd1f11374f7a941cc33667076334ca9090b53245fd348ed7aa61d6652)"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.215213244+02:00" level=warning msg="Failed getting network for ep 2572e177590718adc1b23a120b2a5d2d1400ccebf659ef873574987470aa3607 during sandbox 7d99d9ceb146f191bc89c22cf3361230e2c2fde460f90d2fb7bca7bb65844d04 delete: network dac2cf5864ab6a4cfbdd76d506914ac9a9e646c3191ded08d54cd45e3b171568 not found"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.215243355+02:00" level=error msg="failed to delete sandbox 7d99d9ceb146f191bc89c22cf3361230e2c2fde460f90d2fb7bca7bb65844d04 while trying to cleanup: could not cleanup all the endpoints in container 43a1512bd1f11374f7a941cc33667076334ca9090b53245fd348ed7aa61d6652 / sandbox 7d99d9ceb146f191bc89c22cf3361230e2c2fde460f90d2fb7bca7bb65844d04"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.215708202+02:00" level=info msg="Removing stale endpoint gateway_43a1512bd1f1 (7942a27c73036b2e04feabaaddb3f9f075366ed718214c7c9632a355307bfb33)"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.216123025+02:00" level=warning msg="driver error disconnecting container gateway_43a1512bd1f1 : endpoint not found: 7942a27c73036b2e04feabaaddb3f9f075366ed718214c7c9632a355307bfb33"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.230567712+02:00" level=warning msg="failed to leave sandbox for endpoint gateway_43a1512bd1f1 : container 43a1512bd1f11374f7a941cc33667076334ca9090b53245fd348ed7aa61d6652: endpoint create on GW Network failed: service endpoint with name gateway_43a1512bd1f1 already exists"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.243278703+02:00" level=warning msg="driver error deleting endpoint gateway_43a1512bd1f1 : endpoint not found: 7942a27c73036b2e04feabaaddb3f9f075366ed718214c7c9632a355307bfb33"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.254341479+02:00" level=info msg="Removing stale endpoint gateway_53960e0bb487 (dc4215323012a33db5cfc4d64830a4cec16b0ce0b76918ac3231cd0920da9ed4)"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.255607271+02:00" level=warning msg="driver error disconnecting container gateway_53960e0bb487 : endpoint not found: dc4215323012a33db5cfc4d64830a4cec16b0ce0b76918ac3231cd0920da9ed4"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.269705728+02:00" level=warning msg="failed to leave sandbox for endpoint gateway_53960e0bb487 : container 53960e0bb487775bd50462a8a929e97c34e6bca03ee9e62b929501bb170b6136: endpoint create on GW Network failed: service endpoint with name gateway_53960e0bb487 already exists"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.284557267+02:00" level=warning msg="driver error deleting endpoint gateway_53960e0bb487 : endpoint not found: dc4215323012a33db5cfc4d64830a4cec16b0ce0b76918ac3231cd0920da9ed4"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.292203592+02:00" level=info msg="Fixing inconsistent endpoint_cnt for network docker_gwbridge. Expected=0, Actual=1"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.356580666+02:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.462473042+02:00" level=info msg="Loading containers: start."
Jul 29 09:42:57 nschoe-PC docker[32466]: ...time="2016-07-29T09:42:57.497541983+02:00" level=warning msg="Failed getting network for ep fb05ac904b932f714a45645db73bd8b30c618f4ef652e0beb001f13dcc165fca during sandbox c6872ce8d79e37c24230837961513493510f7c856aafdb2eaafe6a71fb9fdc01 delete: network dac2cf5864ab6a4cfbdd76d506914ac9a9e646c3191ded08d54cd45e3b171568 not found"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.497574069+02:00" level=error msg="failed to cleanup up stale network sandbox for container 53960e0bb487775bd50462a8a929e97c34e6bca03ee9e62b929501bb170b6136"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.499774574+02:00" level=warning msg="Failed getting network for ep 2572e177590718adc1b23a120b2a5d2d1400ccebf659ef873574987470aa3607 during sandbox 7d99d9ceb146f191bc89c22cf3361230e2c2fde460f90d2fb7bca7bb65844d04 delete: network dac2cf5864ab6a4cfbdd76d506914ac9a9e646c3191ded08d54cd45e3b171568 not found"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.499816225+02:00" level=error msg="failed to cleanup up stale network sandbox for container 43a1512bd1f11374f7a941cc33667076334ca9090b53245fd348ed7aa61d6652"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.500657403+02:00" level=warning msg="Failed getting network for ep fb05ac904b932f714a45645db73bd8b30c618f4ef652e0beb001f13dcc165fca during sandbox c6872ce8d79e37c24230837961513493510f7c856aafdb2eaafe6a71fb9fdc01 delete: network dac2cf5864ab6a4cfbdd76d506914ac9a9e646c3191ded08d54cd45e3b171568 not found"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.500701826+02:00" level=error msg="Error deleting sandbox id c6872ce8d79e37c24230837961513493510f7c856aafdb2eaafe6a71fb9fdc01 for container 53960e0bb487775bd50462a8a929e97c34e6bca03ee9e62b929501bb170b6136: could not cleanup all the endpoints in container 53960e0bb487775bd50462a8a929e97c34e6bca03ee9e62b929501bb170b6136 / sandbox c6872ce8d79e37c24230837961513493510f7c856aafdb2eaafe6a71fb9fdc01"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.500751956+02:00" level=warning msg="failed to cleanup ipc mounts:\nfailed to umount /var/lib/docker/containers/53960e0bb487775bd50462a8a929e97c34e6bca03ee9e62b929501bb170b6136/shm: invalid argument"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.501607907+02:00" level=warning msg="Failed getting network for ep 2572e177590718adc1b23a120b2a5d2d1400ccebf659ef873574987470aa3607 during sandbox 7d99d9ceb146f191bc89c22cf3361230e2c2fde460f90d2fb7bca7bb65844d04 delete: network dac2cf5864ab6a4cfbdd76d506914ac9a9e646c3191ded08d54cd45e3b171568 not found"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.501657138+02:00" level=error msg="Error deleting sandbox id 7d99d9ceb146f191bc89c22cf3361230e2c2fde460f90d2fb7bca7bb65844d04 for container 43a1512bd1f11374f7a941cc33667076334ca9090b53245fd348ed7aa61d6652: could not cleanup all the endpoints in container 43a1512bd1f11374f7a941cc33667076334ca9090b53245fd348ed7aa61d6652 / sandbox 7d99d9ceb146f191bc89c22cf3361230e2c2fde460f90d2fb7bca7bb65844d04"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.501699802+02:00" level=warning msg="failed to cleanup ipc mounts:\nfailed to umount /var/lib/docker/containers/43a1512bd1f11374f7a941cc33667076334ca9090b53245fd348ed7aa61d6652/shm: invalid argument"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.528088680+02:00" level=error msg="Failed to start container 53960e0bb487775bd50462a8a929e97c34e6bca03ee9e62b929501bb170b6136: network network1 not found"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.560397307+02:00" level=error msg="Failed to start container 43a1512bd1f11374f7a941cc33667076334ca9090b53245fd348ed7aa61d6652: network network1 not found"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.667518816+02:00" level=info msg="Loading containers: done."
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.667538641+02:00" level=info msg="Daemon has completed initialization"
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.667551642+02:00" level=info msg="Docker daemon" commit=b9f10c9 graphdriver=aufs version=1.11.2
Jul 29 09:42:57 nschoe-PC docker[32466]: time="2016-07-29T09:42:57.671920206+02:00" level=info msg="API listen on /var/run/docker.sock"
Jul 29 09:42:57 nschoe-PC systemd[1]: Started Docker Application Container Engine.
-- Subject: Unit docker.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit docker.service has finished starting up.
-- 
-- The start-up result is done.

Then when running docker ps, only the container running the consul server was restarted. cont1 and cont2 have exit code 128, and docker inspect cont2 gives me "Error": "network network1 not found" in section State.

So from what I understand, it's back to: the consul cluster is not yet ready when the container want to start, but from what I understand, the PR should have fixed this, right?

Any clues?

nschoe on 29 Jul 2016

I had the same issue with the swarm running on top of etcd, but was able to recover from it. Issue might have been caused by reboot of the CoreOS box.

Recovery procedure was: find trouble network endpoint in etcd and delete it. Here is what I did:

Using docker network inspect <network> find trouble endpoint and it's ID:

"ep-abaeb8d676bb077553d254b563931bfae0d38275d9aaf2888298ea34a49d0bb3": {
             "Name": "topbeat_[...]",
              "EndpointID": "abaeb8d676bb077553d254b563931bfae0d38275d9aaf2888298ea34a49d0bb3",
             "MacAddress": "02:42:0a:00:00:1f",
             "IPv4Address": "10.0.0.31/24",
             "IPv6Address": ""
          },

Use etcdctl ls --recursive /docker/network to find endpoint in etcd and delete it:

etcdctl rm /docker/network/v1.0/endpoint/[...]/abaeb8d676bb077553d254b563931bfae0d38275d9aaf2888298ea34a49d0bb3

dmytro on 31 Aug 2016

👍5

I have a similar problem: docker 1.12.5, swarm 1.2.5, etcd 3.0.15

Steps to reproduce the problem:

run a compose file (docker-compose up -d) with several containers and a overlay network (defined in the compose file)
stop the service with "docker-compose down" => got this error: "ERROR: Error response from daemon: network has active endpoints"

Moreover, if I inspect the overlay network (docker network inspect ) I do not see any active endpoints.

If I start the service again (docker-compose up -d) the service is started without problems and if I inspect the network I can see the endpoints.

previ on 17 Dec 2016

This sometimes happens when you try shutdown with env loaded from a node not master/manager.

optimum-web on 13 Feb 2017

I see the same issue on Docker 1.13.1 using a 2-host overlay network with a 3-node Consul 0.7.4 cluster.
I can reproduce the issue by forcibly shutting down one of the Docker hosts.

The result is that after I start the host (and the container) I can see the endpoint with docker network inspect however it shows the old container's ID.
docker network disconnect -f doesn't remove the container, it gives an error message that the endpoint doesn't exist (it is using the new container ID I assume).

It would be great if container ID could be used in network disconnect and that would not be validated against the containers on the host.

balintbako on 15 Feb 2017

👍4

^ same issue here

Platzii on 16 Feb 2017

And same issue here.

Error response from daemon: network gefahr has active endpoints

docker network inspect gefahr

[ { "Name": "gefahr", "Id": "zp1hd8cmb9h5i1fkiylvsifag", "Created": "2017-02-12T11:23:49.363874538Z", "Scope": "local", "Driver": "overlay", "EnableIPv6": false, "IPAM": { "Driver": "default", "Options": null, "Config": [ { "Subnet": "10.0.1.0/24", "Gateway": "10.0.1.1" } ] }, "Internal": false, "Attachable": false, "Containers": { "1c99e6519e0b8e536d2bf211efddfc5db0f088b556af2a37b8a75bfde276a26e": { "Name": "SERVICE_LAGER_lager-server.ssl36jlfbv3ghfy78tml2cohu.293uxoxt0azts51jj74l91jy4", "EndpointID": "cf854ada7777e68b3ae6a61f349cdb34400265f0ec9b1432d3b10ff4d4e0f5c5", "MacAddress": "02:42:0a:00:01:17", "IPv4Address": "10.0.1.23/24", "IPv6Address": "" }, "ac40f8922601ef04bc173abcde918bea964b47c725d525681696b3aabcc26bba": { "Name": "SERVICE_LAGER_angebot-db.ssl36jlfbv3ghfy78tml2cohu.tbssz0dfqalrtedq7tou6kzx9", "EndpointID": "69d48195d4adf0613a05e5172037e7c20928b3b24969be131e6ea017100583e0", "MacAddress": "02:42:0a:00:01:03", "IPv4Address": "10.0.1.3/24", "IPv6Address": "" }, "e55c75e75518be6bddc074ae52b8896971bbd557a214185a159a4d91b983f555": { "Name": "SERVICE_LAGER_lager-db.ssl36jlfbv3ghfy78tml2cohu.gjbtm2q202j7gy31g1mdjm6a7", "EndpointID": "6249ea4fc5031f2fcd09249fc2b8f0951bdbf02997aee043fca1b68febfb5d3f", "MacAddress": "02:42:0a:00:01:10", "IPv4Address": "10.0.1.16/24", "IPv6Address": "" } }, "Options": { "com.docker.network.driver.overlay.vxlanid_list": "4098" }, "Labels": {}, "Peers": [ { "Name": "Lenhart00-bf0334fafc55", "IP": "139.6.102.107" } ] } ]

Can't disconnect Endpoints because these Containers are not existent anymore.

michaellenhart on 24 Mar 2017

@michaellenhart use docker network disconnect -f [network] [container name|id] to disconnect a not exist container from network.

BSWANG on 27 Mar 2017

👎8 👍3

@BSWANG As already stated the --force flag doesn't work anymore. In my case it is version 1.12, but has worked with 1.10 and (but I'm not sure) 1.11.

Update:

To remove the network in the case that a docker disconnect rm -f <network> <container> results in an error you have to pick the network ID (e.g. zp1hd8cmb9h5i1fkiylvsifag for the mentioned _gefahr_ network. Then go into consul K/V_Store (assuming you use consul and not etcd), navigate to kv/docker/network/v1.0/network/ and kv/docker/network/v1.0/overlay/ and remove the entry with the found ID from both directories.
After this, the network should not be listed anymore.
I've not observed any side effects, but can't ensure this.

hcguersoy on 4 Apr 2017

👍1

@hcguersoy I did the same earlier to recover the broken network and it did work but this is not a production grade method.

balintbako on 4 Apr 2017

I had exactly the same problem. All I did was restart the docker service and then was able to remove all stuck containers.

avijendr on 5 Apr 2017

Restart docker daemon seemed to work for me as well and Im on 17.04.0-ce-rc1

alfredcs on 27 Apr 2017

👍7 🎉1

same issue

ohcrider on 3 May 2017

Hello there, I'm srry for responding so late.
Well, I have changed the docker Version to

Client:
Version: 17.03.1-ce
API version: 1.27
Go version: go1.7.5
Git commit: c6d412e
Built: Mon Mar 27 17:14:09 2017
OS/Arch: linux/amd64

Server:
Version: 17.03.1-ce
API version: 1.27 (minimum version 1.12)
Go version: go1.7.5
Git commit: c6d412e
Built: Mon Mar 27 17:14:09 2017
OS/Arch: linux/amd64
Experimental: false

Sometimes I get the same error message "Error response from daemon: network lager has active endpoints" by removing an network using command docker network rm 8bde, but I can disconnect active endpoints even if the containers are not "existent" anymore.
I use the command

docker network disconnect -f <network> <container name>

Example:
docker network inspect lager

Shows all active endpoints even if the containers are not available anymore, I think it's called zombie containers ;-)

[
    {
        "Name": "lager",
        "Id": "8bde2c60085a5bae1989f36dd55ea89767a60690da537ad23b0570adc19ebdfb",
        "Created": "2017-04-04T06:49:13.531850943Z",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": {},
            "Config": [
                {
                    "Subnet": "172.19.0.0/16",
                    "Gateway": "172.19.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Containers": {
            "252cd073f4f84a4e165a662193a62c44a4fa4ba38c36dc5ee640fc5cb94fa728": {
                "Name": "deploy_lager-client_1",
                "EndpointID": "6c2acca22cfc73f3de47a0711ee6000b7909b0adaf1d427bb91beb0592b7dc66",
                "MacAddress": "02:42:ac:13:00:07",
                "IPv4Address": "172.19.0.7/16",
                "IPv6Address": ""
            },

If I want to disconnect the Service deploy_lager-client_1
I use docker network disconnect -f 8bde deploy_lager-client_1

Or you can use docker network disconnect -f lager deploy_lager-client_1

You can not use the container id, you must instead use the Container Name.
This docker network disconnect -f lager 252cd is NOT working.

If you have removed all active endpoints you can delete the network.

michaellenhart on 8 May 2017

👍1

Hi, I have the same issue with these versions:

# docker version
Client:
 Version:      1.11.2
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   b9f10c9
 Built:        Wed Jun  1 21:23:11 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.2
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   b9f10c9
 Built:        Wed Jun  1 21:23:11 2016
 OS/Arch:      linux/amd64

and, in a second node:
calico/node:v1.1.3
etcdctl version: 3.1.7

Steps to reproduce:

docker network create --driver calico --ipam-driver calico-ipam myNet
docker run -d -t --name myContainer --net myNet --restart=always ubuntu bash
systemctl stop docker
systemctl start docker
docker rm -f myContainer
docker network rm myNet
Error response from daemon: network myNet has active endpoints
docker network disconnect -f myNet myContainer
Error response from daemon: unable to force delete endpoint myContainer: invalid endpoint locator identifier

The main problem, as stated in a previous comment, is that docker doesn't remove the etcd entries, so my workaround is to delete them by hand, removing the stopped containers and finally, the created network:

NETWORK_TO_DELETE="myNet"
NET_ID=$(docker network inspect --format '{{.ID}}' ${NETWORK_TO_DELETE})
ENDPOINTS_TO_DELETE=$(etcdctl ls /docker/network/v1.0/endpoint/${NET_ID} | sed 's/.*\///g')
for ep in ${ENDPOINTS_TO_DELETE}; do docker rm $ep && etcdctl rm /docker/network/v1.0/endpoint/${NET_ID}/$ep ;done; && etcdctl rm /docker/network/v1.0/endpoint_count/${NET_ID} && etcdctl rm /docker/network/v1.0/network/${NET_ID}

I hope it helps someone.. :)

stg- on 22 May 2017

👍2

Just want to emphasize that the only working solution for etcd in this thread is the one given by @stg- above.

To explain it a bit more in depth, there are two steps required:

Remove all zombie endpoints from the network directory
Delete or set the endpoint count to zero

apeschel on 20 Jun 2017

I've had the same issue. We're using Consul (a very old version, 0.5.2 so we couldn't just use the CLI) and I got the problem fixed via the http api like this:

network_id = $(docker network inspect [NETWORK_NAME] --format '{{ .Id }}') \
endpoint_id = sudo docker network inspect [NETWORK_NAME] --format '{{json .Containers }}' \
  | jq 'to_entries|map(.value | select(.Name == "[ENDPOINT_NAME]") | .EndpointID)|@sh'
curl --request DELETE \
http://localhost:8500/v1/kv/docker/network/v1.0/endpoint/$network_id/$endpoint_id/

(Replace [NETWORK_NAME] and [ENDPOINT_NAME] with your values or simply set network_id and endpoint_id through copy & paste from docker network inspect ...)

kaikuchn on 31 Jul 2017

Got the same error using docker-compose: network my_container has active endpoints. After that there is no way to stop/kill the active container with docker stop even forcing the command, thus resulting in a active but detached container process.

loretoparisi on 12 Oct 2017

schmunk42 on 17 Oct 2017

So I was not seeing this issue until I recently updated to 17.12.0-ce.

We are deploying stacks onto docker swarms, running series of tests against containers in the stack, removing the stack and then immediately deploying a 2nd stack and running set of tests. Then removing and then deploying a 3rd stack. Somewhere (and it is random) between the remove and deploy of stacks we see this issue.

The process is randomly (but frequently) failing to remove the stack (5 out of 6 containers are removed from the stack). Which container fails to be removed is random.

When I try to manually remove the last container, I get the:

Failed to remove network vjbo7hqulyrf1p0uk0ka2nstk: Error response from daemon: network test_default id vjbo7hqulyrf1p0uk0ka2nstk has active endpoints
Failed to remove some resources from stack: test

I then try to remove the test_default network manually - with same message regarding "has active endpoints".

I tried restarting the docker daemon - which hung. I was forced to reboot the system.

This is definitely an issue for us - it is regularly breaking our CI/CD process.

RobWillis on 22 Feb 2018

👍1

Result of docker version:

Client:
 Version:   18.03.0-ce
 API version:   1.30 (downgraded from 1.37)
 Go version:    go1.9.4
 Git commit:    0520e24
 Built: Wed Mar 21 23:06:22 2018
 OS/Arch:   darwin/amd64
 Experimental:  false
 Orchestrator:  swarm

Server:
 Engine:
  Version:  ucp/2.2.5
  API version:  1.30 (minimum version 1.20)
  Go version:   go1.8.3
  Git commit:   42d28d140
  Built:    Wed Jan 17 04:44:14 UTC 2018
  OS/Arch:  linux/amd64
  Experimental: false

We also experienced this problem. When we tried to remove our network with docker network rm <network_id> we got this:

[root@server centos]# docker network rm y9ru2bnofd7y
Error response from daemon: network myapp-prod_myapp-prod id y9ru2bnofd7ytdr8kjjm7a01v has active endpoints

When we inspect with docker network inspect <network_id> we get:

[root@server centos]# docker network inspect y9ru2bnofd7y
[
    {
        "Name": "myapp-prod_myapp-prod",
        "Id": "y9ru2bnofd7ytdr8kjjm7a01v",
        "Created": "2018-04-04T05:27:15.897789997Z",
        "Scope": "swarm",
        "Driver": "overlay",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "10.0.12.0/24",
                    "Gateway": "10.0.12.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "f8e6a5676b73dc649f323deb72b0a2691f05e4f5f146815b5eb92bd099fdb90e": {
                "Name": "myapp-prod_app-extranet.1.r4elcwsxnlpu4kscrbn0h3zxw",
                "EndpointID": "f1d31c6c85a23cb6b1d1f59b4890f3bdc5ebdffe114cfa0593828c3ec3dc296e",
                "MacAddress": "02:42:0a:00:0c:90",
                "IPv4Address": "10.0.12.144/24",
                "IPv6Address": ""
            }
        },
        "Options": {
            "com.docker.network.driver.overlay.vxlanid_list": "4142"
        },
        "Labels": {
            "com.docker.stack.namespace": "myapp-prod",
            "com.docker.ucp.access.label": "/Shared/Private/deploy",
            "com.docker.ucp.collection": "21694651-5a5d-4f93-8d39-3c74807ea70d",
            "com.docker.ucp.collection.21694651-5a5d-4f93-8d39-3c74807ea70d": "true",
            "com.docker.ucp.collection.private": "true",
            "com.docker.ucp.collection.root": "true",
            "com.docker.ucp.collection.shared": "true",
            "com.docker.ucp.collection.swarm": "true"
        },
        "Peers": [
            {
                "Name": "ip-xxx-xxx-xxx-xxx.ec2.internal-id",
                "IP": "xxx.xxx.xxx.xxx"
            }
        ]
    }
]

So, according to the network there is still a container out there. First, we look for the container with docker ps -a | grep myapp and it does not exist:

[root@server centos]# docker ps -a | grep myapp
319d03200304        dtr.myregistry.io/dev/logistics:latest                  "./run-gunicorn.sh"      2 hours ago         Up 2 hours              8000/tcp                  myapp-dev_app-logistics.1.6zqzfk011b41gai80urbaao4d
ad6565232e69        dtr.myregistry.io/dev/proxy:latest                      "/bin/sh -c 'servi..."   6 hours ago         Up 6 hours              80/tcp                    myapp-dev_proxy.1.l598rw0nw4gg7b5qz0nb9j0vg
a7ee62618f24        dtr.myregistry.io/dev/redis:4.0.2                       "docker-entrypoint..."   9 hours ago         Up 9 hours              6379/tcp                  myapp-dev_app-logistics-redis.1.vqo8czmx2kbrhdeejh5153c5a
b0601d50ae79        dtr.myregistry.io/dev/user-api:dev-7ae88c6              "dotnet user-a..."      14 hours ago         Up 14 hours             8000/tcp                  user-api-dev_user-api.1.0ymlai010ttj6xtto2xsyqoxu
267de3e27f65        dtr.myregistry.io/dev/user-api:prod-11e54b9             "dotnet user-a..."      21 hours ago         Up 21 hours             8000/tcp                  user-api-prod_user-api.1.cmjrap524aojku3240bb5jwjm
d30389e81a7c        dtr.myregistry.io/dev/svc-access:prod-b38c752           "docker-php-entryp..."  21 hours ago         Up 21 hours             80/tcp                    svc-access-prod_svc-dc-access.1.mtw174i8n2ofcuk2racafvsmk
db91df87f58a        dtr.myregistry.io/dev/account-management:dev-917752f    "dotnet myacco..."      21 hours ago         Up 21 hours             8000/tcp                  app-account-management-dev_app-account-management.1.bu2qdvdfgg95q883ncsnin5sn

So we attempt to stop/remove this container we get:

Error response from daemon: No such container: f8e6a5676b73

We were stuck with this network we couldn't remove. This broke our CI/CD process. Deploying the stack failed because it could not create the myapp-prod_myapp-prod as defined in the stack file since one already existed.

We were able to eventually remove the network after disconnecting the zombie container with docker network disconnect --force myapp-prod_myapp-prod myapp-prod_app-extranet.1.r4elcwsxnlpu4kscrbn0h3zxw. Thanks @thaJeztah for that tip! After that the network easily removed with docker network rm <id>.

While we do have a manual workaround, this is still an irritating issue for us.

jdkelley on 5 Apr 2018

👍4

Ran into the same problem with stale networks.

docker network prune saved the day!! Works like a charm!

Sweeps networks that are unused -- stale containers leaves unused networks, thus prune worked

vincentr-dev on 12 Jul 2018

Here, docker network prune did not remove the network because it still thought there were containers, due to the remaining endpoints, however there weren't, and these endpioints are completely invalid stale hanging markers.

Like I said here -> https://github.com/moby/moby/issues/17217 it still happens on Docker version 18.06.1-ce, build e68fc7a, I think one of these tickets might be a duplicate of the other (or was that already decided?)

It's as above - needs a network disconnect with force, with the name from the docker inspect call of the network - IDs won't be found.

danwdart on 4 Sep 2018

👍5

When no command works then do
sudo service docker restart
your problem will be solved

NagatoPeinI1 on 15 Nov 2018

👎6 😄5 🎉4

I am using Docker desktop for mac Version 2.0.0.0-mac81 (29211) channel stable, docker engine 18.09.0.
I am going through the same issues. I confirm it's impossible to disconnect/remove the broken network.
A docker system prune doesn't remove the broken network.
The only solution seems to restart the docker daemon, which is quite annoying since it requires manual intervention.

mtazzari on 4 Dec 2018

If I wanted to fix issues by restarting things, I'd have stayed on Microsoft Windows.

shukriadams on 25 Jan 2019

😄38 🚀2 👍2 😕1

just for the log: same issue here, pretty annoying. usually (?) needs some unwanted event, like VM freeze/crash.
strange enough, the bad status can even survive a VM restart, still needs a service restart for cleanup.

heidricha on 29 Oct 2019

Restart docker daemon seemed to work for me as well and Im on 17.04.0-ce-rc1

same here. restarting the docker service also removed the unremovable network endpoint, and after that I could also remove the network.

Peneheals on 6 Sep 2020

In my case, docker rm -f will stop the 'unstoppable' containers, but leaves them listed as connected to the network.

docker network disconnect [-f] <net> <container> fails with 'no such container' or 'endpoint not found' (according to -f or not), but even after that they're still listed as connected.

This seems only to happen with containers that have healthchecks, and prior to rm -f (after an attempted docker-compose down) they are '(unhealthy)'.