Rancher: Starting containers locks: "Waiting: allocated [container:1i4142])"

Created on 22 Dec 2016 · 54Comments · Source: rancher/rancher

Rancher Version:

1.2.0

Docker Version:

1.12.3

OS and where are the hosts located? (cloud, bare metal, etc):

Vultr VPS, rancher hosted on Joyent

Setup Details: (single node rancher vs. HA rancher, internal DB vs. external DB)

Single external DB

Environment Type: (Cattle/Kubernetes/Swarm/Mesos)

Cattle

Steps to Reproduce:

Unsure, however creating a stack, or upgrading containers (starting) how causes the upgrade to wait forever or until a restart
Results:

Service/Stack/Container shows as Waiting: allocated [container:1i4142]) (with an appropriate ID), a restart unlocks the issue temporarily but it soon starts again.

Logs show constant errors:

2016-12-22 16:06:45,009 ERROR [e1b43cf0-8ac5-4d0b-b7d3-5e9f8d14d2a0:168835] [volumeSt
oragePoolMap:2224] [volumestoragepoolmap.remove] [] [ecutorService-5] [c.p.e.p.i.Defau
ltProcessInstanceImpl] Agent error for [storage.volume.remove.reply;agent=673]: Error 
response from daemon: Unable to remove filesystem for c34a47c06dc8a438beb7f932a102c86a
0b3fd2b46ebcad6974d55fb98f24834c: remove /var/lib/docker/containers/c34a47c06dc8a438be
b7f932a102c86a0b3fd2b46ebcad6974d55fb98f24834c/shm: device or resource busy           
�2016-12-22 16:06:45,015 ERROR [:] [] [] [] [ecutorService-5] [.e.s.i.ProcessInstanceD
ispatcherImpl] Agent error for [storage.volume.remove.reply;agent=673]: Error response
 from daemon: Unable to remove filesystem for c34a47c06dc8a438beb7f932a102c86a0b3fd2b4
6ebcad6974d55fb98f24834c: remove /var/lib/docker/containers/c34a47c06dc8a438beb7f932a1
02c86a0b3fd2b46ebcad6974d55fb98f24834c/shm: device or resource busy                   
�2016-12-22 16:06:45,138 ERROR [8fdbd6b5-875b-46a2-81cc-adaffb4d01f5:169888] [volumeSt
oragePoolMap:2257] [volumestoragepoolmap.remove] [] [ecutorService-5] [c.p.e.p.i.Defau
ltProcessInstanceImpl] Agent error for [storage.volume.remove.reply;agent=668]: Error 
response from daemon: Unable to remove filesystem for c0e4a882533de17793caa222d510460d
ce87a5851dd9ec28f2835b4b543b816c: remove /var/lib/docker/containers/c0e4a882533de17793
caa222d510460dce87a5851dd9ec28f2835b4b543b816c/shm: device or resource busy

Expected:

Handling containers would not time out or get locked

statuautoclosed

Source

bfosberry

👍13

Most helpful comment

Okay, have to come back to my previous statement.
After running without a hitch for about 2 to 3 weeks, where in the meanwhile i haven't upgraded rancher or docker on the hosts.
I got the Waiting: allocated again today while upgrading a container.

To the rancher development team is there anything you need to further debug this issue?

Cause this is proving very troublesome in our production environment because the only solution to this problem i have found is to physically restart the hosts.

joostliketoast on 10 Mar 2017

👍16

All 54 comments

I manually deleted the hosts containing containers listed above, however my rancher implementation is still in an invalid state:

ZINFO[0000] Setting log level                             logLevel=info               
^INFO[0000] Starting go-machine-service...                gitcommit=v0.34.1           
CINFO[0000] Waiting for handler registration (1/2)                                    

2016-12-23 04:56:50,576 ERROR [e904d620-1aca-4a89-9903-07c2516d981d:272072] [instance:
7871] [instance.start->(InstanceStart)] [] [ecutorService-8] [i.c.p.process.instance.I
nstanceStart] Failed to Waiting for deployment unit instances to create for instance [
7871]                                                                                 

2016-12-23 04:56:52,624 ERROR [11c37ae3-8c95-46dc-943f-d422299c9564:272068] [instance:
7869] [instance.start->(InstanceStart)] [] [ecutorService-2] [i.c.p.process.instance.I
nstanceStart] Failed to Waiting for deployment unit instances to create for instance [
7869]                                                                                 
SFATA[0017] Exiting go-machine-service: Timed out waiting for transtion.              
C2016/12/23 04:56:53 http: proxy error: net/http: request canceled                    
ZINFO[0000] Setting log level                             logLevel=info               
^INFO[0000] Starting go-machine-service...                gitcommit=v0.34.1           
CINFO[0000] Waiting for handler registration (1/2)                                    
SFATA[0019] Exiting go-machine-service: Timed out waiting for transtion.              
C2016/12/23 04:57:13 http: proxy error: net/http: request canceled                    
ZINFO[0000] Setting log level                             logLevel=info               
^INFO[0000] Starting go-machine-service...                gitcommit=v0.34.1           
CINFO[0000] Waiting for handler registration (1/2)

These logs just loop

bfosberry on 23 Dec 2016

I am facing the same problem

sergey-koba-mobidev on 29 Dec 2016

This has become a serious issue for me. I regularly get into a state where Rancher is unable to complete upgrades and ends up in an invalid state. I believe the issue is that a certain number of filesystem resources error out, locking up certain threads in the application, blocking other work from being done. I had to tear down my entire setup causing downtime to recover from this.

bfosberry on 30 Dec 2016

If you're actually on 1.2.0 the first step would be to update to 1.2.2. A huge amount of stuff changed between 1.1.x and 1.2 and there are a several important fixes in the patches since.

Unable to remove filesystem for ...: remove /var/lib/docker/containers/.../shm: device or resource busy is a Docker bug that there's little we can do about directly, but one of the things in 1.2.1 or 2 was to make it less likely for Rancher to trigger it.

vincent99 on 30 Dec 2016

I had the same issue two days ago, the only solution I found was to remove my Rancher environment with its hosts and to recreate it from scratch. But unfortunately today same issue again on this env. I have a second environment with a host that is the one which run Rancher itself and on this one no problem. I tried to catch some logs but the only info that I have is the message "Activating (Waiting: allocated [container:........."

I'm running Rancher 1.2.2 on Ubuntu Trusty and Debian Jessie with Docker 1.11.1

richardlt on 31 Dec 2016

I tried restarting hosts to resolve the docker bug but it doesn't help. The
action I've taken now is to discard hosts where this happens and create new
ones and limp by

On Sat, Dec 31, 2016, 6:43 AM Richard LT notifications@github.com wrote:

I had the same issue two days ago, the only solution I found was to remove
my Rancher environment with its hosts and to recreate it from scratch. But
unfortunately today same issue again on this env. I have a second
environment with a host that is the one which run Rancher itself and on
this one no problem. I tried to catch some logs but the only info that I
have is the message "Activating (Waiting: allocated [container:........."

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/rancher/rancher/issues/7181#issuecomment-269867917,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABeyj-d3ByQBPvURPLb0PEG7D10UYKDEks5rNmongaJpZM4LUIMv
.

bfosberry on 31 Dec 2016

Yes, It also work if you remove the host from the UI then remove the agent container (I'm using a custom host). And next add new host. But it's really annoying.

richardlt on 31 Dec 2016

Thanks for the tip, I'll give that a go if I can prevent the VM deletion

On Sat, Dec 31, 2016, 9:37 AM Richard LT notifications@github.com wrote:

Yes, It also work of you remove the host from the UI then remove the agent
container (I'm using a custom host). And next add new host. But it's really
annoying.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/rancher/rancher/issues/7181#issuecomment-269874933,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABeyj5BkLwxUOqtMKLmLX1f6v2ul7Zkhks5rNpLJgaJpZM4LUIMv
.

bfosberry on 31 Dec 2016

Just installed v1.3.0, still the same problem ;)
EDIT: My bad look like it work just failed to start a previously created container that was stuck waiting. Will see in few days.

richardlt on 9 Jan 2017

Just ran into this as well:

Rancher Version:
1.2.2

Docker Version:
1.11.2

OS and where are the hosts located? (cloud, bare metal, etc):
Amazon AWS with Amazon Linux 2016.09

Setup Details: (single node rancher vs. HA rancher, internal DB vs. external DB)
Single external DB

Environment Type: (Cattle/Kubernetes/Swarm/Mesos)
Cattle

ecliptik on 11 Jan 2017

After trying to add in new hosts to fix the issue, things got even worse:

Steps:

Manually removed two hosts from a Cattle environment
Added in two new hosts

I'm not quite what exactly happened, but all the container stacks start giving Waiting: allocated messages and went into Unhealthy or Updating-Active.

screen shot 2017-01-11 at 1 35 19 am

Trying to stop/start a service puts it into Deactivating and it eventually stops or just hangs,

screen shot 2017-01-11 at 2 04 26 am

The new hosts that were added to the cluster either failed to register or when it did register it didn't bring up any containers.

Failed host registration:

INFO: Running Agent Registration Process, CATTLE_URL=https://rancher.demandbase.com/v1
INFO: Attempting to connect to: https://rancher.demandbase.com/v1
INFO: https://rancher.demandbase.com/v1 is accessible
INFO: Inspecting host capabilities
INFO: Boot2Docker: false
INFO: Host writable: true
INFO: Token: xxxxxxxx
INFO: Running registration
Traceback (most recent call last):
  File "./register.py", line 31, in <module>
    r = client.wait_success(r, timeout=300)
  File "/usr/local/lib/python2.7/dist-packages/cattle.py", line 15, in wait_success
    obj = self.wait_transitioning(obj, timeout)
  File "/usr/local/lib/python2.7/dist-packages/cattle.py", line 33, in wait_transitioning
    raise Exception(msg)
Exception: Timeout waiting for [register:1r4989] to be done after 301.394496918 seconds

An identical host brought up at the same time registered successfully, but then doesn't run any stacks/containers, including system containers:

screen shot 2017-01-11 at 1 36 49 am

Removing and restarting the Rancher master container didn't change anything, and I'm hesitant to completely remove the environment and start from scratch. The other two Cattle environments we have are not showing any errors, although I haven't deployed any new containers or hosts to them and am afraid the same thing may happen.

This is the first time I've seen a Rancher cluster completely go haywire like this and am wondering if it's related to the 1.2 upgrade.

I can provide the cattle-debug and cattle-error logs from the master as well if needed.

ecliptik on 11 Jan 2017

Woke up this morning and cluster was in the same state. Tried to deploy to another environment that has't been modified for a few weeks and it is also showing the same behavior,

screen shot 2017-01-11 at 9 17 22 am

ecliptik on 11 Jan 2017

Just wanted to comment and say deploy's of new containers, even load balancer containers are taking upwards of 10 minutes, sitting in the "Waiting: allocated" state for the majority of the time. Running 1.3.0. This happens for new fresh container deploys and upgrades. Even taking a service down is taking longer, erroring, or otherwise just not completing.

The robustness of Rancher/Cattle has seemed to decrease since 1.1.4, especially for basic actions like handling containers. Restarting the server process/container seems to help albeit for a small amount of time until it gets unhappy again :(

InAnimaTe on 11 Jan 2017

👍2

Just experienced this... deployment was behind an ALB. The problem turned out to be the way rancher server starts its internal LB? So adding --advertise-address starts the rancher server in HA mode which uses traefik, non HA rancher server uses websocket-proxy (I'd imagine this would work well with the ELB+proxy protocol). Starting HA server with ALB made everything work as expected here. @ibuildthecloud sound about right?

aemneina on 11 Jan 2017

👍2

For anyone still having issues with 1.3.0, downgrading to 1.2.2 does seem to work fine. All my stacks/services are still there and happy, and new containers (and upgraded ones) actually get deployed properly.

InAnimaTe on 12 Jan 2017

Same problem here. Stack service is stuck in the middle of deployment (upgrading) with " Waiting: allocated [container:1i23696]"
Rancher 1.2.0, Docker 1.12.1 on Ubuntu 16.04 Xenial.
Will try to upgrade to 1.2.2 and see if this fixes it.

Napsty on 12 Jan 2017

@aemneina suggestion fixed it for me

rancher/server --advertise-address {RancherServerIP}

Issues:
1) Slow to start docker image
2) Image would stuck at Waiting/Activating - even though the image has complete with IP address, it still stuck in Activating. Ports didn't get map since it still waiting for image.

Ubuntu 16.04 Xenial, Docker 1.12.x - 1.13.0 also running 16.04 Xenial, Rancher 1.3.0-1.3.3

Also note that I'm not behind any VPC or ALB. All docker hosts are public facing. I'm not running in HA at all. Just using the --advertise-address solves it.

noogen on 24 Jan 2017

having the same issue as the others, but sadly @noogen fix didn't work for me

The only difference i have is that i have one environment with hosts which is on our local network.
And that environment does not have any problems.(these are dedicated machines no VPS instances)

dedicated rancher server with external database and the hosts are directly connected to the public internet

joostliketoast on 7 Feb 2017

For me upgrading to 1.4.0 and using --advertise-address helped. I don't know for how long, because 1.3.4 stopped working properly, despite using --advertise-address.

fsdev-io on 9 Feb 2017

does anyone have a potential other solution to this issue? because the only (very unpleasant) solution i have at the moment is restarting the hosts every few days...

joostliketoast on 20 Feb 2017

@joostliketoast We upgraded Rancher from 1.1.x to 1.3.3 and also stopped using environments with hosts from different networks. According to https://github.com/rancher/lb-controller/pull/56 the healthcheck containers of every host need to be able to communicate with another host on the default loadbalancer port, which is tcp/42 (or tcp/41, check the haproxy config file inside a LB container). I have nowhere found this in the documentation and was kind of surprising news. We have since created new environments for each host subnet (so there is no firewall between the hosts). Since then I haven't seen any problems with the LB service anymore.

Napsty on 20 Feb 2017

@Napsty strange the hosts have complete access between each other (the one's that are on the same rancher environment).

I would suspect that they wouldn't work at all if they had problems with that port, instead of giving Waiting: allocated after a few days when i try to create or upgrade a container...

but i'll try and open that port for the rancher server to see if it has any effect

Edit: no effect, and rebooting the docker server doesn't help either.
So far not a fan of ranchers new networking code, stability of the software seems to have gone down hill...

joostliketoast on 20 Feb 2017

tail of rancer server cattle-error.log

2017-02-20 12:35:19,770 WARN  [:] [] [] [] [cutorService-10] [i.c.p.core.cleanup.BadDataCleanup   ] Removing invalid resource [MountRecord:1240]
2017-02-20 12:35:19,773 WARN  [:] [] [] [] [ecutorService-8] [i.c.p.core.cleanup.BadDataCleanup   ] Removing invalid resource [MountRecord:1241]
2017-02-20 12:35:19,777 WARN  [:] [] [] [] [ecutorService-7] [i.c.p.core.cleanup.BadDataCleanup   ] Removing invalid resource [MountRecord:1242]
2017-02-20 12:35:19,778 WARN  [:] [] [] [] [cutorService-13] [i.c.p.core.cleanup.BadDataCleanup   ] Removing invalid resource [MountRecord:1243]
2017-02-20 12:35:19,780 WARN  [:] [] [] [] [ecutorService-6] [i.c.p.core.cleanup.BadDataCleanup   ] Removing invalid resource [MountRecord:1244]
2017-02-20 12:35:19,785 WARN  [:] [] [] [] [ecutorService-2] [i.c.p.core.cleanup.BadDataCleanup   ] Removing invalid resource [MountRecord:1245]
2017-02-20 12:35:19,787 WARN  [:] [] [] [] [cutorService-12] [i.c.p.core.cleanup.BadDataCleanup   ] Removing invalid resource [MountRecord:1246]
2017-02-20 12:35:19,787 WARN  [:] [] [] [] [ecutorService-1] [i.c.p.core.cleanup.BadDataCleanup   ] Removing invalid resource [MountRecord:1247]
2017-02-20 12:35:19,789 WARN  [:] [] [] [] [ecutorService-3] [i.c.p.core.cleanup.BadDataCleanup   ] Removing invalid resource [MountRecord:1248]
2017-02-20 12:35:19,791 WARN  [:] [] [] [] [ecutorService-9] [i.c.p.core.cleanup.BadDataCleanup   ] Removing invalid resource [MountRecord:1249]
2017-02-20 12:35:19,794 WARN  [:] [] [] [] [cutorService-14] [i.c.p.core.cleanup.BadDataCleanup   ] Removing invalid resource [MountRecord:1250]
2017-02-20 12:35:19,798 WARN  [:] [] [] [] [cutorService-11] [i.c.p.core.cleanup.BadDataCleanup   ] Removing invalid resource [MountRecord:1251]
2017-02-20 12:35:19,799 WARN  [:] [] [] [] [ecutorService-4] [i.c.p.core.cleanup.BadDataCleanup   ] Removing invalid resource [MountRecord:1252]
2017-02-20 12:35:19,808 WARN  [:] [] [] [] [cutorService-10] [i.c.p.core.cleanup.BadDataCleanup   ] Removing invalid resource [MountRecord:1253]
2017-02-20 12:35:19,815 WARN  [:] [] [] [] [ecutorService-6] [i.c.p.core.cleanup.BadDataCleanup   ] Removing invalid resource [MountRecord:1254]
2017-02-20 12:35:31,913 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [34200] count [3]
2017-02-20 12:35:36,915 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [34200] count [4]
2017-02-20 12:35:41,917 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [34200] count [5]
2017-02-20 12:35:46,918 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [34200] count [6]
2017-02-20 12:35:46,920 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Scheduling reconnect for [34200]
2017-02-20 12:35:51,937 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [34200] count [7]
2017-02-20 12:35:56,942 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [34200] count [8]
2017-02-20 12:36:01,945 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [34200] count [9]
2017-02-20 12:36:32,549 ERROR [:] [] [] [] [cutorService-31] [o.a.c.m.context.NoExceptionRunnable ] Expected state running but got removed

tail of cattle-debug.log:

2017-02-20 12:44:17,125 INFO  [:] [] [] [] [ecutorService-6] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volume.activate:8624441] on [304248] : Timeout
2017-02-20 12:44:31,678 INFO  [<REMOVED UID>:8802860] [nic:297994] [nic.purge->(NicPurge)] [] [cutorService-22] [i.c.p.r.p.i.ResourcePoolManagerImpl ] Releasing [02:53:27:a3:97:b9] id [39460] to pool [network:54] from owner [nic:297994]
2017-02-20 12:44:31,678 INFO  [<REMOVED UID>:8802859] [nic:297982] [nic.purge->(NicPurge)] [] [cutorService-44] [i.c.p.r.p.i.ResourcePoolManagerImpl ] Releasing [02:f2:7a:33:ba:13] id [39438] to pool [network:53] from owner [nic:297982]
2017-02-20 12:44:32,090 INFO  [:] [] [] [] [ecutorService-1] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8366028] on [27786] : Timeout
2017-02-20 12:44:32,092 INFO  [:] [] [] [] [cutorService-20] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8356084] on [27716] : Timeout
2017-02-20 12:44:32,092 INFO  [:] [] [] [] [cutorService-21] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8366026] on [27784] : Timeout
2017-02-20 12:44:32,092 INFO  [:] [] [] [] [cutorService-41] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8365452] on [14528] : Timeout
2017-02-20 12:44:32,093 INFO  [:] [] [] [] [cutorService-29] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8355981] on [27722] : Timeout
2017-02-20 12:45:17,103 INFO  [:] [] [] [] [cutorService-47] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volume.activate:8624441] on [304248] : Timeout
2017-02-20 12:45:17,150 INFO  [:] [] [] [] [cutorService-23] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8356083] on [27717] : Timeout
2017-02-20 12:45:17,152 INFO  [:] [] [] [] [cutorService-25] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8355982] on [27723] : Timeout
2017-02-20 12:45:17,152 INFO  [:] [] [] [] [cutorService-15] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8355978] on [27674] : Timeout
2017-02-20 12:45:17,152 INFO  [:] [] [] [] [cutorService-46] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8357592] on [27726] : Timeout
2017-02-20 12:45:17,152 INFO  [:] [] [] [] [cutorService-17] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8355980] on [27675] : Timeout
2017-02-20 12:45:17,152 INFO  [:] [] [] [] [cutorService-35] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8356436] on [27724] : Timeout
2017-02-20 12:45:17,154 INFO  [:] [] [] [] [cutorService-45] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8367945] on [27708] : Timeout
2017-02-20 12:45:32,083 INFO  [:] [] [] [] [cutorService-34] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8365452] on [14528] : Timeout
2017-02-20 12:45:32,084 INFO  [:] [] [] [] [cutorService-42] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8356084] on [27716] : Timeout
2017-02-20 12:45:32,084 INFO  [:] [] [] [] [cutorService-16] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8355981] on [27722] : Timeout
2017-02-20 12:45:32,084 INFO  [:] [] [] [] [cutorService-40] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8366028] on [27786] : Timeout
2017-02-20 12:45:32,084 INFO  [:] [] [] [] [cutorService-13] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8366026] on [27784] : Timeout
2017-02-20 12:46:17,090 INFO  [:] [] [] [] [cutorService-19] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8356436] on [27724] : Timeout
2017-02-20 12:46:17,091 INFO  [:] [] [] [] [cutorService-50] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volume.activate:8624441] on [304248] : Timeout
2017-02-20 12:46:17,093 INFO  [:] [] [] [] [cutorService-48] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8356083] on [27717] : Timeout
2017-02-20 12:46:17,093 INFO  [:] [] [] [] [cutorService-38] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8355978] on [27674] : Timeout
2017-02-20 12:46:17,093 INFO  [:] [] [] [] [ecutorService-5] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8367945] on [27708] : Timeout
2017-02-20 12:46:17,094 INFO  [:] [] [] [] [cutorService-36] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8355980] on [27675] : Timeout
2017-02-20 12:46:17,130 INFO  [:] [] [] [] [cutorService-39] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8357592] on [27726] : Timeout
2017-02-20 12:46:17,130 INFO  [:] [] [] [] [cutorService-43] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8355982] on [27723] : Timeout
2017-02-20 12:46:32,091 INFO  [:] [] [] [] [ecutorService-3] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8366026] on [27784] : Timeout
2017-02-20 12:46:32,093 INFO  [:] [] [] [] [cutorService-30] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8355981] on [27722] : Timeout
2017-02-20 12:46:32,093 INFO  [:] [] [] [] [ecutorService-7] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8356084] on [27716] : Timeout
2017-02-20 12:46:32,093 INFO  [:] [] [] [] [cutorService-31] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8366028] on [27786] : Timeout
2017-02-20 12:46:32,094 INFO  [:] [] [] [] [cutorService-10] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8365452] on [14528] : Timeout

joostliketoast on 20 Feb 2017

Same bug for me since december on all rancher versions, no solution.

richardlt on 24 Feb 2017

We ended up degrading back to 1.2.0 and have been fine ever since.

InAnimaTe on 24 Feb 2017

The port 42 check occurs over the overlay network and does not require that port be open on the host.

There is a conflation of different issues here:

Using HA without the correct advertise address
Cruft in the db from flapping services and upgrading from 1.1 (see docker run rancher/cleanup-1-1)
Running old versions with known problems; 1.2.0 is the worst possible choice.
Engine bugs (1.4 shows supported/unsupported on each host and 1.11.x-1.12.2 are specifically unsupported for a reason)

vincent99 on 24 Feb 2017

👍1

@InAnimaTe 1.2.0 is not the answer :). The fact that 'works' in your case would lead me to believe you're running a RHEL/CentOS based OS for your agents? Typically SELinux blocks containers from sharing the network, so ipsec breaks with a restrictive policy in place. Updated kernel for RHEL/CentOS also helps.

@joostliketoast can you run the rancher/cleanup-1-1:v0.1.2 (i believe is the latest) cleanup utility against your db? Also running a modern version of rancher helps here 1.3.4 or 1.4.1 would be ideal. Make sure you upgrade infrastructure services one at a time (network services, ipsec, healthcheck, scheduler... in that order).

aemneina on 24 Feb 2017

@aemneina I didn't say it was the answer and I don't advise people use an old version of software. Was merely stating that in some magical realm, rolling back to 1.2.0 fixed the problems we were having entirely. (I've since edited my comment not to sound like I'm providing a solution; sorry about that ;)

We're running latest RancherOS btw.

InAnimaTe on 24 Feb 2017

@InAnimaTe is there documentation onusing the rancher/cleanup utility and when it should be used? We haven't done anything to our external DB since we did the v1.1 -> 1.2 migration.

I'd like to upgrade to a newer version of Rancher, but we seem somewhat stable with v1.2.2 (may go to v1.2.3) for now. Looking at the compatibility matrix we should be okay using v1.2 with Docker v1.12.6 until December?

Since we were hitting this bug a lot before your help with adding --advertise-address and moving to Ubuntu, if we run into it again I'll add as much detail as I can to this issue.

ecliptik on 24 Feb 2017

So the cleanup stuff is a bit new to me and I haven't used it. Regarding docs on that, I'm not seeing much from a quick search (other than the hub image) so Rancher staff/devs would be better poised to answer.

InAnimaTe on 24 Feb 2017

Okay after running stable again for a week, this combination of changes seem to have worked for me:

Upgraded rancher to 1.4.1
Started it with --advertise-address as mentioned before
Then when i went to 'infrastructure' > 'hosts' in rancher. And the 1.4 release added a very good notification mentioning my docker version wasnt supported (was running 1.13) and i reverted back to the latest supported version. And all seems stable again now knock on wood

joostliketoast on 28 Feb 2017

We also met this issue frequently after we upgrade to 1.3.0.

Many containers stuck at Waiting: allocated, but they are actually already running.

And we got a lot volumestoragepoolmap.remove processes with UNKNOWN EXCEPTION

Our hosts were added to the environment in different time, so the versions are different, including
1.11.2, 1.12.1, 1.12.3, 1.13.0 and 1.13.1

Is this related to docker version or just need to update rancher server to 1.3.4 or 1.4.1?

lingzhang-lyon on 8 Mar 2017

To the rancher development team is there anything you need to further debug this issue?

Cause this is proving very troublesome in our production environment because the only solution to this problem i have found is to physically restart the hosts.

joostliketoast on 10 Mar 2017

👍16

Had the same issue a few hours ago, with rancher v1.4.3, on one of our Cattle environments. The only solution was to reboot all servers on the environment. =/

fabiorauber on 2 Jun 2017

Having the same issue. My environment is Rancher running on RancherOS, and reboot does not help at all. This is a total stopper for us :(

devnix on 15 Jun 2017

Yep, have the same. And very often containers just hang in "starting" stage after upgrades.

ketchoop on 15 Jun 2017

just experiencing this myself, I think im going to have to kill ha and go back to non ha

gregkeys on 28 Jun 2017

I had the same issue (Rancher 1.5.9, host on debian 9), but things seems to be back to normal after a docker service restart on the host. No need to reboot it.

manuito on 13 Jul 2017

👍1

Thanks manuito, good to know. Would it be possible for people who are affected to report if they still have the issue in newer version or would it be too noisy/chatty? It would help people who are considering upgrades. Also help find a resolution, such as telling people to upgrade vs leaving this issue open.

I'm currently running stable with rancher 1.3.3 server and is held up by two things: this issue and seeing upgrade issues with latest versions. This is also because both Docker and Rancher is moving at "fast and furious" speed, a good thing; and I don't want to be too far behind resulting even worse upgrade experience.

noogen on 13 Jul 2017

I have the same issue, I cannot deploy a new stack on some hosts.
I just restart docker on the host and its back to normal. but this is not a good enough fix for me.

Rancher 1.6.3
Ubuntu 16.04.2
Docker: 17.03.1-ce

jean-francois-labbe on 28 Jul 2017

👍1

I haven't seen this problem since I upgraded Docker to a supported version on all my hosts. You can check the compatibility in Rancher's documentation: http://rancher.com/docs/rancher/v1.6/en/hosts/

richardlt on 28 Jul 2017

@richardlt what is your rancher + docker version ? I have the problem even if my docker version is a supported one (Rancher 1.5.9 / docker 17.03.0-ce - see http://rancher.com/docs/rancher/v1.5/en/hosts/)

manuito on 28 Jul 2017

@manuito I'm running Rancher 1.4.1 which is a little bit old with Docker 1.12.3. What's interesting is that when I started answering in this issue I was on Rancher 1.2.x. Updating Rancher didn't solve the bug and I finally manage to shutdown my servers install a recommended Docker version and it works. Sometime with my current setup the "waiting for allocated container" appears for good reason for example when a host is unreachable and then back online. I can fix it quickly by restarting the Rancher network manager container.

richardlt on 28 Jul 2017

Having this issue with latest rancher stable. Version mix:

Rancher 1.6.5
Docker version 17.06.0-ce, build 02c1d87

Might be unrelated, but all of my hosts disconnect over time too.

timelf123 on 28 Jul 2017

👍1

Having the same issue:

Rancher 1.2.1
Docker version 1.12.3

Restarting the docker service doesn't work at all.

danyfoo on 1 Aug 2017

We resolve the problem upgrading rancher from 1.2.1 to 1.3.4 and docker version of the host from 1.12.3 to the latest version 17.06-ce.

We hope this problem doesn't happen again.

danyfoo on 1 Aug 2017

Happy to report that I am running latest stable version of Rancher Server(1.6.5) and 17.06.00-ce hosts for a week without any issue, knock on woods.

All are running ubuntu 16.04.02 lts. Updated all my servers. Didn't really go through the upgrade route, just create new and move some of the cattles over to new Ranch. Have the luxury because most of my cattles are just worker engines. (prerender, image resize proxy, etc...)

noogen on 1 Aug 2017

👍1

I'm running v1.6.6 + Docker version 1.12.6, build a82d35e and have this issue now...
maybe when I'm trying connect to nfs... but I think there is timeout problem...

scippio on 15 Aug 2017

I am running Docker 17.06-ce and Rancher v1.6.8 and i experienced thesame issue. But I discovered the problem was because of my Host Machine Memory. Just like @noogen has said. Recreate it. So always download the dockercompose and ranchercompose file of your stack, so that moving to a new deployment wont be a challenge

nimboya on 13 Sep 2017

Subscribing!
This is a major hurt point for us. Upgrading to a later version of Rancher seems to help, but after a while, after an undetermined amount of time, some service that we will try to upgrade will fail with this error, again.

Also removing the host machine that the service tries to start, cleaning the Rancher state folder from /var/lib/rancher/state, and re-registering it to Rancher seems to work, but what a sledgehammer of a solution... :(

stratosgear on 25 Sep 2017

👍1

echoing stratosgear and others here - this also happened to us. We've little idea what caused the issue - so I apologise this is probably not helpful but this is a big issue and we couldn't find any worthwhile logs explaining what was happening.

After a failed upgrade of our production api service, we first attempted to reboot the rancher server and then when this didn't resolve the issue we were left with little choice but to restart what we thought was the bad host where this non-starting container had been running. This was a terrible idea and it basically brought down our entire cluster as rancher attempted to heal our deployment, failing to spin up now more containers on all other hosts in the env and also failing to recognise when the original host had restarted.

The only way we managed to recover from this was to remove all nodes from the environment, remove lib/rancher/state from each as suggested here, and then re-enroll each of the 3 hosts.

Only once the last host was removed did rancher start attempting to deploying containers again to the newly cleaned hosts. Once we got to this point rancher was able to heal our deployment in a couple of seconds (I imagine because most images were on most servers still).

Were on the latest stable rancher - v1.6.10 and running Ubuntu 14.04.5 LTS on 3 x dedicated servers for this env. Docker version 17.06.2-ce, build cec0b72.

darrencrossley on 24 Nov 2017

I had the same issue and it looks like restarting "metadata" and "network-manager" services in "infrastructure" -> "network services" helped me.

P.S. Rancher 1.4.0

606ep on 16 Jan 2018

👍2

I also just went through the same issues, nearly all stacks were locked and did not come up again.
I tried many things mentioned here, downgrading was no option for me, so finally I update rancher to the latest stable version: 1.6.16.
I removed all hosts from the cattle environment and did set up everything from scratch using the same Database. After that, I still had the same issues. I tried to create a new environment, but I had the same issues there.

In the logs, I could find a database issue "max_allowed_packet" was too small as described in #9357 so I fixed that first.

After that, I came across #6859 and I realized that all my unhealthy stacks where using convoy-nfs. I believe that this was the root cause of my issues. After migrating from convoy-nfs to rancher-nfs, all stacks came up again and are now back fully functional.

Flos on 17 Apr 2018

With the release of Rancher 2.0, development on v1.6 is only limited to critical bug fixes and security patches.

deniseschannon on 23 Aug 2018

👎4

Was this page helpful?

0 / 5 - 0 ratings

Related issues

convoy-gluster service won't run on docker 1.11

tfiduccia · 49Comments

Feature Request: Support for AWS Elastic Container Registry (ECR)

johnrengelman · 50Comments

[2.4.3] Frequent backed up reader errors when using Imported cluster (EKS)

niranjan94 · 53Comments

[controlPlane] Failed to bring up Control Plane: Failed to verify healthcheck: Failed to check https://localhost:6443/healthz for service [xxx.xxx.xxx.xxx] on host [ip address]: Get https://localhost:6443/healthz: can not build dialer to cluster-z4rdx:m-snndn

gary-skwirrel · 47Comments

Failed to evaluate networking: failed to find plugin \"rancher-bridge\" when metadata container's bridge IP is equal to another hosts docker0 IP (bip)

ciokan · 102Comments