Rancher Version:
1.2.0
Docker Version:
1.12.3
OS and where are the hosts located? (cloud, bare metal, etc):
Vultr VPS, rancher hosted on Joyent
Setup Details: (single node rancher vs. HA rancher, internal DB vs. external DB)
Single external DB
Environment Type: (Cattle/Kubernetes/Swarm/Mesos)
Cattle
Steps to Reproduce:
Unsure, however creating a stack, or upgrading containers (starting) how causes the upgrade to wait forever or until a restart
Results:
Service/Stack/Container shows as Waiting: allocated [container:1i4142]) (with an appropriate ID), a restart unlocks the issue temporarily but it soon starts again.
Logs show constant errors:
2016-12-22 16:06:45,009 ERROR [e1b43cf0-8ac5-4d0b-b7d3-5e9f8d14d2a0:168835] [volumeSt
oragePoolMap:2224] [volumestoragepoolmap.remove] [] [ecutorService-5] [c.p.e.p.i.Defau
ltProcessInstanceImpl] Agent error for [storage.volume.remove.reply;agent=673]: Error
response from daemon: Unable to remove filesystem for c34a47c06dc8a438beb7f932a102c86a
0b3fd2b46ebcad6974d55fb98f24834c: remove /var/lib/docker/containers/c34a47c06dc8a438be
b7f932a102c86a0b3fd2b46ebcad6974d55fb98f24834c/shm: device or resource busy
�2016-12-22 16:06:45,015 ERROR [:] [] [] [] [ecutorService-5] [.e.s.i.ProcessInstanceD
ispatcherImpl] Agent error for [storage.volume.remove.reply;agent=673]: Error response
from daemon: Unable to remove filesystem for c34a47c06dc8a438beb7f932a102c86a0b3fd2b4
6ebcad6974d55fb98f24834c: remove /var/lib/docker/containers/c34a47c06dc8a438beb7f932a1
02c86a0b3fd2b46ebcad6974d55fb98f24834c/shm: device or resource busy
�2016-12-22 16:06:45,138 ERROR [8fdbd6b5-875b-46a2-81cc-adaffb4d01f5:169888] [volumeSt
oragePoolMap:2257] [volumestoragepoolmap.remove] [] [ecutorService-5] [c.p.e.p.i.Defau
ltProcessInstanceImpl] Agent error for [storage.volume.remove.reply;agent=668]: Error
response from daemon: Unable to remove filesystem for c0e4a882533de17793caa222d510460d
ce87a5851dd9ec28f2835b4b543b816c: remove /var/lib/docker/containers/c0e4a882533de17793
caa222d510460dce87a5851dd9ec28f2835b4b543b816c/shm: device or resource busy
Expected:
Handling containers would not time out or get locked
I manually deleted the hosts containing containers listed above, however my rancher implementation is still in an invalid state:
ZINFO[0000] Setting log level logLevel=info
^INFO[0000] Starting go-machine-service... gitcommit=v0.34.1
CINFO[0000] Waiting for handler registration (1/2)
2016-12-23 04:56:50,576 ERROR [e904d620-1aca-4a89-9903-07c2516d981d:272072] [instance:
7871] [instance.start->(InstanceStart)] [] [ecutorService-8] [i.c.p.process.instance.I
nstanceStart] Failed to Waiting for deployment unit instances to create for instance [
7871]
2016-12-23 04:56:52,624 ERROR [11c37ae3-8c95-46dc-943f-d422299c9564:272068] [instance:
7869] [instance.start->(InstanceStart)] [] [ecutorService-2] [i.c.p.process.instance.I
nstanceStart] Failed to Waiting for deployment unit instances to create for instance [
7869]
SFATA[0017] Exiting go-machine-service: Timed out waiting for transtion.
C2016/12/23 04:56:53 http: proxy error: net/http: request canceled
ZINFO[0000] Setting log level logLevel=info
^INFO[0000] Starting go-machine-service... gitcommit=v0.34.1
CINFO[0000] Waiting for handler registration (1/2)
SFATA[0019] Exiting go-machine-service: Timed out waiting for transtion.
C2016/12/23 04:57:13 http: proxy error: net/http: request canceled
ZINFO[0000] Setting log level logLevel=info
^INFO[0000] Starting go-machine-service... gitcommit=v0.34.1
CINFO[0000] Waiting for handler registration (1/2)
These logs just loop
I am facing the same problem
This has become a serious issue for me. I regularly get into a state where Rancher is unable to complete upgrades and ends up in an invalid state. I believe the issue is that a certain number of filesystem resources error out, locking up certain threads in the application, blocking other work from being done. I had to tear down my entire setup causing downtime to recover from this.
If you're actually on 1.2.0 the first step would be to update to 1.2.2. A huge amount of stuff changed between 1.1.x and 1.2 and there are a several important fixes in the patches since.
Unable to remove filesystem for ...: remove /var/lib/docker/containers/.../shm: device or resource busy is a Docker bug that there's little we can do about directly, but one of the things in 1.2.1 or 2 was to make it less likely for Rancher to trigger it.
I had the same issue two days ago, the only solution I found was to remove my Rancher environment with its hosts and to recreate it from scratch. But unfortunately today same issue again on this env. I have a second environment with a host that is the one which run Rancher itself and on this one no problem. I tried to catch some logs but the only info that I have is the message "Activating (Waiting: allocated [container:........."
I'm running Rancher 1.2.2 on Ubuntu Trusty and Debian Jessie with Docker 1.11.1
I tried restarting hosts to resolve the docker bug but it doesn't help. The
action I've taken now is to discard hosts where this happens and create new
ones and limp by
On Sat, Dec 31, 2016, 6:43 AM Richard LT notifications@github.com wrote:
I had the same issue two days ago, the only solution I found was to remove
my Rancher environment with its hosts and to recreate it from scratch. But
unfortunately today same issue again on this env. I have a second
environment with a host that is the one which run Rancher itself and on
this one no problem. I tried to catch some logs but the only info that I
have is the message "Activating (Waiting: allocated [container:........."—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/rancher/rancher/issues/7181#issuecomment-269867917,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABeyj-d3ByQBPvURPLb0PEG7D10UYKDEks5rNmongaJpZM4LUIMv
.
Yes, It also work if you remove the host from the UI then remove the agent container (I'm using a custom host). And next add new host. But it's really annoying.
Thanks for the tip, I'll give that a go if I can prevent the VM deletion
On Sat, Dec 31, 2016, 9:37 AM Richard LT notifications@github.com wrote:
Yes, It also work of you remove the host from the UI then remove the agent
container (I'm using a custom host). And next add new host. But it's really
annoying.—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/rancher/rancher/issues/7181#issuecomment-269874933,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABeyj5BkLwxUOqtMKLmLX1f6v2ul7Zkhks5rNpLJgaJpZM4LUIMv
.
Just installed v1.3.0, still the same problem ;)
EDIT: My bad look like it work just failed to start a previously created container that was stuck waiting. Will see in few days.
Just ran into this as well:
Rancher Version:
1.2.2
Docker Version:
1.11.2
OS and where are the hosts located? (cloud, bare metal, etc):
Amazon AWS with Amazon Linux 2016.09
Setup Details: (single node rancher vs. HA rancher, internal DB vs. external DB)
Single external DB
Environment Type: (Cattle/Kubernetes/Swarm/Mesos)
Cattle
After trying to add in new hosts to fix the issue, things got even worse:
Steps:
I'm not quite what exactly happened, but all the container stacks start giving Waiting: allocated messages and went into Unhealthy or Updating-Active.

Trying to stop/start a service puts it into Deactivating and it eventually stops or just hangs,

The new hosts that were added to the cluster either failed to register or when it did register it didn't bring up any containers.
Failed host registration:
INFO: Running Agent Registration Process, CATTLE_URL=https://rancher.demandbase.com/v1
INFO: Attempting to connect to: https://rancher.demandbase.com/v1
INFO: https://rancher.demandbase.com/v1 is accessible
INFO: Inspecting host capabilities
INFO: Boot2Docker: false
INFO: Host writable: true
INFO: Token: xxxxxxxx
INFO: Running registration
Traceback (most recent call last):
File "./register.py", line 31, in <module>
r = client.wait_success(r, timeout=300)
File "/usr/local/lib/python2.7/dist-packages/cattle.py", line 15, in wait_success
obj = self.wait_transitioning(obj, timeout)
File "/usr/local/lib/python2.7/dist-packages/cattle.py", line 33, in wait_transitioning
raise Exception(msg)
Exception: Timeout waiting for [register:1r4989] to be done after 301.394496918 seconds
An identical host brought up at the same time registered successfully, but then doesn't run any stacks/containers, including system containers:

Removing and restarting the Rancher master container didn't change anything, and I'm hesitant to completely remove the environment and start from scratch. The other two Cattle environments we have are not showing any errors, although I haven't deployed any new containers or hosts to them and am afraid the same thing may happen.
This is the first time I've seen a Rancher cluster completely go haywire like this and am wondering if it's related to the 1.2 upgrade.
I can provide the cattle-debug and cattle-error logs from the master as well if needed.
Woke up this morning and cluster was in the same state. Tried to deploy to another environment that has't been modified for a few weeks and it is also showing the same behavior,

Just wanted to comment and say deploy's of new containers, even load balancer containers are taking upwards of 10 minutes, sitting in the "Waiting: allocated" state for the majority of the time. Running 1.3.0. This happens for new fresh container deploys and upgrades. Even taking a service down is taking longer, erroring, or otherwise just not completing.
The robustness of Rancher/Cattle has seemed to decrease since 1.1.4, especially for basic actions like handling containers. Restarting the server process/container seems to help albeit for a small amount of time until it gets unhappy again :(
Just experienced this... deployment was behind an ALB. The problem turned out to be the way rancher server starts its internal LB? So adding --advertise-address starts the rancher server in HA mode which uses traefik, non HA rancher server uses websocket-proxy (I'd imagine this would work well with the ELB+proxy protocol). Starting HA server with ALB made everything work as expected here. @ibuildthecloud sound about right?
For anyone still having issues with 1.3.0, downgrading to 1.2.2 does seem to work fine. All my stacks/services are still there and happy, and new containers (and upgraded ones) actually get deployed properly.
Same problem here. Stack service is stuck in the middle of deployment (upgrading) with " Waiting: allocated [container:1i23696]"
Rancher 1.2.0, Docker 1.12.1 on Ubuntu 16.04 Xenial.
Will try to upgrade to 1.2.2 and see if this fixes it.
@aemneina suggestion fixed it for me
rancher/server --advertise-address {RancherServerIP}
Issues:
1) Slow to start docker image
2) Image would stuck at Waiting/Activating - even though the image has complete with IP address, it still stuck in Activating. Ports didn't get map since it still waiting for image.
Ubuntu 16.04 Xenial, Docker 1.12.x - 1.13.0 also running 16.04 Xenial, Rancher 1.3.0-1.3.3
Also note that I'm not behind any VPC or ALB. All docker hosts are public facing. I'm not running in HA at all. Just using the --advertise-address solves it.
having the same issue as the others, but sadly @noogen fix didn't work for me
The only difference i have is that i have one environment with hosts which is on our local network.
And that environment does not have any problems.(these are dedicated machines no VPS instances)
dedicated rancher server with external database and the hosts are directly connected to the public internet
| Useful | Info |
| :-- | :-- |
|Versions|Rancher v1.3.4 Cattle: v0.175.10 UI: v1.3.6 |
|Access|github admin|
|Orchestration|Cattle|
|Route|container.ports|
For me upgrading to 1.4.0 and using --advertise-address helped. I don't know for how long, because 1.3.4 stopped working properly, despite using --advertise-address.
does anyone have a potential other solution to this issue? because the only (very unpleasant) solution i have at the moment is restarting the hosts every few days...
@joostliketoast We upgraded Rancher from 1.1.x to 1.3.3 and also stopped using environments with hosts from different networks. According to https://github.com/rancher/lb-controller/pull/56 the healthcheck containers of every host need to be able to communicate with another host on the default loadbalancer port, which is tcp/42 (or tcp/41, check the haproxy config file inside a LB container). I have nowhere found this in the documentation and was kind of surprising news. We have since created new environments for each host subnet (so there is no firewall between the hosts). Since then I haven't seen any problems with the LB service anymore.
@Napsty strange the hosts have complete access between each other (the one's that are on the same rancher environment).
I would suspect that they wouldn't work at all if they had problems with that port, instead of giving Waiting: allocated after a few days when i try to create or upgrade a container...
but i'll try and open that port for the rancher server to see if it has any effect
Edit: no effect, and rebooting the docker server doesn't help either.
So far not a fan of ranchers new networking code, stability of the software seems to have gone down hill...
tail of rancer server cattle-error.log
2017-02-20 12:35:19,770 WARN [:] [] [] [] [cutorService-10] [i.c.p.core.cleanup.BadDataCleanup ] Removing invalid resource [MountRecord:1240]
2017-02-20 12:35:19,773 WARN [:] [] [] [] [ecutorService-8] [i.c.p.core.cleanup.BadDataCleanup ] Removing invalid resource [MountRecord:1241]
2017-02-20 12:35:19,777 WARN [:] [] [] [] [ecutorService-7] [i.c.p.core.cleanup.BadDataCleanup ] Removing invalid resource [MountRecord:1242]
2017-02-20 12:35:19,778 WARN [:] [] [] [] [cutorService-13] [i.c.p.core.cleanup.BadDataCleanup ] Removing invalid resource [MountRecord:1243]
2017-02-20 12:35:19,780 WARN [:] [] [] [] [ecutorService-6] [i.c.p.core.cleanup.BadDataCleanup ] Removing invalid resource [MountRecord:1244]
2017-02-20 12:35:19,785 WARN [:] [] [] [] [ecutorService-2] [i.c.p.core.cleanup.BadDataCleanup ] Removing invalid resource [MountRecord:1245]
2017-02-20 12:35:19,787 WARN [:] [] [] [] [cutorService-12] [i.c.p.core.cleanup.BadDataCleanup ] Removing invalid resource [MountRecord:1246]
2017-02-20 12:35:19,787 WARN [:] [] [] [] [ecutorService-1] [i.c.p.core.cleanup.BadDataCleanup ] Removing invalid resource [MountRecord:1247]
2017-02-20 12:35:19,789 WARN [:] [] [] [] [ecutorService-3] [i.c.p.core.cleanup.BadDataCleanup ] Removing invalid resource [MountRecord:1248]
2017-02-20 12:35:19,791 WARN [:] [] [] [] [ecutorService-9] [i.c.p.core.cleanup.BadDataCleanup ] Removing invalid resource [MountRecord:1249]
2017-02-20 12:35:19,794 WARN [:] [] [] [] [cutorService-14] [i.c.p.core.cleanup.BadDataCleanup ] Removing invalid resource [MountRecord:1250]
2017-02-20 12:35:19,798 WARN [:] [] [] [] [cutorService-11] [i.c.p.core.cleanup.BadDataCleanup ] Removing invalid resource [MountRecord:1251]
2017-02-20 12:35:19,799 WARN [:] [] [] [] [ecutorService-4] [i.c.p.core.cleanup.BadDataCleanup ] Removing invalid resource [MountRecord:1252]
2017-02-20 12:35:19,808 WARN [:] [] [] [] [cutorService-10] [i.c.p.core.cleanup.BadDataCleanup ] Removing invalid resource [MountRecord:1253]
2017-02-20 12:35:19,815 WARN [:] [] [] [] [ecutorService-6] [i.c.p.core.cleanup.BadDataCleanup ] Removing invalid resource [MountRecord:1254]
2017-02-20 12:35:31,913 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [34200] count [3]
2017-02-20 12:35:36,915 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [34200] count [4]
2017-02-20 12:35:41,917 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [34200] count [5]
2017-02-20 12:35:46,918 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [34200] count [6]
2017-02-20 12:35:46,920 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Scheduling reconnect for [34200]
2017-02-20 12:35:51,937 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [34200] count [7]
2017-02-20 12:35:56,942 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [34200] count [8]
2017-02-20 12:36:01,945 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [34200] count [9]
2017-02-20 12:36:32,549 ERROR [:] [] [] [] [cutorService-31] [o.a.c.m.context.NoExceptionRunnable ] Expected state running but got removed
tail of cattle-debug.log:
2017-02-20 12:44:17,125 INFO [:] [] [] [] [ecutorService-6] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volume.activate:8624441] on [304248] : Timeout
2017-02-20 12:44:31,678 INFO [<REMOVED UID>:8802860] [nic:297994] [nic.purge->(NicPurge)] [] [cutorService-22] [i.c.p.r.p.i.ResourcePoolManagerImpl ] Releasing [02:53:27:a3:97:b9] id [39460] to pool [network:54] from owner [nic:297994]
2017-02-20 12:44:31,678 INFO [<REMOVED UID>:8802859] [nic:297982] [nic.purge->(NicPurge)] [] [cutorService-44] [i.c.p.r.p.i.ResourcePoolManagerImpl ] Releasing [02:f2:7a:33:ba:13] id [39438] to pool [network:53] from owner [nic:297982]
2017-02-20 12:44:32,090 INFO [:] [] [] [] [ecutorService-1] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8366028] on [27786] : Timeout
2017-02-20 12:44:32,092 INFO [:] [] [] [] [cutorService-20] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8356084] on [27716] : Timeout
2017-02-20 12:44:32,092 INFO [:] [] [] [] [cutorService-21] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8366026] on [27784] : Timeout
2017-02-20 12:44:32,092 INFO [:] [] [] [] [cutorService-41] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8365452] on [14528] : Timeout
2017-02-20 12:44:32,093 INFO [:] [] [] [] [cutorService-29] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8355981] on [27722] : Timeout
2017-02-20 12:45:17,103 INFO [:] [] [] [] [cutorService-47] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volume.activate:8624441] on [304248] : Timeout
2017-02-20 12:45:17,150 INFO [:] [] [] [] [cutorService-23] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8356083] on [27717] : Timeout
2017-02-20 12:45:17,152 INFO [:] [] [] [] [cutorService-25] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8355982] on [27723] : Timeout
2017-02-20 12:45:17,152 INFO [:] [] [] [] [cutorService-15] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8355978] on [27674] : Timeout
2017-02-20 12:45:17,152 INFO [:] [] [] [] [cutorService-46] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8357592] on [27726] : Timeout
2017-02-20 12:45:17,152 INFO [:] [] [] [] [cutorService-17] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8355980] on [27675] : Timeout
2017-02-20 12:45:17,152 INFO [:] [] [] [] [cutorService-35] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8356436] on [27724] : Timeout
2017-02-20 12:45:17,154 INFO [:] [] [] [] [cutorService-45] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8367945] on [27708] : Timeout
2017-02-20 12:45:32,083 INFO [:] [] [] [] [cutorService-34] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8365452] on [14528] : Timeout
2017-02-20 12:45:32,084 INFO [:] [] [] [] [cutorService-42] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8356084] on [27716] : Timeout
2017-02-20 12:45:32,084 INFO [:] [] [] [] [cutorService-16] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8355981] on [27722] : Timeout
2017-02-20 12:45:32,084 INFO [:] [] [] [] [cutorService-40] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8366028] on [27786] : Timeout
2017-02-20 12:45:32,084 INFO [:] [] [] [] [cutorService-13] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8366026] on [27784] : Timeout
2017-02-20 12:46:17,090 INFO [:] [] [] [] [cutorService-19] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8356436] on [27724] : Timeout
2017-02-20 12:46:17,091 INFO [:] [] [] [] [cutorService-50] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volume.activate:8624441] on [304248] : Timeout
2017-02-20 12:46:17,093 INFO [:] [] [] [] [cutorService-48] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8356083] on [27717] : Timeout
2017-02-20 12:46:17,093 INFO [:] [] [] [] [cutorService-38] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8355978] on [27674] : Timeout
2017-02-20 12:46:17,093 INFO [:] [] [] [] [ecutorService-5] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8367945] on [27708] : Timeout
2017-02-20 12:46:17,094 INFO [:] [] [] [] [cutorService-36] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8355980] on [27675] : Timeout
2017-02-20 12:46:17,130 INFO [:] [] [] [] [cutorService-39] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8357592] on [27726] : Timeout
2017-02-20 12:46:17,130 INFO [:] [] [] [] [cutorService-43] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8355982] on [27723] : Timeout
2017-02-20 12:46:32,091 INFO [:] [] [] [] [ecutorService-3] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8366026] on [27784] : Timeout
2017-02-20 12:46:32,093 INFO [:] [] [] [] [cutorService-30] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8355981] on [27722] : Timeout
2017-02-20 12:46:32,093 INFO [:] [] [] [] [ecutorService-7] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8356084] on [27716] : Timeout
2017-02-20 12:46:32,093 INFO [:] [] [] [] [cutorService-31] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8366028] on [27786] : Timeout
2017-02-20 12:46:32,094 INFO [:] [] [] [] [cutorService-10] [.e.s.i.ProcessInstanceDispatcherImpl] Timeout on process [volumestoragepoolmap.remove:8365452] on [14528] : Timeout
Same bug for me since december on all rancher versions, no solution.
We ended up degrading back to 1.2.0 and have been fine ever since.
The port 42 check occurs over the overlay network and does not require that port be open on the host.
There is a conflation of different issues here:
docker run rancher/cleanup-1-1)@InAnimaTe 1.2.0 is not the answer :). The fact that 'works' in your case would lead me to believe you're running a RHEL/CentOS based OS for your agents? Typically SELinux blocks containers from sharing the network, so ipsec breaks with a restrictive policy in place. Updated kernel for RHEL/CentOS also helps.
@joostliketoast can you run the rancher/cleanup-1-1:v0.1.2 (i believe is the latest) cleanup utility against your db? Also running a modern version of rancher helps here 1.3.4 or 1.4.1 would be ideal. Make sure you upgrade infrastructure services one at a time (network services, ipsec, healthcheck, scheduler... in that order).
@aemneina I didn't say it was the answer and I don't advise people use an old version of software. Was merely stating that in some magical realm, rolling back to 1.2.0 fixed the problems we were having entirely. (I've since edited my comment not to sound like I'm providing a solution; sorry about that ;)
We're running latest RancherOS btw.
@InAnimaTe is there documentation onusing the rancher/cleanup utility and when it should be used? We haven't done anything to our external DB since we did the v1.1 -> 1.2 migration.
I'd like to upgrade to a newer version of Rancher, but we seem somewhat stable with v1.2.2 (may go to v1.2.3) for now. Looking at the compatibility matrix we should be okay using v1.2 with Docker v1.12.6 until December?
Since we were hitting this bug a lot before your help with adding --advertise-address and moving to Ubuntu, if we run into it again I'll add as much detail as I can to this issue.
So the cleanup stuff is a bit new to me and I haven't used it. Regarding docs on that, I'm not seeing much from a quick search (other than the hub image) so Rancher staff/devs would be better poised to answer.
Okay after running stable again for a week, this combination of changes seem to have worked for me:
--advertise-address as mentioned beforeWe also met this issue frequently after we upgrade to 1.3.0.
Many containers stuck at Waiting: allocated, but they are actually already running.
And we got a lot volumestoragepoolmap.remove processes with UNKNOWN EXCEPTION
Our hosts were added to the environment in different time, so the versions are different, including
1.11.2, 1.12.1, 1.12.3, 1.13.0 and 1.13.1
Is this related to docker version or just need to update rancher server to 1.3.4 or 1.4.1?
| Useful | Info |
| :-- | :-- |
|Versions|Rancher v1.3.0 Cattle: v0.175.2 UI: v1.3.5 |
|Access|github admin|
|Orchestration|Cattle|
|Route|stacks.index|
Okay, have to come back to my previous statement.
After running without a hitch for about 2 to 3 weeks, where in the meanwhile i haven't upgraded rancher or docker on the hosts.
I got the Waiting: allocated again today while upgrading a container.
To the rancher development team is there anything you need to further debug this issue?
Cause this is proving very troublesome in our production environment because the only solution to this problem i have found is to physically restart the hosts.
Had the same issue a few hours ago, with rancher v1.4.3, on one of our Cattle environments. The only solution was to reboot all servers on the environment. =/
Having the same issue. My environment is Rancher running on RancherOS, and reboot does not help at all. This is a total stopper for us :(
Yep, have the same. And very often containers just hang in "starting" stage after upgrades.
just experiencing this myself, I think im going to have to kill ha and go back to non ha
I had the same issue (Rancher 1.5.9, host on debian 9), but things seems to be back to normal after a docker service restart on the host. No need to reboot it.
Thanks manuito, good to know. Would it be possible for people who are affected to report if they still have the issue in newer version or would it be too noisy/chatty? It would help people who are considering upgrades. Also help find a resolution, such as telling people to upgrade vs leaving this issue open.
I'm currently running stable with rancher 1.3.3 server and is held up by two things: this issue and seeing upgrade issues with latest versions. This is also because both Docker and Rancher is moving at "fast and furious" speed, a good thing; and I don't want to be too far behind resulting even worse upgrade experience.
I have the same issue, I cannot deploy a new stack on some hosts.
I just restart docker on the host and its back to normal. but this is not a good enough fix for me.
Rancher 1.6.3
Ubuntu 16.04.2
Docker: 17.03.1-ce
I haven't seen this problem since I upgraded Docker to a supported version on all my hosts. You can check the compatibility in Rancher's documentation: http://rancher.com/docs/rancher/v1.6/en/hosts/
@richardlt what is your rancher + docker version ? I have the problem even if my docker version is a supported one (Rancher 1.5.9 / docker 17.03.0-ce - see http://rancher.com/docs/rancher/v1.5/en/hosts/)
@manuito I'm running Rancher 1.4.1 which is a little bit old with Docker 1.12.3. What's interesting is that when I started answering in this issue I was on Rancher 1.2.x. Updating Rancher didn't solve the bug and I finally manage to shutdown my servers install a recommended Docker version and it works. Sometime with my current setup the "waiting for allocated container" appears for good reason for example when a host is unreachable and then back online. I can fix it quickly by restarting the Rancher network manager container.
Having this issue with latest rancher stable. Version mix:
Might be unrelated, but all of my hosts disconnect over time too.
Having the same issue:
Rancher 1.2.1
Docker version 1.12.3
Restarting the docker service doesn't work at all.
We resolve the problem upgrading rancher from 1.2.1 to 1.3.4 and docker version of the host from 1.12.3 to the latest version 17.06-ce.
We hope this problem doesn't happen again.
Happy to report that I am running latest stable version of Rancher Server(1.6.5) and 17.06.00-ce hosts for a week without any issue, knock on woods.
All are running ubuntu 16.04.02 lts. Updated all my servers. Didn't really go through the upgrade route, just create new and move some of the cattles over to new Ranch. Have the luxury because most of my cattles are just worker engines. (prerender, image resize proxy, etc...)
I'm running v1.6.6 + Docker version 1.12.6, build a82d35e and have this issue now...
maybe when I'm trying connect to nfs... but I think there is timeout problem...
I am running Docker 17.06-ce and Rancher v1.6.8 and i experienced thesame issue. But I discovered the problem was because of my Host Machine Memory. Just like @noogen has said. Recreate it. So always download the dockercompose and ranchercompose file of your stack, so that moving to a new deployment wont be a challenge
Subscribing!
This is a major hurt point for us. Upgrading to a later version of Rancher seems to help, but after a while, after an undetermined amount of time, some service that we will try to upgrade will fail with this error, again.
Also removing the host machine that the service tries to start, cleaning the Rancher state folder from /var/lib/rancher/state, and re-registering it to Rancher seems to work, but what a sledgehammer of a solution... :(
echoing stratosgear and others here - this also happened to us. We've little idea what caused the issue - so I apologise this is probably not helpful but this is a big issue and we couldn't find any worthwhile logs explaining what was happening.
After a failed upgrade of our production api service, we first attempted to reboot the rancher server and then when this didn't resolve the issue we were left with little choice but to restart what we thought was the bad host where this non-starting container had been running. This was a terrible idea and it basically brought down our entire cluster as rancher attempted to heal our deployment, failing to spin up now more containers on all other hosts in the env and also failing to recognise when the original host had restarted.
The only way we managed to recover from this was to remove all nodes from the environment, remove lib/rancher/state from each as suggested here, and then re-enroll each of the 3 hosts.
Only once the last host was removed did rancher start attempting to deploying containers again to the newly cleaned hosts. Once we got to this point rancher was able to heal our deployment in a couple of seconds (I imagine because most images were on most servers still).
Were on the latest stable rancher - v1.6.10 and running Ubuntu 14.04.5 LTS on 3 x dedicated servers for this env. Docker version 17.06.2-ce, build cec0b72.
I had the same issue and it looks like restarting "metadata" and "network-manager" services in "infrastructure" -> "network services" helped me.
P.S. Rancher 1.4.0
I also just went through the same issues, nearly all stacks were locked and did not come up again.
I tried many things mentioned here, downgrading was no option for me, so finally I update rancher to the latest stable version: 1.6.16.
I removed all hosts from the cattle environment and did set up everything from scratch using the same Database. After that, I still had the same issues. I tried to create a new environment, but I had the same issues there.
In the logs, I could find a database issue "max_allowed_packet" was too small as described in #9357 so I fixed that first.
After that, I came across #6859 and I realized that all my unhealthy stacks where using convoy-nfs. I believe that this was the root cause of my issues. After migrating from convoy-nfs to rancher-nfs, all stacks came up again and are now back fully functional.
With the release of Rancher 2.0, development on v1.6 is only limited to critical bug fixes and security patches.
Most helpful comment
Okay, have to come back to my previous statement.
After running without a hitch for about 2 to 3 weeks, where in the meanwhile i haven't upgraded rancher or docker on the hosts.
I got the
Waiting: allocatedagain today while upgrading a container.To the rancher development team is there anything you need to further debug this issue?
Cause this is proving very troublesome in our production environment because the only solution to this problem i have found is to physically restart the hosts.