We have problem in running mongodb cluster under rancher 1.3. Previously we used rancher 1.1 and this worked fine. Suddenly rancher without reasons keeps restarting at least one node of the mogodb cluster claiming that it's incomplete. Below you can find a fragment of rancher log (look at the end first as a reversed order log):
01:19:17 PM INFO service.trigger.info Requested: 3, Created: 3, Unhealthy: 0, Bad: 0, Incomplete: 0
01:19:17 PM INFO service.trigger.info Service already reconciled
01:19:16 PM INFO service.trigger Re-evaluating state
01:19:16 PM INFO service.trigger (1 sec) Re-evaluating state
01:19:16 PM INFO service.trigger.info Service reconciled: Requested: 3, Created: 3, Unhealthy: 0, Bad: 0, Incomplete: 0
01:19:16 PM INFO service.update.info Service already reconciled
01:19:16 PM INFO service.update Updating service
01:19:16 PM INFO service.update.info Requested: 3, Created: 3, Unhealthy: 0, Bad: 0, Incomplete: 0
01:19:16 PM INFO service.trigger.exception Busy processing [SERVICE.280] will try later
01:19:03 PM INFO service.update Updating service
01:19:03 PM INFO service.update.exception Busy processing [SERVICE.280] will try later
01:19:02 PM INFO service.trigger.wait (14 sec) Waiting for instances to start
01:19:02 PM INFO service.instance.create Creating extra service instance
01:19:02 PM INFO service.instance.create Creating extra service instance
01:19:01 PM INFO service.trigger (15 sec) Re-evaluating state
01:19:01 PM INFO service.trigger.info Requested: 3, Created: 3, Unhealthy: 0, Bad: 0, Incomplete: 1
the problem always starts with log Requested: 3, Created: 3, Unhealthy: 0, Bad: 0, Incomplete:1
However in the very same time in the mongo nothing interesting is happening, suddenly it's being restarted by something external i.e. rancher (log in natural order):
There is is problem with mongodb cluster service which is being restarted by rather without reasons a few times per hour.
2017-01-22T13:06:11.957+0000 I NETWORK [conn2362] end connection 10.42.191.72:55615 (24 connections now open)
2017-01-22T13:06:14.848+0000 I NETWORK [initandlisten] connection accepted from 10.42.191.72:55635 #2363 (25 connections now open)
2017-01-22T13:06:14.849+0000 I NETWORK [conn2363] end connection 10.42.191.72:55635 (24 connections now open)
(nothing unusual until here, look here, and look at the times
it's after rancher triggered service.trigger.info
Requested: 3, Created: 3, Unhealthy: 0, Bad: 0, Incomplete: 1)
->
2017-01-22T13:06:15.243+0000 I CONTROL [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
2017-01-22T13:06:15.244+0000 I FTDC [signalProcessingThread] Shutting down full-time diagnostic data capture
2017-01-22T13:06:15.253+0000 I REPL [signalProcessingThread] Stopping replication applier threads
2017-01-22T13:06:15.556+0000 I STORAGE [conn105] got request after shutdown()
2017-01-22T13:06:15.871+0000 I STORAGE [conn91] got request after shutdown()
2017-01-22T13:06:15.874+0000 I STORAGE [conn86] got request after shutdown()
2017-01-22T13:06:15.887+0000 I STORAGE [conn82] got request after shutdown()
2017-01-22T13:06:15.941+0000 I STORAGE [conn83] got request after shutdown()
2017-01-22T13:06:16.009+0000 I STORAGE [conn85] got request after shutdown()
2017-01-22T13:06:16.020+0000 I STORAGE [conn84] got request after shutdown()
2017-01-22T13:06:16.108+0000 I STORAGE [conn75] got request after shutdown()
2017-01-22T13:06:16.133+0000 I STORAGE [conn87] got request after shutdown()
Rancher Versions: 1.3.1
Server: Ubuntu 14.04 LTS
healthcheck: v0.2.0
ipsec: net:v0.8.1
network-services:network-manager:v0.4.0
scheduler:scheduler:v0.5.1
Docker Version:
Repeated on docker version 1.10.3 and on 1.12.0
OS and where are the hosts located? (cloud, bare metal, etc):
Whole environment works on private cloud environment
Setup Details: (single node rancher vs. HA rancher, internal DB vs. external DB)
Rancher works on single node with external mysql database
Environment Type: (Cattle/Kubernetes/Swarm/Mesos)
Cattle only
Steps to Reproduce:
Deply MongoDB cluster which by default works in scale 3. Wait around 1hour and take a look at the rancher logs for mongo stack.
You don't need even use mongo. The rancher takes a decision is a problem
Results:
Mongo is being restarted a few times an hour. Sometimes it leads to corruption of mongo local db as the system is not waiting long enough for peaceful mongo shutdown (probably it takes longer then 10sec, when the mongo is in use).
Same issue here with Rancher 1.4.1, with the same config as luman75.
02:12:29 PM INFO service.trigger.info Requested: 3, Created: 3, Unhealthy: 0, Bad: 0, Incomplete: 0
02:12:29 PM INFO service.trigger.info Service already reconciled
02:12:29 PM INFO service.trigger Re-evaluating state
02:12:25 PM INFO service.update.info Service reconciled: Requested: 3, Created: 3, Unhealthy: 0, Bad: 0, Incomplete: 0
02:12:20 PM INFO service.trigger Re-evaluating state
02:12:20 PM INFO service.trigger.exception Busy processing [SERVICE.229] will try later
02:12:18 PM INFO service.instance.create Creating extra service instance
02:12:18 PM INFO service.instance.create Creating extra service instance
02:12:17 PM INFO service.instance.create Creating extra service instance
02:12:17 PM INFO service.instance.create Creating extra service instance
02:12:17 PM INFO service.instance.create Creating extra service instance
02:12:17 PM INFO service.update.wait (7 sec) Waiting for instances to start
02:12:17 PM INFO service.instance.create Creating extra service instance
02:12:16 PM INFO service.update (9 sec) Updating service
02:12:16 PM INFO service.update.info Requested: 3, Created: 3, Unhealthy: 0, Bad: 0, Incomplete: 3
02:12:16 PM INFO service.trigger Re-evaluating state
02:12:16 PM INFO service.trigger.info Requested: 3, Created: 3, Unhealthy: 0, Bad: 0, Incomplete: 3
12:57:26 PM INFO service.trigger.info Requested: 3, Created: 3, Unhealthy: 0, Bad: 0, Incomplete: 0
12:57:26 PM INFO service.trigger.info Service already reconciled
It upgraded itself and recreated all the containers, killing my replica-set.
That's happening regularly, cant say the delay between restart from now.
I'm using this configuration to launch my cluster: https://github.com/aestetype/rancher-mongodb-cluster
It's a fork of this https://github.com/rancher/catalog-dockerfiles/pull/92 with a little modification to use only hostnames in the replicaset config.
Edit: It looks like my mongodb services are restarting when i'm using the app linked to it. It triggers a
3/3/2017 1:41:58 AMFri, 03 Mar 2017 01:41:58 GMT mongo-oplog unknow error {"name":"MongoError","message":"server mongo-cluster-1.rancher.internal:27017 timed out"}
And then the mongodb cluster restarts and I have to manually restart the app because it's hanging without a mongodb connection :/
Investigating...
We are seeing this as well in v1.4.1, running a single instance of MongoDB:
Requested: 1, Created: 1, Unhealthy: 0, Bad: 0, Incomplete: 1
Re-evaluating state
Creating extra service instance
Updating service
Waiting for instances to start
Creating extra service instance
Requested: 1, Created: 1, Unhealthy: 0, Bad: 0, Incomplete: 1
Re-evaluating state
Busy processing [SERVICE.100] will try later
Service reconciled: Requested: 1, Created: 1, Unhealthy: 0, Bad: 0, Incomplete: 0
Occasionally after the mongo service updates our server seems to lose connection with it and the stack needs to be restarted.
We figured out why this was happening on our end - we had installed the Janitor service, and it was removing the data volumes we'd set up to 'start once'. When the data volumes were removed, Rancher upgraded our data service, and this occasionally led to a dropped connection from the server. We followed the instructions here to ensure the Janitor service ignored our data volumes, and are seeing that our system is stable now:
https://github.com/rancher/community-catalog/tree/master/templates/janitor
Most helpful comment
We figured out why this was happening on our end - we had installed the Janitor service, and it was removing the data volumes we'd set up to 'start once'. When the data volumes were removed, Rancher upgraded our data service, and this occasionally led to a dropped connection from the server. We followed the instructions here to ensure the Janitor service ignored our data volumes, and are seeing that our system is stable now:
https://github.com/rancher/community-catalog/tree/master/templates/janitor