Rancher: Rancher continuously restarts MongoDB

Created on 23 Jan 2017  路  3Comments  路  Source: rancher/rancher

We have problem in running mongodb cluster under rancher 1.3. Previously we used rancher 1.1 and this worked fine. Suddenly rancher without reasons keeps restarting at least one node of the mogodb cluster claiming that it's incomplete. Below you can find a fragment of rancher log (look at the end first as a reversed order log):

01:19:17 PM INFO    service.trigger.info    Requested: 3, Created: 3, Unhealthy: 0, Bad: 0, Incomplete: 0    
01:19:17 PM INFO    service.trigger.info    Service already reconciled   
01:19:16 PM INFO    service.trigger Re-evaluating state  
01:19:16 PM INFO    service.trigger (1 sec) Re-evaluating state  
01:19:16 PM INFO    service.trigger.info    Service reconciled: Requested: 3, Created: 3, Unhealthy: 0, Bad: 0, Incomplete: 0    
01:19:16 PM INFO    service.update.info Service already reconciled   
01:19:16 PM INFO    service.update  Updating service     
01:19:16 PM INFO    service.update.info Requested: 3, Created: 3, Unhealthy: 0, Bad: 0, Incomplete: 0    
01:19:16 PM INFO    service.trigger.exception   Busy processing [SERVICE.280] will try later     
01:19:03 PM INFO    service.update  Updating service     
01:19:03 PM INFO    service.update.exception    Busy processing [SERVICE.280] will try later     
01:19:02 PM INFO    service.trigger.wait (14 sec)   Waiting for instances to start   
01:19:02 PM INFO    service.instance.create Creating extra service instance  
01:19:02 PM INFO    service.instance.create Creating extra service instance  
01:19:01 PM INFO    service.trigger (15 sec)    Re-evaluating state  
01:19:01 PM INFO    service.trigger.info    Requested: 3, Created: 3, Unhealthy: 0, Bad: 0, Incomplete: 1

the problem always starts with log Requested: 3, Created: 3, Unhealthy: 0, Bad: 0, Incomplete:1

However in the very same time in the mongo nothing interesting is happening, suddenly it's being restarted by something external i.e. rancher (log in natural order):

There is is problem with mongodb cluster service which is being restarted by rather without reasons a few times per hour.

2017-01-22T13:06:11.957+0000 I NETWORK  [conn2362] end connection 10.42.191.72:55615 (24 connections now open)
2017-01-22T13:06:14.848+0000 I NETWORK  [initandlisten] connection accepted from 10.42.191.72:55635 #2363 (25 connections now open)
2017-01-22T13:06:14.849+0000 I NETWORK  [conn2363] end connection 10.42.191.72:55635 (24 connections now open)

(nothing unusual until here, look here, and look at the times 
it's after rancher triggered service.trigger.info    
Requested: 3, Created: 3, Unhealthy: 0, Bad: 0, Incomplete: 1)
->

2017-01-22T13:06:15.243+0000 I CONTROL  [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
2017-01-22T13:06:15.244+0000 I FTDC     [signalProcessingThread] Shutting down full-time diagnostic data capture
2017-01-22T13:06:15.253+0000 I REPL     [signalProcessingThread] Stopping replication applier threads
2017-01-22T13:06:15.556+0000 I STORAGE  [conn105] got request after shutdown()
2017-01-22T13:06:15.871+0000 I STORAGE  [conn91] got request after shutdown()
2017-01-22T13:06:15.874+0000 I STORAGE  [conn86] got request after shutdown()
2017-01-22T13:06:15.887+0000 I STORAGE  [conn82] got request after shutdown()
2017-01-22T13:06:15.941+0000 I STORAGE  [conn83] got request after shutdown()
2017-01-22T13:06:16.009+0000 I STORAGE  [conn85] got request after shutdown()
2017-01-22T13:06:16.020+0000 I STORAGE  [conn84] got request after shutdown()
2017-01-22T13:06:16.108+0000 I STORAGE  [conn75] got request after shutdown()
2017-01-22T13:06:16.133+0000 I STORAGE  [conn87] got request after shutdown()

Rancher Versions: 1.3.1
Server: Ubuntu 14.04 LTS
healthcheck: v0.2.0
ipsec: net:v0.8.1
network-services:network-manager:v0.4.0
scheduler:scheduler:v0.5.1

Docker Version:
Repeated on docker version 1.10.3 and on 1.12.0

OS and where are the hosts located? (cloud, bare metal, etc):
Whole environment works on private cloud environment

Setup Details: (single node rancher vs. HA rancher, internal DB vs. external DB)
Rancher works on single node with external mysql database

Environment Type: (Cattle/Kubernetes/Swarm/Mesos)
Cattle only

Steps to Reproduce:
Deply MongoDB cluster which by default works in scale 3. Wait around 1hour and take a look at the rancher logs for mongo stack.
You don't need even use mongo. The rancher takes a decision is a problem

Results:
Mongo is being restarted a few times an hour. Sometimes it leads to corruption of mongo local db as the system is not waiting long enough for peaceful mongo shutdown (probably it takes longer then 10sec, when the mongo is in use).

Most helpful comment

We figured out why this was happening on our end - we had installed the Janitor service, and it was removing the data volumes we'd set up to 'start once'. When the data volumes were removed, Rancher upgraded our data service, and this occasionally led to a dropped connection from the server. We followed the instructions here to ensure the Janitor service ignored our data volumes, and are seeing that our system is stable now:

https://github.com/rancher/community-catalog/tree/master/templates/janitor

All 3 comments

Same issue here with Rancher 1.4.1, with the same config as luman75.

02:12:29 PM INFO    service.trigger.info    Requested: 3, Created: 3, Unhealthy: 0, Bad: 0, Incomplete: 0    
02:12:29 PM INFO    service.trigger.info    Service already reconciled   
02:12:29 PM INFO    service.trigger Re-evaluating state  
02:12:25 PM INFO    service.update.info Service reconciled: Requested: 3, Created: 3, Unhealthy: 0, Bad: 0, Incomplete: 0    
02:12:20 PM INFO    service.trigger Re-evaluating state  
02:12:20 PM INFO    service.trigger.exception   Busy processing [SERVICE.229] will try later     
02:12:18 PM INFO    service.instance.create Creating extra service instance  
02:12:18 PM INFO    service.instance.create Creating extra service instance  
02:12:17 PM INFO    service.instance.create Creating extra service instance  
02:12:17 PM INFO    service.instance.create Creating extra service instance  
02:12:17 PM INFO    service.instance.create Creating extra service instance  
02:12:17 PM INFO    service.update.wait (7 sec) Waiting for instances to start   
02:12:17 PM INFO    service.instance.create Creating extra service instance  
02:12:16 PM INFO    service.update (9 sec)  Updating service     
02:12:16 PM INFO    service.update.info Requested: 3, Created: 3, Unhealthy: 0, Bad: 0, Incomplete: 3    
02:12:16 PM INFO    service.trigger Re-evaluating state  
02:12:16 PM INFO    service.trigger.info    Requested: 3, Created: 3, Unhealthy: 0, Bad: 0, Incomplete: 3    
12:57:26 PM INFO    service.trigger.info    Requested: 3, Created: 3, Unhealthy: 0, Bad: 0, Incomplete: 0    
12:57:26 PM INFO    service.trigger.info    Service already reconciled

It upgraded itself and recreated all the containers, killing my replica-set.
That's happening regularly, cant say the delay between restart from now.

I'm using this configuration to launch my cluster: https://github.com/aestetype/rancher-mongodb-cluster
It's a fork of this https://github.com/rancher/catalog-dockerfiles/pull/92 with a little modification to use only hostnames in the replicaset config.

Edit: It looks like my mongodb services are restarting when i'm using the app linked to it. It triggers a

3/3/2017 1:41:58 AMFri, 03 Mar 2017 01:41:58 GMT mongo-oplog unknow error {"name":"MongoError","message":"server mongo-cluster-1.rancher.internal:27017 timed out"}

And then the mongodb cluster restarts and I have to manually restart the app because it's hanging without a mongodb connection :/

Investigating...

We are seeing this as well in v1.4.1, running a single instance of MongoDB:

Requested: 1, Created: 1, Unhealthy: 0, Bad: 0, Incomplete: 1
Re-evaluating state
Creating extra service instance
Updating service
Waiting for instances to start
Creating extra service instance
Requested: 1, Created: 1, Unhealthy: 0, Bad: 0, Incomplete: 1
Re-evaluating state
Busy processing [SERVICE.100] will try later
Service reconciled: Requested: 1, Created: 1, Unhealthy: 0, Bad: 0, Incomplete: 0

Occasionally after the mongo service updates our server seems to lose connection with it and the stack needs to be restarted.

We figured out why this was happening on our end - we had installed the Janitor service, and it was removing the data volumes we'd set up to 'start once'. When the data volumes were removed, Rancher upgraded our data service, and this occasionally led to a dropped connection from the server. We followed the instructions here to ensure the Janitor service ignored our data volumes, and are seeing that our system is stable now:

https://github.com/rancher/community-catalog/tree/master/templates/janitor

Was this page helpful?
0 / 5 - 0 ratings

Related issues

danlanxiaohei picture danlanxiaohei  路  3Comments

sankaet picture sankaet  路  3Comments

Kamaradeivanov picture Kamaradeivanov  路  3Comments

hhirsch picture hhirsch  路  3Comments

disappearinjon picture disappearinjon  路  3Comments