Mongoose: Failing to automatically re-connect to New PRIMARY after a replica set failover , from Mongoose (MongoDB, NodeJS Driver)

Created on 19 Jan 2015  Â·  36Comments  Â·  Source: Automattic/mongoose

I am not sure if this is a bug or some mis-configuration from my end. How to solve this ?

I made a simple NodeJS App, with Mongoose as MongoDB Driver. And connected to a mongodb replica set. The App is working fine until I shut down the current primary, When the PRIMARY is down, the replica set automatically elected a new PRIMARY. But, after that the node application doesn't seems to be responding for DB queries.

CODE: DB Connection

var options = { 
    server: { 
        socketOptions: { 
            keepAlive: 1, 
            connectTimeoutMS:30000, 
            socketTimeoutMS:90000 } 
        },
    replset: { 
        socketOptions: { 
            keepAlive: 1, 
            connectTimeoutMS : 30000 , 
            socketTimeoutMS: 90000  
        }, 
        rs_name: 'rs0' 
     } };

var uri = "mongodb://xxx.xxx.xxx.xxx:27017,xxx.xxx.xxx.xxx:27017,xxx.xxx.xxx.xxx:27017/rstest";

 mongoose.connect(uri,options);

CODE: DB Query

  router.('/test',function(req,res){

     var testmodel = new testModel('test') ;
     testmodel.save(function (err, doc,numberAffected) {
         if (err) {
             console.log("ERROR: "+ err);
             res.status = 404;
             res.end;
         }else{
             console.log("Response sent ");
             res.status = 200;
             res.end;
         }
     });
 });

Steps Followed

Created a MongoDB replica set in three VMs.
Created a simple nodeJS App (Express + Mongoose) with a test API as above
Sent GET request to 'test' continuously with some time interval to the app from a local system.
Took the PRIMARY instance down
APPLICATION STOPPED RESPONDING TO REQUESTS
Varsions: "express": "4.10.6", "mongodb": "1.4.23", "mongoose": "3.8.21",

A sample app that I have done for debugging this issue is available at https://[email protected]/melvingeorge/nodejsmongorssample.git

can't reproduce help

Most helpful comment

There was a bug with the mongodb driver that affected connections to new replica set primaries. It was fixed in https://jira.mongodb.org/browse/NODE-818

However, mongoose versions from 4.6.0 to 4.7.2 inclusive were using a version of the mongodb driver that had the bug.

This affected my project, which is running mongodb 3.0.12. If you have issues where mongoose fails to reconnect after a step down, check what version of mongoose you're using.

Hopefully this helps someone.

All 36 comments

Heya,

So in theory, this should work fine once a new primary is elected. Looking at your code, I don't see where you're using the configuration options you specified. Frankly, it doesn't look like you're actually connecting with mongoose at all, looks like you're connecting with monk and default options. The following example works fine for me even when I kill off the primary:

var mongoose = require('mongoose');

mongoose.connect('mongodb://hostname:27000,hostname:27001,hostname:27002');

mongoose.model('Parent', new mongoose.Schema({ name: String }));

var http = require('http');
http.createServer(function (req, res) {
  mongoose.model('Parent').findOne(function(error) {
    res.writeHead(200, {'Content-Type': 'text/plain'});
    res.end('Error: ' + error + '\n');
  });
}).listen(1337, '127.0.0.1');
console.log('Server running at http://127.0.0.1:1337/');

Can you verify if the above code continues to be responsive with your setup when a primary gets taken down?

I tried this code , still have the same issue .

var mongoose = require('mongoose');
mongoose.connect('mongodb://104.155.199.26,104.155.201.26,104.155.237.251/rstest');

var pingSchema = mongoose.Schema({ time:Number });
var pingModel = mongoose.model('pings', pingSchema);

var http = require('http');
http.createServer(function (req, res) {

    var time = Date.now();
    var testmodel = new pingModel({ 'time':time}) ;
    testmodel.save(function (err, doc,numberAffected) {
        if (err) {
                console.log("ERROR: Write "+ err);
                res.writeHead(500, {'Content-Type': 'text/plain'});
                res.end('Error '+err+'  \n');
        }else{
                res.writeHead(200, {'Content-Type': 'text/plain'});
                res.end(JSON.stringify(doc));
        }
    });    

}).listen(1337, '127.0.0.1');
console.log('Server running at http://127.0.0.1:1337/');


In some cases if the PRIMARY is down for the first time, its working . But, if If all servers are back and again if i am taking down the new PRIMARY , it will stop working. All requests after that will get timed out.

Updated the repository with the above code. For reproducing the issue I created a mocha test (Its also available in the repository) which will continuously ping to the URL and check if its receiving status code as 200.

During the first time when the PRIMARY DOWN down
screen shot 2015-01-20 at 1 12 45 pm

When the PRIMARY is down for the second time.
screen shot 2015-01-20 at 1 13 01 pm
Error message is "ERROR: Write Error: connection closed" as given above

I have the same issue. I've noticed this happens to applications with a high writing rate. In other words, if your application is trying to make too many write operations while the primary is down, it will eventually throw the "connection closed" error.

Hmmm I can't seem to reproduce this in my environment. Once I kill the primary and then bring it back, writes continue even if I kill off the new primary once the election's done. Here's a few questions that you can answer to give me a better idea of what's going on on your end:

1) Do you ever have 2 members of the replica set down at the same time?
2) What version of mongodb and how many repl set members?
3) What version of mongoose?

I had some issue similar to yours while dealing with replica, in my case I had 1 primary node with a priority of 10, 1 secondary priority of 0(for analytics) and an arbiter. My writes would fail after reconnecting the primary instance and I went through a lot trying to fix it here's the most important thing I learnt:

When my primary is down or unreacheable, there has to be another number eligible to become primary.(At least 2members in my set has to have a priority >= 1). If I have only arbiters, hidden, or members with a priority of 0, queries will get stuck even after I reconnect my primary and it gets re-elected as primary, my client is unable to complete write queries. Read queries would still work, but write wouldn't.`
This is what I faced with mongoose, even with keepalive, autoreconnect and all the socket and connection timeout MS set.

Hopefully this helps.

That's not really something mongoose can help with. If you have a 3 node replica set with an arbiter and a priority 0 secondary you may as well save yourself the effort and get rid of the arbiter. Mongodb can't elect a new primary unless there's a replica set member up that's eligible to become primary, and neither priority 0 secondaries nor arbiters are eligible to become primaries. Without a primary, you can't do write operations.

Yes @vkarpov15 I agree that there can't be an election. But the problem is that when the primary is back on mongoose does not reconnect/allow writes unless you manually restart the mongoose connection.
Isn't this abnormal, shouldn't the mongoose connection auto-reconnect to the primary once there is a re-election?

@arv1989 can you try changing your replset connection to this:

replset: { 
socketOptions: { 
keepAlive: 1, 
connectTimeoutMS : 30000 , 
socketTimeoutMS: 0
},

changing socketTimeouwMS : 90000 to socketTimeoutMS : 0 according to this post

@dhatawesomedude what version of mongoose are you using? If mongoose doesn't properly reconnect after the primary is back up that would be a bug

Oh, I thought it was abnormal. I'm on [email protected].

Hi guys, I am having a similar problem but I thought it might be a problem in new MongoDB 3.0! Any of you guys working with MongoDB 3.0? Everything is fine and the first time Primary goes down the other will be elected as Primary and the other Secondary but if you stepDown the current Primary it wont work again even though MongoDB elected a Primary.

Thanks

@maziyarpanahi I'm on 2.8 not 3.0, perhaps you could post your replica config here and mongoose mongodb connection config too.

Agreed with @dhatawesomedude, that would be helpful. A script that reproduces the behavior you're seeing would be even more helpful :)

I'm having the same issue as @maziyarpanahi with Mongoose 4.0.1 and MongoDB 3.0.2. My setup includes a 3-node replica set. The connection string is: mongodb://ip1:27017,ip2:27017,ip3:27017/app.

When I bring down a primary the first time (after last application restart), Mongoose is able to reconnect to a newly elected primary. However, if I keep bringing down new primaries, usually by the second or third time, Mongoose will fail to reconnect. Both reads and writes return the error "no primary server available." It will keep failing until I have to restart the application.

Before bring down a primary, I always check to make sure all nodes are up and the failed node already transitions to secondary. In other words, no 2 nodes are down at the same time. In addition, I always connect to the shell of each node and verify the result of rs.status() to make sure everything's okay before testing reads and writes.

In short, the MongoDB always fails over correctly, but Mongoose (or the Mongo driver) sometimes fails to reconnect.

@vkarpov15 any thought on this?

Not yet, been busy debugging people's 4.0.x issues. I'd be happy to see a standalone script that repros this because "sometimes fails to reconnect" makes me think that this issue is more subtle than that. I'll take a closer look later this week.

So I have set up this simple script:

var mongoose = require('mongoose');

mongoose.connect('mongodb://[3 ip addresses]/test');

var M = mongoose.model('gh2603', {});

setInterval(function() {
  M.findOne({}, function(error) {
    console.log('Got error: ' + error);
  });
}, 1000);

With a 3 node replica set and killed the primary 3 times, haven't gotten a single error from a read. Can you check if this script reproduces your issue?

No activity for a while, re-open if this is still an issue.

The issue got resolved when I created a subnet in Azure and used local(Internal IPs in the Virtual Network or VPN) address instead of public facing IP or domain names. The issue still happens when connecting from a remote location to the DB hosted in azure.

@arv1989 I don't think that is the case. I use private IPs for all my nodes and I still have the issue.

@dhatawesomedude I don't mean private IPs, I mean internal IP address in the Virtual Network (in my case Azure Virtual Network that I created)

Would just like to pipe in to say I have this issue as well. My replica set is managed by MongoDB MMS, so I didn't really understand what was happening other than the DB would stop responding randomly, and it seemed to correspond with a larger-than-usual number of connections to the web app. Also using Azure. Not really sure what to do about it at this point...

@lukejaylarsen please provide additional context, code samples, etc.

OK, so in an attempt to get more context, I ran the script you posted - switched primaries a number of times with no issue. All I can say for sure is that it's pretty unpredictable for me... the database will just stop responding after a day or a week and when I check the MMS logs there is a 'Replica Set elected a new primary' notification that looks to be around the time that it stopped working.

I can't guarantee that Mongoose is causing the problem, but the symptoms seem consistent with the other posters here. It happens when running the web server locally as well (although I only have experience with running the replica set on Azure).

Sorry I can't be more specific at this stage. I will monitor the problem closely over the next few days and report back...

I think the issue is something related to the network connection handling, the issue won't happen when the DB Server and Webserver are in same local network or connected in virtual network/VPN.  It happens when you connect through public facing connection, once the network becomes idle for sometime, mongodb fails to resume the connection ( in mongodb perspective, the connection is still active, so it won't create a new connection ), but any query from driver is not reaching the DB. 

—
Sent from my iPhone

On Sun, Aug 16, 2015 at 8:24 AM, lukejaylarsen [email protected]
wrote:

OK, so in an attempt to get more context, I ran the script you posted - switched primaries a number of times with no issue. All I can say for sure is that it's pretty unpredictable for me... the database will just stop responding after a day or a week and when I check the MMS logs there is a 'Replica Set elected a new primary' notification that looks to be around the time that it stopped working.
I can't guarantee that Mongoose is causing the problem, but the symptoms seem consistent with the other posters here. It happens when running the web server locally as well (although I only have experience with running the replica set on Azure).

Sorry I can't be more specific at this stage. I will monitor the problem closely over the next few days and report back...

Reply to this email directly or view it on GitHub:
https://github.com/Automattic/mongoose/issues/2603#issuecomment-131480810

The issue happens even there is no 'election' activity happened during the time 
—
Sent from my iPhone

On Sun, Aug 16, 2015 at 10:34 AM, Ameen Rashad.V [email protected]
wrote:

I think the issue is something related to the network connection handling, the issue won't happen when the DB Server and Webserver are in same local network or connected in virtual network/VPN.  It happens when you connect through public facing connection, once the network becomes idle for sometime, mongodb fails to resume the connection ( in mongodb perspective, the connection is still active, so it won't create a new connection ), but any query from driver is not reaching the DB. 
Regards,
Ameen Rashad
+919961161277
—
Sent from my iPhone
On Sun, Aug 16, 2015 at 8:24 AM, lukejaylarsen [email protected]
wrote:

OK, so in an attempt to get more context, I ran the script you posted - switched primaries a number of times with no issue. All I can say for sure is that it's pretty unpredictable for me... the database will just stop responding after a day or a week and when I check the MMS logs there is a 'Replica Set elected a new primary' notification that looks to be around the time that it stopped working.
I can't guarantee that Mongoose is causing the problem, but the symptoms seem consistent with the other posters here. It happens when running the web server locally as well (although I only have experience with running the replica set on Azure).

Sorry I can't be more specific at this stage. I will monitor the problem closely over the next few days and report back...

Reply to this email directly or view it on GitHub:
https://github.com/Automattic/mongoose/issues/2603#issuecomment-131480810

Just to share an update... I created a virtual network for my site to connect to as suggested and it seems to have solved the problem for my live site. The connection still starts timing out after an unspecific amount of time when using a regular connection (e.g. connecting to the db from a local server) but restarting the server always fixes it.

Expecting same problems -

In my case automatic reconnection is turned off, and I check mongoose connection on every request.
When normally shutdown mongodb node, all is ok.
If the problem comes from external source (hard reset/reboot), it doesn't reconnect.
I suspect unchanged state when mongodb exits not normally.
will investigate

I am experiencing the same problem. I have a setup (see bellow) and after i do a second stepDown, it becomes read-only. That is:

  • I start with all dbs up, stepDown primary and turn off secundary --> it works OK for reads and writes
  • Than I stepDown again and turn off the the secundary --> reads are OK, writes NOTOK

I then have to re-start my application to all go to normal.

Here is my setup:

    "_id" : "rs0", 
    "version" : 15, 
    "members" : [
        {
            "_id" : 0, 
            "host" : "node0:27017", 
            "arbiterOnly" : false, 
            "buildIndexes" : true, 
            "hidden" : false, 
            "priority" : 1, 
            "tags" : {

            }, 
            "slaveDelay" : 0, 
            "votes" : 1
        }, 
        {
            "_id" : 1, 
            "host" : "node1:27017", 
            "arbiterOnly" : false, 
            "buildIndexes" : true, 
            "hidden" : false, 
            "priority" : 1, 
            "tags" : {

            }, 
            "slaveDelay" : 0, 
            "votes" : 1
        }, 
        {
            "_id" : 2, 
            "host" : "node2:30000", 
            "arbiterOnly" : true, 
            "buildIndexes" : true, 
            "hidden" : false, 
            "priority" : 1, 
            "tags" : {

            }, 
            "slaveDelay" : 0, 
            "votes" : 1
        }
    ], 
    "settings" : {
        "chainingAllowed" : true, 
        "heartbeatTimeoutSecs" : 10, 
        "getLastErrorModes" : {

        }, 
        "getLastErrorDefaults" : {
            "w" : 1, 
            "wtimeout" : 0
        }
    }
}

Here are my connection string and options:

        dbHost = "mongodb://myuser:mypass@IP1:27017,IP2:27017/mydb?replicaSet=rs0";
{
            server: { auto_reconnect: true, poolSize: 16, socketOptions: { keepAlive: 120, connectTimeoutMS: 5 * 60 * 60 * 1000, socketTimeoutMS: 5 * 60 * 1000 } },
            replset: { poolSize: 16, socketOptions: { keepAlive: 120, connectTimeoutMS : 5 * 60 * 60 * 1000, socketTimeoutMS: 5 * 60 * 1000 } },
            db: { read_preference: 'primaryPreferred', w: 1, j: true, wtimeout: 10000  }
        }

Here is how I connect

    var conTry = 0;
    var _connect = function(goNext) {
        mongoose.connect(serverConfig.db.host, serverConfig.db.options, function(err) {
            if(err) {
                console.log('Trying... ' + conTry);
                if(conTry <= 10) {
                    conTry++;
                    setTimeout(function() {
                        _connect();
                    }, 1000)
                } else {
                    console.dir(err);
                    console.log("DB connection error");
                }
            } else {
                console.log('DB connected');
                var conn = mongoose.connection;

                conn.on('disconnected', function() {
                    console.log('MongoDB disconnected!');
                    _connect();
                });

                conn.on('SIGINT', function() {
                    conn.close(function () {
                        console.log('Force to close the MongoDB conection');
                        _connect();
                    });
                });
            }
        });
    }

    _connect();

@abernardobr you need to wait for the replica set to re-elect a primary. Writes will fail until the election is done and a new primary is established, and then writes should work normally with no intervention required on your front. MongoDB 3.2 has much faster election times than previous versions, so I'd recommend upgrading if you want to reduce the time your writes fail.

@vkarpov15 thanks for your reply. I am not sure this is a waiting period issue. It cannot take hours to re-elect a primary. After it is done twice, it simple will not return to a write mode.

But, for sure. It is in our todo's moving to 3.2.

Are you adding a 3rd member to the repl set after the first step down? If you have a 3 node replica set and take down 2 members, writes will never work - in order to become a primary, a node must be able to reach a strict majority of the replica set.

@abernardobr One thing I noticed about your code was that you never set conTry back to 0 upon successful connection. So at some point your code will stop trying to reconnect b/c you hit your conTry limit.

I am also facing the same issue and using Casbah Mongo driver in Scala and MongoDB version 3.0.12. In a 3 replica set if primary reboots and some other node becomes primary out of 3, then it goes in loop with following logs:

If I force the earlier primary which is now secondary to become primary again, things start working.

2016-11-25 11:59:56.755 INFO [pool-2-thread-1] o.m.d.cluster - No server chosen by ReadPreferenceServerSelector{readPreference=primary} from cluster description ClusterDescription{type=REPLICA_SET, connecti
onMode=MULTIPLE, all=[ServerDescription{address=ip-172-21-2-184:27017, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 0, 12]}, minWireVersion=0, maxWireVersion=3,
electionId=null, maxDocumentSize=16777216, roundTripTimeNanos=45766945, setName='s-1', canonicalAddress=ip-172-21-2-184:27017, hosts=[ip-172-22-3-210:27017, ip-172-21-3-177:27017, ip-172-21-2-184:27017], pa
ssives=[], arbiters=[], primary='ip-172-22-3-210:27017', tagSet=TagSet{[]}}, ServerDescription{address=ip-172-21-3-177:27017, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versi
onList=[3, 0, 12]}, minWireVersion=0, maxWireVersion=3, electionId=null, maxDocumentSize=16777216, roundTripTimeNanos=1451901, setName='s-1', canonicalAddress=ip-172-21-3-177:27017, hosts=[ip-172-22-3-210:27
017, ip-172-21-3-177:27017, ip-172-21-2-184:27017], passives=[], arbiters=[], primary='ip-172-22-3-210:27017', tagSet=TagSet{[]}}, ServerDescription{address=ip-172-22-3-210:27017, type=UNKNOWN, state=CONNECT
ING, exception={com.mongodb.MongoSocketOpenException: Exception opening socket}, caused by {java.net.SocketTimeoutException: connect timed out}}]}. Waiting for 30000 ms before timing out
2016-11-25 12:00:26.756 ERROR [pool-2-thread-1] c.v.v.f.s.SnapshotHandlersRegistry - Failed executing VmStorageHandler.onPropertyWrite for vm for request: 00ec0a2c-2090-4ee1-af87-f9022302b5e2
com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches ReadPreferenceServerSelector{readPreference=primary}. Client view of cluster state is {type=REPLICA_SET, se
rvers=[{address=ip-172-21-2-184:27017, type=REPLICA_SET_SECONDARY, roundTripTime=0.6 ms, state=CONNECTED}, {address=ip-172-21-3-177:27017, type=REPLICA_SET_SECONDARY, roundTripTime=1.4 ms, state=CONNECTED},
{address=ip-172-22-3-210:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketOpenException: Exception opening socket}, caused by {java.net.SocketTimeoutException: connect timed out}}]

Is there any configuration parameters I am missing or is this a bug?

I have the similar issue.

Right now I have used this docker to create a replicate set
http://www.sohamkamani.com/blog/2016/06/30/docker-mongo-replica-set/

Connect to the mongo first.

When I pause the primary replicaset container, the mongo rs.status() will show one of them become primary. So the mongo itself already selected a new primary.

But there is no response on the mongoose when trying to access via express.

Even I unpause the previous container, it doesn't work
it will disconnect and close my connection sometime later.

I haven't set any configuration for replicaset when connect mongo (only the mongo uri), i expects the default will be the same as mongo client with ha, right?

https://mongodb.github.io/node-mongodb-native/driver-articles/mongoclient.html

There was a bug with the mongodb driver that affected connections to new replica set primaries. It was fixed in https://jira.mongodb.org/browse/NODE-818

However, mongoose versions from 4.6.0 to 4.7.2 inclusive were using a version of the mongodb driver that had the bug.

This affected my project, which is running mongodb 3.0.12. If you have issues where mongoose fails to reconnect after a step down, check what version of mongoose you're using.

Hopefully this helps someone.

Was this page helpful?
0 / 5 - 0 ratings