Mongoose: Cursor not found - while iterating over sharded collection

Created on 23 May 2018  路  16Comments  路  Source: Automattic/mongoose

Do you want to request a feature or report a bug?
Bug

What is the current behavior?
Hy all,
I ran into a very weird issue, I have a 2 shards replica set mongodb cluster (latest 3.6.5).
One sharded collection with about 300.000 entries randomly fails with

Unhandled rejection MongoError: Cursor not found (namespace: 'hocknfinder.users', id: 1684357377346700205).
    at /home/simon/Dev/Web/hokify-server/node_modules/mongodb-core/lib/connection/pool.js:598:61
    at authenticateStragglers (/home/simon/Dev/Web/hokify-server/node_modules/mongodb-core/lib/connection/pool.js:516:16)
    at Connection.messageHandler (/home/simon/Dev/Web/hokify-server/node_modules/mongodb-core/lib/connection/pool.js:552:5)
    at emitMessageHandler (/home/simon/Dev/Web/hokify-server/node_modules/mongodb-core/lib/connection/connection.js:309:10)
    at Socket.<anonymous> (/home/simon/Dev/Web/hokify-server/node_modules/mongodb-core/lib/connection/connection.js:452:17)
    at Socket.emit (events.js:182:13)
    at Socket.EventEmitter.emit (domain.js:442:20)
    at addChunk (_stream_readable.js:277:12)
    at readableAddChunk (_stream_readable.js:262:11)
    at Socket.Readable.push (_stream_readable.js:217:10)
    at TCP.onread (net.js:638:20)

I have set cursorTimeoutMillis on mongos to 30 minutes (18000000) quite a while ago to overcome this kind of issue, but now it's back within seconds some times (never takes longer than 5 minutes till the exception occurs). The problem is: I'm unable to narrow it down. The script that fails does a find() cursor and iterates over it. Every row is pushed in a data array, as soon as this data array has 25.000 entries, the array is sent to another server via HTTP and is emptied again. While the upload to the server is running, the cursor fills it up again. I tried to narrow it down, but as soon as I change the upload code (e.g. not using the library and just doing a regular HTTP request) things behave differently. It even works some times to finish the script without any errors. Therefore I guess it must be related with the network somehow. It seems evertime when the "PUT" http call is done, it somehow confuses the driver of mongodb. The Cursor not found error is most of the time followed by "upload succesful" message.

What I tried so far:

  • settings poolsize to 1.. no change
  • Increase any kidn of timeouts
  • Set debug to true, to see what's going on ==> then it does not fail, but it never finishes either! It get stuck in a "add getMore to queue" call or something..

If the current behavior is a bug, please provide the steps to reproduce.

var Promise = require('bluebird'); // bluebird
var mongoose = require('mongoose');
mongoose.Promise = Promise;
mongoose.connect(database.url, database.opts);

const MyTable = models('user'); // a mongosoe table

const bulkSize = 25000;
const request = require('request-promise-native');
const MongooseStream = MyTable.find({})
    .lean()
    .cursor({ batchSize: bulkSize });

let data = [];
console.log("start run...")
async function uploadData(commited, csv, retried) {
    console.log(`uploading part ${commited}... (${csv.length/1024/1024} mb)`);

    const req = {
        url: 'https://hokify.com/non-existing-endpoint',
        headers: { 'Content-Type': 'text/csv' },
        body: csv,
        method: 'PUT'
    };

    return request(req)
        .catch(() => {}) // this fails for sure due to the "fake" endpoint
}

let commited = 0;

return MongooseStream.eachAsync(async (obj) => {
    if(data.length % 300 == 0) console.log(Math.round(data.length / bulkSize * 100) + "%\r");

    data.push(obj);

    if (data.length > bulkSize) {
        const clonedData = data;
        data = [];
        console.log('processing...', new Date(), clonedData.length);
        return uploadData(++commited, JSON.stringify(clonedData))
            .then(() => {
                console.log("processing partially done");
            })
    }
}, {parallel: 10})
    .then(async () => {
        console.log('ALL DONE', data.length + ' entries left to process');
    });

this script does different things:
1.) If I lower the timeout values to:
keepAlive: 30000,
connectTimeoutMS: 3000,
socketTimeoutMS: 36000,
reconnectTries: 3,
reconnectInterval: 100,
it NEVER finishes, it get stuck.

2.) If I set the timeout values to:
connectTimeoutMS: 30000,
socketTimeoutMS: 360000,
reconnectTries: 30,
reconnectInterval: 1000,

it sometimes finishes, sometimes hangs and very rarely I can reproduce the Cursor not found error.

3.) Increasing the values from here on doesn't change much.

What is the expected behavior?
Script should just finish fine, if all is good, the total execution takes not more than 7 minutes. Therefore there should never be any kind of cursor timeout.

Please mention your node.js, mongoose and MongoDB version.
node v10.1.0 (but also happenign with node 8)
mongoose 5.1.2
mongodb 3.6.5

needs repro script

All 16 comments

Update 1:

  • Even if I add MongooseStream.addCursorFlag('noCursorTimeout', true) to a created cursor it keeps failing with "Cursor not found".tlog
  • It's not possible to debug this issue. I run the script with the only difference of these 3 lines:
const Logger = require('mongodb').Logger;
Logger.setLevel('debug');
Logger.filter('class', ['Server']);

and it works perfectly. As soon as I remove these 3 lines, it fails again. Either this is related to some "nextTick" thing..or with something else regarding networking. But why it does just not fail if I enable logging? The only difference would be the "slightly" slower execution....

Update 2:
I finally managed to trigger it, not quite sure if it is related to it, but I limited the cpu usage of the node process.
What I get is the following last few lines:

[DEBUG-Cursor:28416] 1527097368034 schedule getMore call for query [{"bson":{},"ns":"hocknfinder.$cmd","query":{"$query":{"find":"users","filter":{},"projection":{"internal.campaign":0,"internal.manualReviewResult":0,"internal.scorings":0},"batchSize":25000,"returnKey":false,"showRecordId":false,"$clusterTime":{"clusterTime":"6558832231165460488","signature":{"hash":"nefFuBLx8mYI2Juo3BW+vgD78UY=","keyId":"6522044801188102147"}},"lsid":{"id":"feZw+h8/T/CvBSed98lSRA=="}},"$readPreference":{"mode":"primaryPreferred"}},"options":{"numberToSkip":0,"numberToReturn":1,"checkKeys":false,"returnFieldSelector":null,"serializeFunctions":false,"ignoreUndefined":false},"numberToSkip":0,"numberToReturn":1,"returnFieldSelector":null,"requestId":179,"serializeFunctions":false,"ignoreUndefined":false,"maxBsonSize":16777216,"checkKeys":false,"batchSize":1,"tailable":false,"slaveOk":true,"oplogReplay":false,"noCursorTimeout":false,"awaitData":false,"exhaust":false,"partial":false,"documentsReturnedIn":"firstBatch"}] { type: 'debug',
  message: 'schedule getMore call for query [{"bson":{},"ns":"hocknfinder.$cmd","query":{"$query":{"find":"users","filter":{},"projection":{"internal.campaign":0,"internal.manualReviewResult":0,"internal.scorings":0},"batchSize":25000,"returnKey":false,"showRecordId":false,"$clusterTime":{"clusterTime":"6558832231165460488","signature":{"hash":"nefFuBLx8mYI2Juo3BW+vgD78UY=","keyId":"6522044801188102147"}},"lsid":{"id":"feZw+h8/T/CvBSed98lSRA=="}},"$readPreference":{"mode":"primaryPreferred"}},"options":{"numberToSkip":0,"numberToReturn":1,"checkKeys":false,"returnFieldSelector":null,"serializeFunctions":false,"ignoreUndefined":false},"numberToSkip":0,"numberToReturn":1,"returnFieldSelector":null,"requestId":179,"serializeFunctions":false,"ignoreUndefined":false,"maxBsonSize":16777216,"checkKeys":false,"batchSize":1,"tailable":false,"slaveOk":true,"oplogReplay":false,"noCursorTimeout":false,"awaitData":false,"exhaust":false,"partial":false,"documentsReturnedIn":"firstBatch"}]',
  className: 'Cursor',
  pid: 28416,
  date: 1527097368034 }
[DEBUG-Cursor:28416] 1527097368064 schedule getMore call for query [{"bson":{},"ns":"hocknfinder.$cmd","query":{"$query":{"find":"users","filter":{},"projection":{"internal.campaign":0,"internal.manualReviewResult":0,"internal.scorings":0},"batchSize":25000,"returnKey":false,"showRecordId":false,"$clusterTime":{"clusterTime":"6558832231165460488","signature":{"hash":"nefFuBLx8mYI2Juo3BW+vgD78UY=","keyId":"6522044801188102147"}},"lsid":{"id":"feZw+h8/T/CvBSed98lSRA=="}},"$readPreference":{"mode":"primaryPreferred"}},"options":{"numberToSkip":0,"numberToReturn":1,"checkKeys":false,"returnFieldSelector":null,"serializeFunctions":false,"ignoreUndefined":false},"numberToSkip":0,"numberToReturn":1,"returnFieldSelector":null,"requestId":179,"serializeFunctions":false,"ignoreUndefined":false,"maxBsonSize":16777216,"checkKeys":false,"batchSize":1,"tailable":false,"slaveOk":true,"oplogReplay":false,"noCursorTimeout":false,"awaitData":false,"exhaust":false,"partial":false,"documentsReturnedIn":"firstBatch"}] { type: 'debug',
  message: 'schedule getMore call for query [{"bson":{},"ns":"hocknfinder.$cmd","query":{"$query":{"find":"users","filter":{},"projection":{"internal.campaign":0,"internal.manualReviewResult":0,"internal.scorings":0},"batchSize":25000,"returnKey":false,"showRecordId":false,"$clusterTime":{"clusterTime":"6558832231165460488","signature":{"hash":"nefFuBLx8mYI2Juo3BW+vgD78UY=","keyId":"6522044801188102147"}},"lsid":{"id":"feZw+h8/T/CvBSed98lSRA=="}},"$readPreference":{"mode":"primaryPreferred"}},"options":{"numberToSkip":0,"numberToReturn":1,"checkKeys":false,"returnFieldSelector":null,"serializeFunctions":false,"ignoreUndefined":false},"numberToSkip":0,"numberToReturn":1,"returnFieldSelector":null,"requestId":179,"serializeFunctions":false,"ignoreUndefined":false,"maxBsonSize":16777216,"checkKeys":false,"batchSize":1,"tailable":false,"slaveOk":true,"oplogReplay":false,"noCursorTimeout":false,"awaitData":false,"exhaust":false,"partial":false,"documentsReturnedIn":"firstBatch"}]',
  className: 'Cursor',
  pid: 28416,
  date: 1527097368064 }
Unhandled rejection MongoError: Cursor not found (namespace: 'hocknfinder.users', id: 1684357378727690897).

... error

and immideatily after around 10 more getMore calls:

[DEBUG-Cursor:28416] 1527097368105 schedule getMore call for query [{"bson":{},"ns":"hocknfinder.$cmd","query":{"$query":{"find":"users","filter":{},"projection":{"internal.campaign":0,"internal.manualReviewResult":0,"internal.scorings":0},"batchSize":25000,"returnKey":false,"showRecordId":false,"$clusterTime":{"clusterTime":"6558832231165460488","signature":{"hash":"nefFuBLx8mYI2Juo3BW+vgD78UY=","keyId":"6522044801188102147"}},"lsid":{"id":"feZw+h8/T/CvBSed98lSRA=="}},"$readPreference":{"mode":"primaryPreferred"}},"options":{"numberToSkip":0,"numberToReturn":1,"checkKeys":false,"returnFieldSelector":null,"serializeFunctions":false,"ignoreUndefined":false},"numberToSkip":0,"numberToReturn":1,"returnFieldSelector":null,"requestId":179,"serializeFunctions":false,"ignoreUndefined":false,"maxBsonSize":16777216,"checkKeys":false,"batchSize":1,"tailable":false,"slaveOk":true,"oplogReplay":false,"noCursorTimeout":false,"awaitData":false,"exhaust":false,"partial":false,"documentsReturnedIn":"firstBatch"}] { type: 'debug',
  message: 'schedule getMore call for query [{"bson":{},"ns":"hocknfinder.$cmd","query":{"$query":{"find":"users","filter":{},"projection":{"internal.campaign":0,"internal.manualReviewResult":0,"internal.scorings":0},"batchSize":25000,"returnKey":false,"showRecordId":false,"$clusterTime":{"clusterTime":"6558832231165460488","signature":{"hash":"nefFuBLx8mYI2Juo3BW+vgD78UY=","keyId":"6522044801188102147"}},"lsid":{"id":"feZw+h8/T/CvBSed98lSRA=="}},"$readPreference":{"mode":"primaryPreferred"}},"options":{"numberToSkip":0,"numberToReturn":1,"checkKeys":false,"returnFieldSelector":null,"serializeFunctions":false,"ignoreUndefined":false},"numberToSkip":0,"numberToReturn":1,"returnFieldSelector":null,"requestId":179,"serializeFunctions":false,"ignoreUndefined":false,"maxBsonSize":16777216,"checkKeys":false,"batchSize":1,"tailable":false,"slaveOk":true,"oplogReplay":false,"noCursorTimeout":false,"awaitData":false,"exhaust":false,"partial":false,"documentsReturnedIn":"firstBatch"}]',
  className: 'Cursor',
  pid: 28416,
  date: 1527097368105 }

Any idea? Is it possible that this is related to "session" support. Because with mongoose 5.0.14 this is not happening.

Hi. We're seeing the same thing with Mongoose 5.0.18 and not with Mongoose 5.0.14 for a standalone server on Mongo 3.6.4.

A rough sketch of our script:

// make mongoose connection, wait for it to open, etc.

// get model from connection

myModel
  .find()
  .populate(....)
  .exec((err, data) => {
    // occasional 'MongoError: cursor id ____ not found' error
  })

The "cursor not found" error was happening randomly for us even though the query itself was taking less than a second to return. We played around with cursorTimeoutMillis and noCursorTimeout but still kept seeing the same error randomly.

Eventually we stumbled upon this ticket - https://jira.mongodb.org/browse/SERVER-34810
We were then able to reproduce the error more consistently by issuing/spamming the db.adminCommand({refreshLogicalSessionCacheNow: 1}); command after starting our script.

Given the above, our theory is that it's related to sessions as you hypothesized @simllll. Curious to see if refreshing the session cache also makes a difference for you. For now we may just stick with an older version of Mongoose.

Hi nguo,
thanks for your reply. I just tried to find some similiarty, but I guess this scenary is specific to a single mongo host only. I'm connecting to mongo through mongos. And issuing the refresh command brings only:

db.adminCommand({refreshLogicalSessionCacheNow: 1});
{
    "ok" : 0,
    "errmsg" : "no such cmd: refreshLogicalSessionCacheNow",
}

as return.

Any idea how I can check if this is related? Unfortunately I didn't find any documentatino about refreshLogicalSessionCacheNow. Is there any equiavlent command for mongos?

Thanks

@simllll here's a couple other ideas you might try: reduce the batchSize, or get rid of the parallel option to eachAsync(). Also, approximately how long does the HTTP upload take?

@vkarpov15 thanks, I tried parallel: 1 and a bunch of different batchSizes already. I guess the issue here is either mongodb related or somewhere "deep" in the driver. I tried to reproduce it with the "cpu" limitiation today, couldn't figure out any relation to it. But it keeps happening. Especially weird because I've set the noCursorTimeout flag. Which I also debugged and is indeed part of the query.

And regarding your question about the upload, it takes about 50% of the time till the next "data"-set is complete. Arround 30-40 seconds.

Is there any option to disable the "session" support temporarly to see if this is related?
EDIT: I just reminded your post about .find({}, null, { skipSessions: true }).. I try right now if it makes any difference.

EDIT2: Seems not related to sessions (in case skipSessions does what I think it does). Unhandled rejection MongoError: Cursor not found (namespace: 'hocknfinder.users', id: 6724850416504880802) after not even one minute.

EDIT3: I've never managed to get the cursor timeout within the first "batch". It comes either in the second or any other following one after then. Therefore it even helps to increase the batchSize - very the opposite than I expected. Well, just increased the batch size to over 60.000 and then I also see it in the first batch. So my last statement is not true anymore.

EDIT4: I can definitely verify that it is not happening with mongoose 5.0.14. I'm trying to track it down now to mongodb-core or mongodb package.

Thanks

Very interesting outcome:

mongoose 5.0.14 works
mongoose 5.1.2 doesn't

running mongoose 5.1.2 with mongodb & mongodb-core 3.0.4 works
running mongoose 5.1.2 with mongodb & mongodb-core 3.0.5 returns in:
Unhandled rejection MongoError: Cursor session id (none) is not the same as the operation context's session id (0b03db89-0000-4653-a369-3a744c84bfa7 - J9HIfMrQRlAhC0lvltFaplCk1dIVYfrjXGoZMXBckbE=)
fixing this error manually by patching lib/bulk/common.js with https://github.com/mongodb-js/mongodb-core/compare/v3.0.7...v3.0.8#diff-95549f1db7facd9411362e6554f63e2f
returns again in Unhandled rejection MongoError: Cursor not found (namespace: 'hocknfinder.users', id: 7904791698994979006).

So far what I can say, it has to do with mongodb-core. If I look at the changesets from 3.0.4 to 3.0.5 (https://github.com/mongodb-js/mongodb-core/compare/v3.0.4...v3.0.5) the only change sin there are abuot closing sessions.. can anyone have a look who understands the code to see what could cause the weird behaviour in there? Thanks :)

Simon

I've opened a JIRA issue. The issue is definitly somehwere in mongo's added implicit session support which has been added between 3.0.4 and 3.0.5 of the native driver.

https://jira.mongodb.org/browse/NODE-1482

Thanks

Thanks for digging into this @simllll , I'll look into this as well. I have a hypothesis that maybe this has something to do with mongoose not passing down an implicitly created session when calling next() with eachAsync().

Another workaround for you to try would be to use this for loop mechanism for iterating through a mongoose cursor. Since you're using async/await, this should be pretty easy to try, and would help see if its worth digging in to eachAsync()

Thanks @vkarpov15 for your hints, I just tried the "for loop" method, but unfortunately it didn't change anything, still runs into "Cursor not found".

Thanks
Simon

Had the same issue on a single instance.
Downgrading to mongoose 5.0.8 & mongodb 3.0.3 fixed it for now, waiting for the core fix with 5.1.3 & 3.0.8 (accordingly).

@simllll , thanks for digging into this issue!

@vkarpov15 looks like MongoDB fixed the underlying issue (in versions 3.6.6, 4.0.1, 4.1.1) https://jira.mongodb.org/browse/SERVER-34810

I'm also watching this issue closely. Right now there is no new release of mongodb though. As soon as they release a new version, I will post an update how it turns out.

Update: It looks like this bug was not related to SERVER-34810. I'm trying to reproduce it with latest mongoose, and just can't do it. Looks like the mongodb-core update solved this issue? So in the end it was maybe indeed "only" a driver bug?

Interesting, not sure why this issue went away but I'm glad it did :+1: I'll close this for now, re-open if you manage to repro it again.

Yes :) thanks for your support! Still no more issues on my side.

Was this page helpful?
0 / 5 - 0 ratings