Sails: Actual Memory Leak when accessing external DBs

Created on 5 Aug 2016 · 16Comments · Source: balderdashy/sails

Waterline version:0.11.2
Node version:v6.3.1
NPM version:3.10.5
Operating system:Ubuntu 16.04.1 LTS

Hi all,

Yep, I too roll my eyes each time someone says they've found a Sails/Waterline memory leak, but this time.... I've found one, though I'm pretty sure it's "caused" by dependencies, it only manifests itself in production enviro.

What I'm trying to do is migrate our DB from MySQL to PostgreSQL, so I'm running a loop over one model with 1M+ rows, and pushing into another model. Each time I create a DB connection, another instance of tough-cookie is loaded into memory and after a few thousand records Node runs out of memory. A heap dump reveals a few hundred thousand copies of tough-cookie's TLD array: https://github.com/SalesforceEng/tough-cookie/blob/c11a2d11d12348a35ef595c809e30e641a804a7d/lib/pubsuffix.js#L95-L96

This large scale migration runs fine locally, but as soon as I'm accessing remote DBs the problem manifests - which is an issue in production.

Memory usage in log below is measured with process.memoryUsage().rss

*** Starting Sails in production ***
2016-08-04T01:34:02.835Z - info: ----      TIME ---    MEM ---       # ---                                Record ID
2016-08-04T01:34:02.835Z - info: ----       01s ---  203MB ---      50 ---                               kwthz1c7Zj
2016-08-04T01:34:03.781Z - info: ----       02s ---  221MB ---     100 ---                               CIjVS9wtxO
2016-08-04T01:34:04.654Z - info: ----       03s ---  238MB ---     150 ---                               1FUnEpVpOv
2016-08-04T01:34:05.546Z - info: ----       04s ---  256MB ---     200 ---                               UMkcytCmhz
....
2016-08-04T01:34:19.627Z - info: ----       18s ---  466MB ---    1000 ---                               tpban03BNs
2016-08-04T01:34:21.024Z - info: ----       19s ---  507MB ---    1050 ---                               umdBBfLEjC
2016-08-04T01:34:21.896Z - info: ----       20s ---  524MB ---    1100 ---                               yEmHHxAWyv
2016-08-04T01:34:22.796Z - info: ----       21s ---  542MB ---    1150 ---                               rgDK9scqum
....
2016-08-04T01:34:38.375Z - info: ----       37s ---  676MB ---    2000 ---                               txRr4nf0WW
2016-08-04T01:34:39.691Z - info: ----       38s ---  712MB ---    2050 ---                               exypA9PjVC
2016-08-04T01:34:40.582Z - info: ----       39s ---  729MB ---    2100 ---                               q4yUjZ4ct5
2016-08-04T01:34:41.482Z - info: ----       40s ---  746MB ---    2150 ---                               5AoIUcFSK1
....
2016-08-04T01:37:20.393Z - info: ----    3m 19s --- 2531MB ---    9450 ---                                 T7Pbquad
2016-08-04T01:37:21.305Z - info: ----    3m 20s --- 2541MB ---    9500 ---                                 hRsfquad
2016-08-04T01:37:22.337Z - info: ----    3m 21s --- 2553MB ---    9550 ---                                 Z4Ikquad
2016-08-04T01:37:23.296Z - info: ----    3m 22s --- 2564MB ---    9600 ---                                 rVGiquad
2016-08-04T01:37:24.372Z - info: ----    3m 23s --- 2577MB ---    9650 ---                                 T1nXquad
2016-08-04T01:37:25.366Z - info: ----    3m 24s --- 2588MB ---    9700 ---                                 oMsyquad
2016-08-04T01:37:26.381Z - info: ----    3m 25s --- 2599MB ---    9750 ---                                 PNzMquad
2016-08-04T01:37:27.397Z - info: ----    3m 26s --- 2611MB ---    9800 ---                                 rsjsquad
2016-08-04T01:37:28.377Z - info: ----    3m 27s --- 2622MB ---    9850 ---                                 nkJBquad
2016-08-04T01:37:29.434Z - info: ----    3m 28s --- 2634MB ---    9900 ---                                 lAT5quad
2016-08-04T01:37:30.463Z - info: ----    3m 29s --- 2651MB ---    9950 ---                                 L5yXquad
2016-08-04T01:37:31.514Z - info: ----    3m 30s --- 2668MB ---   10000 ---                                 Bml9quad

The code:

doMigration: function (req, res) {
  // doMigration moves records from one model to another
  //
  //
  // Guest is a model backed by MySQL
  // GuestPostgres is a model backed by PostgreSQL to which I'm migrating
  //
  // fast === false does row by row and property by property to better highlight invalid data

  res.send('Started');

  let progress = 0;
  let timeBegin = Date.now();
  let fast == true;

  let leftPad = function (input = '', length = 5, char = ' ') {
    return (Array(length).join(char) + input).slice(-length);
  };

  let bytesToString = function (bytes) {
    if (bytes > (1024 * 1024)) {
      return Math.floor(bytes / 1048576) + 'MB';
    } else {
      return Math.floor(bytes / 1024) + 'KB';
    }
  };

  let duration = function (timeBegin) {
    let seconds = Math.floor((Date.now() - timeBegin) / 1000);

    if (seconds > 3600) {
      return Math.floor(seconds / 3600) + 'h ' + Math.floor((seconds % 3600) / 60) + 'm ' + leftPad(seconds % 60, 2, '0') + 's';
    } else if (seconds > 60) {
      return Math.floor(seconds / 60) + 'm ' + leftPad(seconds % 60, 2, '0') + 's';
    } else {
      return leftPad(seconds, 2, '0') + 's';
    }

  };

  let updateIndvProps = function (guest, callbackUpdate) {

    if (fast) {
      /////////////////////////////////////////////////////////////////////////////////////////////
      // Update whole guest (FAST METHOD)
      GuestPostgres.update(
        {
          TMP_id: guest.TMP_id,
        },
        guest,
        updateError => callbackUpdate(updateError)
      );

      /////////////////////////////////////////////////////////////////////////////////////////////
    } else {
      /////////////////////////////////////////////////////////////////////////////////////////////
      // One by one populate each column (SLOW METHOD BUT SHOWS ERRORS WELL)
      async.eachLimit(
        _.without(Object.keys(guest), 'TMP_id'),
        1,
        (prop, callbackProp) => {
          let updateObj = {};
          updateObj[prop] = guest[prop];

          GuestPostgres.update(
            {
              TMP_id: guest.TMP_id,
            },
            updateObj,
            (updateError) => {
              if (updateError) sails.log.info('error on ' + guest.pool + ' - ' + guest.TMP_id + ' - prop: ' + prop + ' - ' + JSON.stringify(updateObj));
              return callbackProp(updateError);
            }
          );
        },

        error => callbackUpdate(error)
      );

      /////////////////////////////////////////////////////////////////////////////////////////////
    }
  };

  let upsertGuest = function (guest, callbackMaster) {
    // Upsert - create if not exists

    GuestPostgres.findOne({
      TMP_id: guest.TMP_id,
    }).exec((error, guestPoorFind) => {
      if (error) return callbackMaster(error);
      if (!guestPoorFind) {

        // Doesn't exist yet in PG so first create

        if (fast) {
          // FAST
          GuestPostgres.create(guest).exec(error => callbackMaster(error));
        } else {
          // SLOW METHOD
          GuestPostgres.create({
            TMP_id: guest.TMP_id,
          }).exec(error => {
            if (error) return callbackMaster(error);
            updateIndvProps(guest, callbackMaster);
          });
        }

      } else {
        updateIndvProps(guest, callbackMaster);
      }
    });
  };

  //
  // the main meat of the migration
  //
  // Get all clients
  Client.find({}).sort('pool ASC').exec((error, clients) => {

    async.eachLimit(

      clients,

      1, // one client at a time

      (client, callbackClient) => {

        let guestsChunkSize           = 1000;
        let guestsReturnedOnThisFetch = 0;
        let guestsReturnedInTotal     = 0;

        async.doWhilst(
            doWhilstCB => {

              // Get all entries for that client
              Guest.find({
                pool: client.pool,
              })
              .limit(guestsChunkSize)
              .skip(guestsReturnedInTotal)
              .exec((error, guests) => {

                guestsReturnedOnThisFetch =  guests.length;
                guestsReturnedInTotal     += guests.length;

                // calc longest client.pool length for leftPad
                // sails.log.info(_.chain(days).map(client => client.pool.length).reduce((a, b) => a > b ? a : b).valueOf());

                async.eachLimit(
                  guests,

                  // 1,        // one at a time
                  20,        // twenty guests at a time

                  (guest, callbackMaster) => {

                    progress++;

                    if (progress % (fast ? 50 : 1) === 0) {

                      let memUsed = bytesToString(process.memoryUsage().rss);

                      sails.log.info('----' + leftPad(duration(timeBegin), 10) + ' --- ' + leftPad(memUsed, 6) + ' --- ' + leftPad(progress, 7) + ' --- ' + leftPad(guest.TMP_id, 40) + ' --- ' + guest.pool);
                    }

                    return upsertGuest(guest, callbackMaster);
                  },

                  error => doWhilstCB(error)
                );
              });

            },

            () => guestsReturnedOnThisFetch === guestsChunkSize,

            error => callbackClient(error)
        );
      },

      error => {
        if (error) {
          console.error(error);
        } else {
          sails.log.info('All done!');
          sails.log.info('Time taken: ' + ((Date.now() - timeBegin) / 1000) + 's');
        }
      }
    );
  });
},

Things I have looked into:

Connection pooling is turned on
/myController/doMigration is triggered via a single cURL command from bash, receiving an immediate res.send('Started'); so no issues with sockets / sessions.
config/env/production.js external hostnames have been tried with both hostname and IP address, no difference

If I tunnel through a bastion host, to make all DBs appear as if they were on 'localhost', while in production mode the leak persists.

Only when I tunnel through a bastion host AND run in development mode does the leak go away, and the 1M+ row migration can go through. There is no "leak per row", a few fluctuations between 200-450MB, but GC picks everything up.

for example, NODE_ENV=development, RAM usage after 200,000 rows is fine:
....
2016-08-05T11:36:53.315Z - info: ----   43m 28s ---  247MB ---  200000 ---                               7zd8lJNBir --- evasbackpackers
....

A hunch would suggest that it could be related to https://github.com/SalesforceEng/tough-cookie/issues/48 as it is the same data in the heapdump.

What I can't understand is how/why ENV=prod/dev would make a difference?

What more information can I offer? Heapdumps? Spin up some RDS instances and populate them with garbage data to replicate?

--- Wulf

orm performance

Source

wulfsolter

Most helpful comment

I'm the tough cookie.

sailsbot on 9 Aug 2016

❤5 😄1

All 16 comments

Hi @wulfsolter! It looks like you may have removed some required elements from the initial comment template, without which I can't verify that this post meets our contribution guidelines. To re-open this issue, please copy the template from here, paste it at the beginning of your initial comment, and follow the instructions in the text. Then post a new comment (e.g. "ok, fixed!") so that I know to go back and check.

Sorry to be a hassle, but following these instructions ensures that we can help you in the best way possible and keep the Sails project running smoothly.

_If you feel this message is in error, or you want to debate the merits of my existence (sniffle), please contact [email protected]._

sailsbot on 5 Aug 2016

@sailsbot I only made an edit for lingual clarity. Take a chill pill :p

wulfsolter on 5 Aug 2016

Sorry to be a hassle, but following these instructions ensures that we can help you in the best way possible and keep the Sails project running smoothly.

_If you feel this message is in error, or you want to debate the merits of my existence (sniffle), please contact [email protected]._

sailsbot on 5 Aug 2016

Seriously @sailsbot .... give a guy a break

wulfsolter on 5 Aug 2016

😄1

@wulfsolter Thanks for taking the time to write this up. I'll take a look at what tough-cookie even is and get back to you.

mikermcneil on 9 Aug 2016

I'm the tough cookie.

sailsbot on 9 Aug 2016

❤5 😄1

@sailsbot you are not either

@wulfsolter so it looks like tough-cookie is a dependency of request, which is a dependency of codeclimate-test-reporter, which is a dev dependency of waterline-adapter-tests. But I can't find anywhere that tough-cookie is a production dependency of Waterline or either of those two adapters. In Sails, it's a dependency of request again-- but request is just a devDependency of Sails. So yeah, I'm not sure what's going on there :/

Heap dumps are tricky (I remember getting endlessly confused by ours for Treeline when we were living in California.) The cleanest way I've been able to get to the bottom of leaks thus far is just by forcing the gc() to run and comparing valleys. Would you mind giving the process I laid out in https://github.com/balderdashy/sails/issues/3782#issuecomment-238679052 a go?

mikermcneil on 9 Aug 2016

@wulfsolter something that just jumped out at me is this:

// Get all clients
  Client.find({}).sort('pool ASC').exec((error, clients) => {

If you have 1 million rows in your production database, and assuming the table in question is the one for your Client model, then this problem is likely because its fetching _all 1 million of those records_ and trying to fit them in memory.

Assuming Client isn't the one with tons of records, though, it's probably something else. I see you're paginating through guests:

// Get all entries for that client
Guest.find({ pool: client.pool })
.limit(guestsChunkSize)
.skip(guestsReturnedInTotal)
.exec((error, guests) => {...

...but that bit all seems reasonable enough to me.

mikermcneil on 9 Aug 2016

@wulfsolter You might try a variation on the gc() test I linked to in that issue; e.g. next to where you have:

if (progress % (fast ? 50 : 1) === 0) {

You could add something like:

if (progress % 1000 === 0) {
  if (!global.gc) { return callbackMaster(new Error('Garbage collector must be enabled to run this exciting code.  But gc() is not exposed.  Try lifting your app via \'node --expose-gc app.js\'.')); }
  global.gc();
  var ramReportAfterGc = process.memoryUsage();
  console.log('the garbage man has arrived, and heres what she has to say:', '\n', 'After gc():', ramReportAfterGc);
}

mikermcneil on 10 Aug 2016

Sorry, I should've included my explorations into garbage collection - GC made no difference.

Client has a few hundred, Guest is the one with the loads of records, so my code is pulling an individual client, then one by one looping through the client's guests.

I'll get around to making a repro repo shortly - hopefully sooner than @sailsbot threatening to close the issue after 30 days. Once I'd done the bastion host thing and migration completed without overflowing, my task had been completed, but I can't help but suspect that some of the times I've felt memory pressure has been related to a lot of model intensive code.

I hope you have a fabulous pre-birthday week @mikermcneil, and that @sailsbot gets you something nice :)

wulfsolter on 10 Aug 2016

@wulfsolter cool, let us know if you're able to isolate the problem. And thanks!

(btw one other random thing that sticks out-- if guest.TMP_id was ever set to something that might match many records, then .update() would try to load all of those Guest records into RAM. That would only be possible if the data was different though. So if you tested in development mode with the same data, then there's no way it could be that-- just don't want to assume)

mikermcneil on 10 Aug 2016

In the mean time, added the "performance" label to satisfy Sailsbot.

mikermcneil on 10 Aug 2016

@mikermcneil About the performance and memory issue, it might be related to this https://github.com/balderdashy/sails-mongo/pull/369

luislobo on 7 Nov 2016

Note that I recently updated sails-hook-dev to allow for easy gc() usage for testing stuff like this (also more troubleshooting tips here: http://mikermcneil.com/post/148716477837/omg-is-that-a-memory-leak)

mikermcneil on 7 Mar 2017

Hi @wulfsolter,
I am wondering if you have made the upgrade to Sails V1 and if so have you noticed any new leakage?
As a side note we are currently consolidating Sails issues into one repo to help keep a better eye on things.
~Cheers!

raqem on 6 Mar 2019

Hi @raqem,

Yes, we're on sails@^1.1.0 and not seeing any more major problems like this one was... Plenty crashes and restarts, but that's what pm2 is for:

┌─[node@WORKER1]-[2019-04-08 04:12 - 1554696756]-[/var/www/backend]
└─[$]-> pm2 ls
┌──────────┬────┬─────────┬───────┬────────┬─────────┬────────┬───────┬────────────┬──────┬──────────┐
│ App name │ id │ mode    │ pid   │ status │ restart │ uptime │ cpu   │ mem        │ user │ watching │
├──────────┼────┼─────────┼───────┼────────┼─────────┼────────┼───────┼────────────┼──────┼──────────┤
│ API      │ 0  │ cluster │ 10479 │ online │ 27      │ 5h     │ 1.5%  │ 340.3 MB   │ node │ disabled │
│ API      │ 1  │ cluster │ 10484 │ online │ 195     │ 5h     │ 23.9% │ 520.2 MB   │ node │ disabled │
│ API      │ 2  │ cluster │ 10714 │ online │ 25      │ 5h     │ 2.1%  │ 371.4 MB   │ node │ disabled │
│ API      │ 3  │ cluster │ 10742 │ online │ 26      │ 5h     │ 2.4%  │ 399.6 MB   │ node │ disabled │
│ API      │ 4  │ cluster │ 10807 │ online │ 121     │ 5h     │ 1.8%  │ 341.7 MB   │ node │ disabled │
│ API      │ 5  │ cluster │ 10841 │ online │ 24      │ 5h     │ 20%   │ 472.3 MB   │ node │ disabled │
│ API      │ 6  │ cluster │ 10899 │ online │ 31      │ 5h     │ 2.2%  │ 316.9 MB   │ node │ disabled │
│ API      │ 7  │ cluster │ 10933 │ online │ 28      │ 5h     │ 2%    │ 332.8 MB   │ node │ disabled │
└──────────┴────┴─────────┴───────┴────────┴─────────┴────────┴───────┴────────────┴──────┴──────────┘

wulfsolter on 8 Apr 2019

Was this page helpful?

0 / 5 - 0 ratings