Socket.io: Default uWebSocket Engine in 2.0.1 causing 100% CPU spin on server

Created on 30 May 2017  路  23Comments  路  Source: socketio/socket.io

I want to:

  • [x] report a bug
  • [ ] request a feature

Current behaviour

Version 2.0.1 exhibits a problem for me when using the default engine: uWebSockets, pushing the Node.js server process into 100% CPU

Steps to reproduce (if the current behaviour is a bug)

I get the problem consistently if I have an application running in the browser with a web-socket connection in place. I then reload the application in the browser (which forces the old socket to be removed and a new one put in place). At this point the server goes into a spinning loop, eating 100% of the CPU.

Replacing the engine with "ws" fixes the issue.

Setup

  • OS: Ubuntu 16.04
  • browser: Chrome 57
  • socket.io version: 2.0.1

Other information (e.g. stacktraces, related issues, suggestions how to fix)

Here's what node debug reports when I pause during 100% CPU utilisation:
ubuntu@ip-172-30-1-57:~/qewd$ node debug localhost:7000
connecting to localhost:7000 ... ok
debug> pause
break in node_modules/uws/uws.js:8
6 const DEFAULT_PAYLOAD_LIMIT = 16777216;
7

8 function noop() {}
9
10 function abortConnection(socket, code, name) {
debug>

It appears to be stuck in a loop calling this function

Most helpful comment

Hi,

I saw symptoms similar to the ones described above, namely:

  1. socket handler code generating a child-process fork
  2. socket disconnect occurs before fork exits
  3. node thrashing repeatedly trying to reap the socket somewhere in Socket.onclose()
  4. event loop does not fully block because the reaper fails, and retries, fails, and retries in a loop until the fork is killed. CPU appears pegged at 100% but in reality the event loop is just getting clogged with these retries. Node server still able to respond to requests but super slow (and gets slower with more volume of course)
  5. kill the fork via cmd line kill signal or via eventual code process.exit and boom, socket cleanup finishes and node stops thrashing.

I was able to work around this by handling the Socket "disconnecting" event (new in 2.x?) and cleaning up associated forks before the rest of the onclose() executes. See below for the onclose() impl to see why this works. EventEmitter emits disconnecting before anything else is done, so you can safely kill all relevant child processes in an event handler here.

Socket.prototype.onclose = function(reason){
  if (!this.connected) return this;
  debug('closing socket - reason %s', reason);
  this.emit('disconnecting', reason);
  this.leaveAll();
  this.nsp.remove(this);
  this.client.remove(this);
  this.connected = false;
  this.disconnected = true;
  delete this.nsp.connected[this.id];
  this.emit('disconnect', reason);
};

Hope this helps anyone else having this issue currently.

@darrachequesne I'd recommend a closer look at the onclose flow and see if there is a place where retry logic can be handled better during socket cleanup. Maybe an increasing retry interval would be acceptable, followed by a thrown Error after n retries? At least then it wouldn't blow up the event loop with constant retries.

* Just in case this helps, I wanted to note that this issue does not exist in 1.3.5. I didn't notice anything in the diffs between 1.3.5 and 2.0.4 that could be the culprit, but could be inside a dependency, as many of these were upgraded in 2.x

DETAILS: Node.js 8.4.0, socket-io 2.0.4 running on Debian 8.9

All 23 comments

Just a follow-up - I've tried to get some further diagnostic info on what's actually happening, but no joy. All I can tell you is that it does seem to be related to a socket disconnection event and when using the uWebSockets engine. The noop() function in uws.js starts being fired repeatedly - I've been unable to find out what's calling it: noop.caller just returns null, and there's no stack trace if you force an error within noop(). I don't believe I'm doing anything unusual within my app the way I use web-sockets, and switching to "ws" as the engine returns it to normality. If you can give me any suggestions to try out to help diagnose the problem, do let me know, and I'll see what I can find out.

@robtweed I am not able to reproduce (same OS, same browser, same socket.io version). Can you reproduce consistently? Could you provide some code please?

OK I eventually tracked it down. It happens if you've forked a child process, and the process environment of the master process is inherited by the child. If socket.io is loaded in the master process environment at the time the child process is forked, then reloading a socket client throws the master process into the noop() spin.

Initially everything is fine - as soon as you reload the browser, causing the original socket to be closed and a new one created, the master process goes berserk - for me it happens 100% of the time.

For example, I've managed to strip it back to this, so hopefully it's replicable:

var express = require('express');
var app = express();
server = app.listen(8080);
var io = require('socket.io')(server);
app.use('/', express.static(__dirname + '/www/'));
var cp = require('child_process');

var workerStarted = false;

function startWorker() {
  var workerProcess = cp.fork(__dirname + '/workerKeepAlive.js', null, { env: process.env });
  workerStarted = true;
}

io.on('connection', function (socket) {
    console.log('socket connected: ' + socket.id);
    var handleMessage = function(data) {
      console.log(socket.id + ' handling message: ' + JSON.stringify(data));

      if (!workerStarted) startWorker();  // start up the child process

      var type = data.type; 
      var responseObj;
      if (type === 'ewd-register' || type === 'ewd-reregister') {
        console.log('ewd-register ' + socket.id);
        data.socketId = socket.id;
        data.ipAddress = socket.request.connection.remoteAddress;
        responseObj = {"type":"ewd-register","finished":true,"message":{"token":"3fc58f5f-9ed3-4165-bdd2-650eca1ed9b6"}}
      }
      else {
        responseObj = {
          type: type,
          finished: true,
          message: {
            hello: 'world'
          }
        };
      }

      socket.emit('ewdjs', responseObj);
    };

    socket.on('ewdjs', handleMessage);

    socket.on('disconnect', function() {
      console.log('socket ' + socket.id + ' disconnected');
    });
});

Here's the very simple child process keepalive script:

console.log('child process started: ' + process.pid);
setTimeout(function() {
  console.log('child process ' + process.pid + ' terminating');
  process.exit();
}, 20000);

Note: if you start the child process before socket.io is loaded, it behaves OK. Start it once socket.io is loaded and the problem occurs.

I'd hazard a guess that it's an IPC-related problem, with the child process having a handle to the socket.io instance. Switch the socket.io engine to "ws" and the problem goes away.

For my situation, I need to be able to start child processes after socket.io has started, and the run-time environment of the child process needs to be the same as the parent process (although I don't need the socket.io environment in the worker)

Just a quick follow-up - it turns out the process.env issue is a red herring - it even fails with this:

cp.fork(__dirname + '/workerKeepAlive.js');

The critical thing is when the child process is started - start it inside the
io.on('connection', function (socket) {..}
closure and you get the CPU runaway happening when you reload the browser

Start it before this and it's OK

Hi! I'm experiencing a similar issue, with slightly different conditions :

1- the client connects via socket.io
2- the server spawns a child process with same env via a require('child_process').exec() call
3- the client diconnects before the child proces exits

Then the server CPU goes 100%.
If the child process ends before disconnection, nothing happens.
As reported above, the option {wsEngine: 'ws'} fixes the issue.

Any resolution to this sorted out? It would be good to be able to switch to uws as the engine

I also run into this problem. I set to use polling only for now.

Me too.
This problem is happen in Linux. (Debian stretch, Ubuntu 16.04)
However, that does not happen in macOS.

Hi, It seems that this bug has disappeared on my side with recent updates.

@valette could you please elaborate? I just checked npm and the latest version was published 4 months ago.

Are you referring to the latest develop branch here in this repo?

no, I do not know exactly, but something changed and the bug is no longer visible (maybe an update in core node?). I am not using any development branch, only released packages.

I load tested the library with wsEngine set to ws, after only 500~ worker kill/fork the 100% cpu problem comes back and keep haunting me, this is basically a no go sign for me now.

I'm putting together demo code and will try to host it somewhere for others to play with.

Just so you know, i'm having the same issue with a 30 worker environment. Same patern. At least now that I have found your issue, i'm know i'm not crazy. I'll keep making some tests and post anything I can find.

same issue for me,
same OS and I am using React Native on client side

wsEngine: 'ws'
does not fix the issue for me...
any other ideas?

Hi,

I saw symptoms similar to the ones described above, namely:

  1. socket handler code generating a child-process fork
  2. socket disconnect occurs before fork exits
  3. node thrashing repeatedly trying to reap the socket somewhere in Socket.onclose()
  4. event loop does not fully block because the reaper fails, and retries, fails, and retries in a loop until the fork is killed. CPU appears pegged at 100% but in reality the event loop is just getting clogged with these retries. Node server still able to respond to requests but super slow (and gets slower with more volume of course)
  5. kill the fork via cmd line kill signal or via eventual code process.exit and boom, socket cleanup finishes and node stops thrashing.

I was able to work around this by handling the Socket "disconnecting" event (new in 2.x?) and cleaning up associated forks before the rest of the onclose() executes. See below for the onclose() impl to see why this works. EventEmitter emits disconnecting before anything else is done, so you can safely kill all relevant child processes in an event handler here.

Socket.prototype.onclose = function(reason){
  if (!this.connected) return this;
  debug('closing socket - reason %s', reason);
  this.emit('disconnecting', reason);
  this.leaveAll();
  this.nsp.remove(this);
  this.client.remove(this);
  this.connected = false;
  this.disconnected = true;
  delete this.nsp.connected[this.id];
  this.emit('disconnect', reason);
};

Hope this helps anyone else having this issue currently.

@darrachequesne I'd recommend a closer look at the onclose flow and see if there is a place where retry logic can be handled better during socket cleanup. Maybe an increasing retry interval would be acceptable, followed by a thrown Error after n retries? At least then it wouldn't blow up the event loop with constant retries.

* Just in case this helps, I wanted to note that this issue does not exist in 1.3.5. I didn't notice anything in the diffs between 1.3.5 and 2.0.4 that could be the culprit, but could be inside a dependency, as many of these were upgraded in 2.x

DETAILS: Node.js 8.4.0, socket-io 2.0.4 running on Debian 8.9

@juergengunz Hello, Can it be used in the react native project? I plan to use it. Please advise.

Note: ws will be restored as default wsEngine starting version 2.1.0

I am having the same issue, even without using child_process. This only happens after quite a long time of the server being active!

Is still an issue with node >= 11.1.0 ? After a chrome-devtools inspect... It's look like a nodejs infinite loop.

Looks like we are having the same issue still in 2020.
Socket.io version 2.2.0
Node.js version 12.15.0

As I can see this is now default { "wsEngine": "ws" } so not sure if it is related to the same issue.
One specific thing is that we are using socket.io-redis as adapter for running multiple instances.

Same issue here. I'm also using redis adapter. Thinking of use ws instead.

Same issue. I'm also using redis adapter.

Was this page helpful?
0 / 5 - 0 ratings