Node: IPC can freeze the process

Created on 11 Jul 2016  路  19Comments  路  Source: nodejs/node

Since we upgraded our app from 0.10.38 to v6, we experienced a lot of problem with IPC messaging.

Essentially, we have a web application with a few workers to handle all the requests. The workers contains caches to speed up the request and theses caches are synchronized between process with IPC messaging. We were also using log4js as a logging library with the clustered appender that uses IPC to send all child logs back to the master to have a single process handling the logs.

All was working fine under 0.10.38, but when we upgraded to 6.0.0 (and then 6.2.0) our app kept crashing under various circumstances

We soon realized that if we send too much data (or too fast) through IPC, that it was freezing our application.

We began refactoring our entire app to use IPC to the strict minimum.

  • We created a custom logging process that receive logs by TCP instead of IPC
  • We refactored our entire master/worker process so the workers could load all the information on their own and restrict IPC messages to only "trigger" messages instead of sending all the data.

All thoses changes are good for our application, since it reduced dependencies from master/worker and did a better separation of responsibilities, but I still see it as a flaw in Node.JS since the IPC is a fairly simple communication mechanism to exchange information between workers, but it seems so fragile now that we are afraid of using it.

I attached a simple script that reproduce the problem. It is not a real scenario, just a test case I created to reproduce the problem of the application that stop responding.

ipc_test_scripts.zip

On my laptop, the app crash at startup (or before the first log) with 5 forks (maybe because I have 4 physical core)

At first I tested with 3 workers and It froze after 5-10 minutes (all process CPU go down to 0 and there's no more log output)

If I remove the "bacon ipsum" from the worker message, it works (might freeze after a while)
If I increase the message interval from 1ms to 10ms, it works (might freeze after a while)
If I spawn only 4 workers it works (will probably freeze after 5-10 minutes)

If I execute it with 0.10.38 it works (as long as I ran it)

So if you play with the timings, size of messages and/or number of forks, you should be able to reproduce the problem.

One thing I observed is that the IPC messaging seem to have improve in performance big time from 0.10 to 6. If i run the test with 3 workers for 10 seconds with 0.10.38 the master only handle 1902 messages and in comparison with 6.3.0, in the same 10 seconds, the master handles 25514 messages.

I also tested it with 4.4.7 and it freeze at startup with 5 forks and after 4 minutes with 4 forks

My specs :
NodeJS Windows 6.3.0 64 bits (bug)
NodeJS Windows 6.2.0 64 bits (bug)
NodeJS Windows 4.4.7 64 bits (bug)
NodeJS Windows 0.10.38 64 bits (OK)

child_process libuv windows

Most helpful comment

How about re-opening this issue until this fix in libuv lands in Node.JS? Now I had to run the test script to check if it works already (it doesn't)

All 19 comments

I just tested your example unmodified on a recent 7.0.0-pre build on Windows 7 x64 with 4 physical hyperthreaded cores for around 8 minutes , handling around 2.2 million messages with no issues. Maybe it's already fixed in the repo, just not on the 6.3.0 release?
EDIT: Running with 9 workers (tested for 5 minutes) was also no problem.

@httpdigest
Have you tested it with 6.3.0? Have you been able to reproduce the problem?

Great if it's fixed on the V7 branch, but since V6 is the next LTS from october I assume this kind of bug should be ported backed!

I confirmed that the provided repro works under Windows 10 with node v0.10.38 and v0.10.46, but stops immediately with v6.3.0. Tested master under Windows 2012 and it stops after printing the "still alive" message once or twice. I will add this to my backlog and investigate when possible.

cc @nodejs/platform-windows

I'm able to reproduce the hang on Windows using the scripts provided by @cvillemure . In a debugger I found the main thread of the parent process was stuck waiting for a pipe write to complete. Normally writes to an IPC pipe complete immediately, but when the pipe gets full because messages are written into it faster than they are read out on the other side then the writes are blocked until space is available. While I can't figure out how to debug the child processes, I assume they are similarly blocked trying to write to their IPC pipe, so that the parent and child processes are waiting on each other.

According to the documentation, the correct behavior should be for the send() method to return false instead of waiting indefinitely on a blocked pipe:

child.send() will return false if the channel has closed or when the backlog of unsent messages exceeds a threshold that makes it unwise to send more.

A possibly related issue is that as of node v4.0, process.send() operations are asynchronous on unix, but they appear to be still synchronous on Windows. There is a comment in the Windows pipe code that specifically mentions that _IPC_ writes are intentionally blocking, that I don't understand.

I don't see a v7.0.0 branch, but I'm using binaries built from the latest sources from the master branch. (On Windows 10 x64.)

We are seeing a similar issue in VS Code where the application just hangs after sufficient amount of data is sent between a node process and its forked child.

I would appreciate if someone can enlighten me about the boolean return value of process.send. According to the docs:

child.send() will return false if the channel has closed or when the backlog of unsent messages exceeds a threshold that makes it unwise to send more.

In my reproducible case I see false being returned from process.send and if I stop sending data at that point I do not run into the freeze. But it is unclear to me how to proceed from that point. Was the message only partially send? When is it safe again to send further messages?

Our understanding now is to use process.send in the following way:

  • if the return value is true, just continue sending
  • if the return value is false, assume the message was not delivered and store the message in a buffer
  • wait for the callback of process.send to return before sending additional data

Can someone confirm the following pseudo code?

var buffer = [];

function send(msg) {
    if (buffer.length > 0) {
        buffer.push(msg);
        return; // wait for the pending process.send to finish before sending
    }

    var res = process.send(msg, () => {
        // send buffer now that we are good again to send
        var bufferCopy = buffer.slice(0);
        buffer = [];
        bufferCopy.forEach(b => send(b));
    });

    if (!res) {
        buffer.push(msg); // start adding the message to the buffer if send failed
    }
}

What worries me is that here I am assuming that the callback is a good place to continue sending data to the process but according to the docs this is not clear to me:

The optional callback is a function that is invoked after the message is sent but before the child may have received it.

Basically I am missing a way to find out when is a good time to start sending messages again after receiving false from process.send.

@bpasero The return value indicates whether node.js was able to send the message right away (true) or had to buffer it (false). You can keep sending messages and node.js will dutifully buffer them but that may result in unbounded memory growth so as a rule of thumb, when process.send(message, callback) returns false, you should back off until the callback is called. You don't need to resend the message.

@bnoordhuis thanks for the explanation, this should probably go into the docs of child_process.send.

I will try to follow that approach, however I would be surprised if the node-process deadlock is fixed with that approach. It may just make it less likely to happen.

Super simple repro for me:

index.js

var cp = require("child_process");

var res = cp.fork("./fork.js");

var largeObj = {};
for (var i = 0; i < 10000; i++) {
    largeObj[i] = "foo bar";
}

setInterval(function () {
    console.log("PING (main side)")
}, 1000);

for (var i = 0; i < 2; i++) {
    var result = res.send(JSON.stringify(largeObj), function() {
        console.log("Done sending from main side");
    });

    console.log("Result from sending: " + result);
}

fork.js

setInterval(function () {
    console.log("PING (fork side)")
}, 1000);

var largeObj = {};
for (var i = 0; i < 10000; i++) {
    largeObj[i] = "foo bar";
}

for (var i = 0; i < 2; i++) {
    process.send(JSON.stringify(largeObj));
}

All it needs is a sufficient large enough data that causes the process.send call to return false and both sides need to be sending data.

As expected, using process.send more gracefully by checking the return code and only continuing to send data when the callback is hit does not solve the freeze. It just makes it a little bit less likely because you end up sending data after a setTimeout(0).

@bpasero I've filed https://github.com/libuv/libuv/issues/1099. It's on my radar but I'm not much of a Windows programmer. If you want a speedy resolution, maybe you can have one of your programmers look at it.

Thanks. A workaround that seems to prevent this issue for us is to send a message in sequence always from the callback of the process.send message. On Windows at least this basically means that each message gets send after a process.nextTick.

You don't need to resend the message.

@bnoordhuis so Node.JS takes care of resending the messages?

It isn't "resending" it, because it didn't "fail to get sent". It just queued. The return code is for flow-control, so you know you are sending faster than data can be written out the socket, it isn't an indication that data was dropped.

@sam-github thanks for clarifying that

Should this remain open?

Since this is a libuv issue, I'll take the liberty of closing this. Libuv PRs welcome.

Even though this isn't resolved yet, thank you so much for documenting this! Was losing my mind when my root node process was freezing up.

Glad to know its Windows-specific, not a Node.js/IPC issue.

Double thanks to @cvillemure for packaging a test case. Helped me quickly confirm the issue I'm experiencing is the same, and not unique to my app.

How about re-opening this issue until this fix in libuv lands in Node.JS? Now I had to run the test script to check if it works already (it doesn't)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

addaleax picture addaleax  路  3Comments

cong88 picture cong88  路  3Comments

mcollina picture mcollina  路  3Comments

Icemic picture Icemic  路  3Comments

fanjunzhi picture fanjunzhi  路  3Comments