Node: IPC can freeze the process

Created on 11 Jul 2016 · 19Comments · Source: nodejs/node

Since we upgraded our app from 0.10.38 to v6, we experienced a lot of problem with IPC messaging.

Essentially, we have a web application with a few workers to handle all the requests. The workers contains caches to speed up the request and theses caches are synchronized between process with IPC messaging. We were also using log4js as a logging library with the clustered appender that uses IPC to send all child logs back to the master to have a single process handling the logs.

All was working fine under 0.10.38, but when we upgraded to 6.0.0 (and then 6.2.0) our app kept crashing under various circumstances

We soon realized that if we send too much data (or too fast) through IPC, that it was freezing our application.

We began refactoring our entire app to use IPC to the strict minimum.

We created a custom logging process that receive logs by TCP instead of IPC
We refactored our entire master/worker process so the workers could load all the information on their own and restrict IPC messages to only "trigger" messages instead of sending all the data.

All thoses changes are good for our application, since it reduced dependencies from master/worker and did a better separation of responsibilities, but I still see it as a flaw in Node.JS since the IPC is a fairly simple communication mechanism to exchange information between workers, but it seems so fragile now that we are afraid of using it.

I attached a simple script that reproduce the problem. It is not a real scenario, just a test case I created to reproduce the problem of the application that stop responding.

ipc_test_scripts.zip

On my laptop, the app crash at startup (or before the first log) with 5 forks (maybe because I have 4 physical core)

At first I tested with 3 workers and It froze after 5-10 minutes (all process CPU go down to 0 and there's no more log output)

If I remove the "bacon ipsum" from the worker message, it works (might freeze after a while)
If I increase the message interval from 1ms to 10ms, it works (might freeze after a while)
If I spawn only 4 workers it works (will probably freeze after 5-10 minutes)

If I execute it with 0.10.38 it works (as long as I ran it)

So if you play with the timings, size of messages and/or number of forks, you should be able to reproduce the problem.

One thing I observed is that the IPC messaging seem to have improve in performance big time from 0.10 to 6. If i run the test with 3 workers for 10 seconds with 0.10.38 the master only handle 1902 messages and in comparison with 6.3.0, in the same 10 seconds, the master handles 25514 messages.

I also tested it with 4.4.7 and it freeze at startup with 5 forks and after 4 minutes with 4 forks

My specs :
NodeJS Windows 6.3.0 64 bits (bug)
NodeJS Windows 6.2.0 64 bits (bug)
NodeJS Windows 4.4.7 64 bits (bug)
NodeJS Windows 0.10.38 64 bits (OK)

child_process libuv windows

Source

cvillemure

👍5

Most helpful comment

How about re-opening this issue until this fix in libuv lands in Node.JS? Now I had to run the test script to check if it works already (it doesn't)

rogierschouten on 12 Jun 2018

👍2

All 19 comments

I just tested your example unmodified on a recent 7.0.0-pre build on Windows 7 x64 with 4 physical hyperthreaded cores for around 8 minutes , handling around 2.2 million messages with no issues. Maybe it's already fixed in the repo, just not on the 6.3.0 release?
EDIT: Running with 9 workers (tested for 5 minutes) was also no problem.

httpdigest on 14 Jul 2016

@httpdigest
Have you tested it with 6.3.0? Have you been able to reproduce the problem?

Great if it's fixed on the V7 branch, but since V6 is the next LTS from october I assume this kind of bug should be ported backed!

cvillemure on 14 Jul 2016

👍1

I confirmed that the provided repro works under Windows 10 with node v0.10.38 and v0.10.46, but stops immediately with v6.3.0. Tested master under Windows 2012 and it stops after printing the "still alive" message once or twice. I will add this to my backlog and investigate when possible.

cc @nodejs/platform-windows

joaocgreis on 15 Jul 2016

I'm able to reproduce the hang on Windows using the scripts provided by @cvillemure . In a debugger I found the main thread of the parent process was stuck waiting for a pipe write to complete. Normally writes to an IPC pipe complete immediately, but when the pipe gets full because messages are written into it faster than they are read out on the other side then the writes are blocked until space is available. While I can't figure out how to debug the child processes, I assume they are similarly blocked trying to write to their IPC pipe, so that the parent and child processes are waiting on each other.

According to the documentation, the correct behavior should be for the send() method to return false instead of waiting indefinitely on a blocked pipe:

child.send() will return false if the channel has closed or when the backlog of unsent messages exceeds a threshold that makes it unwise to send more.

A possibly related issue is that as of node v4.0, process.send() operations are asynchronous on unix, but they appear to be still synchronous on Windows. There is a comment in the Windows pipe code that specifically mentions that _IPC_ writes are intentionally blocking, that I don't understand.

I don't see a v7.0.0 branch, but I'm using binaries built from the latest sources from the master branch. (On Windows 10 x64.)

jasongin on 18 Aug 2016

We are seeing a similar issue in VS Code where the application just hangs after sufficient amount of data is sent between a node process and its forked child.

I would appreciate if someone can enlighten me about the boolean return value of process.send. According to the docs:

child.send() will return false if the channel has closed or when the backlog of unsent messages exceeds a threshold that makes it unwise to send more.

In my reproducible case I see false being returned from process.send and if I stop sending data at that point I do not run into the freeze. But it is unclear to me how to proceed from that point. Was the message only partially send? When is it safe again to send further messages?

bpasero on 14 Oct 2016

Our understanding now is to use process.send in the following way:

if the return value is true, just continue sending
if the return value is false, assume the message was not delivered and store the message in a buffer
wait for the callback of process.send to return before sending additional data

Can someone confirm the following pseudo code?

var buffer = [];

function send(msg) {
    if (buffer.length > 0) {
        buffer.push(msg);
        return; // wait for the pending process.send to finish before sending
    }

    var res = process.send(msg, () => {
        // send buffer now that we are good again to send
        var bufferCopy = buffer.slice(0);
        buffer = [];
        bufferCopy.forEach(b => send(b));
    });

    if (!res) {
        buffer.push(msg); // start adding the message to the buffer if send failed
    }
}

What worries me is that here I am assuming that the callback is a good place to continue sending data to the process but according to the docs this is not clear to me:

The optional callback is a function that is invoked after the message is sent but before the child may have received it.

Basically I am missing a way to find out when is a good time to start sending messages again after receiving false from process.send.

bpasero on 14 Oct 2016

@bpasero The return value indicates whether node.js was able to send the message right away (true) or had to buffer it (false). You can keep sending messages and node.js will dutifully buffer them but that may result in unbounded memory growth so as a rule of thumb, when process.send(message, callback) returns false, you should back off until the callback is called. You don't need to resend the message.

bnoordhuis on 14 Oct 2016

@bnoordhuis thanks for the explanation, this should probably go into the docs of child_process.send.

I will try to follow that approach, however I would be surprised if the node-process deadlock is fixed with that approach. It may just make it less likely to happen.

bpasero on 14 Oct 2016

👍2

Super simple repro for me:

index.js

var cp = require("child_process");

var res = cp.fork("./fork.js");

var largeObj = {};
for (var i = 0; i < 10000; i++) {
    largeObj[i] = "foo bar";
}

setInterval(function () {
    console.log("PING (main side)")
}, 1000);

for (var i = 0; i < 2; i++) {
    var result = res.send(JSON.stringify(largeObj), function() {
        console.log("Done sending from main side");
    });

    console.log("Result from sending: " + result);
}

fork.js

setInterval(function () {
    console.log("PING (fork side)")
}, 1000);

var largeObj = {};
for (var i = 0; i < 10000; i++) {
    largeObj[i] = "foo bar";
}

for (var i = 0; i < 2; i++) {
    process.send(JSON.stringify(largeObj));
}

All it needs is a sufficient large enough data that causes the process.send call to return false and both sides need to be sending data.

bpasero on 14 Oct 2016

As expected, using process.send more gracefully by checking the return code and only continuing to send data when the callback is hit does not solve the freeze. It just makes it a little bit less likely because you end up sending data after a setTimeout(0).

bpasero on 15 Oct 2016

@bpasero I've filed https://github.com/libuv/libuv/issues/1099. It's on my radar but I'm not much of a Windows programmer. If you want a speedy resolution, maybe you can have one of your programmers look at it.

bnoordhuis on 17 Oct 2016

Thanks. A workaround that seems to prevent this issue for us is to send a message in sequence always from the callback of the process.send message. On Windows at least this basically means that each message gets send after a process.nextTick.

bpasero on 17 Oct 2016

You don't need to resend the message.

@bnoordhuis so Node.JS takes care of resending the messages?

christian-bromann on 12 Dec 2016

It isn't "resending" it, because it didn't "fail to get sent". It just queued. The return code is for flow-control, so you know you are sending faster than data can be written out the socket, it isn't an indication that data was dropped.

sam-github on 12 Dec 2016

👍2

@sam-github thanks for clarifying that

christian-bromann on 12 Dec 2016

Should this remain open?

Trott on 15 Jul 2017

Since this is a libuv issue, I'll take the liberty of closing this. Libuv PRs welcome.

bnoordhuis on 17 Jul 2017

Even though this isn't resolved yet, thank you so much for documenting this! Was losing my mind when my root node process was freezing up.

Glad to know its Windows-specific, not a Node.js/IPC issue.

Double thanks to @cvillemure for packaging a test case. Helped me quickly confirm the issue I'm experiencing is the same, and not unique to my app.