Node: Investigate test-child-process-fork-closed-channel-segfault

Created on 19 May 2018  路  18Comments  路  Source: nodejs/node

https://ci.nodejs.org/job/node-test-binary-windows/17414/COMPILED_BY=vs2017,RUNNER=win2016,RUN_SUBSET=3/console

not ok 51 parallel/test-child-process-fork-closed-channel-segfault
  ---
  duration_ms: 0.259
  severity: fail
  exitcode: 1
  stack: |-
    c:\workspace\node-test-binary-windows\test\parallel\test-child-process-fork-closed-channel-segfault.js:70
                throw err;
                ^

    Error: write EMFILE
        at ChildProcess.target._send (internal/child_process.js:741:20)
        at ChildProcess.target.send (internal/child_process.js:625:19)
        at Worker.send (internal/cluster/worker.js:40:28)
        at Socket.<anonymous> (c:\workspace\node-test-binary-windows\test\parallel\test-child-process-fork-closed-channel-segfault.js:36:16)
        at Object.onceWrapper (events.js:273:13)
        at Socket.emit (events.js:182:13)
        at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1147:10)

Likely related:

not ok 52 parallel/test-child-process-fork-net
  ---
  duration_ms: 0.354
  severity: fail
  exitcode: 1
  stack: |-
    PARENT: server listening
    CHILD: server listening
    CLIENT: connected
    PARENT: got connection
    CLIENT: connected
    CHILD: got connection
    CLIENT: closed
    CHILD: got connection
    CHILD: got connection
    CLIENT: closed
    CLIENT: connected
    CLIENT: connected
    CLIENT: closed
    CLIENT: closed
    PARENT: server closed
    CHILD: got socket
    CLIENT: got data
    CLIENT: closed
    testSocket, listening
    events.js:167
          throw er; // Unhandled 'error' event
          ^

    Error: write EPIPE
        at ChildProcess.target._send (internal/child_process.js:741:20)
        at ChildProcess.target.send (internal/child_process.js:625:19)
        at SocketListSend._request (internal/socket_list.js:20:16)
        at SocketListSend.close (internal/socket_list.js:40:10)
        at Server.close (net.js:1624:24)
        at Socket.<anonymous> (c:\workspace\node-test-binary-windows\test\parallel\test-child-process-fork-net.js:179:16)
        at Socket.emit (events.js:182:13)
        at TCP._handle.close [as _onclose] (net.js:596:12)
    Emitted 'error' event at:
        at process.nextTick (internal/child_process.js:745:39)
        at process._tickCallback (internal/process/next_tick.js:61:11)
CI / flaky test child_process test windows

Most helpful comment

ok - I guess know the reason for failure (though cannot test now, about to leave for the day.)

https://github.com/nodejs/node/issues/3635#issuecomment-157714683 (refefenced in the test case) was applicable to this line - connection establishment to a server whose state is arbitrary, is prone to 3 types of errors, and hence we checked and skipped them in the client socket's error handler.

This was almost 3 years back!

Now it turns out to be that this line too is prone to the same problem, under similar circumstances - and that makes sense, as the server can close any time between client connecting and worker sending messages.

In summary, we need an error callback for worker object too, that has the same code as in the client socket callback.

All 18 comments

For the first one, EMFILE is predicted to occur according to this comment but not sure why it wasn't captured in the test.

@gireeshpunathil The test checks that err.message is 'Channel closed' and it's not for EMFILE. Not sure if that's a recent change to the message or what. The code that checks for that string was added in https://github.com/nodejs/node/pull/8954 by @santigimeno (when the test had a different name--it was subsequently renamed).

@nodejs/testing @nodejs/platform-windows @nodejs/child_process

(Interestingly, the string was 'channel closed' in the original PR but is now 'Channel closed' with different capitalization.)

And another one (again both tests failed - I wonder if the first test has influence on the second one).

https://ci.nodejs.org/job/node-test-binary-windows/17446/COMPILED_BY=vs2017,RUNNER=win2016,RUN_SUBSET=3/console

One issue with test-child-process-fork-net is that is uses both console.log() and console.error(). The test runner does not preserve the interleaving of messages from stdout and stderr, so the messages as shown in CI are not the order they appear in when you run the test from the command line. PR to fix that aspect of things: https://github.com/nodejs/node/pull/20873 A 馃憤 to fast-track that would be great.

Better output for test-child-process-fork-net (stdout and stderr properly interleaved):

not ok 52 parallel/test-child-process-fork-net
  ---
  duration_ms: 0.324
  severity: fail
  exitcode: 1
  stack: |-
    PARENT: server listening
    CHILD: server listening
    CLIENT: connected
    PARENT: got connection
    CLIENT: connected
    CHILD: got connection
    CLIENT: closed
    CHILD: got connection
    CHILD: got connection
    CLIENT: closed
    CLIENT: connected
    CLIENT: connected
    CLIENT: closed
    CLIENT: closed
    PARENT: server closed
    testSocket, listening
    CHILD: got socket
    CLIENT: got data
    CLIENT: closed
    events.js:167
          throw er; // Unhandled 'error' event
          ^

    Error: write EPIPE
        at ChildProcess.target._send (internal/child_process.js:741:20)
        at ChildProcess.target.send (internal/child_process.js:625:19)
        at SocketListSend._request (internal/socket_list.js:20:16)
        at SocketListSend.close (internal/socket_list.js:40:10)
        at Server.close (net.js:1624:24)
        at Socket.<anonymous> (c:\workspace\node-test-binary-windows\test\parallel\test-child-process-fork-net.js:178:16)
        at Socket.emit (events.js:182:13)
        at TCP._handle.close [as _onclose] (net.js:596:12)
    Emitted 'error' event at:
        at process.nextTick (internal/child_process.js:745:39)
        at process._tickCallback (internal/process/next_tick.js:61:11)
  ...

I tried to run Windows stress tests for this to see how often it fails but after compiling for a long time, the builds hit a three-minute timeout and don't run the tests. See https://ci.nodejs.org/job/node-stress-single-test/1858/nodes=win2016-1p-vs2017/console and https://ci.nodejs.org/job/node-stress-single-test/1857/nodes=win2016-1p-vs2017/console.

19:31:05      Creating library c:\workspace\node-stress-single-test\nodes\win2016-1p-vs2017\Release\mksnapshot.lib and object c:\workspace\node-stress-single-test\nodes\win2016-1p-vs2017\Release\mksnapshot.exp
19:31:05   Generating code
19:34:05 Build timed out (after 3 minutes). Marking the build as aborted.
19:34:05 Build was aborted
19:34:05 Notifying upstream projects of job completion
19:34:05 Finished: ABORTED

@nodejs/build

On my Win 10 box test-child-process-fork-closed-channel-segfault fails about 6% times about parallel/test-child-process-fork-net fails about once every 30k runs.

@bzoz any chance you could have a look at test-http2-pipe and see if it still fails under load? (unrelated to this issue, of course) If so, could you build with --debug-http2 --debug-nghttp2 and post the debugging info from a failed run to this issue: https://github.com/nodejs/node/issues/20750

It would really be appreciated!

@apapirovski sure, I'll take a look

Just my 2c but this test has been flaky since its introduction, only very infrequently replicates the condition on the originally bugged version (Node 4.0.0) and is unlikely to occur exactly like this again. As such, I think we should scrap the test and move the one important check (assert.strictEqual(worker.process.channel, null);) into another test that uses child_process in a similar manner.

If we try to fix it yet again, it'll be a 6th PR to make this test less flaky. There are currently 5 separate error conditions that we ignore in this. We might be adding a sixth. IMO it's more trouble than it's worth.

Failures showing up on windows for 10.2.0 release

https://ci.nodejs.org/job/node-test-binary-windows/17534/COMPILED_BY=vs2017,RUNNER=win2016,RUN_SUBSET=3/console

Failure again on 10.2.1

https://ci.nodejs.org/job/node-test-binary-windows/17566/COMPILED_BY=vs2017,RUNNER=win2016,RUN_SUBSET=3/console

Failure on node-daily-master task:

https://ci.nodejs.org/job/node-test-binary-windows/17579/COMPILED_BY=vs2017,RUNNER=win2016,RUN_SUBSET=1/console

Host: test-azure_msft-win2016-x64-5

not ok 51 parallel/test-child-process-fork-closed-channel-segfault
  ---
  duration_ms: 0.299
  severity: fail
  exitcode: 1
  stack: |-
    c:\workspace\node-test-binary-windows\test\parallel\test-child-process-fork-closed-channel-segfault.js:70
                throw err;
                ^

    Error: write EMFILE
        at ChildProcess.target._send (internal/child_process.js:741:20)
        at ChildProcess.target.send (internal/child_process.js:625:19)
        at Worker.send (internal/cluster/worker.js:40:28)
        at Socket.<anonymous> (c:\workspace\node-test-binary-windows\test\parallel\test-child-process-fork-closed-channel-segfault.js:36:16)
        at Object.onceWrapper (events.js:273:13)
        at Socket.emit (events.js:182:13)
        at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1147:10)
  ...

ok - I guess know the reason for failure (though cannot test now, about to leave for the day.)

https://github.com/nodejs/node/issues/3635#issuecomment-157714683 (refefenced in the test case) was applicable to this line - connection establishment to a server whose state is arbitrary, is prone to 3 types of errors, and hence we checked and skipped them in the client socket's error handler.

This was almost 3 years back!

Now it turns out to be that this line too is prone to the same problem, under similar circumstances - and that makes sense, as the server can close any time between client connecting and worker sending messages.

In summary, we need an error callback for worker object too, that has the same code as in the client socket callback.

Is parallel/test-child-process-fork-net actually related or should we open another issue to fix it?

@mmarchini - looking the failing stack, I believe it is the same issue. So you can apply the same patch as in #20973 over there too, if you are going for a PR. However, I have found issues with it, and please wait while we fix this.

@gireeshpunathil do you mind if I open a PR to fix parallel/test-child-process-fork-net as well?

Was this page helpful?
0 / 5 - 0 ratings