Node: Flaky test: parallel/test-tls-ticket-cluster

Created on 27 Aug 2019  Â·  10Comments  Â·  Source: nodejs/node

  • Version: master
  • Platform: test-macstadium-macos10.11-x64-1
  • Subsystem: test

Console output

04:34:35 not ok 2094 parallel/test-tls-ticket-cluster
04:34:35   ---
04:34:35   duration_ms: 0.579
04:34:35   severity: fail
04:34:35   exitcode: 1
04:34:35   stack: |-
04:34:35     [master] got "listening"
04:34:35     [master] connecting 60380 session? false
04:34:35     [worker] connection reused? false
04:34:35     [master] got "not-reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [master] got "listening"
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [master] connecting 60380 session? true
04:34:35     [worker] connection reused? true
04:34:35     [master] got "reused"
04:34:35     [worker] got "die"
04:34:35     [worker] server close
04:34:35     [worker] exit
04:34:35     [worker] got "die"
04:34:35     [worker] server close
04:34:35     [worker] exit
04:34:35     [worker] got "die"
04:34:35     [worker] server close
04:34:35     [worker] exit
04:34:35     [worker] got "die"
04:34:35     [worker] server close
04:34:35     [worker] exit
04:34:35     events.js:186
04:34:35           throw er; // Unhandled 'error' event
04:34:35           ^
04:34:35     
04:34:35     Error: write EPIPE
04:34:35         at ChildProcess.target._send (internal/child_process.js:813:20)
04:34:35         at ChildProcess.target.send (internal/child_process.js:683:19)
04:34:35         at sendHelper (internal/cluster/utils.js:22:15)
04:34:35         at send (internal/cluster/master.js:351:10)
04:34:35         at internal/cluster/master.js:317:5
04:34:35         at done (internal/cluster/round_robin_handle.js:46:7)
04:34:35         at RoundRobinHandle.add (internal/cluster/round_robin_handle.js:55:12)
04:34:35         at queryServer (internal/cluster/master.js:311:10)
04:34:35         at Worker.onmessage (internal/cluster/master.js:246:5)
04:34:35         at ChildProcess.onInternalMessage (internal/cluster/utils.js:43:8)
04:34:35     Emitted 'error' event on Worker instance at:
04:34:35         at ChildProcess.<anonymous> (internal/cluster/worker.js:27:12)
04:34:35         at ChildProcess.emit (events.js:209:13)
04:34:35         at internal/child_process.js:817:39
04:34:35         at processTicksAndRejections (internal/process/task_queues.js:75:11) {
04:34:35       errno: -32,
04:34:35       code: 'EPIPE',
04:34:35       syscall: 'write'
04:34:35     }
CI / flaky test child_process macos tls

Most helpful comment

That probably means libuv/libuv@ee24ce900e5714c950b248da2bdd311b01c983be is responsible (cc @addaleax) but that also means it's probably exposing a latent bug in Node.js (edit: or the test), not introducing a new one.

The sole uv_try_write() caller is below and that err == UV_EAGAIN check looks highly suspect:
https://github.com/nodejs/node/blob/98b718572f9c136e7a11dc60e984d873d08346cc/src/stream_wrap.cc#L350-L354

All 10 comments

/cc @nodejs/cluster ?

Also on FreeBSD:

https://ci.nodejs.org/job/node-test-commit-freebsd/28120/nodes=freebsd11-x64/console

00:15:03 not ok 1983 parallel/test-tls-ticket-cluster
00:15:03   ---
00:15:03   duration_ms: 0.865
00:15:03   severity: fail
00:15:03   exitcode: 1
00:15:03   stack: |-
00:15:03     [master] got "listening"
00:15:03     [master] connecting 28298 session? false
00:15:03     [worker] connection reused? false
00:15:03     [master] got "not-reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [master] connecting 28298 session? true
00:15:03     [worker] connection reused? true
00:15:03     [master] got "reused"
00:15:03     [worker] got "die"
00:15:03     [worker] server close
00:15:03     [worker] exit
00:15:03     [master] worker died
00:15:03     [worker] got "die"
00:15:03     [worker] server close
00:15:03     [worker] exit
00:15:03     [worker] got "die"
00:15:03     [worker] server close
00:15:03     [worker] exit
00:15:03     [master] worker died
00:15:03     [worker] got "die"
00:15:03     [worker] server close
00:15:03     [worker] exit
00:15:03     events.js:186
00:15:03           throw er; // Unhandled 'error' event
00:15:03           ^
00:15:03     
00:15:03     Error: write EPIPE
00:15:03         at ChildProcess.target._send (internal/child_process.js:813:20)
00:15:03         at ChildProcess.target.send (internal/child_process.js:683:19)
00:15:03         at sendHelper (internal/cluster/utils.js:22:15)
00:15:03         at send (internal/cluster/master.js:351:10)
00:15:03         at internal/cluster/master.js:317:5
00:15:03         at Server.done (internal/cluster/round_robin_handle.js:46:7)
00:15:03         at Object.onceWrapper (events.js:298:28)
00:15:03         at Server.emit (events.js:214:15)
00:15:03         at emitListeningNT (net.js:1332:10)
00:15:03         at processTicksAndRejections (internal/process/task_queues.js:79:21)
00:15:03     Emitted 'error' event on Worker instance at:
00:15:03         at ChildProcess.<anonymous> (internal/cluster/worker.js:27:12)
00:15:03         at ChildProcess.emit (events.js:209:13)
00:15:03         at internal/child_process.js:817:39
00:15:03         at processTicksAndRejections (internal/process/task_queues.js:75:11) {
00:15:03       errno: -32,
00:15:03       code: 'EPIPE',
00:15:03       syscall: 'write'
00:15:03     }
00:15:03   ...

I'm so far unable to replicate this locally, but EPIPE can happen if you try to write to a closed stream, and there have been some changes to streams lately... @nodejs/streams? @ronag?

I can reproduce with

$ ./tools/test.py -J --repeat=500 test/parallel/test-tls-ticket-cluster.js

and bisecting points to 43677325e1925c04156f97ebf8a9226f2bae5d79 as first bad commit.

@nodejs/libuv

That probably means libuv/libuv@ee24ce900e5714c950b248da2bdd311b01c983be is responsible (cc @addaleax) but that also means it's probably exposing a latent bug in Node.js (edit: or the test), not introducing a new one.

The sole uv_try_write() caller is below and that err == UV_EAGAIN check looks highly suspect:
https://github.com/nodejs/node/blob/98b718572f9c136e7a11dc60e984d873d08346cc/src/stream_wrap.cc#L350-L354

It looks more like issues with the cluster/child process code to me… currently, synchronous write errors from .send() are reported (but there is an ancient undocumented option to turn that off, swallowErrors), while asynchronous write errors are always discarded silently.

It generally also looks like there’s very little error handling in the cluster module in general, and in particular, there’s this bit that’s also in the stack trace:

https://github.com/nodejs/node/blob/98b718572f9c136e7a11dc60e984d873d08346cc/lib/internal/cluster/round_robin_handle.js#L45-L48

Judging from the commit message of 0da4e0e8439a453d197484402f9f14798c6e00d0 (which added swallowErrors), I’d think it make sense to add that flag for internal cluster messages, too.

/cc @cjihrig

The sole uv_try_write() caller is below and that err == UV_EAGAIN check looks highly suspect:

I think UV_ENOSYS and UV_EAGAIN are precisely the return values that indicate that it makes sense to use uv_write() after a failed uv_try_write(), correct?

I think UV_ENOSYS and UV_EAGAIN are precisely the return values that indicate that it makes sense to use uv_write() after a failed uv_try_write(), correct?

Yes, but before libuv/libuv@ee24ce9 that path was always taken. The if (err < 0) below was dead code but now it's not.

This failure is showing up on v10.x now, specifically on centos7. Did this get figured out on master? Should we mark it as flaky?

This failure is showing up on v10.x now, specifically on centos7. Did this get figured out on master? Should we mark it as flaky?

Using ncu-ci walk, I can only find one recent example, and it's a node-daily-v10.x-staging daily job.

Saw this one failing on OS X today: https://ci.nodejs.org/job/node-test-commit-osx/33696/nodes=osx1015/testReport/junit/(root)/test/parallel_test_tls_ticket_cluster/

Was this page helpful?
0 / 5 - 0 ratings

Related issues

addaleax picture addaleax  Â·  3Comments

srl295 picture srl295  Â·  3Comments

jmichae3 picture jmichae3  Â·  3Comments

danielstaleiny picture danielstaleiny  Â·  3Comments

ksushilmaurya picture ksushilmaurya  Â·  3Comments