Node: flaky: test-child-process-pipe-dataflow

Created on 7 Feb 2019  路  27Comments  路  Source: nodejs/node

Job: https://ci.nodejs.org/job/node-test-binary-windows/23616/COMPILED_BY=vs2017,RUNNER=win10,RUN_SUBSET=3/testReport/junit/(root)/test/parallel_test_child_process_pipe_dataflow/
Worker: https://ci.nodejs.org/computer/test-azure_msft-win10-x64-3/
Code:
https://github.com/gireeshpunathil/node/blob/b1f82e4342f8a630b1ef83cd33781a725428f569/test/parallel/test-child-process-pipe-dataflow.js#L44-L46
Message:

assert.js:86
  throw new AssertionError(obj);
  ^

AssertionError [ERR_ASSERTION]: Expected values to be strictly equal:

1 !== 0

    at ChildProcess.<anonymous> (c:\workspace\node-test-binary-windows\test\parallel\test-child-process-pipe-dataflow.js:45:12)
    at ChildProcess.<anonymous> (c:\workspace\node-test-binary-windows\test\common\index.js:367:15)
    at ChildProcess.emit (events.js:197:13)
    at Process.ChildProcess._handle.onexit (internal/child_process.js:254:12)
CI / flaky test test windows

All 27 comments

if grep found less bytes than expected, that should result in another assertion.
if grep did not find any bytes, then it should hang (our pipe is fused into its stdin) leading to timeout
if grep shows up exit code 1 would mean strange: Is it that grep is not available on the system? in which case the failure should be consistent!

Is this system picked part of regular CI? @refack - can you please check if it has grep in the first place? My test semantics expects UNIX simulated grep, not the windows native one (though I am not sure how different they are in terms of arguments). thanks!

ran a 3K times on the failing system but no reproduce.

It turns out that the assumptions about grep were not right - it works like in any other system, and there is nothing special. At the moment the best thing I believe is to catch the standard error streams of the child process as well, so that we know what happened to those.

https://ci.nodejs.org/job/node-test-binary-windows/24562/COMPILED_BY=vs2017,RUNNER=win10,RUN_SUBSET=0/console

test-azure_msft-win10-x64-4

00:22:57 not ok 61 parallel/test-child-process-pipe-dataflow
00:22:57   ---
00:22:57   duration_ms: 0.175
00:22:57   severity: fail
00:22:57   exitcode: 1
00:22:57   stack: |-
00:22:57     got unexpected data from child #0:
00:22:57     cat
00:22:57     got unexpected data from child #0:
00:22:57     :
00:22:57     got unexpected data from child #0:
00:22:57      
00:22:57     got unexpected data from child #0:
00:22:57     write error
00:22:57     got unexpected data from child #0:
00:22:57     : 
00:22:57     got unexpected data from child #0:
00:22:57     No space left on device
00:22:57     got unexpected data from child #0:
00:22:57     
00:22:57     
00:22:57     assert.js:85
00:22:57       throw new AssertionError(obj);
00:22:57       ^
00:22:57     
00:22:57     AssertionError [ERR_ASSERTION]: Expected values to be strictly equal:
00:22:57     
00:22:57     1 !== 0
00:22:57     
00:22:57         at ChildProcess.<anonymous> (c:\workspace\node-test-binary-windows\test\parallel\test-child-process-pipe-dataflow.js:42:14)
00:22:57         at ChildProcess.<anonymous> (c:\workspace\node-test-binary-windows\test\common\index.js:358:15)
00:22:57         at ChildProcess.emit (events.js:198:13)
00:22:57         at Process.ChildProcess._handle.onexit (internal/child_process.js:254:12)
00:22:57   ...

P.S. I want to add a lint rule that looks for asserts with no message in loops 馃

https://ci.nodejs.org/job/node-test-binary-windows/24683/COMPILED_BY=vs2017,RUNNER=win10,RUN_SUBSET=1/console

test-azure_msft-win10-x64-5

10:32:10 not ok 62 parallel/test-child-process-pipe-dataflow
10:32:10   ---
10:32:10   duration_ms: 0.212
10:32:10   severity: fail
10:32:10   exitcode: 1
10:32:10   stack: |-
10:32:10     assert.js:85
10:32:10       throw new AssertionError(obj);
10:32:10       ^
10:32:10     
10:32:10     AssertionError [ERR_ASSERTION]: Expected values to be strictly equal:
10:32:10     
10:32:10     1 !== 0
10:32:10     
10:32:10         at ChildProcess.<anonymous> (c:\workspace\node-test-binary-windows\test\parallel\test-child-process-pipe-dataflow.js:42:14)
10:32:10         at ChildProcess.<anonymous> (c:\workspace\node-test-binary-windows\test\common\index.js:358:15)
10:32:10         at ChildProcess.emit (events.js:198:13)
10:32:10         at Process.ChildProcess._handle.onexit (internal/child_process.js:254:12)
10:32:10   ...

looking for suggestions here!
the need for data flow integrity check between a -> b -> c is valid, and essential. However, selection of a, b, c and their interplay (being independent OS processes) seem to be proving less robust and less trustworthy?

https://ci.nodejs.org/job/node-test-binary-windows-2/890/COMPILED_BY=vs2017,RUNNER=win10,RUN_SUBSET=1/console

test-azure_msft-win10-x64-2

18:23:33 not ok 130 parallel/test-child-process-pipe-dataflow
18:23:33   ---
18:23:33   duration_ms: 0.147
18:23:33   severity: fail
18:23:33   exitcode: 1
18:23:33   stack: |-
18:23:33     assert.js:89
18:23:33       throw new AssertionError(obj);
18:23:33       ^
18:23:33     
18:23:33     AssertionError [ERR_ASSERTION]: Expected values to be strictly equal:
18:23:33     
18:23:33     1 !== 0
18:23:33     
18:23:33         at ChildProcess.<anonymous> (c:\workspace\node-test-binary-windows-2\test\parallel\test-child-process-pipe-dataflow.js:46:14)
18:23:33         at ChildProcess.<anonymous> (c:\workspace\node-test-binary-windows-2\test\common\index.js:374:15)
18:23:33         at ChildProcess.emit (events.js:200:13)
18:23:33         at Process.ChildProcess._handle.onexit (internal/child_process.js:272:12) {
18:23:33       generatedMessage: true,
18:23:33       code: 'ERR_ASSERTION',
18:23:33       actual: 1,
18:23:33       expected: 0,
18:23:33       operator: 'strictEqual'
18:23:33     }
18:23:33   ...

as there was no associated stderr messages from child, I would assume this is an early exit of one of the child process (with exit code 1) for an unknown reason. I know this info does not help much, but just stating, to localize the problem statement.

I'll open a PR to add more debugging data so that if this happens again, at least we'll know which subprocess is failing.

I'll open a PR to add more debugging data so that if this happens again, at least we'll know which subprocess is failing.

https://github.com/nodejs/node/pull/27790

https://ci.nodejs.org/job/node-test-binary-windows-2/4422/COMPILED_BY=vs2017,RUNNER=win10,RUN_SUBSET=0/console

test-azure_msft-win10-x64-2

00:11:03 not ok 135 parallel/test-child-process-pipe-dataflow
00:11:03   ---
00:11:03   duration_ms: 0.212
00:11:03   severity: fail
00:11:03   exitcode: 1
00:11:03   stack: |-
00:11:03     Can't clean tmpdir: c:\workspace\node-test-binary-windows-2\test\.tmp.134
00:11:03     Files blocking: [ 'data.txt' ]
00:11:03     
00:11:03     assert.js:93
00:11:03       throw new AssertionError(obj);
00:11:03       ^
00:11:03     
00:11:03     AssertionError [ERR_ASSERTION]: child 1 exited with code 1
00:11:03         at ChildProcess.<anonymous> (c:\workspace\node-test-binary-windows-2\test\parallel\test-child-process-pipe-dataflow.js:49:14)
00:11:03         at ChildProcess.<anonymous> (c:\workspace\node-test-binary-windows-2\test\common\index.js:372:15)
00:11:03         at ChildProcess.emit (events.js:223:5)
00:11:03         at Process.ChildProcess._handle.onexit (internal/child_process.js:270:12) {
00:11:03       generatedMessage: false,
00:11:03       code: 'ERR_ASSERTION',
00:11:03       actual: 1,
00:11:03       expected: 0,
00:11:03       operator: 'strictEqual'
00:11:03     }
00:11:03   ...

this does not look like the usual symptom this test was manifesting - cat failure or grep failure. There was an infra failure yesterday (25/11) - could this be an indirect side effect of that?

this does not look like the usual symptom this test was manifesting - cat failure or grep failure. There was an infra failure yesterday (25/11) - could this be an indirect side effect of that?

It happened again a few hours ago:

https://ci.nodejs.org/job/node-test-binary-windows-2/4716/COMPILED_BY=vs2017,RUNNER=win10,RUN_SUBSET=0/console

test-azure_msft-win10-x64-2

00:09:26 not ok 135 parallel/test-child-process-pipe-dataflow
00:09:26   ---
00:09:26   duration_ms: 0.194
00:09:26   severity: fail
00:09:26   exitcode: 1
00:09:26   stack: |-
00:09:26     assert.js:98
00:09:26       throw new AssertionError(obj);
00:09:26       ^
00:09:26     
00:09:26     AssertionError [ERR_ASSERTION]: child 1 exited with code 1
00:09:26         at ChildProcess.<anonymous> (c:\workspace\node-test-binary-windows-2\test\parallel\test-child-process-pipe-dataflow.js:49:14)
00:09:26         at ChildProcess.<anonymous> (c:\workspace\node-test-binary-windows-2\test\common\index.js:372:15)
00:09:26         at ChildProcess.emit (events.js:219:5)
00:09:26         at Process.ChildProcess._handle.onexit (internal/child_process.js:274:12) {
00:09:26       generatedMessage: false,
00:09:26       code: 'ERR_ASSERTION',
00:09:26       actual: 1,
00:09:26       expected: 0,
00:09:26       operator: 'strictEqual'
00:09:26     }
00:09:26   ...

Since stdio can still be open when the subprocess exits, I wonder if we're losing information by asserting so early on the error code?

seen again: https://ci.nodejs.org/job/node-test-binary-windows-js-suites/RUN_SUBSET=1,nodes=win2012r2-COMPILED_BY-vs2019-x86/2012/console

09:04:40 not ok 78 parallel/test-child-process-pipe-dataflow
09:04:40   ---
09:04:40   duration_ms: 0.903
09:04:40   severity: fail
09:04:40   exitcode: 1
09:04:40   stack: |-
09:04:40     assert.js:102
09:04:40       throw new AssertionError(obj);
09:04:40       ^
09:04:40     
09:04:40     AssertionError [ERR_ASSERTION]: Expected values to be strictly equal:
09:04:40     + actual - expected
09:04:40     
09:04:40     + '131072'
09:04:40     - '1048576'
09:04:40         at Socket.<anonymous> (C:\workspace\node-test-binary-windows-js-suites\node\test\parallel\test-child-process-pipe-dataflow.js:64:12)
09:04:40         at Socket.<anonymous> (C:\workspace\node-test-binary-windows-js-suites\node\test\common\index.js:362:15)
09:04:40         at Socket.emit (events.js:321:20)
09:04:40         at addChunk (_stream_readable.js:297:12)
09:04:40         at readableAddChunk (_stream_readable.js:273:9)
09:04:40         at Socket.Readable.push (_stream_readable.js:214:10)
09:04:40         at Pipe.onStreamRead (internal/stream_base_commons.js:186:23) {
09:04:40       generatedMessage: true,
09:04:40       code: 'ERR_ASSERTION',
09:04:40       actual: '131072',
09:04:40       expected: '1048576',
09:04:40       operator: 'strictEqual'
09:04:40     }

this is a new assertion: basically saying that some data got lost in the pipes.

could this have any relation with https://github.com/nodejs/node/issues/31317? @ronag @lpinca
(I wrote the test, so I can explain it if need be)

could this have any relation with #31317? @ronag @lpinca

I don't think so. I will try to dig into it.

I've had weird behaviour from Pipe.onStreamRead (internal/stream_base_commons.js:186:23) in general on vs2019.

See, https://github.com/nodejs/node/pull/31806#discussion_r382938926.

thanks @ronag for looking into this!

Not sure even that weirdness can explain this failure:

In the linked comment you seem to be getting extract callbacks, beyond the end-of-file?
Here we seem to have stopped reading in a connected pipe - cat <file> | grep | wc kind of a pipe chain. Without debugging too much, looking at the truncation (only 128 KB passed through the pipe instead of 1MB) , I can say one of the party closed their stream - could be:

  • write end of cat,
  • read end of grep,
  • write end of grep,
  • read end of wc

Let me think further on what could have happened or what can be done to troubleshoot.

I've got a Windows VM with that platform. If I get some time I might try to dig into this as well over the weekend. If so, I'll send you a mail and maybe we can do a collaborative effort.

@gireeshpunathil @ronag I can't reproduce the issue locally (Windows 10).

./tools/test.py -J --repeat=1000 test/parallel/test-child-process-pipe-dataflow.js

Node.js was built with Visual Studio Build Tools 2019.

@lpinca - since the last conversation on this, 4fedb702b221469a65f0f2bc267ae08c8159e72a has landed, which would have potentially fixed it.

Given that you have this rare system combination, if you have time, can you try without this commit and see if it breaks? that will be a great validation of the issue!

I tried reverting https://github.com/nodejs/node/commit/4fedb702b221469a65f0f2bc267ae08c8159e72a and I had a couple of failures over multiple runs:

$ ./tools/test.py -J --repeat=1000 test/parallel/test-child-process-pipe-dataflow.js
=== release test-child-process-pipe-dataflow ===
Path: parallel/test-child-process-pipe-dataflow
assert.js:102
  throw new AssertionError(obj);
  ^

AssertionError [ERR_ASSERTION]: Expected values to be strictly equal:

'0' !== '1048576'

    at Socket.<anonymous> (C:\Users\luigi\Development\node\test\parallel\test-child-process-pipe-dataflow.js:64:12)
    at Socket.<anonymous> (C:\Users\luigi\Development\node\test\common\index.js:362:15)
    at Socket.emit (events.js:315:20)
    at addChunk (_stream_readable.js:296:12)
    at readableAddChunk (_stream_readable.js:272:9)
    at Socket.Readable.push (_stream_readable.js:213:10)
    at Pipe.onStreamRead (internal/stream_base_commons.js:186:23) {
  generatedMessage: true,
  code: 'ERR_ASSERTION',
  actual: '0',
  expected: '1048576',
  operator: 'strictEqual'
}
Command: C:\Users\luigi\Development\node\out\Release\node.exe C:\Users\luigi\Development\node\test\parallel\test-child-process-pipe-da
taflow.js
=== release test-child-process-pipe-dataflow ===
Path: parallel/test-child-process-pipe-dataflow
assert.js:102
  throw new AssertionError(obj);
  ^

AssertionError [ERR_ASSERTION]: Expected values to be strictly equal:

'0' !== '1048576'

    at Socket.<anonymous> (C:\Users\luigi\Development\node\test\parallel\test-child-process-pipe-dataflow.js:64:12)
    at Socket.<anonymous> (C:\Users\luigi\Development\node\test\common\index.js:362:15)
    at Socket.emit (events.js:315:20)
    at addChunk (_stream_readable.js:296:12)
    at readableAddChunk (_stream_readable.js:272:9)
    at Socket.Readable.push (_stream_readable.js:213:10)
    at Pipe.onStreamRead (internal/stream_base_commons.js:186:23) {
  generatedMessage: true,
  code: 'ERR_ASSERTION',
  actual: '0',
  expected: '1048576',
  operator: 'strictEqual'
}
Command: C:\Users\luigi\Development\node\out\Release\node.exe C:\Users\luigi\Development\node\test\parallel\test-child-process-pipe-da
taflow.js
[01:24|% 100|+ 998|-   2]: Done

I went back to current master and after several runs I got a failure. It seems the issue was not completely fixed by https://github.com/nodejs/node/commit/4fedb702b221469a65f0f2bc267ae08c8159e72a. The failure rate is very small, it's very hard to reproduce.

EDIT:

On this machine (current master) it has a failure rate of 0.5 %

[11:20|% 100|+ 4975|-  25]: Done

Saw this one yesterday in https://ci.nodejs.org/job/node-test-binary-windows-js-suites/3294/RUN_SUBSET=1,nodes=win2016-COMPILED_BY-vs2017/testReport/junit/(root)/test/parallel_test_child_process_pipe_dataflow/:

Error Message
fail (1)
Stacktrace
assert.js:102
  throw new AssertionError(obj);
  ^

AssertionError [ERR_ASSERTION]: Expected values to be strictly equal:

'0' !== '1048577'

    at Socket.<anonymous> (C:\workspace\node-test-binary-windows-js-suites\node\test\parallel\test-child-process-pipe-dataflow.js:66:12)
    at Socket.<anonymous> (C:\workspace\node-test-binary-windows-js-suites\node\test\common\index.js:362:15)
    at Socket.emit (events.js:315:20)
    at addChunk (_stream_readable.js:299:12)
    at readableAddChunk (_stream_readable.js:275:9)
    at Socket.Readable.push (_stream_readable.js:216:10)
    at Pipe.onStreamRead (internal/stream_base_commons.js:191:23) {
  generatedMessage: true,
  code: 'ERR_ASSERTION',
  actual: '0',
  expected: '1048577',
  operator: 'strictEqual'
}

Was this solved in https://github.com/nodejs/node/commit/776905ef99ec414700d6d12bf9c17ce788b7f369? @gireeshpunathil is this still something we need to look into?

looking at the dates of landing that PR and the latest issue report (march 13 and April 20) I would say it is not resolved fully. However, we might be seeing a different issue here:

1 !== 0 // exit code of one of the child bad
131072 < 1048576 // truncated data in one of the child

I guess we are not seeing these anymore, while

0 !== 1048576 || 1048577 // full data loss in the pipes

start seeing this, which looks to be a new issue to me.

  • abrupt child exit cannot be, as we have checks in place
  • stream errors cannot be, as those should have resulted in child abnormal exits?
  • pre-mature stream close that was absorbed adn not thrown - possible
Was this page helpful?
0 / 5 - 0 ratings

Related issues

sandeepks1 picture sandeepks1  路  3Comments

Brekmister picture Brekmister  路  3Comments

filipesilvaa picture filipesilvaa  路  3Comments

addaleax picture addaleax  路  3Comments

danielstaleiny picture danielstaleiny  路  3Comments