Node: test: investigate flakiness of `parallel/test-async-wrap-getasyncid`

Created on 14 May 2017  ·  35Comments  ·  Source: nodejs/node

  • Version: master
  • Platform: Windows ~smartos15-64 ++~
  • Subsystem: test,async-hooks

file:

https://github.com/nodejs/node/blob/master/test/parallel/test-async-wrap-getasyncid.js

falakings

  1. https://ci.nodejs.org/job/node-test-binary-windows/8471/
  2. https://ci.nodejs.org/job/node-test-binary-windows/8362/
  3. https://ci.nodejs.org/job/node-test-binary-windows/8402/

TAP output

2   parallel/test-async-wrap-getasyncid 
duration_ms 0.232
severity    fail
stack   
Mismatched <anonymous> function calls. Expected 1, actual 0.
    at Object.<anonymous> (c:\workspace\node-test-binary-windows\RUN_SUBSET\3\VS_VERSION\vs2015\label\win2008r2\test\parallel\test-async-wrap-getasyncid.js:155:42)
    at Module._compile (module.js:569:30)
    at Object.Module._extensions..js (module.js:580:10)
    at Module.load (module.js:503:32)
    at tryModuleLoad (module.js:466:12)
    at Function.Module._load (module.js:458:3)
    at Function.Module.runMain (module.js:605:10)
    at startup (bootstrap_node.js:144:16)
    at bootstrap_node.js:561:3
(node:5848) [DEP0064] DeprecationWarning: tls.createSecurePair() is deprecated. Please use tls.Socket instead.

Hypothesis

Seems like a net.server sometimes doesn't complete it's creation before the test exits...
const server = net.createServer(common.mustCall((socket) => {

CI / flaky test arm async_wrap test windows

Most helpful comment

@refack @Trott I can reproduce on SmartOS, will look into this. I would be surprised if I could make any significant progress on this before Monday though, is that ok?

All 35 comments

@addaleax @AndreasMadsen is it possible that the anonymous scope + eager gc calls, cleans up the server before it has a chance to (h)->flags = UV__HANDLE_REF?

As far as I understand common.mustCall(), all missed functions are listed on exit. If you were right, that would mean that the .listen() handler wouldn’t get called, but it’s not listed so it seems safe to assume it does get called?

The only explanation that makes sense to me, is that the client connects, sends data, and shutdown itself before the actual server received any connection. Thus the libuv loop doesn't know that there a connection incoming and will just exit the process.

We could wait until the server received its connection before running handle.shutdown() or alternatively simply remove the common.mustCall from net.createServer().

In any case, the net.createServer() should actually not call socket.end(); or socket.destroy(); since that creates a race conditon between socket.destroy(); and handle.shutdown().

safe to assume it does get called?

Yep...

We could wait until the server received its connection before running handle.shutdown()

👍 handle.shutdown() only in req.on('end')
I'm on it.

handle.shutdown() only in req.on('end')

We need to wait for both server.on('connection') and wreq.oncomplete. In particular, req.on('end') will not happen at all before handle.shutdown().

@refack was this issue solved? I don't remember.

I think someone somewhere said something about #12854 might solve this... 🤔
Anyway I did not see it flake in a while.
Running another stress https://ci.nodejs.org/job/node-stress-single-test/nodes=win10/1232/

I think the script path was spelled wrong, rerunning: https://ci.nodejs.org/job/node-stress-single-test/1240/

✔️ More rigorous stress (-j 16 parallel/test-async-wrap-getasyncid X 100): https://ci.nodejs.org/job/node-stress-single-test/1242/nodes=win10/

Closing

Hello form smartos15-64

    9   parallel/test-async-wrap-getasyncid 
duration_ms 0.627
severity    fail
stack   
Mismatched <anonymous> function calls. Expected exactly 1, actual 0.
    at Object.exports.mustCall (/home/iojs/build/workspace/node-test-commit-smartos/nodes/smartos15-64/test/common/index.js:483:10)
    at Object.<anonymous> (/home/iojs/build/workspace/node-test-commit-smartos/nodes/smartos15-64/test/parallel/test-async-wrap-getasyncid.js:162:42)
    at Module._compile (module.js:569:30)
    at Object.Module._extensions..js (module.js:580:10)
    at Module.load (module.js:503:32)
    at tryModuleLoad (module.js:466:12)
    at Function.Module._load (module.js:458:3)
    at Function.Module.runMain (module.js:605:10)
    at startup (bootstrap_node.js:158:16)
(node:53941) [DEP0064] DeprecationWarning: tls.createSecurePair() is deprecated. Please use tls.Socket instead.

I am getting a timeout of this test today on MacOS 10.12.4 and can reproduce this throughout the day (I have not rebooted my machine yet):

▶ tools/test.py parallel/test-async-wrap-getasyncid
=== release test-async-wrap-getasyncid ===
Path: parallel/test-async-wrap-getasyncid
(node:28564) [DEP0064] DeprecationWarning: tls.createSecurePair() is deprecated. Please use tls.Socket instead.
Command: out/Release/node /Users/joyee/projects/node/test/parallel/test-async-wrap-getasyncid.js
--- TIMEOUT ---
[01:00|% 100|+   0|-   1]: Done

I am getting a timeout of this test today on MacOS 10.12.4 and can reproduce this throughout the day (I have not rebooted my machine yet):

Do you have a way of knowing which case without breaking the spell? (especially interested if it's the one starting at L160)

node-test-binary-windows > #9523 > 0,vs2015,win2008r2:

not ok 6 parallel/test-async-wrap-getasyncid
  ---
  duration_ms: 0.272
  severity: fail
  stack: |-
    Mismatched <anonymous> function calls. Expected exactly 1, actual 0.
        at Object.exports.mustCall (c:\workspace\node-test-binary-windows\RUN_SUBSET\0\VS_VERSION\vs2015\label\win2008r2\test\common\index.js:486:10)
        at Object.<anonymous> (c:\workspace\node-test-binary-windows\RUN_SUBSET\0\VS_VERSION\vs2015\label\win2008r2\test\parallel\test-async-wrap-getasyncid.js:162:42)
        at Module._compile (module.js:569:30)
        at Object.Module._extensions..js (module.js:580:10)
        at Module.load (module.js:503:32)
        at tryModuleLoad (module.js:466:12)
        at Function.Module._load (module.js:458:3)
        at Function.Module.runMain (module.js:605:10)
        at startup (bootstrap_node.js:158:16)
    (node:3320) [DEP0064] DeprecationWarning: tls.createSecurePair() is deprecated. Please use tls.Socket instead.
  ...

@refack Bumped into this one again, this line

dns.resolve6('::1', () => {})

causes a hang on my machine when ipv6 is disabled automatically since my current ISP does not support that. I guess I could not stably reproduce this before because I keep moving around...

BTW dig ::1 AAAA hangs on my machine as well

@refack Bumped into this one again, this line

dns.resolve6('::1', () => {})

That should be simpler to fix then the flakiness, especially since this is /internet/ behaviour...
Ref: https://github.com/nodejs/node/pull/13883

macOS:

not ok 74 parallel/test-async-wrap-getasyncid
  ---
  duration_ms: 1.20
  severity: fail
  stack: |-
    Mismatched <anonymous> function calls. Expected exactly 1, actual 0.
        at Object.exports.mustCall (/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx1010/test/common/index.js:484:10)
        at Object.<anonymous> (/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx1010/test/parallel/test-async-wrap-getasyncid.js:162:42)
        at Module._compile (module.js:569:30)
        at Object.Module._extensions..js (module.js:580:10)
        at Module.load (module.js:503:32)
        at tryModuleLoad (module.js:466:12)
        at Function.Module._load (module.js:458:3)
        at Function.Module.runMain (module.js:605:10)
        at startup (bootstrap_node.js:158:16)
    (node:76986) [DEP0064] DeprecationWarning: tls.createSecurePair() is deprecated. Please use tls.Socket instead.
  ...

https://ci.nodejs.org/job/node-test-commit-osx/11049/nodes=osx1010/

Just tested it 1000 times on my own machine, no error :(

Just tested it 1000 times on my own machine, no error :(

I've looked at the code 10 times, I can't see a reason. Maybe we should just shuffle thing around in that block... It might work...

@refack We could start out by extracting the offending test to a separate file. Generally speaking, I actually don't like having this many logically separate tests in one file, it makes it hard to debug.

@refack We could start out by extracting the offending test to a separate file. Generally speaking, I actually don't like having this many logically separate tests in one file, it makes it hard to debug.

I'll get to it after my nap 😪

@refack We could start out by extracting the offending test to a separate file. Generally speaking, I actually don't like having this many logically separate tests in one file, it makes it hard to debug.

I'll get to it after my nap 😪

Well I napped on it, and I'm now thinking there's probably a non-async-hooks bug somewhere in there, so I'll _copy_ the flaky block to a new file, as I don't want to make this disappear without a solution. Maybe do a lite pummel in the new test (loop a 1000 times or something)

On CI, with this recent recurrence (perhaps since disabling snapshots?), we're (so far) only seeing it on SmartOS, right?

/cc @nodejs/platform-smartos

On CI, with this recent recurrence (perhaps since disabling snapshots?), we're (so far) only seeing it on SmartOS, right?

Nope. macOS fail is post snapshotpocalypse https://ci.nodejs.org/job/node-test-commit-osx/11049/nodes=osx1010/
/cc @nodejs/platform-macos

@refack @Trott I can reproduce on SmartOS, will look into this. I would be surprised if I could make any significant progress on this before Monday though, is that ok?

@misterdjules of course. Even Amazing. This issue has been there for a 2 months with relatively low frequency of failures. And we're stumped.

From a quick look, it seems to me that there is a race between the client and the server.

In the test's current form, the server is closed by the _client_, because I would think the test currently assumes that if the client successfully wrote data on the client connection and closed it, then the server must have called its connection callback.

However, it is not necessarily the case, and I would think the server instance should be closed only when we can be sure that the server ran its connection callback which is...in the connection callback.

I'm currently stress testing the following patch:

diff --git a/test/parallel/test-async-wrap-getasyncid.js b/test/parallel/test-async-wrap-getasyncid.js
index 6b9d33b..778ca4c 100644
--- a/test/parallel/test-async-wrap-getasyncid.js
+++ b/test/parallel/test-async-wrap-getasyncid.js
@@ -160,6 +160,7 @@ if (common.hasCrypto) {
   const stream_wrap = process.binding('stream_wrap');
   const tcp_wrap = process.binding('tcp_wrap');
   const server = net.createServer(common.mustCall((socket) => {
+    server.close();
     socket.on('data', (x) => {
       socket.end();
       socket.destroy();
@@ -176,7 +177,6 @@ if (common.hasCrypto) {

     sreq.oncomplete = common.mustCall(() => {
       handle.close();
-      server.close();
     });

     wreq.handle = handle;

on SmartOS and I'll report results back. In the meantime I'd appreciate initial feedback on the analysis above.

I'll report results back.

parallel/test-async-wrap-getasyncid has been running with the change mentionned above successfully for > 12 hours without any failure on the same SmartOS VM on which I was able to consistently reproduce the problem < 1 hour. I'll submit a PR asap, which might be no sooner than Monday (today is a day off for me).

FWIW, I ran parallel/test-async-wrap-getasyncid with the changes in #14329 for > 48 hours without a single failure on SmartOS.

Fixed by 2da1af0a9320d613ca0107ed39087e23b9719369.

Windows VS2017 is back.
https://ci.nodejs.org/job/node-test-binary-windows/11616/COMPILED_BY=vs2017,RUNNER=win2016,RUN_SUBSET=2/console

not ok 479 sequential/test-async-wrap-getasyncid
  ---
  duration_ms: 0.215
  severity: crashed
  stack: |-
    oh no!
    exit code: CRASHED
  ...

And ARM:

not ok 263 sequential/test-async-wrap-getasyncid
  ---
  duration_ms: 1.736
  severity: fail
  stack: |-
    Mismatched <anonymous> function calls. Expected exactly 1, actual 2.
        at Object.exports.mustCall (/home/iojs/build/workspace/node-test-binary-arm/test/common/index.js:485:10)
        at Object.<anonymous> (/home/iojs/build/workspace/node-test-binary-arm/test/sequential/test-async-wrap-getasyncid.js:19:35)
        at Module._compile (module.js:600:30)
        at Object.Module._extensions..js (module.js:611:10)
        at Module.load (module.js:521:32)
        at tryModuleLoad (module.js:484:12)
        at Function.Module._load (module.js:476:3)
        at Function.Module.runMain (module.js:641:10)
        at startup (bootstrap_node.js:187:16)
    (node:1457) [DEP0064] DeprecationWarning: tls.createSecurePair() is deprecated. Please use tls.Socket instead.
    (node:1457) [INSPECTOR_ASYNC_STACK_TRACES_NOT_AVAILABLE] Warning: Warning: Async stack traces in debugger are not available on 32bit platforms. The feature is disabled.
  ...

https://ci.nodejs.org/job/node-test-binary-arm/10781/RUN_SUBSET=1,label=pi2-raspbian-wheezy/console

@refack I'm thinking that it might be beneficial to open separate issues for these 2 new occurrences. The reason is that the windows failure is a crash, and so seems like a different problem than the original issue.

The failure on ARM seems more similar to the original issue, but it's still different (2 actual calls vs 0 actual call in the original issue), and so the cause might be different.

Being able to discuss these separate problems in separate issues could help make discussions less confusing.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

willnwhite picture willnwhite  ·  3Comments

fanjunzhi picture fanjunzhi  ·  3Comments

srl295 picture srl295  ·  3Comments

cong88 picture cong88  ·  3Comments

addaleax picture addaleax  ·  3Comments