Node: test: flaky parallel/test-net-connect-local-error on macOS & freeBSD

Created on 10 May 2017 · 13Comments · Source: nodejs/node

Version: master
Platform: macOS,freeBSD
Subsystem: test

https://ci.nodejs.org/job/node-test-commit-freebsd/8933/nodes=freebsd11-x64/tapResults/
https://ci.nodejs.org/job/node-test-commit-osx/9696/nodes=osx1010/tapResults/

738 parallel/test-net-connect-local-error   
duration_ms 
0.162
severity    
fail
stack   
Mismatched onError function calls. Expected 1, actual 0.
    at Object.<anonymous> (/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx1010/test/parallel/test-net-connect-local-error.js:15:27)
    at Module._compile (module.js:569:30)
    at Object.Module._extensions..js (module.js:580:10)
    at Module.load (module.js:503:32)
    at tryModuleLoad (module.js:466:12)
    at Function.Module._load (module.js:458:3)
    at Function.Module.runMain (module.js:605:10)
    at startup (bootstrap_node.js:144:16)
    at bootstrap_node.js:537:3

freebsd macos net test

Source

refack

Most helpful comment

Also: common.PORT and common.PORT + 1 in sequential is safe (assuming no non-Node.js processes grab the ports) so you could just consider moving to sequential if the logic gets too hairy otherwise. It's probably not anyone's first choice for a solution, but it probably shouldn't be the last choice either :-D

Also also, and this starts to enter rant territory so feel free to tune out right now: There seems to be a reluctance to move tests to sequential out of concerns for making the test suite run longer. I know I've certainly been reluctant. I believe now that these concerns are massively overblown. First, a lot of CI hosts (most?) don't run any tests in parallel and these include our absolute far-and-away slowest hosts. So moving something to sequential will have no impact whatsoever on the test suite duration for the slowest hosts. And on the faster hosts, we're generally talking about shaving off something on the order of a tenth of a second. And a lot of that benefit is probably lost by the introduction of somewhat more complicated code to make it safe to run in parallel. So we're making the code more complex but not actually getting any benefit out of it. Or at least, that's my theory. It would be fun to compare how long the test suite takes if everything in parallel were moved to sequential. I might do that right now.

UPDATE (still in rant mode, so feel free to skip): Ran the test benchmark described above. Normal parallel test run on my computer was about 90 seconds. Running the same tests sequentially took about 250 seconds. That's pretty good, but it's worth noting that with 1333 tests, each test moved from parallel to sequential will cost about 120ms. The question becomes: Is the added code complexity worth saving 120ms on a test suite that (once you add in the addon tests and tests in sequential and so on) takes about 2.5 minutes to run. For any significant complexity increase, the answer is always going to be "no".

Trott on 14 May 2017

👍2

All 13 comments

Can I take this one?

sebastianplesciuc on 10 May 2017

@sebastianplesciuc you like these ones 😉 . dibs is yours.
P.S. I've been told that the use of common.PORT is fragile in macOS / freeBSD.
There's a general goal to use 0 as port (if possible) and then figure out which port the OS assigned.

refack on 10 May 2017

👍1

Ref: #12376

refack on 10 May 2017

I think that's the case here as well.

I think there are two issues with this, one is that server.close() might be called before the client's callback which is the error on line 15 I believe.

Another issue might arise with port + 1. We have no guarantee that this is an empty port when this runs. A way of fixing this might be creating another server with port 0 and assigning it's port to the client socket in it's close callback. Thoughts? Is it guaranteed to be in use while the callback is running from net.Server close event?

sebastianplesciuc on 10 May 2017

The close is definitely in a bad place
When net.Server.close call the cb the server will not answer anymore, so it's a good time to try to connect to it (to get an error)
Another option is to loop port + 1...5, if that's simpler.

Worse case we can move it to /sequential/ where it will run alone.

refack on 10 May 2017

@sebastianplesciuc You might appreciate this. I've been looking at this issue's brother #12951.
These tests are run in parallel first this one, then the other.
In the other one there's a server that's supposed to receive 6 connections, instead it received 7.
I wonder where that 7th request comes from 🤣

refack on 12 May 2017

@refack I'm not sure I get it :) You're suggesting that the connection in that test comes from this one? How does that work?

sebastianplesciuc on 13 May 2017

@refack OOOOH! I think I get it now. So when that test fails, this test fails because it connects from this test to the server from that test. So fixing this test would fix that one?

sebastianplesciuc on 13 May 2017

😄1

Yep

refack on 13 May 2017

@refack

P.S. I've been told that the use of common.PORT is fragile in macOS / freeBSD.
There's a general goal to use 0 as port (if possible) and then figure out which port the OS assigned.

Using 0 to choose a random port can be fragile on BSD systems (including macOS) too, because of how SO_REUSEADDR works in their implementations of TCP. Recently I observed test/parallel/test-http-extra-response failing on my machine trying to communicate with the server of an autocompletion plugin for Vim instead of the server created in the test itself. The problem is that when a process binds to port 0 with SO_REUSEADDR, the kernel may give it a port that another process is already listening on if one of them binds to a concrete address (like 127.0.0.1) and the other — to a wildcard one. The logic that the kernel uses to route requests between them is not very clear to me.

/cc @nodejs/testing

aqrln on 14 May 2017

P.S. I've been told that the use of common.PORT is fragile in macOS / freeBSD.

Hmmm, this sounds like there's a misunderstanding somewhere. There's nothing macOS or FreeBSD specific about the common.PORT stuff as far as I know. And there's nothing fragile about common.PORT.

The reason we don't typically want to use common.PORT in parallel is to prevent this scenario:

Test A uses port 0 for something. Test B uses common.PORT for something. Test A starts to run first. The OS gives it an open port that just happens to be the same value as common.PORT will be when Test B runs. Test B starts to run but Test A is already using the port it requests via common.PORT and so Test B fails.

If Test B were in sequential rather than parallel, then there wouldn't be a problem.

The port collision scenario described above may sound unlikely, but it has been observed in CI results (or at least things that look suspiciously like it).

Trott on 14 May 2017

👍2

There seems to be a reluctance to move tests to sequential ... each test moved from parallel to sequential will cost about 120ms

@Trott IMHO it's not just the perf cost; it's the implicit assumption that testing in parallel is more vigorous, hence more rigorous. which is kinda true. A good mechanism could be to add a fourth flag to parallel.status: TRY_SEQ, to flag the harness that for a particular test is it fail in parallel, decision should be relegated until the test is run alone.
[that was me being grandiloquent]

refack on 15 May 2017

Was this page helpful?

0 / 5 - 0 ratings