node 🚀 - Flaky parallel/test-tls-buffersize

I'm not sure if there's much point anymore in creating separate issues for all of these debian8-x86, ubuntu1404 and ubuntu1204 failures without a strack-trace. They seem to affect literally all of the tests so we might eventually have an issue for every single test that exists on each of those platforms.

/cc @nodejs/build It would be nice if someone could look into this. The lack of a stack trace or signal code is puzzling.

apapirovski on 26 Feb 2018

👍1

agree on the lack of hints / artifacts associated with these failure types - they don't help us in anyway other than telling that they failed. My debugging method in those cases always had been to run a 1000 times locally in expectation of a recreate, reduce the code as long as the problem retains etc. (time consuming and many times un-successful).

My question is: what type of failure this is: assertion failure / unforeseen exception / crash / bad exit caught by python driver / forced failure? is it a tap.assert(false) or something?

gireeshpunathil on 26 Feb 2018

@nodejs/build someone with test access should get into one of these machines straight after a failure and run the test manually to see what's up.

@apapirovski

They seem to affect literally all of the tests so we might eventually have an issue for every single test that exists on each of those platforms

Are you saying this is more than just that one test that fails like this?

rvagg on 27 Feb 2018

@rvagg check the ~~tests~~ issues with the flaky label. Probably more than 50 % are all similar to this one.

BridgeAR on 27 Feb 2018

Are these tests resource intensive in any way? Is it possible that they may crash if there are zombie jobs running in the background that may conflict them in some fashion - maybe sharing resources, or where a zombie has too much memory. I've been seeing a few make test-ci and associated children zombied on some linux machines recently, not many but enough that I could imagine it's connected.

rvagg on 27 Feb 2018

I've observed some similar failures on Docker containers today:

https://ci.nodejs.org/job/node-test-commit-linux/16729/nodes=alpine37-container-x64/ running on https://ci.nodejs.org/computer/test-softlayer-alpine37_container-x64-1 which runs on test-softlayer-ubuntu1604_docker-x64-1

not ok 1778 async-hooks/test-graph.intervals
  ---
  duration_ms: 0.240
  severity: crashed
  stack: |-
    oh no!
    exit code: CRASHED (Signal: 11)

https://ci.nodejs.org/job/node-test-commit-linux-containered/2544/nodes=ubuntu1604_sharedlibs_openssl110_x64/ on https://ci.nodejs.org/computer/test-softlayer-ubuntu1604_sharedlibs_container-x64-2 which is on test-softlayer-ubuntu1604_docker-x64-2

not ok 1931 addons/stringbytes-external-exceed-max/test-stringbytes-external-exceed-max
  ---
  duration_ms: 21.243
  severity: crashed
  stack: |-
    oh no!
    exit code: CRASHED (Signal: 9)

That same container had a different failure on its previous run that doesn't look the same but perhaps it's somehow related:

https://ci.nodejs.org/job/node-test-commit-linux-containered/2543/nodes=ubuntu1604_sharedlibs_fips20_x64/ on https://ci.nodejs.org/computer/test-softlayer-ubuntu1604_sharedlibs_container-x64-2 which is on test-softlayer-ubuntu1604_docker-x64-2

00:12:29.798 not ok 691 parallel/test-http-client-timeout-agent
00:12:29.798   ---
00:12:29.798   duration_ms: 0.824
00:12:29.798   severity: fail
00:12:29.798   stack: |-
00:12:29.798     res#0 data:0
00:12:29.798     res#0 end
00:12:29.798     res#2 data:2
00:12:29.799     res#2 end
00:12:29.799     res#4 data:4
00:12:29.799     res#4 end
00:12:29.799     res#6 data:6
00:12:29.799     res#6 end
00:12:29.799     res#8 data:8
00:12:29.799     res#8 end
00:12:29.799     res#10 data:10
00:12:29.799     res#10 end
00:12:29.799     res#12 data:12
00:12:29.799     res#12 end
00:12:29.800     res#14 data:14
00:12:29.800     res#14 end
00:12:29.800     res#16 data:16
00:12:29.800     res#16 end
00:12:29.800     res#18 data:18
00:12:29.800     res#18 end
00:12:29.800     res#20 data:20
00:12:29.800     res#20 end
00:12:29.800     res#22 data:22
00:12:29.800     res#22 end
00:12:29.801     res#24 data:24
00:12:29.801     res#24 end
00:12:29.801     res#26 data:26
00:12:29.801     res#26 end
00:12:29.801     res#28 data:28
00:12:29.801     res#28 end
00:12:29.801     req#1 timeout
00:12:29.801     req#3 timeout
00:12:29.801     req#5 timeout
00:12:29.801     req#7 timeout
00:12:29.802     req#9 timeout
00:12:29.802     req#11 timeout
00:12:29.802     req#13 timeout
00:12:29.802     req#15 timeout
00:12:29.802     req#17 timeout
00:12:29.802     req#19 timeout
00:12:29.802     req#21 timeout
00:12:29.802     req#23 timeout
00:12:29.802     req#25 timeout
00:12:29.803     req#27 timeout
00:12:29.803     req#29 timeout
00:12:29.803     req#0 timeout
00:12:29.803     req#28 close
00:12:29.803     req#26 close
00:12:29.803     req#24 close
00:12:29.803     req#22 close
00:12:29.803     req#20 close
00:12:29.803     req#18 close
00:12:29.803     req#16 close
00:12:29.804     req#14 close
00:12:29.804     req#12 close
00:12:29.804     req#10 close
00:12:29.804     req#8 close
00:12:29.804     req#6 close
00:12:29.804     req#4 close
00:12:29.804     req#2 close
00:12:29.804     req#0 close
00:12:29.804     req#29 error
00:12:29.804     req#29 close
00:12:29.804     req#27 error
00:12:29.805     req#27 close
00:12:29.805     req#25 error
00:12:29.805     req#25 close
00:12:29.805     req#23 error
00:12:29.805     req#23 close
00:12:29.805     req#21 error
00:12:29.805     req#21 close
00:12:29.805     req#19 error
00:12:29.805     req#19 close
00:12:29.805     req#17 error
00:12:29.806     req#17 close
00:12:29.806     req#15 error
00:12:29.806     req#15 close
00:12:29.806     req#13 error
00:12:29.806     req#13 close
00:12:29.806     req#11 error
00:12:29.806     req#11 close
00:12:29.806     req#9 error
00:12:29.806     req#9 close
00:12:29.806     req#7 error
00:12:29.807     req#7 close
00:12:29.807     req#5 error
00:12:29.807     req#5 close
00:12:29.807     req#3 error
00:12:29.807     req#3 close
00:12:29.807     req#1 error
00:12:29.808     req#1 close
00:12:29.808     done=31 sent=30
00:12:29.808     assert.js:74
00:12:29.808       throw new AssertionError(obj);
00:12:29.808       ^
00:12:29.808     
00:12:29.808     AssertionError [ERR_ASSERTION]: timeout on http request called too much
00:12:29.809         at process.<anonymous> (/home/iojs/build/workspace/node-test-commit-linux-containered/nodes/ubuntu1604_sharedlibs_fips20_x64/test/parallel/test-http-client-timeout-agent.js:94:10)
00:12:29.810         at process.emit (events.js:136:15)

I couldn't find anything abnormal on these containers or the hosts running them.

rvagg on 28 Feb 2018

@rvagg I think the stringbytes failures are memory overcommit related. I raised the issue before of potentially needing to turn it off on the CI machines (at least the alpine ones) because those tests try to assess whether enough memory is available before running but with memory overcommit they don't get a representative answer and proceed to crash.

The one referenced in this issue is very particular because it doesn't get killed by a signal code but it also doesn't have a stack trace...

https://github.com/nodejs/node/issues/18998
https://github.com/nodejs/node/issues/18997
https://github.com/nodejs/node/issues/18996
https://github.com/nodejs/node/issues/18995
https://github.com/nodejs/node/issues/18735
https://github.com/nodejs/node/issues/18733
https://github.com/nodejs/node/issues/18732
https://github.com/nodejs/node/issues/18663
https://github.com/nodejs/node/issues/18504
https://github.com/nodejs/node/issues/18496

There are a variety of subsystems represented here and I don't really see much in common.

apapirovski on 1 Mar 2018

@nodejs/collaborators we're going to need some more help on getting to the bottom of this. Errors of this type are showing up pretty frequently and not for one specific test. Just looking at the last 4 node-test-commit-linux failures show what look to be similar failures (4 failures out of the 6 at the top right now):

https://ci.nodejs.org/job/node-test-commit-linux/16868/nodes=alpine35-container-x64/console
https://ci.nodejs.org/job/node-test-commit-linux/16867/nodes=debian8-x86/console
https://ci.nodejs.org/job/node-test-commit-linux/16868/nodes=alpine35-container-x64/console
https://ci.nodejs.org/job/node-test-commit-linux/16866/nodes=debian8-x86/console

The test-http-client-timeout-agent failure mentioned above is also showing up regularly too, I'm not sure if it's related or not.

Build has been working to iron out the frustrating Jenkins-related failures, but these kinds of errors are now one of the major blockers for getting CI back to green--your help in getting us there would be appreciated!

rvagg on 6 Mar 2018

Sorry, I got those numbers wrong. There are 5 failures out of the last 7 builds, 3 of them have the weird no-output crashes, two of them have the test-http-client-timeout-agent failure on Alpine. But you can go back not much further and even find test-http-client-timeout-agent failing on ARM64. There's at least one bug being exposed on Linux here.

rvagg on 6 Mar 2018

@rvagg It is possible to turn on ulimit -c <limit> on the problematic hosts and store the core dump + the Node.js binary for a period of time (or up to n GB of spaces) so we can download them and debug those?

joyeecheung on 7 Mar 2018

👍1

@joyeecheung ok, done on the Debian 8 and Debian 9 hosts. It's a bit more tricky on the Alpine ones so I haven't bothered for now. Next thing is to keep an eye out for these crashes.
I've only done a limit of infinity and set a reminder that we should undo it at some point. Is it actually possible to limit the space taken up by core files in total rather than just the size of the files themselves? Or to set a maxage without scripting something in cron?

rvagg on 8 Mar 2018

had one on test-digitalocean-debian8-x64-1 today https://ci.nodejs.org/job/node-test-commit-linux/16972/nodes=debian8-64/

not ok 2029 sequential/test-http-server-keep-alive-timeout-slow-server
  ---
  duration_ms: 0.212
  severity: fail
  stack: |-

but no core, at least I can't find one, perhaps one was made in the working directory but destroyed with the next run after a git clean. It's also possible that LimitCORE did nothing since we could be using a too-old version of systemd on Debian 8, the coredumpctl utilities aren't even avilable by default. So I've gone ahead and put jessie-backports on there and upgraded systemd and systemd-coredump on there. _I believe_ that coredumpctl list should now show any cores that are dumped on those machines. I'll keep an eye out but I've never used systemd coredump utilities before so I'm flying a little blind here.

rvagg on 9 Mar 2018

@rvagg I believe if the test crashes then the output would be a severity: crashed following by the signal that the process crash with? This looks like another test that fails without output, probably there's something wrong with tools/test.py that fails to grab the output from the test.

joyeecheung on 9 Mar 2018

@joyeecheung ahhh good point, I didn't realise that. self.output.exit_code < 0 and self.output.exit_code != -signal.SIGABRT in test.py, why would not be treating abort() as a "crash" tho?

Anyway, I can see corefiles stacking up, looks like _lots_ of them per test run on each of these servers, is that normal? I've never bothered getting in to core dump analysis so I'm a bit out of my depth.

rvagg on 9 Mar 2018

Also, I've enabled timestamps on the console output of node-test-commit-linux so we can match up test failures with core files since systemd keeps them nicely stored with a timestamp.

rvagg on 9 Mar 2018

@rvagg I am able to get the core dumps from test-digitalocean-debian8-x64-1 but I don't have the accompanying Node.js binary that generates them...File names are core.node.1000.567551fcf4104b21b9fb317a3ca0da3b.<pid>.<time?>, there seems to be a sha in the file name but I am unable to find the correct commit..

joyeecheung on 9 Mar 2018

Sorry @joyeecheung, I think maybe they've not been kept between runs, it looks like systemd isn't saving the binaries too. Perhaps we need to do a binary save after each run too, although that'll have to be done in Jenkins.

I'm out of action for the next 2 days sadly so I won't be able to help again until at least mid-week.

rvagg on 11 Mar 2018

It seems like we still have failures that that do not bring up any stack traces. @Trott you got a couple of those recently.

BridgeAR on 3 Apr 2018

@BridgeAR @Trott Do you know which machines are affected most by this? This has been going on for a while now, and I think I’d like to request access from build to try and debug those…

addaleax on 3 Apr 2018

👍1

Refs: https://github.com/nodejs/build/issues/1207

BridgeAR on 3 Apr 2018

@addaleax Not sure, but I'll try to take note. Here's another one from today on ppcle-ubuntu1404...

https://ci.nodejs.org/job/node-test-commit-plinux/16540/nodes=ppcle-ubuntu1404/console

not ok 1178 parallel/test-net-socket-timeout
  ---
  duration_ms: 0.510
  severity: fail
  stack: |-

Trott on 3 Apr 2018

I am thinking maybe node-report could be useful, but still I think there's a bug in tools/test.py that fails to collect the stderr/stdout from failed tests.

joyeecheung on 4 Apr 2018

Closing this in favor of #19903, but feel free to re-open or comment if you think that's the wrong thing to do.

Trott on 10 Apr 2018

Node: Flaky parallel/test-tls-buffersize

All 23 comments

Related issues