https://ci.nodejs.org/job/node-test-commit-linux/16566/nodes=debian8-x86/console
not ok 1585 parallel/test-tls-buffersize
---
duration_ms: 0.213
severity: fail
stack: |-
I'm not sure if there's much point anymore in creating separate issues for all of these debian8-x86, ubuntu1404 and ubuntu1204 failures without a strack-trace. They seem to affect literally all of the tests so we might eventually have an issue for every single test that exists on each of those platforms.
/cc @nodejs/build It would be nice if someone could look into this. The lack of a stack trace or signal code is puzzling.
agree on the lack of hints / artifacts associated with these failure types - they don't help us in anyway other than telling that they failed. My debugging method in those cases always had been to run a 1000 times locally in expectation of a recreate, reduce the code as long as the problem retains etc. (time consuming and many times un-successful).
My question is: what type of failure this is: assertion failure / unforeseen exception / crash / bad exit caught by python driver / forced failure? is it a tap.assert(false) or something?
@nodejs/build someone with test access should get into one of these machines straight after a failure and run the test manually to see what's up.
@apapirovski
They seem to affect literally all of the tests so we might eventually have an issue for every single test that exists on each of those platforms
Are you saying this is more than just that one test that fails like this?
@rvagg check the tests issues with the flaky label. Probably more than 50 % are all similar to this one.
Are these tests resource intensive in any way? Is it possible that they may crash if there are zombie jobs running in the background that may conflict them in some fashion - maybe sharing resources, or where a zombie has too much memory. I've been seeing a few make test-ci and associated children zombied on some linux machines recently, not many but enough that I could imagine it's connected.
I've observed some similar failures on Docker containers today:
https://ci.nodejs.org/job/node-test-commit-linux/16729/nodes=alpine37-container-x64/ running on https://ci.nodejs.org/computer/test-softlayer-alpine37_container-x64-1 which runs on test-softlayer-ubuntu1604_docker-x64-1
not ok 1778 async-hooks/test-graph.intervals
---
duration_ms: 0.240
severity: crashed
stack: |-
oh no!
exit code: CRASHED (Signal: 11)
https://ci.nodejs.org/job/node-test-commit-linux-containered/2544/nodes=ubuntu1604_sharedlibs_openssl110_x64/ on https://ci.nodejs.org/computer/test-softlayer-ubuntu1604_sharedlibs_container-x64-2 which is on test-softlayer-ubuntu1604_docker-x64-2
not ok 1931 addons/stringbytes-external-exceed-max/test-stringbytes-external-exceed-max
---
duration_ms: 21.243
severity: crashed
stack: |-
oh no!
exit code: CRASHED (Signal: 9)
That same container had a different failure on its previous run that doesn't look the same but perhaps it's somehow related:
https://ci.nodejs.org/job/node-test-commit-linux-containered/2543/nodes=ubuntu1604_sharedlibs_fips20_x64/ on https://ci.nodejs.org/computer/test-softlayer-ubuntu1604_sharedlibs_container-x64-2 which is on test-softlayer-ubuntu1604_docker-x64-2
00:12:29.798 not ok 691 parallel/test-http-client-timeout-agent
00:12:29.798 ---
00:12:29.798 duration_ms: 0.824
00:12:29.798 severity: fail
00:12:29.798 stack: |-
00:12:29.798 res#0 data:0
00:12:29.798 res#0 end
00:12:29.798 res#2 data:2
00:12:29.799 res#2 end
00:12:29.799 res#4 data:4
00:12:29.799 res#4 end
00:12:29.799 res#6 data:6
00:12:29.799 res#6 end
00:12:29.799 res#8 data:8
00:12:29.799 res#8 end
00:12:29.799 res#10 data:10
00:12:29.799 res#10 end
00:12:29.799 res#12 data:12
00:12:29.799 res#12 end
00:12:29.800 res#14 data:14
00:12:29.800 res#14 end
00:12:29.800 res#16 data:16
00:12:29.800 res#16 end
00:12:29.800 res#18 data:18
00:12:29.800 res#18 end
00:12:29.800 res#20 data:20
00:12:29.800 res#20 end
00:12:29.800 res#22 data:22
00:12:29.800 res#22 end
00:12:29.801 res#24 data:24
00:12:29.801 res#24 end
00:12:29.801 res#26 data:26
00:12:29.801 res#26 end
00:12:29.801 res#28 data:28
00:12:29.801 res#28 end
00:12:29.801 req#1 timeout
00:12:29.801 req#3 timeout
00:12:29.801 req#5 timeout
00:12:29.801 req#7 timeout
00:12:29.802 req#9 timeout
00:12:29.802 req#11 timeout
00:12:29.802 req#13 timeout
00:12:29.802 req#15 timeout
00:12:29.802 req#17 timeout
00:12:29.802 req#19 timeout
00:12:29.802 req#21 timeout
00:12:29.802 req#23 timeout
00:12:29.802 req#25 timeout
00:12:29.803 req#27 timeout
00:12:29.803 req#29 timeout
00:12:29.803 req#0 timeout
00:12:29.803 req#28 close
00:12:29.803 req#26 close
00:12:29.803 req#24 close
00:12:29.803 req#22 close
00:12:29.803 req#20 close
00:12:29.803 req#18 close
00:12:29.803 req#16 close
00:12:29.804 req#14 close
00:12:29.804 req#12 close
00:12:29.804 req#10 close
00:12:29.804 req#8 close
00:12:29.804 req#6 close
00:12:29.804 req#4 close
00:12:29.804 req#2 close
00:12:29.804 req#0 close
00:12:29.804 req#29 error
00:12:29.804 req#29 close
00:12:29.804 req#27 error
00:12:29.805 req#27 close
00:12:29.805 req#25 error
00:12:29.805 req#25 close
00:12:29.805 req#23 error
00:12:29.805 req#23 close
00:12:29.805 req#21 error
00:12:29.805 req#21 close
00:12:29.805 req#19 error
00:12:29.805 req#19 close
00:12:29.805 req#17 error
00:12:29.806 req#17 close
00:12:29.806 req#15 error
00:12:29.806 req#15 close
00:12:29.806 req#13 error
00:12:29.806 req#13 close
00:12:29.806 req#11 error
00:12:29.806 req#11 close
00:12:29.806 req#9 error
00:12:29.806 req#9 close
00:12:29.806 req#7 error
00:12:29.807 req#7 close
00:12:29.807 req#5 error
00:12:29.807 req#5 close
00:12:29.807 req#3 error
00:12:29.807 req#3 close
00:12:29.807 req#1 error
00:12:29.808 req#1 close
00:12:29.808 done=31 sent=30
00:12:29.808 assert.js:74
00:12:29.808 throw new AssertionError(obj);
00:12:29.808 ^
00:12:29.808
00:12:29.808 AssertionError [ERR_ASSERTION]: timeout on http request called too much
00:12:29.809 at process.<anonymous> (/home/iojs/build/workspace/node-test-commit-linux-containered/nodes/ubuntu1604_sharedlibs_fips20_x64/test/parallel/test-http-client-timeout-agent.js:94:10)
00:12:29.810 at process.emit (events.js:136:15)
I couldn't find anything abnormal on these containers or the hosts running them.
@rvagg I think the stringbytes failures are memory overcommit related. I raised the issue before of potentially needing to turn it off on the CI machines (at least the alpine ones) because those tests try to assess whether enough memory is available before running but with memory overcommit they don't get a representative answer and proceed to crash.
The one referenced in this issue is very particular because it doesn't get killed by a signal code but it also doesn't have a stack trace...
https://github.com/nodejs/node/issues/18998
https://github.com/nodejs/node/issues/18997
https://github.com/nodejs/node/issues/18996
https://github.com/nodejs/node/issues/18995
https://github.com/nodejs/node/issues/18735
https://github.com/nodejs/node/issues/18733
https://github.com/nodejs/node/issues/18732
https://github.com/nodejs/node/issues/18663
https://github.com/nodejs/node/issues/18504
https://github.com/nodejs/node/issues/18496
There are a variety of subsystems represented here and I don't really see much in common.
@nodejs/collaborators we're going to need some more help on getting to the bottom of this. Errors of this type are showing up pretty frequently and not for one specific test. Just looking at the last 4 node-test-commit-linux failures show what look to be similar failures (4 failures out of the 6 at the top right now):
https://ci.nodejs.org/job/node-test-commit-linux/16868/nodes=alpine35-container-x64/console
https://ci.nodejs.org/job/node-test-commit-linux/16867/nodes=debian8-x86/console
https://ci.nodejs.org/job/node-test-commit-linux/16868/nodes=alpine35-container-x64/console
https://ci.nodejs.org/job/node-test-commit-linux/16866/nodes=debian8-x86/console
The test-http-client-timeout-agent failure mentioned above is also showing up regularly too, I'm not sure if it's related or not.
Build has been working to iron out the frustrating Jenkins-related failures, but these kinds of errors are now one of the major blockers for getting CI back to green--your help in getting us there would be appreciated!
Sorry, I got those numbers wrong. There are 5 failures out of the last 7 builds, 3 of them have the weird no-output crashes, two of them have the test-http-client-timeout-agent failure on Alpine. But you can go back not much further and even find test-http-client-timeout-agent failing on ARM64. There's at least one bug being exposed on Linux here.
@rvagg It is possible to turn on ulimit -c <limit> on the problematic hosts and store the core dump + the Node.js binary for a period of time (or up to n GB of spaces) so we can download them and debug those?
@joyeecheung ok, done on the Debian 8 and Debian 9 hosts. It's a bit more tricky on the Alpine ones so I haven't bothered for now. Next thing is to keep an eye out for these crashes.
I've only done a limit of infinity and set a reminder that we should undo it at some point. Is it actually possible to limit the space taken up by core files in total rather than just the size of the files themselves? Or to set a maxage without scripting something in cron?
had one on test-digitalocean-debian8-x64-1 today https://ci.nodejs.org/job/node-test-commit-linux/16972/nodes=debian8-64/
not ok 2029 sequential/test-http-server-keep-alive-timeout-slow-server
---
duration_ms: 0.212
severity: fail
stack: |-
but no core, at least I can't find one, perhaps one was made in the working directory but destroyed with the next run after a git clean. It's also possible that LimitCORE did nothing since we could be using a too-old version of systemd on Debian 8, the coredumpctl utilities aren't even avilable by default. So I've gone ahead and put jessie-backports on there and upgraded systemd and systemd-coredump on there. _I believe_ that coredumpctl list should now show any cores that are dumped on those machines. I'll keep an eye out but I've never used systemd coredump utilities before so I'm flying a little blind here.
@rvagg I believe if the test crashes then the output would be a severity: crashed following by the signal that the process crash with? This looks like another test that fails without output, probably there's something wrong with tools/test.py that fails to grab the output from the test.
@joyeecheung ahhh good point, I didn't realise that. self.output.exit_code < 0 and self.output.exit_code != -signal.SIGABRT in test.py, why would not be treating abort() as a "crash" tho?
Anyway, I can see corefiles stacking up, looks like _lots_ of them per test run on each of these servers, is that normal? I've never bothered getting in to core dump analysis so I'm a bit out of my depth.
Also, I've enabled timestamps on the console output of node-test-commit-linux so we can match up test failures with core files since systemd keeps them nicely stored with a timestamp.
@rvagg I am able to get the core dumps from test-digitalocean-debian8-x64-1 but I don't have the accompanying Node.js binary that generates them...File names are core.node.1000.567551fcf4104b21b9fb317a3ca0da3b.<pid>.<time?>, there seems to be a sha in the file name but I am unable to find the correct commit..
Sorry @joyeecheung, I think maybe they've not been kept between runs, it looks like systemd isn't saving the binaries too. Perhaps we need to do a binary save after each run too, although that'll have to be done in Jenkins.
I'm out of action for the next 2 days sadly so I won't be able to help again until at least mid-week.
It seems like we still have failures that that do not bring up any stack traces. @Trott you got a couple of those recently.
@BridgeAR @Trott Do you know which machines are affected most by this? This has been going on for a while now, and I think I’d like to request access from build to try and debug those…
@addaleax Not sure, but I'll try to take note. Here's another one from today on ppcle-ubuntu1404...
https://ci.nodejs.org/job/node-test-commit-plinux/16540/nodes=ppcle-ubuntu1404/console
not ok 1178 parallel/test-net-socket-timeout
---
duration_ms: 0.510
severity: fail
stack: |-
I am thinking maybe node-report could be useful, but still I think there's a bug in tools/test.py that fails to collect the stderr/stdout from failed tests.
Closing this in favor of #19903, but feel free to re-open or comment if you think that's the wrong thing to do.