I just added Alpine 3.8 to CI and removed 3.6 in the process, shifting alpine-last-latest-x64 to Alpine 3.7 and giving alpine-latest-x64 to Alpine 3.8. It tested well on my local machine across our major branches, but now it's enabled in CI we're getting consistent errors with on all test runs:
15:57:44 not ok 647 parallel/test-cluster-master-error
15:57:44 ---
15:57:44 duration_ms: 120.63
15:57:44 severity: fail
15:57:44 exitcode: -15
15:57:44 stack: |-
15:57:44 timeout
15:57:44 ...
15:57:44 not ok 648 parallel/test-cluster-master-kill
15:57:44 ---
15:57:44 duration_ms: 120.92
15:57:44 severity: fail
15:57:44 exitcode: -15
15:57:44 stack: |-
15:57:44 timeout
What should we do with this? Do I remove it from CI for now or can someone spend some time investigating?
The Dockerfile is here minus the template strings which you can easily stub out to set this up locally if you want to give that a go. I'm not sure what the difference is with my local machine but perhaps I wasn't testing it on the latest master and there's something new or perhaps there's a Docker differential.
Quick thing to try: Does moving the tests to sequential cause them to pass in CI on this platform? (I would create a branch and try myself but I've got non-computer things to focus on for the next few hours.)
(Also: Our code now prints stdout and stderr on timeouts so someone can try adding a bunch of console.log() statements to see where things are hanging up.)
Testing move to sequential
https://ci.nodejs.org/job/node-test-commit-linux/20716/
BTW: before switch to 38 tests took < 2s

Testing move to sequential
https://ci.nodejs.org/job/node-test-commit-linux/20716/
https://ci.nodejs.org/job/node-test-commit-linux/20716/nodes=alpine-latest-x64/console
Still fails.
Running with some logging: https://ci.nodejs.org/job/node-test-commit-linux/20725/nodes=alpine-latest-x64/console
The logging shows that the test succeeds except for that pollWorkers() in both tests never detects that the PID exits. I wonder if the bug is in common.isAlive(pid). That code uses process.kill(pid, 'SIGCONT'); so maybe there's something about SIGCONT and permissions or something in the Docker container?
@nodejs/docker
common.isAlive(pid) expects process.kill(pid, 'SIGCONT') to throw if pid does not exist. @rvagg @refack or someone who knows how to log into the Docker container running on CI, any chance you can see if ./node -e 'process.kill(12345)' throws or not (after checking that there is no pid 12345 running)?
bash-4.4$ ./node -e 'process.kill(12345)'
internal/process/per_thread.js:194
throw errnoException(err, 'kill');
^
Error: kill ESRCH
at process.kill (internal/process/per_thread.js:194:13)
at [eval]:1:9
at Script.runInThisContext (vm.js:88:20)
at Object.runInThisContext (vm.js:285:38)
at Object.<anonymous> ([eval]-wrapper:6:22)
at Module._compile (internal/modules/cjs/loader.js:689:30)
at evalScript (internal/bootstrap/node.js:563:27)
at startup (internal/bootstrap/node.js:248:9)
at bootstrapNodeJSCore (internal/bootstrap/node.js:596:3)
Also, @Trott (or anybody else with the nodejs_build_test SSH key), I had written up some docs about how to restart/SSH into the CI docker containers (in case more testing is needed, and you want it to be self-service :))
Since Alpine runs with musl which support level is experimental I've removed the label from the main CI job, and created a dedicated job for stabilization efforts - https://ci.nodejs.org/job/node-test-commit-alpine38/
Reporting from within the danger zone:
bash-4.4$ ./node test/parallel/test-cluster-master-error.js
message received: { cmd: 'worker', workerPID: 26420 }
message received: { cmd: 'worker', workerPID: 26421 }
exited with code 1
polling
polling
bash-4.4$ ./node test/parallel/test-cluster-master-kill.js
message received: { pid: 26447 }
exited with code 0
polling
polling
bash-4.4$
bash-4.4$ /usr/bin/python tools/test.py -j 4 -p tap --logfile test.tap --mode=release --flaky-tests=run parallel/test-cluster-master*
TAP version 13
1..2
ok 1 parallel/test-cluster-master-error
---
duration_ms: 0.713
...
ok 2 parallel/test-cluster-master-kill
---
duration_ms: 1.713
...
more like twilight zone.
Is the failure specific to running the test in a docker container?
Is the failure specific to running the test in a docker container?
We test Alpine only running as a Docker container, so I guess we don't know...
I would love to help but it looks like a pretty steep learning curve figuring out all the build stuff.
Btw, where is the test suite to test?
make test
which in turn runs:
/usr/bin/python tools/test.py -j 4 -p tap --logfile test.tap \
--mode=release --flaky-tests=run \
default addons addons-napi doctool
(it assumes the node binary to test is out/Release/node)
Is there a way to specify the binary path to test?
AFAIR, no. When needed I copy/symlink a binary to that location.
New observation: after running a CI job there seem to be multiple node zombies around:
iojs 1 0.7 0.7 1692000 123156 ? Ssl 01:27 5:37 java -Xmx128m -jar /home/iojs/slave.jar
iojs 1544 0.0 0.0 0 0 ? Z 01:38 0:00 [node] <defunct>
iojs 10025 0.0 0.0 0 0 ? Z 01:42 0:00 [node] <defunct>
iojs 10064 0.0 0.0 0 0 ? Z 01:42 0:00 [node] <defunct>
iojs 10690 0.0 0.0 0 0 ? Z 01:43 0:00 [node] <defunct>
iojs 11805 0.0 0.0 0 0 ? Z 01:43 0:00 [node] <defunct>
iojs 11830 0.0 0.0 0 0 ? Z 01:43 0:00 [node] <defunct>
iojs 11836 0.0 0.0 0 0 ? Z 01:43 0:00 [node] <defunct>
iojs 14805 0.0 0.0 6380 1932 pts/0 Ss 13:04 0:00 /bin/bash
iojs 16049 0.0 0.0 5700 616 pts/0 R+ 13:31 0:00 ps axu
iojs 20074 0.0 0.0 0 0 ? Zs 01:33 0:00 [node] <defunct>
iojs 22336 0.0 0.0 0 0 ? Z 01:34 0:00 [node] <defunct>
iojs 22456 0.0 0.0 0 0 ? Z 01:34 0:00 [node] <defunct>
iojs 22457 0.0 0.0 0 0 ? Z 01:34 0:00 [node] <defunct>
iojs 22480 0.0 0.0 0 0 ? Z 01:34 0:00 [node] <defunct>
iojs 22752 0.0 0.0 0 0 ? Z 01:34 0:00 [node] <defunct>
iojs 26383 0.0 0.0 0 0 ? Z 01:36 0:00 [ls] <defunct>
iojs 31062 0.0 0.0 0 0 ? Z 01:37 0:00 [node] <defunct>
iojs 31080 0.0 0.0 0 0 ? Zs 01:37 0:00 [node] <defunct>
iojs 31094 0.0 0.0 0 0 ? Zs 01:37 0:00 [node] <defunct>
That might make the polling to fail...
Ok, I created the following dockerfile to run the testsuit
FROM node:10-alpine
RUN apk add --no-cache --update \
curl \
python \
&& curl -L --compressed https://api.github.com/repos/nodejs/node/tarball -o node.tar.gz
RUN mkdir -p /node/out/Release \
&& tar -xf node.tar.gz --strip 1 -C /node \
&& ln -s /usr/local/bin/node /node/out/Release/node
RUN cd /node \
&& python tools/test.py -j 4 -p tap --logfile test.tap \
--mode=release --flaky-tests=run \
default addons addons-napi doctool
Hmmm I get a lot of failures
not ok 16 async-hooks/test-fsreqcallback-access
not ok 18 async-hooks/test-fsreqcallback-readFile
not ok 47 async-hooks/test-timers.setInterval
not ok 95 parallel/test-assert
not ok 96 parallel/test-assert-checktag
not ok 97 parallel/test-assert-deep
not ok 140 parallel/test-benchmark-misc
not ok 149 parallel/test-benchmark-util
not ok 190 parallel/test-buffer-readint
not ok 191 parallel/test-buffer-readuint
not ok 204 parallel/test-buffer-writeint
not ok 205 parallel/test-buffer-writeuint
not ok 272 parallel/test-child-process-spawnsync-shell
not ok 359 parallel/test-console
not ok 374 parallel/test-constants
not ok 377 parallel/test-crypto-authenticated
not ok 380 parallel/test-crypto-cipher-decipher
not ok 404 parallel/test-crypto-scrypt
not ok 419 parallel/test-dgram-bind-fd
not ok 420 parallel/test-dgram-bind-fd-error
not ok 426 parallel/test-dgram-cluster-bind-error
not ok 429 parallel/test-dgram-create-socket-handle-fd
not ok 457 parallel/test-dgram-send-error
not ok 464 parallel/test-dgram-socket-buffer-size
not ok 471 parallel/test-dns-lookup
not ok 472 parallel/test-dns-memory-error
not ok 478 parallel/test-dns-setservers-type-check
not ok 587 parallel/test-fs-access
not ok 598 parallel/test-fs-copyfile
not ok 600 parallel/test-fs-error-messages
not ok 612 parallel/test-fs-mkdir
not ok 626 parallel/test-fs-promises
not ok 638 parallel/test-fs-read
not ok 639 parallel/test-fs-read-empty-buffer
not ok 643 parallel/test-fs-read-stream
not ok 649 parallel/test-fs-read-stream-inherit
not ok 651 parallel/test-fs-read-stream-throw-type-error
not ok 656 parallel/test-fs-readdir-types
not ok 686 parallel/test-fs-sync-fd-leak
not ok 698 parallel/test-fs-watch-enoent
not ok 747 parallel/test-http-abort-stream-end
not ok 791 parallel/test-http-client-immediate-error
not ok 824 parallel/test-http-correct-hostname
not ok 827 parallel/test-http-debug
not ok 829 parallel/test-http-deprecated-urls
not ok 861 parallel/test-http-invalid-urls
not ok 902 parallel/test-http-req-res-close
not ok 904 parallel/test-http-request-arguments
not ok 1044 parallel/test-http2-debug
not ok 1115 parallel/test-http2-server-push-stream
not ok 1188 parallel/test-https-request-arguments
not ok 1195 parallel/test-https-strict
not ok 1215 parallel/test-internal-errors
not ok 1219 parallel/test-internal-module-wrap
not ok 1300 parallel/test-net-end-close
not ok 1354 parallel/test-cluster-master-error
not ok 1355 parallel/test-cluster-master-kill
not ok 1397 parallel/test-performanceobserver
not ok 1407 parallel/test-priority-queue
not ok 1413 parallel/test-process-chdir
not ok 1415 parallel/test-process-chdir-errormessage
not ok 1423 parallel/test-process-emit-warning-from-native
not ok 1424 parallel/test-process-emitwarning
not ok 1429 parallel/test-process-euid-egid
not ok 1447 parallel/test-process-hrtime
not ok 1459 parallel/test-process-setgroups
not ok 1462 parallel/test-process-uid-gid
not ok 1463 parallel/test-process-umask
not ok 1488 parallel/test-repl
not ok 1490 parallel/test-repl-colors
not ok 1524 parallel/test-repl-sigint
not ok 1525 parallel/test-repl-sigint-nested-eval
not ok 1532 parallel/test-repl-top-level-await
not ok 1619 parallel/test-stream-pipeline
not ok 1627 parallel/test-stream-readable-hwm-0
not ok 1697 parallel/test-tcp-wrap
not ok 1715 parallel/test-timers-immediate-unref
not ok 1720 parallel/test-timers-interval-throw
not ok 1724 parallel/test-timers-now
not ok 1725 parallel/test-timers-ordering
not ok 1739 parallel/test-timers-unref
not ok 1762 parallel/test-tls-check-server-identity
not ok 1806 parallel/test-tls-handshake-error
not ok 1849 parallel/test-tls-set-ciphers-error
not ok 1852 parallel/test-tls-snicallback-error
not ok 1883 parallel/test-trace-events-async-hooks
not ok 1884 parallel/test-trace-events-binding
not ok 1886 parallel/test-trace-events-category-used
not ok 1888 parallel/test-trace-events-metadata
not ok 1899 parallel/test-ttywrap-invalid-fd
not ok 1917 parallel/test-util-inspect
not ok 1918 parallel/test-util-inspect-bigint
not ok 1927 parallel/test-uv-binding-constant
not ok 1928 parallel/test-uv-errno
not ok 1975 parallel/test-vm-options-validation
not ok 1986 parallel/test-vm-sigint
not ok 1989 parallel/test-vm-sigint-existing-handler
not ok 1991 parallel/test-vm-timeout
not ok 2102 addons/symlinked-module/test
not ok 2103 addons/zlib-binding/test
not ok 2104 addons/hello-world/test
not ok 2105 addons/at-exit/test
not ok 2106 addons/node-module-version/test
not ok 2107 addons/make-callback-recurse/test
not ok 2108 addons/uv-handle-leak/test
not ok 2109 addons/async-hooks-promise/test
not ok 2110 addons/parse-encoding/test
not ok 2111 addons/hello-world-esm/test
not ok 2112 addons/dlopen-ping-pong/test
not ok 2113 addons/async-hello-world/test
not ok 2114 addons/repl-domain-abort/test
not ok 2115 addons/make-callback-domain-warning/test
not ok 2116 addons/not-a-binding/test
not ok 2117 addons/make-callback/test
not ok 2118 addons/async-hooks-id/test
not ok 2120 addons/heap-profiler/test
not ok 2121 addons/hello-world-function-export/test
not ok 2122 addons/async-resource/test
not ok 2123 addons/buffer-free-callback/test
not ok 2124 addons/openssl-binding/test
not ok 2125 addons/load-long-path/test
not ok 2126 addons/errno-exception/test
not ok 2127 addons/callback-scope/test
not ok 2128 addons/new-target/test
not ok 2129 addons/null-buffer-neuter/test
not ok 2130 addons/callback-scope/test-async-hooks
not ok 2131 addons/async-hello-world/test-makecallback
not ok 2132 addons/async-hello-world/test-makecallback-uncaught
not ok 2133 addons/callback-scope/test-resolve-async
not ok 2134 addons/stringbytes-external-exceed-max/test-stringbytes-external-at-max
not ok 2135 addons/stringbytes-external-exceed-max/test-stringbytes-external-exceed-max
not ok 2136 addons/stringbytes-external-exceed-max/test-stringbytes-external-exceed-max-by-1-ascii
not ok 2137 addons/stringbytes-external-exceed-max/test-stringbytes-external-exceed-max-by-1-base64
not ok 2138 addons/stringbytes-external-exceed-max/test-stringbytes-external-exceed-max-by-1-binary
not ok 2139 addons/stringbytes-external-exceed-max/test-stringbytes-external-exceed-max-by-1-hex
not ok 2140 addons/stringbytes-external-exceed-max/test-stringbytes-external-exceed-max-by-1-utf8
not ok 2141 addons/stringbytes-external-exceed-max/test-stringbytes-external-exceed-max-by-2
not ok 2142 addons-napi/6_object_wrap/test
not ok 2143 addons-napi/1_hello_world/test
not ok 2144 addons-napi/test_bigint/test
not ok 2145 addons-napi/test_async/test
not ok 2146 addons-napi/test_typedarray/test
not ok 2147 addons-napi/test_threadsafe_function/test
not ok 2148 addons-napi/3_callbacks/test
not ok 2149 addons-napi/8_passing_wrapped/test
not ok 2150 addons-napi/7_factory_wrap/test
not ok 2151 addons-napi/2_function_arguments/test
not ok 2152 addons-napi/test_env_sharing/test
not ok 2153 addons-napi/test_uv_loop/test
not ok 2154 addons-napi/test_error/test
not ok 2155 addons-napi/test_array/test
not ok 2156 addons-napi/test_fatal/test
not ok 2157 addons-napi/test_make_callback/test
not ok 2158 addons-napi/test_null_init/test
not ok 2159 addons-napi/5_function_factory/test
not ok 2160 addons-napi/test_object/test
not ok 2161 addons-napi/test_promise/test
not ok 2162 addons-napi/test_constructor/test
not ok 2163 addons-napi/test_new_target/test
not ok 2164 addons-napi/test_reference/test
not ok 2165 addons-napi/test_make_callback_recurse/test
not ok 2166 addons-napi/test_function/test
not ok 2167 addons-napi/test_fatal_exception/test
not ok 2168 addons-napi/test_cleanup_hook/test
not ok 2169 addons-napi/test_exception/test
not ok 2170 addons-napi/test_dataview/test
not ok 2171 addons-napi/test_properties/test
not ok 2172 addons-napi/test_conversions/test
not ok 2173 addons-napi/test_string/test
not ok 2174 addons-napi/test_buffer/test
not ok 2175 addons-napi/4_object_factory/test
not ok 2176 addons-napi/test_handle_scope/test
not ok 2177 addons-napi/test_callback_scope/test
not ok 2178 addons-napi/test_general/test
not ok 2179 addons-napi/test_number/test
not ok 2180 addons-napi/test_async/test-async-hooks
not ok 2181 addons-napi/test_make_callback/test-async-hooks
not ok 2182 addons-napi/test_callback_scope/test-async-hooks
not ok 2183 addons-napi/test_async/test-loop
not ok 2184 addons-napi/test_callback_scope/test-resolve-async
not ok 2185 addons-napi/test_async/test-uncaught
not ok 2186 addons-napi/test_symbol/test1
not ok 2187 addons-napi/test_symbol/test2
not ok 2188 addons-napi/test_fatal/test2
not ok 2189 addons-napi/test_constructor/test2
not ok 2190 addons-napi/test_symbol/test3
not ok 2191 addons-napi/test_general/testGlobals
not ok 2192 addons-napi/test_general/testInstanceOf
not ok 2193 addons-napi/test_general/testNapiRun
not ok 2194 addons-napi/test_general/testNapiStatus
not ok 2195 doctool/test-doctool-html
not ok 2196 doctool/test-doctool-json
not ok 2197 doctool/test-make-doc
not ok 2217 message/assert_throws_stack
not ok 2220 message/error_exit
not ok 2240 message/timeout_throw
not ok 2254 pseudo-tty/test-assert-colors
not ok 2257 pseudo-tty/test-readable-tty-keepalive
not ok 2270 sequential/test-async-wrap-getasyncid
not ok 2367 sequential/test-tls-connect
not ok 2370 sequential/test-vm-timeout-rethrow
This is what we use to set up the alpine38 docker container
https://github.com/nodejs/build/blob/master/ansible/roles/docker/templates/alpine38.Dockerfile.j2
@joyeecheung I saw and I couldn't figure out what is was doing with Java etc. My setup runs the test on the latest Node 10 version on alpine 3.8 without anything else installed.
@LaurentGoderre Were you running tests from this repo's master branch using the Node 10 releases? That's probably impossible to pass since the master's tests should be run by binaries built with master's source (11-pre).
The Java stuff are setup for Jenkins's workers
Ooooh, that is a good point!!!
@joyeecheung I ran the v10.x tests and I still get a lot of failures
This is still an issue, I imagine?
Ohh, yeah.
The twist is 3.9 is due any day now.
https://bugs.alpinelinux.org/versions/127
I've re-enabled alpine 3.8 because it's fallen off the radar and we're letting our large alpine user base down by not addressing this.
This needs to get marked as flaky (not actually flaky, actual failure) or fixed. Some help would be appreciated.
For those looking to help, a Dockerfile that will get you the equivilant of what we use in CI is:
FROM alpine:3.8
ENV LC_ALL C
ENV USER iojs
ENV SHELL /bin/bash
ENV HOME /home/iojs
ENV PATH /usr/lib/ccache/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
ENV NODE_COMMON_PIPE /home/iojs/test.pipe
ENV NODE_TEST_DIR /home/iojs/tmp
ENV OSTYPE linux-gnu
ENV OSVARIANT docker
ENV DESTCPU x64
ENV ARCH x64
RUN apk add --no-cache --upgrade apk-tools
RUN apk add --no-cache libstdc++
RUN apk add --no-cache --virtual .build-deps \
shadow \
binutils-gold \
curl \
g++ \
gcc \
gnupg \
libgcc \
linux-headers \
make \
paxctl \
python \
tar \
ccache \
openjdk8 \
git \
procps \
openssh-client \
py2-pip \
bash \
automake \
libtool \
autoconf
RUN addgroup -g 1000 iojs
RUN adduser -G iojs -D -u 1000 iojs
RUN mkdir /home/iojs/tmp
VOLUME /home/iojs/ /home/iojs/.ccache
USER iojs:iojs
(assuming your UID and GID are both 1000, otherwise change them to match)
Then, from in your node repo clone, you can run:
docker run -ti -v $(pwd):/home/iojs/node -v /tmp/alpine.ccache:/home/iojs/.ccache node-alpine:3.8 bash
And you should have a working environment where you can compile and test in /home/iojs/node.
FYI this runs on top of Ubuntu 16.04 in our infra. These containers are _fresh_ as I've just reprovisioned all of our Docker infra over the last couple of days, so this isn't about the process table filling up.
I can't reproduce locally on 18.04 using the same container config. One other difference is that we run from within Jenkins, so there's an additional layer to the process tree, although I'm not sure why that would matter.
OK, can repro locally, it's because of the process hierarchy inside the container. You need to remove bash from the hierarchy, which is present whenever you try to reproduce these things manually. Grab the Dockerfile above (edited just now to add a RUN mkdir /home/iojs/tmp), put it in /tmp/alpine38/ and run docker build -t node-alpine:3.8 /tmp/alpine38/. Then inside a clone of this repo do:
./configure:
docker run -ti -v $(pwd):/home/iojs/node -v /tmp/alpine.ccache:/home/iojs/.ccache -w /home/iojs/node/ node-alpine:3.8 ./configure
make
docker run -ti -v $(pwd):/home/iojs/node -v /tmp/alpine.ccache:/home/iojs/.ccache -w /home/iojs/node/ node-alpine:3.8 make -j8
run the two tests
docker run -ti -v $(pwd):/home/iojs/node -v /tmp/alpine.ccache:/home/iojs/.ccache -w /home/iojs/node/ node-alpine:3.8 /usr/bin/python tools/test.py parallel/test-cluster-master*
yields this output after waiting for timeouts:
docker run -ti -v $(pwd):/home/iojs/node -v /tmp/alpine.ccache:/home/iojs/.ccache -w /home/iojs/node/ node-alpine:3.8 /usr/bin/python tools/test.py parallel/test-cluster-master*
This reinforces the need to fix this as launching your application with minimal layers inside your minimal container is a thing that folks do with Docker / Alpine. (Aside from the fact that you shouldn't be using cluster).
Running the tests directly works too btw, you just need to kill it manually:
docker run -ti -v $(pwd):/home/iojs/node -v /tmp/alpine.ccache:/home/iojs/.ccache -w /home/iojs/node/ node-alpine:3.8 ./node test/parallel/test-cluster-master-error.js
(Unassigning refack. Don't want to discourage others from jumping in on this.)
Since this appears to be a bug in cluster or in Alpine, I'm going to remove the CI/flaky test label too.
Both tests use common.isAlive() and if that is always returning true on this OS, then we'll be seeing these things timeout.
Here's the source for common.isAlive():
function isAlive(pid) {
try {
process.kill(pid, 'SIGCONT');
return true;
} catch {
return false;
}
}
Any chance 'SIGCONT' does not behave here as it does on other operating systems?
The addition of 'SIGCONT' via common.isAlive() was introduced by @jasnell in https://github.com/nodejs/node/commit/baa0ffdab37. Prior to that, it had been using 0 rather than SIGCONT. Might be worth seeing if changing 'SIGCONT' to 0 fixes these two tests in this environment. (It might break other tests, or make tests that should fail pass or something, though.)
EDIT: Changing to 0 seems to fix nothing...oh well...
Probably too optimistic to think that https://github.com/nodejs/node/pull/24756 will fix it without introducing other issues, but let's see...
EDIT: Indeed, too optimistic...didn't work...
Whee! I have Docker installed and setup and I can replicate this. It's like I'm living in the FUTURE or something. Or at least the RECENT PAST. Can't do it right now, but will investigate more in a little bit if no one beats me to it (which: please do!).
As far as I can tell, the worker really does exit, but in the master process, process._kill(pid, 'SIGCONT') does not return an error and ps -aux shows the worker process still in the table but "defunct":
iojs 19 3.7 0.0 0 0 pts/0 Z+ 23:00 0:00 [node] <defunct>
This is I guess basically what refack noted on August 15.
PPID is process 1 which is the node process:
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
4 S 1000 1 0 4 80 0 - 64478 - pts/0 00:00:00 node
0 Z 1000 19 1 1 80 0 - 0 - pts/0 00:00:00 node <defunct>
I think test-cluster-master-kill.js may be slightly misleadingly named. kill() doesn't really come into it. But I can see why it was named that. Here's what it does:
child_process.fork() to create a node subprocesscluster to launch a workerhttp server so that it will run forever if left on its ownI'm not sure if this is a bug in Node.js or an artifact of the way the test is run as rvagg described above. Is there some system call Node.js should be making to let the OS know to remove the pid from the process table? (If so, why does it only matter in this one edge case?) Or is this just what happens when you kinda sorta bypass some normal operating system stuff?
@nodejs/cluster @nodejs/child_process
I guess probably not a terrible time to re-ping @nodejs/docker too...
I bet some clever libuv folks might have a clue here too since we're getting pretty low @cjihrig @bnoordhuis
@nodejs/libuv
So I think this is all about the face that the killed child processes aren't reaped properly because we don't have init or a similar reaping parent process in the chain. I believe I can fix this by putting bash as the parent to Jenkins in the container, something like:
CMD [ "bash", "-c", "cd /home/iojs \
&& curl https://ci.nodejs.org/jnlpJars/slave.jar -O \
&& java -Xmx{{ server_ram|default('128m') }} \
-jar /home/{{ server_user }}/slave.jar \
-jnlpUrl {{ jenkins_url }}/computer/{{ item.name }}/slave-agent.jnlp \
-secret {{ item.secret }}" ]
Here's the bit I quite know the answer to: is this bypassing something that should be Node's responsibility? Why are we not experiencing this on any of our other Docker containers where we do the same thing but execute Jenkins in the same way? I can't find anything special about Alpine 3.9 that would lead to different behaviour. I don't want to be putting a bandaid on something that's a genuine problem on our end.
actually, I solved a similar problem of non-reaping on the ARM machines running docker by using --init which starts a proper init at the root to deal with reaping. But my questions above still stand.
--init seems to have done the trick in this instance https://ci.nodejs.org/job/node-test-commit-linux/nodes=alpine-latest-x64/
I wouldn't mind thoughts from folks more expert in system-level concerns on why this might be a unique problem on a specific distrio+version; and does this suggest problems on our end? Otherwise, we can probably close this and wait to see if we get issues reported about it.
docker containers not having a functional init is a common source of problems, at least when the container runs code that creates sub-sub-processes that don't get waited on by the sub-process. I recall reviewing C code for a mini-reaper of @rmg that was used as a runner. I wonder if the --init option is newish? Providing it seems the right thing to do for containers with complex sub-process management, no matter what the distro.
Hypothetically, if the order of process termination varies, if the sub-process is allowed to run just marginally past the sub-sub-process death, it will get the chance to wait on its child processes, and they won't become orphaned. If the sub-process terminates before the sub-sub-process exit statuses are available, then they become orphaned, and reparented to init, maybe process run/scheduling differences is what we are seeing.
I can buy that as an explanation I suppose. It's just strange that were only seeing this on one of the dockerised platforms. Granted, Ubuntu 16.04 is used for the majority but we've also had alpine in there for a while now without seeing this. Maybe it's a minor musl change that's impacted timing in some subtle but reliable way.