Version: master
Platform: AIX
Subsystem: io
parallel/test-stdio-closed fails on AIX, pulling this out of https://github.com/nodejs/node/issues/7973 to avoid confusion since issues are not related.
not ok 855 parallel/test-stdio-closed # TODO : Fix flaky test
#
# assert.js:85
# throw new assert.AssertionError({
# ^
# AssertionError: 1 == 42
# at ChildProcess.<anonymous> (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/parallel/test-stdio-closed.js:25:10)
# at ChildProcess.<anonymous> (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/common.js:416:15)
# at emitTwo (events.js:106:13)
# at ChildProcess.emit (events.js:191:7)
# at Process.ChildProcess._handle.onexit (internal/child_process.js:215:12)
---
duration_ms: 0.378
...
@nodejs/testing @Fishrock123
If it helps anyone troubleshoot or if it gives anyone an idea as to what's going on, the process.stdout.write() is failing (on AIX only) with:
EBADF: bad file descriptor, write
/cc gireeshpunathil who might have more information to add since their last comment.
Also, if it helps anyone, here's the modified version of the test I used to capture that info:
'use strict';
const common = require('../common');
const assert = require('assert');
const spawn = require('child_process').spawn;
if (common.isWindows) {
common.skip('platform not supported.');
return;
}
if (process.argv[2] === 'child') {
try {
process.stdout.write('stdout', function() {
try {
process.stderr.write('stderr', function() {
process.exit(42);
});
} catch (e) {
process.exit(200);
}
});
} catch (e) {
process.send(e.code);
process.send(e.message);
process.exit(20);
}
return;
}
// Run the script in a shell but close stdout and stderr.
const cmd = `"${process.execPath}" "${__filename}" child 1>&- 2>&-`;
const proc = spawn(
'/bin/sh', ['-c', cmd], { stdio: ['ipc', 'inherit', 'inherit'] }
);
proc.on('message', (msg) => {
console.log(msg);
});
proc.on('exit', common.mustCall(function(exitCode) {
assert.strictEqual(exitCode, 42);
}));
Here's the results. I believe the EBADF is at the end because the test runner prints stderr first, then stdout. Probably should have used console.error() instead of console.log() but whatever.
#
# assert.js:85
# throw new assert.AssertionError({
# ^
# AssertionError: 20 === 42
# at ChildProcess.<anonymous> (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/parallel/test-stdio-closed.js:41:10)
# at ChildProcess.<anonymous> (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/common.js:416:15)
# at emitTwo (events.js:106:13)
# at ChildProcess.emit (events.js:191:7)
# at Process.ChildProcess._handle.onexit (internal/child_process.js:215:12)
# EBADF
# EBADF: bad file descriptor, write
Thanks @Trott - this info will be really useful, I will investigate.
ok, putting things into perspective:
Three processes: a node parent spawns a non-node child (sh) which closes its streams (1 and 2) and then spawns a node child. The grandchild attempts to write to its stream, and then returns an error code.
In Linux, the grandchild node apparently succeeded in writing into stdout, while AIX failed with EBADF (fd closed | invalid)
So the question is, who closed the grandchild node's stream in AIX.
resorting to truss (AIX) and strace (Linux), I see that the parent shell does not perform fork and exec, instead only exec - means the closed streams of sh is inherited to the node grandchild:
Linux:
pid 18948] close(1) = 0
[pid 18948] execve("./node", ["./node", "k.js", "child"], [/* 30 vars */]) = 0
AIX:
3932586: 8913259: close(1) = 0
3932586: 8913259: execve("./node", 0x20010EF8, 0x20011428) argc: 3
This is also visible if I add a timeout in the grandchild and examine the processes:
Linux:
bash-4.1$ ps
PID TTY TIME CMD
16440 pts/0 00:00:00 node
16446 pts/0 00:00:00 node
16457 pts/0 00:00:00 ps
17838 pts/0 00:00:00 csh
17855 pts/0 00:00:00 bash
AIX:
bash-4.3$ ps
PID TTY TIME CMD
5177436 pts/0 0:00 ps
5570730 pts/0 0:00 -ksh
3211596 pts/0 0:00 bash
2097736 pts/0 0:00 ./node k.js
2425576 pts/0 0:00 ./node k.js child
As we can see, there is no /bin/sh in the list. That process itself exec'ed into node.
So, for the grandchild node, fd 1 is inherently closed.
I wrote a C grandchild and used the same test case to invoke it to see how it behaves:
bash-4.3$ cat foo.c
#include <stdio.h>
#include <stdlib.h>
int main() {
fprintf(stdout, "hello\n");
exit(42);
}
Linux:
fstat(1, 0x7fff2453d760) = -1 EBADF (Bad file descriptor)
write(1, "hello\n", 6) = -1 EBADF (Bad file descriptor)
exit_group(42) = ?
AIX:
3735848: 11272557: kioctl(1, 22528, 0x00000000, 0x00000000) = 0 Err#9 EBADF
3735848: 11272557: kwrite(1, " h e l l o\n", 6) Err#9 EBADF
23735848: 11272557: _exit(42)
Here Linux and AIX behaved in the same manner, and the reason is well understood - the grandchild's fd is closed at inception.
So, Linux node seem to have special method to handle closed standard streams: either re-incarnate them through /dev/null device, or suppress the error code from propagation. AIX may be missing this.
At this point, I want to question the Linux behavior: Functionally, a node process spawned with standard streams suppressed, should behave in a console-less manner, and attempts to write to them should err. Any thoughts?
@gireeshpunathil node.cc has code in it designed to detect if any of the stdio streams are closed and (if so) open them on /dev/null.
This test is designed specifically to test that code. And it seems that either that code is not running on AIX (maybe the fstat() call works differently) or else the open() call in that code is returning the correct file descriptor but (for some reason) that file descriptor is not writable.
Any ideas?
Thanks @Trott ! I was looking for such a location. I will now figure out what is happening out there in AIX.
@gireeshpunathil What is the result without spawn on aix?
$ node -e "process.stdout.write('aaa')" 1>&-
@Trott (This is just guess) The code with fstat() is surrounded by #ifdef __POSIX__. Not defined in the aix build?
EDIT: I also found g++ ... '-D__POSIX__' ../src/node.cc line from CI console output. So forget about my guess.
The code with fstat() is surrounded by
#ifdef __POSIX__. Not defined in the aix build?
__POSIX__ should be defined for all non-Windows platforms, according to node.gyp.
I tested that at first, but it succeeded all the time:
bash-4.3$ ./node -e "process.stdout.write('aaa')" 1>&-
bash-4.3$ echo $?
0
bash-4.3$
__POSIX__ is indeed defined in AIX, and I am live debugging into it at the moment. Will get to the bottom shortly, hopefully!
I guess I found the infliction point:
bash-4.3$ cat foo.c
#include <stdio.h>
#include <fcntl.h>
int main() {
struct stat ignored;
int ret = fstat(1, &ignored);
fprintf(stderr, "%d\n", ret);
}
Linux:
bash-4.1$ sh -c ./a.out >&-
-1
bash-4.1$
AIX:
bash-4.3$ sh -c ./a.out 1>&-
0
bash-4.3$
Need to map it back into the problem context, and figure out how this led to the issue. Also need to dig into fstat manuals to understand the disparity.
seems fstat() is there to check stdio fds open.
on aix it return 0, so the code will not open stdio on /dev/null. Then we write at still closed fd. results in EBADF.
ok, sounds reasonable explanation, thanks @veshboo!
I recommend to have #include <sys/stat.h> for sure.
on aix it return 0
Sounds like that, but why?
will continue debugging tomorrow.
Regarding this working on AIX even though the test here is failing:
./node -e "process.stdout.write('aaa')" 1>&-
What about this instead?:
./node -e "process.stdout.write('aaa', () => { console.error('write was successful-ish'); })" 1>&-
I wade into C++ observations with trepidation as I feel like a complete imposter on this stuff, but HERE WE GO:
fstat() on a closed stream is undefined, so AIX and Linux are both right to do what they do?lseek() or fseek() or ftell() or something like that and see if we can make it work across POSIX platforms?/cc @bnoordhuis who introduced both the node.cc code and the test in https://github.com/nodejs/node/pull/875 as a means to prevent information leaks.
Also:
If the consensus turns out to be that this is a bug in AIX, then inserting code to skip the test on AIX (similar to the code already there that skips it for Windows) is a sufficient fix from my perspective.
EDIT: Upon further reflection, skipping might not be OK if it means the potential information leak is there. On the other hand, perhaps the information leak is there too on Windows so...
¯_(ツ)_/¯
Is it possible that behavior of
fstat()on a closed stream is undefined, so AIX and Linux are both right to do what they do?
man 3posix fstat (aka http://pubs.opengroup.org/onlinepubs/009695399/functions/fstat.html, which I _think_ is authoritative) says:
The fstat() function shall fail if:
EBADF The fildes argument is not a valid file descriptor.
So, yes, that may be a bug in AIX then. I could also imagine that something in the background opens another FD before the fstat() call is executed? Like, the pthread operations or the setvbuf() calls in node_main.cc?
@gireeshpunathil What kind of file descriptor does struct stat ignored say it is? It would be good to test that with both your C snippet and the code in node.cc.
@bnoordhuis I'm seeing AIX report a regular file (S_ISREG(ignored.st_mode) is true) when the fstat() unexpectedly succeeds on a file descriptor for a closed stderr.
I have machine access issues right now, hoping to resume my work soon.
The one liner suggested by @Trott breaks in AIX, and is sufficient to differentiate from Linux.
bash-4.3$ ./node -e "process.stdout.write('aaa', () => { console.error('write was successful-ish'); })" 1>&-
fs.js:786
return binding.writeBuffer(fd, buffer, offset, length, position);
^
Error: EBADF: bad file descriptor, write
at Error (native)
at Object.fs.writeSync (fs.js:786:20)
at SyncWriteStream.write (fs.js:2019:6)
at [eval]:1:16
at ContextifyScript.Script.runInThisContext (vm.js:25:33)
at Object.exports.runInThisContext (vm.js:77:17)
at Object.<anonymous> ([eval]-wrapper:6:22)
at Module._compile (module.js:556:32)
at bootstrap_node.js:308:29
at _combinedTickCallback (internal/process/next_tick.js:67:7)
Two scenarios I tested (one with the C program and one with the original node grandchild). Though the final manifestation is same (fstat returns 0), there are differences:
bash-4.3$ cat foo.c
#include <stdio.h>
#include <fcntl.h>
int main() {
struct stat ignored;
memset(&ignored, 0, sizeof(struct stat));
int ret = fstat(1, &ignored);
fprintf(stderr, "%d\n", ret);
}
bash-4.3$ sh -c ./foo >&-
0
bash-4.3$
The internal fields of stat showed:
dev: 655364
ino: 4150
mode: 8630
nlink: 1
flag: 0
uid: 0
gid: 0
rdev: 131074
size: 0
atime: 1473484577
If I modify the above code thus, to make it a self contained code without help of /bin/sh:
bash-4.3$ cat foo.c
#include <stdio.h>
#include <fcntl.h>
int main() {
struct stat ignored;
memset(&ignored, 0, sizeof(struct stat));
int fd = open("/dev/null", O_RDONLY);
dup2(fd, 1);
int ret = fstat(1, &ignored);
fprintf(stderr, "%d\n", ret);
}
I get exactly the same output, which means we know what the shell does when we direct it to close the stdout: re-attaches stdout with /dev/null.
Node program showed this:
dev: 5
ino: 29685
mode: 33188
nlink: 1
flag: 0
uid: 2
gid: 2
rdev: 0
size: 6325
atime: 1473483702
Please note the dev: 5. This is strange.
In both the cases the child is spawned in the same manner: execing the parent
In both the cases, the stdout of the child is actually the stdout of the parent.
However:
In C case, the stdout of the parent was a tty, which got closed, and linked to null device.
In Node case, the stdout of the parent shell (which in turn is a child of main node process), is either an un-named pipe (socket pair) when spawned through default option, or a tty, when spawned through 'inherit' option.
I invoked the fd introspection method uv_guess_handle from the problem site, and captured it in a log file, and the fd type shows 17 (UV_FILE).
Debugging further, but suggestions welcome.
Perl and Ruby seem to deal with closed stdio handles similarly to the way Node.js does. It would be interesting to know if Perl and Ruby handle it successfully on AIX or not.
Assuming you have Perl and Ruby installed, do these (which are my results on OS X) return errors or do these succeed?:
$ perl -e 'print "hi\n"' 1>&-
$ echo $?
0
$ ruby -e 'print "hi\n"' 1>&-
$ echo $?
0
$
If either of them succeed on AIX, it may be worth looking at their source code to see what they're doing differently from Node.js and/or libuv.
Also: Python (on OS X, not AIX):
$ python -c 'print "hi"' 1>&-
close failed in file object destructor:
sys.excepthook is missing
lost sys.stderr
$ echo $?
0
$
Prints a bunch of stuff that look like error messages but doesn't return an error code. Might be interesting to know what it does on AIX as well.
I get exactly the same output, which means we know what the shell does when we direct it to close the stdout: re-attaches stdout with /dev/null.
For the record, bash on Linux doesn’t seem to do this:
$ ls -l /proc/self/fd 2>&-
total 0
lrwx------ 1 sqrt sqrt 64 Sep 11 19:18 0 -> /dev/pts/18
lrwx------ 1 sqrt sqrt 64 Sep 11 19:18 1 -> /dev/pts/18
lr-x------ 1 sqrt sqrt 64 Sep 11 19:18 2 -> /proc/31578/fd
re: perl and ruby behavior: I get the same behavior in AIX for perl. I don't have ruby in this box, need to get one.
re: python behavior: Same is the case in AIX.
re: bash behavior:
$ ./a.out >&-
$ ls /proc/4129134/fd
0 2
$ ./a.out >&- 2>&-
$ ls /proc/2622134/fd
0
Though I am not sure whether this is related to the issue at hand - whether bash or sh, we are directly invoking the child with the streams closed, so to me, it is not depending on the shell behavior at all. Am I missing something?
According to me, there are two issues out there which needs to be understood:
Once this difference is understood, I guess we need a way to contain both the scenarios, through a newer introspection technique (than fstat) which will identify the fd as a bad descriptor. Then on, it will flow in the same route as that of other POSIX platforms.
Hmmm...so, at least as a last resort, if Perl is successfully avoiding EBADF on Perl in this situation, perhaps we can try to look to see what they're doing and consider doing something similar? I imagine the relevant bits are probably in https://github.com/Perl/perl5/blob/49fc490652d8b428d67872fae3acb10f0b43cff7/perlio.c somewhere...
This is the only code that I can find in perlio that could be performing some kind of skip on AIX:
# elif defined(_AIX) || defined(__osf__) || defined(__irix__)
f->_file = -1;
return 1;
Anyone understand what this code is actually doing?
@GeorgeAdams95 The code (in Perl source) we'd need to look for is not necessarily AIX-specific. I suspect they manage closed stdio streams differently altogether.
The code you pasted invalidates an fd. I don't think that's the code we're looking for. (In the case we're talking about here, the stdout stream is already closed.)
My C is mediocre, but I suspect what we are looking for is in this function: https://github.com/Perl/perl5/blob/49fc490652d8b428d67872fae3acb10f0b43cff7/perlio.c#L1486-L1571 Although if @gireeshpunathil or someone else solves this problem before anyone sorts this avenue out, great! :-D (And if I'm just creating noise here that won't get us anywhere, apologies.)
Perl and Ruby seem to deal with closed stdio handles similarly to the way Node.js does. It would be interesting to know if Perl and Ruby handle it successfully on AIX or not.
Are you sure? On Linux:
$ strace perl -e 'print "hi\n"' 1>&-
…
write(1, "hi\n", 3) = -1 EBADF (Bad file descriptor)
exit_group(0) = ?
+++ exited with 0 +++
Are you sure?
Well, I thought I was, but now, uh, yeah, not so much...
From IBM Knowledge Center - stat.h File
st_dev - Device that contains a directory entry for this file
st_rdev - ID of the device. This field is defined only for block or character special files.
In case of special files such as /dev/(tty, null, zero, ...), we should use st_rdev field to know their devices.
So we need to find out the device whose st_rdev is 131074.
#include <stdio.h>
#include <sys/sysmacros.h>
int main() {
printf("131074: major=%d, minor=%d\n", major(131074), minor(131074));
return 0;
}
Then (with substitutions of major and minor)
ls -l /dev | grep "\<major\>, *\<minor\>"
I tried changing this:
for (int fd = STDIN_FILENO; fd <= STDERR_FILENO; fd += 1) {
struct stat ignored;
if (fstat(fd, &ignored) == 0)
continue;
// Anything but EBADF means something is seriously wrong. We don't
// have to special-case EINTR, fstat() is not interruptible.
if (errno != EBADF)
ABORT();
if (fd != open("/dev/null", O_RDWR))
ABORT();
}
to this:
int fd = -1;
while (fd <= STDERR_FILENO) {
fd = open("/dev/null", O_RDWR);
}
close(fd);
And that _still_ failed on AIX.
I'd be delighted to see someone more knowledgable craft a solution, but at this point, I'm wondering if the best option might not be one of these:
...or...
...or...
Well, I thought I was, but now, uh, yeah, not so much...
Perl does it for stdin but not stdout:
$ strace -e close,fstat,ioctl,open,read,write perl -e 'print "."' 0>&- 1>&-
# <shared library loading omitted>
open("/dev/urandom", O_RDONLY) = 0
read(0, "\232~|\341", 4) = 4
close(0) = 0
ioctl(0, TCGETS, 0x7ffe92faf220) = -1 EBADF (Bad file descriptor)
ioctl(1, TCGETS, 0x7ffe92faf220) = -1 EBADF (Bad file descriptor)
ioctl(2, TCGETS, {B9600 opost isig icanon echo ...}) = 0
open("/dev/null", O_RDONLY) = 0
ioctl(0, TCGETS, 0x7ffe92faf340) = -1 ENOTTY (Inappropriate ioctl for device)
fstat(0, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 3), ...}) = 0
write(1, ".", 1) = -1 EBADF (Bad file descriptor)
+++ exited with 0 +++
@bnoordhuis Do you have an opinion as to which of these is the right way to approach solving this issue?
I think it's a good feature to have; I wrote the code for a reason. :-)
The reason to make sure fds 0-2 refer to _something_ is that it's a potential data leak otherwise. Imagine an attacker manages to open a socket with fd 1 - they will receive everything that the application prints through console.log().
@bnoordhuis Would I be correct to conclude that if it turns out it can't be reliably implemented on AIX, then skipping the test on AIX would be your preference and leave the feature alone on other POSIX instances?
Correct.
I suppose another option is to alter the test to confirm that if a stdio stream is closed on startup, that it can't be re-opened. It does mean accepting a behavior difference: Linux (for example) will silently discard writes to the "closed" stream (which will really be re-opened but pointed at /dev/null rather than closed) while AIX will throw EBADF if you try to write to the closed stream. But the main thing is that trying to attach to fd's 0, 1, and 2 all fail.
fcntl(fd, F_GETFD, 0) instead of fstat. Anyone already tried?
fcntl(fd, F_GETFD, 0) instead of fstat. Anyone already tried?
Switched the fstat to this: if (fcntl(fd, F_GETFD, 0) != -1)
make test still passes on my OS X machine.
Still fails on AIX: https://ci.nodejs.org/job/node-stress-single-test/918/nodes=aix61-ppc64/console
Given the results from a couple days ago, I don't think any tweak to the fstat() call has a chance of working by itself. Maybe it successfully detects the closed stdio stream and maybe it doesn't, but it won't fix it. The subsequent open() call won't have the desired result, so we'd still have to fix that somehow as well...
There is ksh process in AIX ps result in @gireeshpunathil's comment.
On AIX, /bin/sh seems to be a link to ksh.
How about changing test-stdio-closed.js to use bash?
How about changing test-stdio-closed.js to use bash?
I'm afraid that appears to have no effect on the test outcome.
https://ci.nodejs.org/job/node-stress-single-test/920/nodes=aix61-ppc64/console
not ok 1 parallel/test-stdio-closed
#
# assert.js:85
# throw new assert.AssertionError({
# ^
# AssertionError: 1 == 42
# at ChildProcess.<anonymous> (/home/iojs/build/workspace/node-stress-single-test/nodes/aix61-ppc64/test/parallel/test-stdio-closed.js:25:10)
# at ChildProcess.<anonymous> (/home/iojs/build/workspace/node-stress-single-test/nodes/aix61-ppc64/test/common.js:416:15)
# at emitTwo (events.js:106:13)
# at ChildProcess.emit (events.js:191:7)
# at Process.ChildProcess._handle.onexit (internal/child_process.js:215:12)
---
duration_ms: 0.279
@Trott while I still want us to better figure out if there is a bug on AIX, I think that:
Given that I'm ok with your suggestion of allowing "AIX throwing EBADF" as a pass for the test on AIX only or disabling the test for AIX with my preference being the former. If we do that we'd still investigate to see if we can come to a more complete understanding/fix.
@gireeshpunathil @bnoordhuis thoughts ?
Most helpful comment
I think it's a good feature to have; I wrote the code for a reason. :-)
The reason to make sure fds 0-2 refer to _something_ is that it's a potential data leak otherwise. Imagine an attacker manages to open a socket with fd 1 - they will receive everything that the application prints through
console.log().