The child_process.spawn stdout stream seems to be taking ~4x longer than other "similar" methods of I/O streaming with similarly sized data streams.
Running on Node.js v4.2.1, Mac OS X 10.11, Macbook Air 1.7GHz
As a baseline, the following program, where file is a 472MB gzip file, and stdout is set to ignore takes 6.5 secs with the gzip process using ~100% CPU:
var dataLength = 0
var time = process.hrtime()
var gzip = spawn('gzip', ['-cd', file], {stdio: ['ignore', 'ignore', 'ignore']})
if (gzip.stdout != null) {
gzip.stdout.on('data', function(data) {
if (!dataLength) dataLength = data.length
})
}
gzip.on('close', function(code) {
var diff = process.hrtime(time)
console.log('stream took %d seconds', diff[0] + diff[1] / 1e9)
if (dataLength) console.log('buffer length was %d', dataLength)
})
$ node spawn-bench.js
stream took 6.497701762 seconds
If I set the stdout option from ignore to pipe, it suddenly takes 27 secs, with gzip only using ~35% CPU and node using ~75%:
var dataLength = 0
var time = process.hrtime()
var gzip = spawn('gzip', ['-cd', file], {stdio: ['ignore', 'pipe', 'ignore']})
if (gzip.stdout != null) {
gzip.stdout.on('data', function(data) {
if (!dataLength) dataLength = data.length
})
}
gzip.on('close', function(code) {
var diff = process.hrtime(time)
console.log('stream took %d seconds', diff[0] + diff[1] / 1e9)
if (dataLength) console.log('buffer length was %d', dataLength)
})
$ node spawn-bench.js
stream took 27.406851714 seconds
buffer length was 8192
Fine, so that 4x overhead could just be standard pipe/stream overhead, but if I pipe in from stdin, there's really not much overhead at all and it finishes in 7.2 secs, with gzip using ~95% CPU and node ~30%:
var dataLength = 0
var time = process.hrtime()
process.stdin.on('data', function(data) {
if (!dataLength) dataLength = data.length
})
process.stdin.on('end', function() {
var diff = process.hrtime(time)
console.log('stream took %d seconds', diff[0] + diff[1] / 1e9)
if (dataLength) console.log('buffer length was %d', dataLength)
})
gzip -cd file.gz | node spawn-bench.js
stream took 7.2209929479999995 seconds
buffer length was 65536
Similarly, if I process the entire gzip file in node using zlib, I get relatively ok performance too, taking 9.8 secs:
var dataLength = 0
var time = process.hrtime()
fs.createReadStream(file, {highWaterMark: 32 * 1024 * 1024})
.pipe(zlib.createGunzip({chunkSize: 32 * 1024 * 1024}))
.on('data', function(data) {
if (!dataLength) dataLength = data.length
})
.on('end', function() {
var diff = process.hrtime(time)
console.log('stream took %d seconds', diff[0] + diff[1] / 1e9)
if (dataLength) console.log('buffer length was %d', dataLength)
})
node spawn-bench.js
stream took 9.836914587 seconds
buffer length was 33554432
So one thing I suspected was the stream buffer size – you can see when spawning it's only 8k, whereas it's 64k when piping from stdin, and I had better performance upping it even further (to 32MB) when reading from the file and gzipping in node.
However, I couldn't see any way to change this with spawn, so I can't really determine if this is the cause.
Is there something else I'm missing? A 4x slowdown seems to be far greater than expected.
I ran with --prof and analyzed... can't really tell what's going on here.
Is this mostly IPC overhead? _openat, _mach_port_extract_member, etc?
Statistical profiling result from isolate-0x101804600-v8.log, (147497 ticks, 602 unaccounted, 0 excluded).
[Unknown]:
ticks total nonlib name
602 0.4%
[Shared libraries]:
ticks total nonlib name
438 0.3% 0.0% /usr/lib/system/libsystem_malloc.dylib
373 0.3% 0.0% /usr/lib/system/libsystem_pthread.dylib
1 0.0% 0.0% /usr/lib/system/libsystem_platform.dylib
1 0.0% 0.0% /usr/lib/system/libsystem_c.dylib
[JavaScript]:
ticks total nonlib name
569 0.4% 0.4% LazyCompile: ~readableAddChunk _stream_readable.js:123:26
487 0.3% 0.3% LazyCompile: *emit events.js:117:44
467 0.3% 0.3% LazyCompile: *nextTick node.js:475:22
464 0.3% 0.3% Stub: LoadICTrampolineStub
405 0.3% 0.3% LazyCompile: *onread net.js:501:16
322 0.2% 0.2% LazyCompile: *_tickCallback node.js:333:27
...
[C++]:
ticks total nonlib name
56380 38.2% 38.4% _openat
44780 30.4% 30.5% _mach_port_extract_member
10645 7.2% 7.3% ___rename_ext
8456 5.7% 5.8% _szone_free_definite_size
907 0.6% 0.6% __ZN2v88internal4Heap22RegisterNewArrayBufferEbPvm
...
[GC]:
ticks total nonlib name
6978 4.7%
[Bottom up (heavy) profile]:
Note: percentage shows a share of a particular caller in the total
amount of its parent calls.
Callers occupying less than 2.0% are not shown.
ticks parent name
56380 38.2% _openat
44780 30.4% _mach_port_extract_member
10645 7.2% ___rename_ext
8456 5.7% _szone_free_definite_size
[Top down (heavy) profile]:
Note: callees occupying less than 0.1% are not shown.
inclusive self name
ticks total ticks total
56380 38.2% 56380 38.2% _openat
44780 30.4% 44780 30.4% _mach_port_extract_member
10645 7.2% 10645 7.2% ___rename_ext
8455 5.7% 8455 5.7% _szone_free_definite_size
4699 3.2% 405 0.3% LazyCompile: *onread net.js:501:16
3879 2.6% 566 0.4% LazyCompile: ~readableAddChunk _stream_readable.js:123:26
812 0.6% 92 0.1% LazyCompile: *maybeReadMore _stream_readable.js:422:23
582 0.4% 405 0.3% LazyCompile: *nextTick node.js:475:22
515 0.3% 436 0.3% LazyCompile: *emit events.js:117:44
452 0.3% 452 0.3% Stub: LoadICTrampolineStub
186 0.1% 186 0.1% Stub: InstanceofStub
152 0.1% 99 0.1% LazyCompile: *howMuchToRead _stream_readable.js:211:23
332 0.2% 69 0.0% LazyCompile: *readableAddChunk _stream_readable.js:123:26
...
FWIW, I can't reproduce on OS X 10.8. The fact that the process spends nearly 40% of its time in openat() is suspicious.
@bnoordhuis interesting – is there something I can do to narrow this down then?
Perhaps this is an OSX-only issue, even OSX-10.11
I'll test the same setup on a Linux box in a sec
Maybe you can put a dtrace probe on openat() that prints the filename and the time spent inside the system call? As root:
$ dtrace -q \
-n 'syscall::openat_nocancel:entry /pid == $target/ { self->timestamp = timestamp; self->path = copyinstr(arg1); }' \
-n 'syscall::openat_nocancel:return /pid == $target/ { printf("%d %s\n", timestamp - self->timestamp, self->path) }' \
-c 'node test.js`
I'm just guessing at the name of the openat probe, 10.8 doesn't have that system call. dtrace -l | grep -w syscall prints the available probes.
Well, that appears to be the correct syscall afaict:
$ sudo dtrace -l | grep -w syscall | grep openat
1076 syscall openat entry
1077 syscall openat return
1078 syscall openat_nocancel entry
1079 syscall openat_nocancel return
But running the command you gave gives me no (extra) output. If I remove the -q I get:
$ sudo dtrace -n 'syscall::openat_nocancel:entry /pid == $target/ { self->timestamp = timestamp; self->path = copyinstr(arg1); }' -n 'syscall::openat_nocancel:return /pid == $target/ { printf("%d %s\n", timestamp - self->timestamp, self->path) }' -c 'node spawn-bench.js'
dtrace: description 'syscall::openat_nocancel:entry ' matched 1 probe
dtrace: description 'syscall::openat_nocancel:return ' matched 1 probe
stream took 25.471454835 seconds
buffer length was 8192
dtrace: pid 56635 has exited
(should I be looking somewhere else for the dtrace output?)
I was using https://github.com/sidorares/node-tick to process the results of --prof – running it using node-tick-processor --mac – just in case there's the possibility that it's resolving the wrong syscalls? There were no errors/warnings though.
Oh, and replacing openat_nocancel with openat in the dtrace command had no effect either.
Can you use tools/v8-prof/tick-processor.js? The format of the log file is pretty closely tied to the version of V8. If the tick processor and the log file don't match up, you can get bogus results.
@sidorares Don't know if node-tick already does that but newer versions of V8 start the log with a v8-version marker that you can use to print a warning when the tick processor doesn't match. Feel free to steal the logic from here.
@bnoordhuis I kinda remember a similar discussion happening here :smile: https://github.com/sidorares/node-tick/issues/10#issuecomment-70250111
Same (similar) result with tools/v8-prof/tick-processor.js. I had deleted the old profile, but generating a new one is quite similar, largely made up by _mach_port_extract_member, _openat, ___rename_ext and _free:
$ node tools/v8-prof/tick-processor.js ../isolate-0x102800000-v8.log
Statistical profiling result from ../isolate-0x102800000-v8.log, (19119 ticks, 43 unaccounted, 0 excluded).
[Shared libraries]:
ticks total nonlib name
79 0.4% /usr/lib/system/libsystem_malloc.dylib
50 0.3% /usr/lib/system/libsystem_pthread.dylib
[JavaScript]:
ticks total nonlib name
82 0.4% 0.4% Stub: LoadICTrampolineStub
66 0.3% 0.3% LazyCompile: *nextTick node.js:475:22
61 0.3% 0.3% LazyCompile: *onread net.js:501:16
55 0.3% 0.3% LazyCompile: *emit events.js:117:44
45 0.2% 0.2% LazyCompile: *Readable.read _stream_readable.js:250:35
44 0.2% 0.2% LazyCompile: *_tickCallback node.js:333:27
27 0.1% 0.1% Stub: InstanceofStub
...
[C++]:
ticks total nonlib name
6310 33.0% 33.2% _mach_port_extract_member
5364 28.1% 28.2% _openat
2030 10.6% 10.7% ___rename_ext
1308 6.8% 6.9% _free
203 1.1% 1.1% _malloc_zone_valloc
187 1.0% 1.0% _malloc_zone_malloc
146 0.8% 0.8% v8::internal::Heap::RegisterNewArrayBuffer(bool, void*, unsigned long)
86 0.4% 0.5% v8::Context::Enter()
86 0.4% 0.5% _uv__io_poll
...
[Summary]:
ticks total nonlib name
800 4.2% 4.2% JavaScript
18147 94.9% 95.6% C++
1109 5.8% 5.8% GC
129 0.7% Shared libraries
43 0.2% Unaccounted
[C++ entry points]:
ticks cpp total name
27 25.0% 0.1% v8::internal::Builtins::~Builtins()
18 16.7% 0.1% v8::internal::Runtime_CompileLazy(int, v8::internal::Object**, v8::internal::Isolate*)
10 9.3% 0.1% v8::internal::Runtime_StackGuard(int, v8::internal::Object**, v8::internal::Isolate*)
9 8.3% 0.0% v8::internal::Runtime_Interrupt(int, v8::internal::Object**, v8::internal::Isolate*)
8 7.4% 0.0% v8::internal::StoreIC_Miss(int, v8::internal::Object**, v8::internal::Isolate*)
8 7.4% 0.0% v8::internal::LoadIC_Miss(int, v8::internal::Object**, v8::internal::Isolate*)
7 6.5% 0.0% v8::internal::Runtime_CompileOptimized(int, v8::internal::Object**, v8::internal::Isolate*)
4 3.7% 0.0% v8::internal::KeyedStoreIC_Miss(int, v8::internal::Object**, v8::internal::Isolate*)
3 2.8% 0.0% v8::Isolate::RunMicrotasks()
2 1.9% 0.0% v8::internal::ToBooleanIC_Miss(int, v8::internal::Object**, v8::internal::Isolate*)
2 1.9% 0.0% v8::internal::KeyedLoadIC_Miss(int, v8::internal::Object**, v8::internal::Isolate*)
...
[Bottom up (heavy) profile]:
Note: percentage shows a share of a particular caller in the total
amount of its parent calls.
Callers occupying less than 2.0% are not shown.
ticks parent name
6310 33.0% _mach_port_extract_member
5364 28.1% _openat
2030 10.6% ___rename_ext
1308 6.8% _free
@bnoordhuis is there a way you know of to increase the buffer size for spawn stdio streams/pipes?
In an attempt to see all of the syscalls made here, I ran dtruss -c -f node <file> and it seems like when stdio is set to ['pipe', 'pipe', 'pipe'], madvise is called significantly more.
With pipe, it was called 24418 times, compared to ignore, 66 times.
Working hypothesis: the openat and __rename_ext calls (maybe mach_port_extract_member too) are caused by V8 logging to disk.
@mhart Can you try the dtrace command again while passing --prof to node this time?
FWIW, I tested with 0.12.7 and 0.10.40 on the same machine – I do get slightly better results – still at least 2x slower than reading from stdin though. Would love to up the buffer sizes to see if that helps.
nave use 0.12.7 node spawn-bench.js
Already installed: 0.12.7
using 0.12.7
stream took 19.746245263 seconds
buffer length was 8192
$ nave use 0.10.40 node spawn-bench.js
Already installed: 0.10.40
using 0.10.40
stream took 15.013791288 seconds
buffer length was 8192
@bnoordhuis will try that dtrace command now
No difference in the dtrace output using --prof – FWIW there's no difference in execution time either – bizarrely it seems that passing --prof even reduces the time by a few seconds...!? Small sample size I know, but it seems pretty consistent.
$ node spawn-bench.js
stream took 26.928994887 seconds
buffer length was 8192
$ node --prof spawn-bench.js
stream took 22.790958391 seconds
buffer length was 8192
$ node spawn-bench.js
stream took 25.437598346 seconds
buffer length was 8192
$ node --prof spawn-bench.js
stream took 22.1429554 seconds
buffer length was 8192
$ node spawn-bench.js
stream took 25.366320914 seconds
buffer length was 8192
$ node --prof spawn-bench.js
stream took 22.147164474 seconds
buffer length was 8192
(ok, maybe forget the --prof-being-faster observation, it's a little inconsistent – in any case, it certainly doesn't appear to be contributing negatively to performance)
Interestingly on an EC2 Ubuntu Linux instance, using node+zlib with large buffers is actually markedly faster than the default gzip command line (which is kinda cool):
$ uname -a
Linux 3.19.0-30-generic #34-Ubuntu SMP Fri Oct 2 22:08:41 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
$ node --version
v4.2.1
# using zlib and 32MB buffers:
$ node spawn-bench.js
stream took 9.823702408 seconds
bytes read: 2674732313
buffer length: 33554432
$ time gzip -cd file.gz > /dev/null
real 0m14.706s
user 0m14.584s
sys 0m0.120s
And there appears to be no difference between spawn and stdin times – the spawn buffer is larger on Linux it seems, 64k:
# using spawn with `ignore`:
$ node spawn-bench.js
stream took 14.649728433 seconds
# using spawn with `pipe`:
$ node spawn-bench.js
stream took 15.488220565 seconds
bytes read: 2674732313
buffer length: 65536
# changing to read from stdin:
$ gzip -cd file.gz | node spawn-bench.js
stream took 15.568545432 seconds
bytes read: 2674732313
buffer length: 65536
(these results are consistent after I run them a few times)
So:
ignore and pipe on Linux (~1 sec)Again, I'd really love to play around with the buffer sizes here to see if it's stream/pipe overhead that's causing the issue, so any pointers on how to set them for spawn would be great.
So 10.11 may be _partly_ to blame. I just tested on a machine w/ Mac OS X 10.10.5 – a 2GHz mid-2012 MB Air, and I saw more like a 2-3x slowdown instead of 4x.
spawn w/ ignore (~6.7 secs):
$ node spawn-bench.js
stream took 6.734095193 seconds
$ node spawn-bench.js
stream took 6.760609026 seconds
$ node spawn-bench.js
stream took 6.768254683 seconds
spawn w/ pipe (~18-19 secs):
$ node spawn-bench.js
stream took 18.098372379 seconds
bytes read: 2674732313
buffer length: 8192
$ node spawn-bench.js
stream took 18.301364698 seconds
bytes read: 2674732313
buffer length: 8192
$ node spawn-bench.js
stream took 19.535337659 seconds
bytes read: 2674732313
buffer length: 8192
piping to stdin (~7.7 secs):
$ gzip -cd file.gz | node spawn-bench.js
stream took 7.753027612 seconds
bytes read: 2674732313
buffer length: 65536
$ gzip -cd file.gz | node spawn-bench.js
stream took 7.710982234 seconds
bytes read: 2674732313
buffer length: 65536
$ gzip -cd file.gz | node spawn-bench.js
stream took 7.70945793 seconds
bytes read: 2674732313
buffer length: 65536
And the profile for the spawn w/ pipe case seems largely taken up by ___bsdthread_ctl, ___thread_selfusage and _connectx:
$ node-tick-processor --mac isolate-0x101804c00-v8.log
Statistical profiling result from isolate-0x101804c00-v8.log, (15488 ticks, 17 unaccounted, 0 excluded).
[Unknown]:
ticks total nonlib name
17 0.1%
[Shared libraries]:
ticks total nonlib name
47 0.3% 0.0% /usr/lib/system/libsystem_pthread.dylib
20 0.1% 0.0% /usr/lib/system/libsystem_malloc.dylib
1 0.0% 0.0% /usr/lib/system/libsystem_platform.dylib
[JavaScript]:
ticks total nonlib name
55 0.4% 0.4% LazyCompile: *emit events.js:117:44
53 0.3% 0.3% LazyCompile: *_tickCallback node.js:333:27
49 0.3% 0.3% Stub: LoadICTrampolineStub
46 0.3% 0.3% LazyCompile: *nextTick node.js:475:22
42 0.3% 0.3% LazyCompile: *Readable.read _stream_readable.js:250:35
32 0.2% 0.2% LazyCompile: *onread net.js:501:16
24 0.2% 0.2% Stub: InstanceofStub
16 0.1% 0.1% LazyCompile: ~readableAddChunk _stream_readable.js:123:26
16 0.1% 0.1% LazyCompile: *howMuchToRead _stream_readable.js:211:23
15 0.1% 0.1% Stub: GrowArrayElementsStub
15 0.1% 0.1% Stub: CallICStub(args(2), FUNCTION,
12 0.1% 0.1% Stub: LoadICStub
11 0.1% 0.1% Stub: CallFunctionStub_Args3
11 0.1% 0.1% LazyCompile: *readableAddChunk _stream_readable.js:123:26
11 0.1% 0.1% Builtin: ArgumentsAdaptorTrampoline
10 0.1% 0.1% Stub: CompareICStub {1}
...
[C++]:
ticks total nonlib name
5694 36.8% 36.9% ___bsdthread_ctl
3420 22.1% 22.2% ___thread_selfusage
2711 17.5% 17.6% _connectx
741 4.8% 4.8% _szone_free
434 2.8% 2.8% _szone_pressure_relief
115 0.7% 0.7% _szone_free_definite_size
89 0.6% 0.6% _szone_malloc_should_clear
74 0.5% 0.5% _mach_msg_destroy
50 0.3% 0.3% _uv__stream_io
47 0.3% 0.3% __ZN2v88internal4Heap22RegisterNewArrayBufferEbPvm
32 0.2% 0.2% _uv__io_poll
31 0.2% 0.2% __ZN2v88internal14LookupIterator14LookupInHolderEPNS0_3MapEPNS0_10JSReceiverE
29 0.2% 0.2% _vm_region_recurse_64
27 0.2% 0.2% __ZN2v88internal8JSObject12SetPrototypeENS0_6HandleIS1_EENS2_INS0_6ObjectEEEb
27 0.2% 0.2% __ZN2v87Context4ExitEv
26 0.2% 0.2% _tiny_malloc_from_free_list
25 0.2% 0.2% _platform_task_update_threads
25 0.2% 0.2% __simple_asl_msg_set
24 0.2% 0.2% __ZN2v88internal7Context14native_contextEv
24 0.2% 0.2% __ZN2v87Context5EnterEv
23 0.1% 0.1% __ZN2v88internalL6InvokeEbNS0_6HandleINS0_10JSFunctionEEENS1_INS0_6ObjectEEEiPS5_
21 0.1% 0.1% __ZN4node9AsyncWrap12MakeCallbackEN2v85LocalINS1_8FunctionEEEiPNS2_INS1_5ValueEEE
21 0.1% 0.1% __ZN4node16ContextifyScript3NewERKN2v820FunctionCallbackInfoINS1_5ValueEEE
21 0.1% 0.1% __ZN2v88internal15TimerEventScopeINS0_17TimerEventExecuteEE13LogTimerEventENS0_6Logger8StartEndE
20 0.1% 0.1% __simple_asl_send
20 0.1% 0.1% __malloc_initialize
...
[GC]:
ticks total nonlib name
1097 7.1%
[Bottom up (heavy) profile]:
Note: percentage shows a share of a particular caller in the total
amount of its parent calls.
Callers occupying less than 2.0% are not shown.
ticks parent name
5694 36.8% ___bsdthread_ctl
3420 22.1% ___thread_selfusage
2711 17.5% _connectx
741 4.8% _szone_free
434 2.8% _szone_pressure_relief
[Top down (heavy) profile]:
Note: callees occupying less than 0.1% are not shown.
inclusive self name
ticks total ticks total
5694 36.8% 5694 36.8% ___bsdthread_ctl
3419 22.1% 3419 22.1% ___thread_selfusage
2711 17.5% 2711 17.5% _connectx
740 4.8% 740 4.8% _szone_free
482 3.1% 32 0.2% LazyCompile: *onread net.js:501:16
295 1.9% 14 0.1% LazyCompile: ~readableAddChunk _stream_readable.js:123:26
59 0.4% 5 0.0% LazyCompile: *maybeReadMore _stream_readable.js:422:23
40 0.3% 30 0.2% LazyCompile: *nextTick node.js:475:22
55 0.4% 32 0.2% LazyCompile: *emit events.js:117:44
20 0.1% 2 0.0% LazyCompile: ~<anonymous> /Users/sofijastefanovic/spawn-bench.js:37:36
30 0.2% 30 0.2% Stub: LoadICTrampolineStub
28 0.2% 20 0.1% LazyCompile: *Readable.read _stream_readable.js:250:35
135 0.9% 10 0.1% LazyCompile: *readableAddChunk _stream_readable.js:123:26
34 0.2% 19 0.1% LazyCompile: *Readable.read _stream_readable.js:250:35
24 0.2% 23 0.1% LazyCompile: *emit events.js:117:44
21 0.1% 5 0.0% LazyCompile: ~emit events.js:117:44
19 0.1% 12 0.1% LazyCompile: *nextTick node.js:475:22
434 2.8% 434 2.8% _szone_pressure_relief
113 0.7% 113 0.7% _szone_free_definite_size
107 0.7% 53 0.3% LazyCompile: *_tickCallback node.js:333:27
30 0.2% 7 0.0% LazyCompile: doNTCallback2 node.js:436:27
87 0.6% 87 0.6% _szone_malloc_should_clear
74 0.5% 74 0.5% _mach_msg_destroy
55 0.4% 0 0.0% Function: ~<anonymous> node.js:10:10
53 0.3% 0 0.0% LazyCompile: ~startup node.js:13:19
23 0.1% 0 0.0% LazyCompile: ~Module.runMain module.js:465:26
22 0.1% 0 0.0% LazyCompile: Module._load module.js:271:24
19 0.1% 0 0.0% LazyCompile: ~Module.load module.js:346:33
18 0.1% 0 0.0% LazyCompile: ~Module._extensions..js module.js:440:37
16 0.1% 0 0.0% LazyCompile: ~Module._compile module.js:379:37
50 0.3% 50 0.3% _uv__stream_io
47 0.3% 47 0.3% __ZN2v88internal4Heap22RegisterNewArrayBufferEbPvm
43 0.3% 43 0.3% /usr/lib/system/libsystem_pthread.dylib
32 0.2% 32 0.2% _uv__io_poll
31 0.2% 31 0.2% __ZN2v88internal14LookupIterator14LookupInHolderEPNS0_3MapEPNS0_10JSReceiverE
29 0.2% 29 0.2% _vm_region_recurse_64
27 0.2% 27 0.2% __ZN2v88internal8JSObject12SetPrototypeENS0_6HandleIS1_EENS2_INS0_6ObjectEEEb
27 0.2% 27 0.2% __ZN2v87Context4ExitEv
...
Might have to scratch the 10.11 theory, seems to exist on 10.9 too. I spun up a macincloud instance running OS X 10.9.5 on a 2.5GHz i5.
The issues are even more pronounced on this setup: seeing a 5-7x slowdown compared w/ stdin here.
Here I generated the gzip file using cat /dev/random | head -c 2674732313 | gzip > file.gz – and the time to ungzip is quite quick:
Using spawn w/ ignore (~2 secs):
$ node spawn-bench.js
stream took 2.036773606 seconds
$ node spawn-bench.js
stream took 2.049106005 seconds
$ node spawn-bench.js
stream took 2.00025435 seconds
Using stdin and streaming (~3.5 secs):
$ gzip -cd file.gz | node spawn-bench.js
stream took 3.46003934 seconds
bytes read: 2674732313
buffer length: 65536
$ gzip -cd file.gz | node spawn-bench.js
stream took 3.537773093 seconds
bytes read: 2674732313
buffer length: 65536
$ gzip -cd file.gz | node spawn-bench.js
stream took 3.562273549 seconds
bytes read: 2674732313
buffer length: 65536
Using spawn w/ pipe (~19 secs):
$ node spawn-bench.js
stream took 18.548235443 seconds
bytes read: 2674732313
buffer length: 8192
$ node spawn-bench.js
stream took 20.300110516 seconds
bytes read: 2674732313
buffer length: 8192
$ node spawn-bench.js
stream took 19.733407683 seconds
bytes read: 2674732313
buffer length: 8192
And here's the profile of spawn w/ pipe:
$ node-tick-processor --mac isolate-0x101804c00-v8.log
Statistical profiling result from isolate-0x101804c00-v8.log, (15787 ticks, 29 unaccounted, 0 excluded).
[Unknown]:
ticks total nonlib name
29 0.2%
[Shared libraries]:
ticks total nonlib name
17 0.1% 0.0% /usr/lib/system/libsystem_pthread.dylib
1 0.0% 0.0% /usr/lib/system/libsystem_c.dylib
[JavaScript]:
ticks total nonlib name
71 0.4% 0.5% LazyCompile: *emit events.js:117:44
67 0.4% 0.4% LazyCompile: *_tickCallback node.js:333:27
52 0.3% 0.3% Stub: LoadICTrampolineStub
52 0.3% 0.3% LazyCompile: *Readable.read _stream_readable.js:250:35
43 0.3% 0.3% LazyCompile: *onread net.js:501:16
43 0.3% 0.3% LazyCompile: *nextTick node.js:475:22
40 0.3% 0.3% Stub: InstanceofStub
28 0.2% 0.2% Stub: JSEntryStub
28 0.2% 0.2% LazyCompile: ~readableAddChunk _stream_readable.js:123:26
27 0.2% 0.2% Stub: GrowArrayElementsStub
26 0.2% 0.2% LazyCompile: *isNaN native v8natives.js:29:21
...
[C++]:
ticks total nonlib name
6488 41.1% 41.1% ___initgroups
2701 17.1% 17.1% _proc_listpidspath
1992 12.6% 12.6% _mprotect
295 1.9% 1.9% _purgeable_ptr_in_use_enumerator
257 1.6% 1.6% _purgeable_print
252 1.6% 1.6% _szone_error
205 1.3% 1.3% _szone_good_size
91 0.6% 0.6% __ZN2v88internal4Heap22RegisterNewArrayBufferEbPvm
...
[GC]:
ticks total nonlib name
1361 8.6%
[Bottom up (heavy) profile]:
Note: percentage shows a share of a particular caller in the total
amount of its parent calls.
Callers occupying less than 2.0% are not shown.
ticks parent name
6488 41.1% ___initgroups
2701 17.1% _proc_listpidspath
1992 12.6% _mprotect
[Top down (heavy) profile]:
Note: callees occupying less than 0.1% are not shown.
inclusive self name
ticks total ticks total
6487 41.1% 6487 41.1% ___initgroups
2701 17.1% 2701 17.1% _proc_listpidspath
1992 12.6% 1992 12.6% _mprotect
664 4.2% 43 0.3% LazyCompile: *onread net.js:501:16
407 2.6% 27 0.2% LazyCompile: ~readableAddChunk _stream_readable.js:123:26
80 0.5% 40 0.3% LazyCompile: *emit events.js:117:44
35 0.2% 6 0.0% LazyCompile: ~<anonymous> /Users/user113524/spawn-bench.js:37:36
69 0.4% 8 0.1% LazyCompile: *maybeReadMore _stream_readable.js:422:23
42 0.3% 22 0.1% LazyCompile: *nextTick node.js:475:22
20 0.1% 20 0.1% Stub: GrowArrayElementsStub
52 0.3% 28 0.2% LazyCompile: *Readable.read _stream_readable.js:250:35
22 0.1% 7 0.0% LazyCompile: *howMuchToRead _stream_readable.js:211:23
37 0.2% 37 0.2% Stub: LoadICTrampolineStub
180 1.1% 16 0.1% LazyCompile: *readableAddChunk _stream_readable.js:123:26
37 0.2% 17 0.1% LazyCompile: *Readable.read _stream_readable.js:250:35
36 0.2% 31 0.2% LazyCompile: *emit events.js:117:44
26 0.2% 17 0.1% LazyCompile: *nextTick node.js:475:22
20 0.1% 6 0.0% LazyCompile: ~emit events.js:117:44
19 0.1% 19 0.1% Stub: InstanceofStub
295 1.9% 295 1.9% _purgeable_ptr_in_use_enumerator
257 1.6% 257 1.6% _purgeable_print
252 1.6% 252 1.6% _szone_error
205 1.3% 205 1.3% _szone_good_size
158 1.0% 67 0.4% LazyCompile: *_tickCallback node.js:333:27
40 0.3% 12 0.1% LazyCompile: doNTCallback2 node.js:436:27
16 0.1% 4 0.0% LazyCompile: ~maybeReadMore_ _stream_readable.js:429:24
91 0.6% 91 0.6% __ZN2v88internal4Heap22RegisterNewArrayBufferEbPvm
88 0.6% 88 0.6% _bitarray_zap_first_set
87 0.6% 87 0.6% _large_entry_free_no_lock
83 0.5% 83 0.5% _hex
81 0.5% 81 0.5% _szone_malloc_should_clear
...
Also seems that regressions from 0.10.40 to 0.12.7 to 4.2.1 are partly to blame. On this 10.9.5 setup I get the following for spawn w/ pipe:
So, it definitely exists in OS X 10.9, and at least half of the slowdown appears to be from regressions from 0.10 where it performs "only" 2-3x slower than an equivalent stdin pipe.
That said, that factor of 2-3x for 0.10.40 is still much greater than any overhead 4.2.1 causes on Linux, so something weird is happening in OS X land.
So I removed gzip from the equation and ran some experiments with different node versions on my OS X 10.11 machine comparing spawn with stdin, streaming 1GB. Here are the results.
(higher values = worse performance = :crying_cat_face:)

Read further below for how I generated this data, but here are some things to note:
To create this, I generated a file using head -c 1000000000 /dev/random > file.dat and then for each version of node (inc v0.8.28, then v0.10.40 and upwards) I ran the following programs five times each and averaged the results in the chart:
spawn.js:
var spawn = require('child_process').spawn
spawnPipe(function(err, time) {
if (err) return console.error(err.stack || err)
console.log([process.version, 'spawn', time].join(','))
})
function spawnPipe(cb) {
var bufferLength = 0, bytesRead = 0
var time = process.hrtime()
var gzip = spawn('cat', ['file.dat'], {stdio: ['ignore', 'pipe', 'ignore']})
if (gzip.stdout != null) {
gzip.stdout.on('data', function(data) {
if (!bufferLength) bufferLength = data.length
bytesRead += data.length
})
}
gzip.on('error', cb)
gzip.on('close', function(code) {
if (code !== 0) return cb(new Error('gzip exited with: ' + code))
if (bytesRead != 1000000000) return cb(new Error('Incorrect bytes read: ' + bytesRead))
var diff = process.hrtime(time)
cb(null, diff[0] + diff[1] / 1e9)
})
}
stdin.js (using cat file.dat | node stdin.js):
stdinPipe(function(err, time) {
if (err) return console.error(err.stack || err)
console.log([process.version, 'stdin', time].join(','))
})
function stdinPipe(cb) {
var bufferLength = 0, bytesRead = 0
var time = process.hrtime()
process.stdin.resume()
process.stdin.on('data', function(data) {
if (!bufferLength) bufferLength = data.length
bytesRead += data.length
})
process.stdin.on('error', cb)
process.stdin.on('end', function() {
if (bytesRead != 1000000000) return cb(new Error('Incorrect bytes read: ' + bytesRead))
var diff = process.hrtime(time)
cb(null, diff[0] + diff[1] / 1e9)
})
}
I invoked these using the following bash script to generate the CSV data:
#!/bin/bash
VERSIONS="4.2.1 0.12.7 0.10.40 0.8.28 0.11.16 3.3.1 2.5.0 1.8.4 \
0.11.0 0.11.1 0.11.2 \
0.11.3 0.11.4 0.11.5 0.11.6 0.11.7 0.11.8 0.11.9 0.11.10 0.11.11 \
0.11.12 0.11.13 0.11.14 0.11.15 0.12.0 0.12.1 0.12.2 0.12.3 \
0.12.4 0.12.5 0.12.6 4.0.0 4.1.0 4.1.1 4.1.2 4.2.0 \
1.0.0 1.0.1 1.0.2 1.0.3 1.0.4 1.1.0 1.2.0 1.3.0 1.4.1 \
1.4.2 1.4.3 1.5.0 1.5.1 1.6.0 1.6.1 1.6.2 1.6.3 1.6.4 \
1.7.1 1.8.1 1.8.2 1.8.3 2.0.0 2.0.1 2.0.2 2.1.0 \
2.2.0 2.2.1 2.3.0 2.3.1 2.3.2 2.3.3 2.3.4 2.4.0 \
3.0.0 3.1.0 3.2.0 3.3.0"
for v in $VERSIONS; do
for i in 0 1 2 3 4 5; do
nave use $v node spawn.js
nave use $v sh -c 'cat file.dat | node stdin.js'
done
done
This doesn't shed much light on why Mac OS X would be any different from Linux (or if something happened from OS X 10.8 to 10.9) – but it does highlight where some pretty serious regressions occurred for spawn on OS X itself.
@bnoordhuis @mhart I added version marker support but haven't published yet - need to properly extract all historical tickprocessor.js versions and map which one to use ( and add heuristic to detect which to use if no marker )
Here's another version of the same data, showing processing speed this time (so higher is better).
That drop from v0.11.3 to v0.11.4 is just... ouch.

(Updated: there's less variance if I use the max instead of the mean)
(as a hunch, any chance that massive drop in performance from 0.11.3 to 0.11.4 is due to the removal of the SlabAllocator in https://github.com/nodejs/node/commit/ec90e6e80adbb9c6f31158c319eba4e35edda845 ?)
It's possible. The OS X implementation of malloc is pretty naive.
A graph of Linux performance for comparison. Ran on an EC2 c3.large instance, generated file using fallocate -l 1000000000 file.dat

No disparity really exists here (indeed spawn is usually _faster_ than stdin), and the v0.11.3 to v0.11.4 leap doesn't exist. Rather precipitous drop entering v3.x though.
Still think the buffer size difference must be a factor on OS X here. Smaller buffer for spawn = more allocations = crappy performance.
Extracted the v0.11.3 to v0.11.4 regression into a separate issue
Extracted the regression from v3.0.0 onwards into a separate issue
With the two performance regressions noted in this issue extracted into #3475 and #3477, does it make sense to close this one?
@Trott it doesn't make sense to close – this is a separate issue – specifically about the gap that you can see in https://github.com/nodejs/node/issues/3429#issuecomment-149424339 regardless of version
The other two issues you referred to are version-specific
So basically, on Mac there's a huge difference between spawn and stdin that doesn't exist on Linux, and there always has been
@mhart If it's not onerous, any chance we can get some graphs for more recent versions of Node.js and/or macOS to see if this is still persisting? (I'm guessing it is.)
@Trott I'm guessing it is too – instructions to repro are here: https://github.com/nodejs/node/issues/3429#issuecomment-149387804
Yep, still similar on latest macOS/Node.js – reading a spawn stream still much slower than reading a stdin stream:

reading a spawn stream still much slower than reading a stdin stream:
Guess we can't turn it around and say "Oh, it's a feature! stdin is much faster on OS X than spawn!" Just kidding. I think.
/cc @nodejs/platform-macos @nodejs/benchmarking
So @Trott what do you think about my buffer size theory then?
one thing I suspected was the stream buffer size – you can see when spawning it's only 8k, whereas it's 64k when piping from stdin, and I had better performance upping it even further (to 32MB)
any pointers on how to set them for spawn would be great
/cc @cjihrig
@mhart was this affected by the TF+I landing?
@MylesBorins I mean... I doubt it. But you can always double check! https://github.com/nodejs/node/issues/3429#issuecomment-149387804
Nah, no difference from v8.1.3 – spawn is still only 60% of the performance of stdin:

(again, on Linux there's no difference)
And just to confirm: manual gc'ing doesn't change the spawn/stdin gap, it just gets rid of the performance regression from v0.10.48:

easily recreatable with this:
$ cat 3429.js
var spawn = require('child_process').spawn
var chunk = 'g'.repeat(65535)
var cp = spawn('echo', [chunk])
var count = 0
cp.stdout.on('data', (d) => {
count++
})
cp.on('close', () => {
console.log(`${require('os').platform()}: ${count}`)
})
Linux: 1
darwin: 8
The key difference here is that Linux reads in 64KB (or sometimes even 1MB) chunks in one shot - probably limited by the socket buffer limits only, while in mac it is always 8KB, so you have more round about time to complete the task.
To check if echo or cat is rate limiting this from within, eliminate them, bring in node as child too:
var spawn = require('child_process').spawn
var fs = require('fs')
var chunk = 'g'.repeat(65535)
if(process.argv[2] === 'child') {
console.log(chunk)
}
else {
var cp = spawn(process.execPath, [process.argv[1], 'child'])
var count = 0
cp.stdout.on('data', (d) => {
count++
})
cp.on('close', () => {
console.log(`${require('os').platform()}: ${count}`)
})
}
But again, the same pattern is observed - the data is chunked at the descriptor level.
30358/0x938fb: write(0x1, "gggggggggggggggg", 0xC000) = 8192 0
30356/0x938e0: kevent(0x3, 0x7FFF5FBF6C00, 0x0) = 1 0
30356/0x938e0: read(0xD, "gggggggggggggggg", 0x10000) = 8192 0
Comparing with the good case:
command | node where stdin is used as opposed to stdout, bash performs the piping which results in large chunks being flown:
30572/0x94ea0: read(0x0, "ggggggggggggggggg", 0x10000) = 65536 0
In the slow case, fd 1 is a socket.
In the good case, fd 0 is a FIFO (a named pipe).
This is same in MAC and Linux, but their underlying resource utilization characteristics seem to be different.
diff --git a/deps/uv/src/unix/process.c b/deps/uv/src/unix/process.c
index 3a3cfd6f09..93d83b48e1 100644
--- a/deps/uv/src/unix/process.c
+++ b/deps/uv/src/unix/process.c
@@ -200,7 +200,8 @@ static int uv__process_init_stdio(uv_stdio_container_t* container, int fds[2]) {
if (container->data.stream->type != UV_NAMED_PIPE)
return UV_EINVAL;
else
- return uv__make_socketpair(fds, 0);
+ return uv__make_pipe(fds, 0);
this patch makes the darwin case at par with linux. However, I did not run any other tests or haven't thought about any implications of doing so, just wanted to complete the investigation and ratify my above findings.
I added these values in /etc/sysctl.conf , but does not seem to help, probably the variables are different for UNIX domain sockets:
kern.ipc.maxsockbuf=8000000
net.inet.tcp.sendspace=1048576
net.inet.tcp.recvspace=2097152
So the bottom line is that socket buffers are relatively smaller and slower than the pipes in darwin, at least in their default configurations.
/cc @nodejs/child_process
@gireeshpunathil yeah I figured the buffer size had a large responsibility to play in the slow-down.
The only problem is I can't see how to adjust it for child_process.spawn like I can with other streaming methods.
let us wait to hear from the team. I will also try to understand the differences between FIFOs and sockets on their buffering aspects, and what OS tunables control them.
@gireeshpunathil I don't think you can pass ancillary data via pipes. Maybe that's the reason of using sockets instead of pipes.
@nodejs/libuv (for @gireeshpunathil's questions/observations above)
I don't think you can pass ancillary data via pipes. Maybe that's the reason of using sockets instead of pipes.
That's right. Sending file descriptors only works over UNIX domain sockets.
I'm not aware of socket options that let you tweak the size of the send and receive buffers; I'm reasonably sure there aren't any.
I'm not aware of socket options that let you tweak the size of the send and receive buffers; I'm reasonably sure there aren't any.
@bnoordhuis Does that mean this should likely be closed with something along the lines of "Buffer size is necessarily OS dependent, so this is just the way it is."? Or am I reading too much into what you're saying?
No, I think that sums it up. I'll close this out.
Libuv could tickle the kernel in ways that might eke out a few percent more throughput but not in any way that's documented or stable. The next Apple patch could undo that optimization again.
FWIW, Node.js on my 2012 MacBook Air can push almost 200M/sec to wc -c, which seems plenty fast to me:
let { WriteWrap } = process.binding('stream_wrap')
let { spawn } = require('child_process')
let proc = spawn('sh', ['-c', 'wc -c'], {stdio: ['ipc', 'inherit']}) // 'ipc' or 'pipe'
let data = Buffer.from('.'.repeat(1 << 20))
let i = 192, next = () => {
if (--i < 0) return proc.channel ? proc.disconnect() : proc.stdin.end()
let req = new WriteWrap()
req.async = false
req.handle = proc.channel || proc.stdin._handle
req.oncomplete = next
req.handle.writeBuffer(req, data)
}
next()
$ time ./out/Release/node 3429.js
201326592
real 0m1.024s
user 0m0.964s
sys 0m0.315s
We should look into reducing that user CPU time but that isn't specific to MacOS.
I'm not aware of socket options that let you tweak the size of the send and receive buffers; I'm reasonably sure there aren't any.
So SO_SNDBUF and SO_RCVBUF options wouldn't work for this, right? (Just asking, I haven't used them)
No, with qualifications. It's not that you can't set them, but they don't have any real effect.
I initially figured it might help with blocking reads and writes (because UNIX domain sockets in XNU do have a concept of high and low water marks) but there's no measurable impact.
FWIW, I was able to apply setsockopt of 64KB on the socketpair successfully, and improve things:
with existing logic:
v9.11.1,spawn,3.435861797
v9.11.1,spawn,2.351160667
v9.11.1,spawn,2.348283289
v9.11.1,spawn,2.534070136
v9.11.1,spawn,2.424818775
with 64KB SNDBUF and RCVBUF on the socketpair:
v10.0.0-pre,spawn,0.590655946
v10.0.0-pre,spawn,0.568137925
v10.0.0-pre,spawn,0.563020876
v10.0.0-pre,spawn,0.564211097
v10.0.0-pre,spawn,0.594333673
29985/0x133545: writev(0x1, 0x7FCA69C02430, 0x2) = 65536 0
29983/0x13352d: read(0xD, "ggggggggggggggggg", 0x10000) = 8192 0
29983/0x13352d: kevent(0x3, 0x7FFF5FBF6C00, 0x0) = 1 0
29983/0x13352d: read(0xD, "ggggggggggggggggg", 0x10000) = 8192 0
reading is still in terms of 8KB only, but looks like the kernel buffer has more data so Node was able to read 8 times back-to-back before making the callback, so the performance is not lost.
In the absence of a clear documentation on OS tunables on the UNIX domain sockets, I agree this change is not reasonable.
That's interesting. Does my benchmarklet from https://github.com/nodejs/node/issues/3429#issuecomment-382325325 give different numbers on your system? I tried what you tried but it made no difference.
Diff so we're talking about the same thing:
diff --git a/deps/uv/src/unix/process.c b/deps/uv/src/unix/process.c
index 3a3cfd6f09..767a8cd940 100644
--- a/deps/uv/src/unix/process.c
+++ b/deps/uv/src/unix/process.c
@@ -144,6 +144,20 @@ skip:
uv__nonblock(fds[1], 1);
}
+ {
+ int fd, err, opt;
+
+ for (fd = 0; fd < 2; ++fd) {
+ opt = 65535;
+ if ((err = setsockopt(fds[fd], SOL_SOCKET, SO_SNDBUF, &opt, sizeof(opt))))
+ return err;
+
+ opt = 65535;
+ if ((err = setsockopt(fds[fd], SOL_SOCKET, SO_RCVBUF, &opt, sizeof(opt))))
+ return err;
+ }
+ }
+
return 0;
}
mine is this:
diff --git a/deps/uv/src/unix/process.c b/deps/uv/src/unix/process.c
index 3a3cfd6f09..724fc7b660 100644
--- a/deps/uv/src/unix/process.c
+++ b/deps/uv/src/unix/process.c
@@ -188,6 +188,8 @@ skip:
static int uv__process_init_stdio(uv_stdio_container_t* container, int fds[2]) {
int mask;
int fd;
+ int ret;
+ int sz = 64 * 1024;
mask = UV_IGNORE | UV_CREATE_PIPE | UV_INHERIT_FD | UV_INHERIT_STREAM;
@@ -199,8 +201,20 @@ static int uv__process_init_stdio(uv_stdio_container_t* container, int fds[2]) {
assert(container->data.stream != NULL);
if (container->data.stream->type != UV_NAMED_PIPE)
return UV_EINVAL;
- else
- return uv__make_socketpair(fds, 0);
+ else {
+ ret = uv__make_socketpair(fds, 0);
+ if(0 == ret) {
+ ret = setsockopt(fds[0], SOL_SOCKET, SO_RCVBUF, &sz, 4);
+ if(0 == ret)
+ ret = setsockopt(fds[0], SOL_SOCKET, SO_SNDBUF, &sz, 4);
+ if(0 == ret)
+ ret = setsockopt(fds[1], SOL_SOCKET, SO_RCVBUF, &sz, 4);
+ if(0 == ret) {
+ ret = setsockopt(fds[1], SOL_SOCKET, SO_SNDBUF, &sz, 4);
+ }
+ }
+ return ret;
+ }
plus, I have this: really unsure whether it had any impact or not:
$ cat /etc/sysctl.conf
kern.ipc.maxsockbuf=8000000
net.inet.tcp.sendspace=1048576
net.inet.tcp.recvspace=2097152
net.inet.tcp.autorcvbufmax=33554432
net.inet.tcp.autosndbufmax=33554432
net.local.stream.recvspace=65536
net.local.stream.sendspace=65536
net.inet.tcp.slowlink_wsize=65536
net.inet.tcp.autosndbufinc=65536
net.inet.raw.maxdgram=65536
net.inet.raw.recvspace=65536
net.inet6.ip6.maxfrags=65536
net.stats.recvspace=65536
will try your benchmark now.
$ time ./out/Release/node 3429.js
201326592
real 0m0.753s
user 0m0.696s
sys 0m0.185s
$
pretty much the same, with or without the patch.
Right, so at least that matches up. Tweaking net.local.stream.recvspace and net.local.stream.sendspace had no discernible impact for me either.
I think we've tried enough that we can safely say there's no _easy_ fix. :-)