This is currently set for the stack executable in stack.cabal.
It would be nice if -N _were_ a good default, but it still isn't. There are problems such as GHC issue 8224, that cause some rather bad behavior on, e.g., 32, 40, or 64 cores, which are not too uncommon now on x86, especially in the HPC case.
This is especially problematic for stack bench, where the stack process itself can pollute the machine and interfere with the measurements being performed.
First, my question would be what heavy lifting does the stack process itself do which speeds up from multiple HECs? I thought sub-processes were always responsible for the heavy lifting?
Second, if there is parallel work to be done, it seems like GHC really needs a setting which provides the following behavior:
-N<5)getNumProcessorsThe problem right now is that if we change -N to -N4, then we use 4 threads even when running on single or dual core platforms. And yet that's the only current way to prevent >=32 HECs in the other extreme.
In the short term the above policy could be implemented using setNumCapabilities. Care is warranted though because in the past I've managed to crash GHC that way, and I still think it doesn't get heavy use/testing.
Do you have a concrete example of -N causing problems in stack?
One way to try to generate one is to measure CPU and system time consumed by the stack process itself on 32+ core platforms, and compare this to when running stack and manually, providing, e.g. +RTS -N1 -RTS or -N4.
The more subtle thing is to try to measure this only during the window of time that a benchmark subprocess is active -- that is, whether stack consumes any system or CPU time during that interval.
As we switch more projects to stack that are running on such platforms, I'll try to capture a specific example.
But, this issue is also a question -- why does is stack designed that way? What benefit do multiple HECs currently provide?
Currently: probably none. But stack can sometimes do truly significant CPU computations, especially if dependency solving ends up moving in-process.
I'm still not understanding why this is causing anyone problems, and it's not like it's difficult to disable if needed in corner cases (+RTS -N1 -RTS), so I'm leaning towards closing as wontfix.
Ah, I wonder if anyone has parallelized dependency solving -- sounds like a fun project.
Ok. I think there's actually a good argument that if more people chioose "-N" that this will expose more problems down the line, more GHC issues will be filed, instead of just the lonely 8224, And eventually we will fix the GHC RTS issues.
As someone whose job is to work on parallelism, though, I think in this day and age any hoped-for parallelism that isn't actually benchmarked and quantified, is probably not going to actually yield a benefit. (Even if it seems like there are enough parallel annotations/forks).
Here's a concrete example/reproducer for reference. Take a random project, not even a big one. (This one is the "hydra print" repo -- pretty small.)
Here's rebuilding an already built project. Thus, there's no actual work, rather checking mod times and all that. It takes less than two seconds:
$ time stack build +RTS -N1
real 0m1.482s
user 0m1.161s
sys 0m0.291s
This is a 32 core machine (small, considering you can get 72-core xeons from Intel these days, not even counting XeonPhi ;-)), so when we run stack build with no RTS flags we get the same result as -N32, which is as follows:
$ time stack build
real 0m3.711s
user 1m0.889s
sys 0m21.423s
Ok, so it's more than 2X slower in real time. But my bigger concern is the 51X increase in user time.
Also, note that the above severely underestimates the problem. This machine has hyper threading turned off in the BIOS. GHC does not attempt to distinguish hyper threads from real threads so in a normal configuration of the above machine it would run -N64. To get an idea, here is a slightly unfair example of that same stack build running at +RTS -N64 (running on the same machine without rebooting, with hyper threading active it may do slightly better but I bet not much):
$ time stack build +RTS -N64
real 0m14.966s
user 6m27.205s
sys 0m42.697s
Ouch, >350X increase in system time, stretching out a quick one second task into six minutes of system time. This is not so great for energy usage or for sharing nicely with other processes if it's a busy machine.
Again, I would love it if I could recommend -N without reservation. Other work-stealing RTS's, especially Intel Cilk Plus, are well engineered so as to make that their default. But GHC is just not there yet in this respect.
Also, I would argue that this is not really a corner case, because it's not specific to the project. It's not like certain projects can put +RTS -N1 in their scripts and call it a day. Rather, it's a function of what machine it runs on.
Maybe it's something that could be set in the global stack config, and then people could set it on a per-machine basis. (And stack would have to enact the policy by using setNumCapabilities when it starts up.)
I think this may also add ~30 seconds of time to many or most travis jobs using stack, so I'd like to reopen this for a while to elicit further comment.
Specifically, the wiki recommends using caching and performing an --only-snapshot install before the script runs to build up that cache:
stack build --only-snapshot --no-terminal
But after the first run, this line will just do nothing, but, in this build, that's 31.7 seconds of nothing.
In this build, which adds +RTS -N1 -RTS to the stack build line above, the same thing takes only 0.73 seconds.
ooohh... that's where the 32 seconds come from. Been wondering about that for a long time.
Thanks for identifying that. But do you have an explanation why?
I think given that stack currently is a purely sequential process with concurrent subprocesses (and likely this will be true for a long time still), the default that uses fewer system resources and does so more predictably is the right default.
I did a quick test on travis:
RTS flags: 4min 12sec https://travis-ci.org/futurice/fum2github/builds/73670708+RTS -N2 -RTS 2min 35sec https://travis-ci.org/futurice/fum2github/builds/73701330+RTS -N1 -RTS 2min 36sec https://travis-ci.org/futurice/fum2github/builds/73701834I'd really like to know why this happens!
Well, I'm afraid I don't really know why. We were starting to investigate on GHC issue 8224, which is the relevant place for discussing the excessive system time. (But there's also excessive CPU time here...)
A simple property that would be great to have is that a program that never calls forkIO/par runs just as well with -N32 as with -N1. That is, as in medicine, our parallel runtime should follow the dictum "Primum non nocere": first, do no harm. We don't yet manage that for the GHC RTS, and we need to fix it.
Again, other runtimes like Cilk can sleep non-main worker threads very quickly and efficiently, and there's no reason we can't too. I'm personally not familiar with the relevant RTS code, but I think a basic recipe would be:
pthread_cond_wait).I'm not going to be able to commit anything for a few days, but if someone wants to change the .cabal file, please go ahead.
Ok, a good & simple fix might be to just not specify anything (removing -N rather than changing to -N1). In principle, when we make it good enough, maybe GHC itself could use more than one core by default in GHC 7.12, 7.14, or whenever, and stack could inherit that behavior. Or at such a time as it yields a speedup, some degree of parallelism can be turned back on.
I don't have commit access, but I went ahead and put this little one-liner on PR #708.
I left parallelism on for the integration tests... not sure if it helps there and won't matter for most users.
more GHC issues will be filed, instead of just the lonely 8224
There is also https://ghc.haskell.org/trac/ghc/ticket/9221, which is currently being looked at.
My guess is that this is due to GC synchronisation. If you use +RTS -N, that Haskell process really wants the whole machine to itself - but stack is a build tool, it's running lots of other processes on the box. The default heap settings use a small nursery (-A512k), which means the stack process will be GC'ing quite often, and each time it synchronises N threads. During the synchronisation phase, threads spin while calling yield (spinning consumes CPU time, yield consumes system time).
The option +RTS -qi1 was intended to help with this situation, by having idle HECs not participate in GC. It may be that +RTS -N -qi1 would help (I'd be really interested to know whether it does). This is heading in the direction that Ryan mentioned - making it so that the RTS doesn't make things worse when there's no parallelism.
The other thing that tends to help a lot is +RTS -A128m, to use a larger heap size and hence GC less often.
But I suspect what you really want is just +RTS -N1, since stack doesn't need parallelism.
For reference: https://github.com/ghc/ghc/commit/a02eb298d3f6089e51a43307ffb37e3a8076c8fd
commit a02eb298d3f6089e51a43307ffb37e3a8076c8fd
Author: Simon Marlow <>
Date: Fri Dec 9 10:35:46 2011 +0000
New flag +RTS -qi<n>, avoid waking up idle Capabilities to do parallel GC
This is an experimental tweak to the parallel GC that avoids waking up
a Capability to do parallel GC if we know that the capability has been
idle for a (tunable) number of GC cycles. The idea is that if you're
only using a few Capabilities, there's no point waking up the ones
that aren't busy.
e.g. +RTS -qi3
says "A Capability will participate in parallel GC if it was running
at all since the last 3 GC cycles."
Results are a bit hit and miss, and I don't completely understand why
yet. Hence, for now it is turned off by default, and also not
documented except in the +RTS -? output.
Yep, should have included +RTS -s output. GC is definitely a culprit here. 11% productivity on the machine mentioned above doing the "empty build" of stack itself (already built):
$ time stack build +RTS -N32 -s
736,798,176 bytes allocated in the heap
400,998,408 bytes copied during GC
83,424,472 bytes maximum residency (11 sample(s))
9,577,416 bytes maximum slop
174 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1193 colls, 1193 par 37.11s 1.23s 0.0010s 0.0168s
Gen 1 11 colls, 10 par 41.67s 1.36s 0.1239s 0.3762s
Parallel GC work balance: 0.36% (serial 0%, perfect 100%)
TASKS: 72 (1 bound, 69 peak workers (71 total), using -N32)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.04s ( 0.02s elapsed)
MUT time 9.90s ( 1.58s elapsed)
GC time 78.78s ( 2.60s elapsed)
EXIT time 0.04s ( 0.02s elapsed)
Total time 88.76s ( 4.21s elapsed)
Alloc rate 74,450,707 bytes per MUT second
Productivity 11.2% of total user, 235.9% of total elapsed
gc_alloc_block_sync: 1237799
whitehole_spin: 0
gen[0].sync: 226
gen[1].sync: 15387
real 0m4.240s
user 1m10.179s
sys 0m18.968s
In fact, stack only gets 41% productivity with default heap settings and +RTS -N1. Improves to 67% with -A128M, 0.37s GC, which goes down to 0.23s at -A256M, and after that GC time goes down a little bit more but total real time starts heading up.
At -N32 bigger nursery helps A LOT:
$ time stack build +RTS -N32 -s -A512K
real 0m4.428s
user 1m21.322s
sys 0m18.655s
$ time stack build +RTS -N32 -s -A2M
real 0m3.290s
user 0m46.251s
sys 0m16.162s
$ time stack build +RTS -N32 -s -A32M
real 0m2.480s
user 0m19.536s
sys 0m16.045s
$ time stack build +RTS -N32 -s -A64M
real 0m2.350s
user 0m16.089s
sys 0m9.653s
$ time stack build +RTS -N32 -s -A128M
real 0m2.312s
user 0m7.382s
sys 0m9.629s
$ time stack build +RTS -N32 -s -A256M
real 0m2.177s
user 0m3.587s
sys 0m5.916s
$ time stack build +RTS -N32 -s -A512M
real 0m2.213s
user 0m1.400s
sys 0m0.949s
And as for -qi1: not a big difference for some reason. It does somewhat worse at the larger nursery sizes, but it does help mitigate the problem at the small nursery sizes. Hmm...
$ stack --version
Version 0.1.2.6, Git revision 863b976e5542b731873fadb3893717b78203b3bc
$ time stack build +RTS -N32 -s -A512K -qi1
real 0m3.286s
user 0m44.021s
sys 0m5.371s
$ time stack build +RTS -N32 -s -A32M -qi1
real 0m3.182s
user 0m34.130s
sys 0m6.648s
$ time stack build +RTS -N32 -s -A128M -qi1
real 0m2.326s
user 0m14.986s
sys 0m3.980s
$ time stack build +RTS -N32 -s -A256M -qi1
real 0m2.163s
user 0m3.255s
sys 0m5.181s
$ time stack build +RTS -N32 -s -A512M -qi1
real 0m2.216s
user 0m1.395s
sys 0m0.966s
Finally, when adding in the remaining GC flags, -qa and -qm don't make much difference but it does become more variable in CPU time used.
A combination of -qi and -qb basically makes the problem go away:
$ time stack build +RTS -N32 -s -A512K -qi1 -qb
real 0m1.993s
user 0m2.001s
sys 0m0.409s
Neither flag alone really does the trick. But, disabling parallel GC entirely works fine, -qg alone:
$ time stack build +RTS -N32 -s -A512K -qg
real 0m1.932s
user 0m1.741s
sys 0m0.403s
@simonmar, is there a backoff approach that would possibly work here? I can understand why spinning can reduce latency in the "good" case (where we own all processors). But when things start getting out of sync, i.e. due to OS preemption or whatever, can we then change behavior to:
-qg), or(FYI, as @ezyang and @gcampax can attest, in our recent Compact Normal Form work, parallel GC was also responsible for some really high-variance results, nondeterministic even on a dedicated machine with capabilities <= numProcs. That must be the load balancing algorithm? @ezyang said it was a problem originating with having mutable arrays shared between cores I believe.)
This ticket has been closed for the past 20h. @rrnewton this is really great data you've been collecting. Would hate for it to be lost. Perhaps continue the discussion in a GHC ticket?
Hah, yes, indeed. I think the barrier to switching before typing was just that trac is a bit more painful than github ;-),
GHC 8 will have a new +RTS option: -Nmax<n>:
" -N[<n>] Use <n> processors (default: 1, -N alone determines",
" the number of processors to use automatically)",
" -Nmax[<n>] Use up to n processors automatically",
-maxN currently, not -Nmax.
Most helpful comment
GHC 8 will have a new +RTS option:
-Nmax<n>:See https://ghc.haskell.org/trac/ghc/ticket/10728