Stack: `--with-rtsopts=-N` is probably a bad default

Created on 27 Jul 2015 · 24Comments · Source: commercialhaskell/stack

This is currently set for the stack executable in stack.cabal.

It would be nice if -N _were_ a good default, but it still isn't. There are problems such as GHC issue 8224, that cause some rather bad behavior on, e.g., 32, 40, or 64 cores, which are not too uncommon now on x86, especially in the HPC case.

This is especially problematic for stack bench, where the stack process itself can pollute the machine and interfere with the measurements being performed.

First, my question would be what heavy lifting does the stack process itself do which speeds up from multiple HECs? I thought sub-processes were always responsible for the heavy lifting?

Second, if there is parallel work to be done, it seems like GHC really needs a setting which provides the following behavior:

Set a max number of HECs (-N<5)
Don't use more HECs than getNumProcessors
Possibly discount hyper threading if requested

The problem right now is that if we change -N to -N4, then we use 4 threads even when running on single or dual core platforms. And yet that's the only current way to prevent >=32 HECs in the other extreme.

In the short term the above policy could be implemented using setNumCapabilities. Care is warranted though because in the past I've managed to crash GHC that way, and I still think it doesn't get heavy use/testing.

Source

rrnewton

👍1

Most helpful comment

GHC 8 will have a new +RTS option: -Nmax<n>:

"  -N[<n>]    Use <n> processors (default: 1, -N alone determines",
"             the number of processors to use automatically)",
"  -Nmax[<n>] Use up to n processors automatically",

See https://ghc.haskell.org/trac/ghc/ticket/10728

thomie on 19 Dec 2015

👍2 ❤1

All 24 comments

Do you have a concrete example of -N causing problems in stack?

borsboom on 29 Jul 2015

One way to try to generate one is to measure CPU and system time consumed by the stack process itself on 32+ core platforms, and compare this to when running stack and manually, providing, e.g. +RTS -N1 -RTS or -N4.
The more subtle thing is to try to measure this only during the window of time that a benchmark subprocess is active -- that is, whether stack consumes any system or CPU time during that interval.

As we switch more projects to stack that are running on such platforms, I'll try to capture a specific example.

But, this issue is also a question -- why does is stack designed that way? What benefit do multiple HECs currently provide?

rrnewton on 30 Jul 2015

Currently: probably none. But stack can sometimes do truly significant CPU computations, especially if dependency solving ends up moving in-process.

I'm still not understanding why this is causing anyone problems, and it's not like it's difficult to disable if needed in corner cases (+RTS -N1 -RTS), so I'm leaning towards closing as wontfix.

snoyberg on 31 Jul 2015

Ah, I wonder if anyone has parallelized dependency solving -- sounds like a fun project.

Ok. I think there's actually a good argument that if more people chioose "-N" that this will expose more problems down the line, more GHC issues will be filed, instead of just the lonely 8224, And eventually we will fix the GHC RTS issues.

As someone whose job is to work on parallelism, though, I think in this day and age any hoped-for parallelism that isn't actually benchmarked and quantified, is probably not going to actually yield a benefit. (Even if it seems like there are enough parallel annotations/forks).

rrnewton on 31 Jul 2015

Here's a concrete example/reproducer for reference. Take a random project, not even a big one. (This one is the "hydra print" repo -- pretty small.)

Here's rebuilding an already built project. Thus, there's no actual work, rather checking mod times and all that. It takes less than two seconds:

$ time stack build +RTS -N1
real    0m1.482s
user    0m1.161s
sys 0m0.291s

This is a 32 core machine (small, considering you can get 72-core xeons from Intel these days, not even counting XeonPhi ;-)), so when we run stack build with no RTS flags we get the same result as -N32, which is as follows:

$ time stack build
real    0m3.711s
user    1m0.889s
sys 0m21.423s

Ok, so it's more than 2X slower in real time. But my bigger concern is the 51X increase in user time.

Also, note that the above severely underestimates the problem. This machine has hyper threading turned off in the BIOS. GHC does not attempt to distinguish hyper threads from real threads so in a normal configuration of the above machine it would run -N64. To get an idea, here is a slightly unfair example of that same stack build running at +RTS -N64 (running on the same machine without rebooting, with hyper threading active it may do slightly better but I bet not much):

$ time stack build +RTS -N64
real    0m14.966s
user    6m27.205s
sys     0m42.697s

Ouch, >350X increase in system time, stretching out a quick one second task into six minutes of system time. This is not so great for energy usage or for sharing nicely with other processes if it's a busy machine.

Again, I would love it if I could recommend -N without reservation. Other work-stealing RTS's, especially Intel Cilk Plus, are well engineered so as to make that their default. But GHC is just not there yet in this respect.

rrnewton on 1 Aug 2015

👍1

Also, I would argue that this is not really a corner case, because it's not specific to the project. It's not like certain projects can put +RTS -N1 in their scripts and call it a day. Rather, it's a function of what machine it runs on.

Maybe it's something that could be set in the global stack config, and then people could set it on a per-machine basis. (And stack would have to enact the policy by using setNumCapabilities when it starts up.)

rrnewton on 1 Aug 2015

I think this may also add ~30 seconds of time to many or most travis jobs using stack, so I'd like to reopen this for a while to elicit further comment.

Specifically, the wiki recommends using caching and performing an --only-snapshot install before the script runs to build up that cache:

stack build --only-snapshot --no-terminal

But after the first run, this line will just do nothing, but, in this build, that's 31.7 seconds of nothing.

In this build, which adds +RTS -N1 -RTS to the stack build line above, the same thing takes only 0.73 seconds.

rrnewton on 1 Aug 2015

ooohh... that's where the 32 seconds come from. Been wondering about that for a long time.

Thanks for identifying that. But do you have an explanation why?

I think given that stack currently is a purely sequential process with concurrent subprocesses (and likely this will be true for a long time still), the default that uses fewer system resources and does so more predictably is the right default.

mboes on 1 Aug 2015

I did a quick test on travis:

no RTS flags: 4min 12sec https://travis-ci.org/futurice/fum2github/builds/73670708
+RTS -N2 -RTS 2min 35sec https://travis-ci.org/futurice/fum2github/builds/73701330
+RTS -N1 -RTS 2min 36sec https://travis-ci.org/futurice/fum2github/builds/73701834

I'd really like to know why this happens!

phadej on 1 Aug 2015

Well, I'm afraid I don't really know why. We were starting to investigate on GHC issue 8224, which is the relevant place for discussing the excessive system time. (But there's also excessive CPU time here...)

A simple property that would be great to have is that a program that never calls forkIO/par runs just as well with -N32 as with -N1. That is, as in medicine, our parallel runtime should follow the dictum "Primum non nocere": first, do no harm. We don't yet manage that for the GHC RTS, and we need to fix it.

Again, other runtimes like Cilk can sleep non-main worker threads very quickly and efficiently, and there's no reason we can't too. I'm personally not familiar with the relevant RTS code, but I think a basic recipe would be:

audit MIO patches, because at least some aspects of the problem got worse with GHC 7.8 and the multicore IO manager
audit startup and sleep code for HECs with respect to how they service both IO threads and sparks. The absence of either should prompt the HEC to go to sleep very quickly, waiting on a blocking OS call. (probably pthread_cond_wait).
@simonmar and @AndreasVoellmy can provide a more concrete roadmap for someone interested in working on this. Also, there are plenty of other experts on the GHC RTS like @ezyang, and @gcampax, who may be able to offer advice.

rrnewton on 2 Aug 2015

I'm not going to be able to commit anything for a few days, but if someone wants to change the .cabal file, please go ahead.

snoyberg on 2 Aug 2015

Ok, a good & simple fix might be to just not specify anything (removing -N rather than changing to -N1). In principle, when we make it good enough, maybe GHC itself could use more than one core by default in GHC 7.12, 7.14, or whenever, and stack could inherit that behavior. Or at such a time as it yields a speedup, some degree of parallelism can be turned back on.

I don't have commit access, but I went ahead and put this little one-liner on PR #708.

I left parallelism on for the integration tests... not sure if it helps there and won't matter for most users.

rrnewton on 2 Aug 2015

more GHC issues will be filed, instead of just the lonely 8224

There is also https://ghc.haskell.org/trac/ghc/ticket/9221, which is currently being looked at.

thomie on 3 Aug 2015

My guess is that this is due to GC synchronisation. If you use +RTS -N, that Haskell process really wants the whole machine to itself - but stack is a build tool, it's running lots of other processes on the box. The default heap settings use a small nursery (-A512k), which means the stack process will be GC'ing quite often, and each time it synchronises N threads. During the synchronisation phase, threads spin while calling yield (spinning consumes CPU time, yield consumes system time).

The option +RTS -qi1 was intended to help with this situation, by having idle HECs not participate in GC. It may be that +RTS -N -qi1 would help (I'd be really interested to know whether it does). This is heading in the direction that Ryan mentioned - making it so that the RTS doesn't make things worse when there's no parallelism.

The other thing that tends to help a lot is +RTS -A128m, to use a larger heap size and hence GC less often.

But I suspect what you really want is just +RTS -N1, since stack doesn't need parallelism.

simonmar on 3 Aug 2015

For reference: https://github.com/ghc/ghc/commit/a02eb298d3f6089e51a43307ffb37e3a8076c8fd

commit a02eb298d3f6089e51a43307ffb37e3a8076c8fd
Author: Simon Marlow <>
Date:   Fri Dec 9 10:35:46 2011 +0000

    New flag +RTS -qi<n>, avoid waking up idle Capabilities to do parallel GC

    This is an experimental tweak to the parallel GC that avoids waking up
    a Capability to do parallel GC if we know that the capability has been
    idle for a (tunable) number of GC cycles.  The idea is that if you're
    only using a few Capabilities, there's no point waking up the ones
    that aren't busy.

    e.g. +RTS -qi3

    says "A Capability will participate in parallel GC if it was running
    at all since the last 3 GC cycles."

    Results are a bit hit and miss, and I don't completely understand why
    yet.  Hence, for now it is turned off by default, and also not
    documented except in the +RTS -? output.

thomie on 3 Aug 2015

Yep, should have included +RTS -s output. GC is definitely a culprit here. 11% productivity on the machine mentioned above doing the "empty build" of stack itself (already built):

$ time stack build +RTS -N32 -s
     736,798,176 bytes allocated in the heap
     400,998,408 bytes copied during GC
      83,424,472 bytes maximum residency (11 sample(s))
       9,577,416 bytes maximum slop
             174 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1193 colls,  1193 par   37.11s    1.23s     0.0010s    0.0168s
  Gen  1        11 colls,    10 par   41.67s    1.36s     0.1239s    0.3762s

  Parallel GC work balance: 0.36% (serial 0%, perfect 100%)

  TASKS: 72 (1 bound, 69 peak workers (71 total), using -N32)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.04s  (  0.02s elapsed)
  MUT     time    9.90s  (  1.58s elapsed)
  GC      time   78.78s  (  2.60s elapsed)
  EXIT    time    0.04s  (  0.02s elapsed)
  Total   time   88.76s  (  4.21s elapsed)

  Alloc rate    74,450,707 bytes per MUT second

  Productivity  11.2% of total user, 235.9% of total elapsed

gc_alloc_block_sync: 1237799
whitehole_spin: 0
gen[0].sync: 226
gen[1].sync: 15387

real    0m4.240s
user    1m10.179s
sys     0m18.968s

In fact, stack only gets 41% productivity with default heap settings and +RTS -N1. Improves to 67% with -A128M, 0.37s GC, which goes down to 0.23s at -A256M, and after that GC time goes down a little bit more but total real time starts heading up.

At -N32 bigger nursery helps A LOT:

$ time stack build +RTS -N32 -s -A512K
real    0m4.428s
user    1m21.322s
sys     0m18.655s

$ time stack build +RTS -N32 -s -A2M
real    0m3.290s
user    0m46.251s
sys     0m16.162s

$ time stack build +RTS -N32 -s -A32M
real    0m2.480s
user    0m19.536s
sys     0m16.045s

$ time stack build +RTS -N32 -s -A64M
real    0m2.350s
user    0m16.089s
sys     0m9.653s

$ time stack build +RTS -N32 -s -A128M
real    0m2.312s
user    0m7.382s
sys     0m9.629s

$ time stack build +RTS -N32 -s -A256M
real    0m2.177s
user    0m3.587s
sys     0m5.916s

$ time stack build +RTS -N32 -s -A512M
real    0m2.213s
user    0m1.400s
sys     0m0.949s

rrnewton on 3 Aug 2015

And as for -qi1: not a big difference for some reason. It does somewhat worse at the larger nursery sizes, but it does help mitigate the problem at the small nursery sizes. Hmm...

$ stack --version
Version 0.1.2.6, Git revision 863b976e5542b731873fadb3893717b78203b3bc

$ time stack build +RTS -N32 -s -A512K -qi1
real    0m3.286s
user    0m44.021s
sys     0m5.371s

$ time stack build +RTS -N32 -s -A32M -qi1
real    0m3.182s
user    0m34.130s
sys     0m6.648s

$ time stack build +RTS -N32 -s -A128M -qi1
real    0m2.326s
user    0m14.986s
sys     0m3.980s

$ time stack build +RTS -N32 -s -A256M -qi1
real    0m2.163s
user    0m3.255s
sys     0m5.181s

$ time stack build +RTS -N32 -s -A512M -qi1
real    0m2.216s
user    0m1.395s
sys     0m0.966s

rrnewton on 3 Aug 2015

Finally, when adding in the remaining GC flags, -qa and -qm don't make much difference but it does become more variable in CPU time used.

A combination of -qi and -qb basically makes the problem go away:

$ time stack build +RTS -N32 -s -A512K -qi1 -qb
real    0m1.993s
user    0m2.001s
sys     0m0.409s

Neither flag alone really does the trick. But, disabling parallel GC entirely works fine, -qg alone:

$ time stack build +RTS -N32 -s -A512K -qg
real    0m1.932s
user    0m1.741s
sys     0m0.403s

rrnewton on 3 Aug 2015

@simonmar, is there a backoff approach that would possibly work here? I can understand why spinning can reduce latency in the "good" case (where we own all processors). But when things start getting out of sync, i.e. due to OS preemption or whatever, can we then change behavior to:

Back off of parallel GC altogether (approaching -qg), or
Switch from a polling to a sleeping implementation so that even if the critical path is extended, it doesn't burn CPU/system time.

rrnewton on 3 Aug 2015

(FYI, as @ezyang and @gcampax can attest, in our recent Compact Normal Form work, parallel GC was also responsible for some really high-variance results, nondeterministic even on a dedicated machine with capabilities <= numProcs. That must be the load balancing algorithm? @ezyang said it was a problem originating with having mutable arrays shared between cores I believe.)

rrnewton on 3 Aug 2015

This ticket has been closed for the past 20h. @rrnewton this is really great data you've been collecting. Would hate for it to be lost. Perhaps continue the discussion in a GHC ticket?

mboes on 3 Aug 2015

Hah, yes, indeed. I think the barrier to switching before typing was just that trac is a bit more painful than github ;-),

rrnewton on 3 Aug 2015

GHC 8 will have a new +RTS option: -Nmax<n>:

"  -N[<n>]    Use <n> processors (default: 1, -N alone determines",
"             the number of processors to use automatically)",
"  -Nmax[<n>] Use up to n processors automatically",

See https://ghc.haskell.org/trac/ghc/ticket/10728

thomie on 19 Dec 2015

👍2 ❤1

-maxN currently, not -Nmax.