Rakudo: RFC: make default $degree in hyper/race depend on number of cores

Created on 18 Apr 2018 · 4Comments · Source: rakudo/rakudo

Just spotted that on my 24-core box, setting :degree to 24 instead of the default 4 makes the program nearly twice as fast.

cpan@perlbuild4~/R/rakudo (master)$ lscpu | grep '^CPU('
CPU(s):                24
cpan@perlbuild4~/R/rakudo (master)$ ./perl6 -e '(^∞).hyper.grep(*.is-prime)[10_001].say; say now - ENTER now'
104759
7.87614461
cpan@perlbuild4~/R/rakudo (master)$ ./perl6 -e '(^∞).hyper(:24degree).grep(*.is-prime)[10_001].say; say now - ENTER now'
104759
4.6141672

At the same time, trying to set it to 24 degree on my 2-core box had a detrimental effect over the default value:

zoffix@leliana~$ lscpu | grep '^CPU('
CPU(s):                2
zoffix@leliana~$ perl6 -e '(^∞).hyper.grep(*.is-prime)[10_001].say; say now - ENTER now'
104759
20.69858647
zoffix@leliana~$ perl6 -e '(^∞).hyper(:24degree).grep(*.is-prime)[10_001].say; say now - ENTER now'
104759
23.1678845

Maybe we should set default $degree on .hyper/.race to some precalculated value based on the number of cores on the box, to make this sort of programs more "portable"?

ASYNC RFC performance

Source

zoffixznet

Most helpful comment

There's the short, medium, and longer term on this.

In the short term, it's an easy patch to make degree match the core count. I'm not sure why the / 2 suggestion; I'd just make it the direct core count.

In the medium term - and this was part of the discussion @lizmat mentioned from GPW - I think we'll want to pick defaults a bit more smartly for certain source types. While for Seq we can't assume much, if one does .hyper or .race on a List or Array that's already reified then we can let the array size inform our choice of batch and/or degree. Generally, people putting hyper or race into their code are making a judgement based upon their expectations of the data, and that the parallelization will be worth it. At this point, I think we can assume their judgement is decent. I have a concrete case where the array may have as few as 2 elements, but the work to do on each is significant; at the moment I have to do :1batch to get it to be useful. It'd be nice if it'd say "OK, well, the array is tiny but you said hyper, so let's make the batches tiny". Note that these cases can be handled by implementing hyper and race methods on List and Range and array.

Longer term, we could start trying to do adaptive stuff where we try and tune it will automatically using timing data. That'll be somewhat tricky to get right, and I think it's probably premature to do it now, because ideally we'd have a bunch of real-world use examples of hyper/race to throw at our algorithm and see how it deals with them. In the best case, we can start telling people to prefer to the defaults, because most of the time they'll match or beat the manual tuning. That's quite some engineering away, though.

jnthn on 18 Apr 2018

👍2

All 4 comments

FWIW, I discussed this with jnthn at the GPW: we need a better way of handling :batch and :degree, so that it can auto-adapt depending on external load / internal load.

Meanwhile, I think defaulting :degree to Kernel.cpu-cores / 2 as a sane intermediate step.

I think it’s too early to cast anything else in stone just yet.

On 18 Apr 2018, at 15:35, Zoffix Znet notifications@github.com wrote:

Just spotted that on my 24-core box, setting :degree to 24 instead of the default 4 makes the program nearly twice as fast.

cpan@perlbuild4~/R/rakudo (master)$ lscpu | grep '^CPU('
CPU(s): 24
cpan@perlbuild4~/R/rakudo (master)$ ./perl6 -e '(^∞).hyper.grep(.is-prime)[10_001].say; say now - ENTER now'
104759
7.87614461
cpan@perlbuild4~/R/rakudo (master)$ ./perl6 -e '(^∞).hyper(:24degree).grep(.is-prime)[10_001].say; say now - ENTER now'
104759
4.6141672

At the same time, trying to set it to 24 degree on my 2-core box had a detrimental effect over the default value:

zoffix@leliana~$ lscpu | grep '^CPU('
CPU(s): 2
zoffix@leliana~$ perl6 -e '(^∞).hyper.grep(.is-prime)[10_001].say; say now - ENTER now'
104759
20.69858647
zoffix@leliana~$ perl6 -e '(^∞).hyper(:24degree).grep(.is-prime)[10_001].say; say now - ENTER now'
104759
23.1678845

Maybe we should set default $degree on .hyper/.race to some precalculated (precompiled?) value based on the number of cores on the box, to make this sort of programs more "portable"?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

lizmat on 18 Apr 2018

👍1

There's the short, medium, and longer term on this.

In the short term, it's an easy patch to make degree match the core count. I'm not sure why the / 2 suggestion; I'd just make it the direct core count.

jnthn on 18 Apr 2018

👍2

On 18 Apr 2018, at 17:26, Jonathan Worthington notifications@github.com wrote:
In the short term, it's an easy patch to make degree match the core count. I'm not sure why the / 2 suggestion; I'd just make it the direct core count.

This is based on the observation that on hyper-threaded CPU’s, it doesn’t make much sense to use all virtual CPU’s for the same kind of work:

This is on my Intel i7:

$ time perl6 -e 'say ^Inf .hyper(:4degree).grep( *.is-prime ).skip(10000).head'
104743

real 0m9.770s
user 0m33.395s
sys 0m0.169s

$ time perl6 -e 'say ^Inf .hyper(:8degree).grep( *.is-prime ).skip(10000).head'
104743

real 0m9.066s
user 0m51.810s
sys 0m0.199s

Note that the 8-core version only got 10% faster, while it used more than 1.5x as much CPU.

Perhaps cpu-cores / 2 + 1 would be better?

$ time perl6 -e 'say ^Inf .hyper(:5degree).grep( *.is-prime ).skip(10000).head'
104743

real 0m9.100s
user 0m36.887s
sys 0m0.181s

This gave the same wallclock as the :8degree one, while using a lot less CPU.

lizmat on 18 Apr 2018

I think we should avoid making assumptions about CPU behavior. You tested 1 CPU with 1 OS and 1 kind of work load. A more complex work load on Linux running on an AMD Ryzen may show completely different characteristics. I think in general just using whatever core count we get is a good rule of thumb until we have the really smart batcher.