Chapel: spike: figure out how to improve CPU core counting on linux

Created on 12 Apr 2018 · 12Comments · Source: chapel-lang/chapel

At present on linux systems which have a /proc/cpuinfo file we compute the number of CPU cores by counting the processor entries and then dividing by the number of hardware threads per core. We get the number of hardware threads from the 'siblings' lines and the number of cores from the 'cpu cores' lines. The code assumes hardware homogeneity and produces an internal error if the 'cpu cores' lines are not all the same or the 'siblings' lines are not all the same. This internal error (on the 'cpu cores' lines) is being reported here and in #9170.

We need to improve our core counting on linux. The first step is a spike to determine what best practice is. This seems like it might be a reasonable place to start.

A quick list of candidate solutions is:

smarter /proc/cpuinfo processing
lscpu(1)
use of the hwloc third-party package.

There may be others.

A solution that worked equally well for ARM-based linux systems would get extra credit, since for those we had to add special processing recently (see here).

Runtime user issue

Source

gbtitus

Most helpful comment

Here is my recommendation in summary form:

Get the CPU counts, both accessible and overall, from hwloc whenever that is configured.
Add a new implementation of the Qthreads internal topology interface, that gets a topology from the Chapel runtime if one is available.
Adjust the existing code that processes /proc/cpuinfo and /sys/devices/system/cpu/... files to take schedaffinity information into account in a more integrated way.

The first 2 should be done together (one PR) in order not to create a disruption in program startup performance, but the 3rd one can be done separately.

gbtitus on 3 May 2018

👍2

All 12 comments

If we can script this, I would use it instead of https://github.com/awallace-cray/chapel/commit/df2375006ec53c9f6eedc9a4882daf512c5e6084, which is slightly hacky
Absent anything better, I intend to change all those to check an env variable, and (if found) take the min of the env variable v. the multiprocessing.cpu_count() value

awallace-cray on 13 Apr 2018

@awallace-cray: This one is about counting physical and logical cores in the runtime, as input information for decision making in the tasking layers and ultimately the default dataParTasksPerLocale for running Chapel programs. We could use similar techniques, although in the scripting world we have access to python-based resources that wouldn't be appropriate in the runtime.

gbtitus on 16 Apr 2018

9395 is another relevant user issue. PR 9188 is present but we still mis-compute the numbers of cores and PUs. The runtime code assumes the affinity mask contains hyperthreads and cores in the same ratio as they are present in the hardware overall (2 HTs per core), but the affinity mask in use actually only specifies 1 hyperthread in each included core.

gbtitus on 1 May 2018

@dmk42 in response to your question yesterday, the schedaffinity mask doesn't seem to affect what one sees in the /sys/devices/system/cpu/... information. For example:

$ taskset 0x5 cat /sys/devices/system/cpu/cpu0/topology/core_siblings
000f,ff000fff
$ taskset 0x5 cat /sys/devices/system/cpu/cpu0/topology/thread_siblings
0000,01000001

It also doesn't affect what is seen in /proc/cpuinfo, of course.

gbtitus on 2 May 2018

The schedaffinity mask definitely does, or at least can, affect what one sees via hwloc:

$ taskset 0x5 <path>/lstopo --no-caches --no-io --restrict binding
Machine (126GB total) + NUMANode L#0 (P#0 126GB) + Package L#0
  Core L#0 + PU L#0 (P#0)
  Core L#1 + PU L#1 (P#2)

It's the --restrict binding option that enables the affinity limiting for the display. There is a corresponding way to do the same thing in the programmatic API as well.

gbtitus on 2 May 2018

There are straightforward hwloc calls to get the numbers of physical and logical processors, via hwloc_get_nobjs_by_type() with object types HWLOC_OBJ_CORE and HWLOC_OBJ_PU, respectively.

hwloc_topology_get_allowed_cpuset() will produce a cpuset referring to just the accessible PUs (hardware threads), and applying hwloc_bitmap_weight() will give the number of them. Counting just the accessible cores is a little more work. We would iterate over the cpuset using hwloc_bitmap_foreach_begin(), call hwloc_get_pu_obj_by_os_index() for each set bit to get the hardware thread and then hwloc_get_ancestor_obj_by_type() to get the core, and use hwloc_topology_{get,set}_userdata() to implement a marking algorithm to count each core just once.

gbtitus on 2 May 2018

The hwloc lstopo program produces correct output on the ARM64 systems where limited /proc/cpuinfo file contents required the chplsys.c code to fall back to looking at the /sys/devices/system/cpu/... hierarchy.

gbtitus on 3 May 2018

It seems best to just use hwloc when we have it. Currently this is any time CHPL_TASKS=qthreads, which is the default except under FreeBSD, NetBSD, and Cygwin, i.e. nearly always. At the same time it probably doesn't make sense to start using hwloc when we don't currently, as for example on Linux in "quickstart" mode where we have CHPL_TASKS=fifo. So we still need to keep the existing chplsys.c code that scans the system files, and we also need to fix its assumption that the PU/core ratio within the affinity mask is the same as it is overall.

Currently Qthreads already loads an hwloc topology. That's not a cheap operation. If we're going to load one we want to have Qthreads use ours instead of loading it again. Note that this is not required functionally; there is no functional problem with loading 2 copies of the hwloc topology. It's just for performance.

(The following plan is due to @ronawho.) We can have QThreads use the Chapel hwloc topology by duplicating one of the Qthreads internal affinity implementation source files under third-party/qthread/qthread-src/src/affinity (the default one, say), calling it hwloc-chpl (say), and modifying that to call the Chapel runtime to get the topology pointer. Then we just add the option --with-topology=hwloc-chpl to the qthreads configure step, and it'll use our new affinity implementation.

gbtitus on 3 May 2018

👍1

Based on comparing lstopo output under taskset to /proc/cpuinfo and /sys/devices/system/cpu/... file contents, it appears that the schedaffinity mask bit numbers correspond to the 'processor' lines in the former and the mask bit numbers and list values in the latter. So, applying the affinity mask to information gathered from those sources should be accurate. Note, however, that currently we only look at /sys/devices/system/cpu/cpu0/..., making the same assumption there as we do elsewhere that the PU/core ratio under the affinity mask is the same as overall. We'll need to augment that code to traverse the entire /sys/devices/system/cpu/... hierarchy to get accurate information. However, this is only an issue on ARM with CHPL_TASKS!=qthreads and thus CHPL_HWLOC=none, since if we have hwloc we'll get accurate topology information from that.

gbtitus on 3 May 2018

Here is my recommendation in summary form:

Get the CPU counts, both accessible and overall, from hwloc whenever that is configured.
Add a new implementation of the Qthreads internal topology interface, that gets a topology from the Chapel runtime if one is available.
Adjust the existing code that processes /proc/cpuinfo and /sys/devices/system/cpu/... files to take schedaffinity information into account in a more integrated way.

The first 2 should be done together (one PR) in order not to create a disruption in program startup performance, but the 3rd one can be done separately.

gbtitus on 3 May 2018

👍2

Oops, I left out one thing I wanted to say. The description block mentions the possibility of using lscpu(1) to get this information. I've been viewing that as a last resort because of the cost of starting a subprocess and running another program. It looks like we can get accurate info without that, so I don't think it needs further investigation.

gbtitus on 3 May 2018

I agree. I also looked at lscpu in connection with this once, and it looks like the most fragile approach.