Chapel: Should Chapel support memory-conscious programming

Created on 18 Jul 2018  路  8Comments  路  Source: chapel-lang/chapel

Should Chapel provide tools that allow querying the amount of memory in use on a node and collectively across the entire cluster? Currently, you can keep track of individual memory allocations via the Memory tracker, but this results in synchronizing all allocations and this presumably would be a severe bottleneck on applications. What about being able to query the amount of used memory without requiring such coarse-grained synchronization, perhaps a single or set of atomic counters could do?

Perhaps with support, the user could design their own heuristic to handle allocation and distribution of their memory. Perhaps with such a heuristic, the user can better distribute their data across locales with lower memory consumption, or even manage their own pools of memory based on how much memory is left on the current locale. I think that there can only be a benefit here. Perhaps for a prototype implementation, the atomic counter can be incremented in chpl_memhook_malloc_pre and decremented in chpl_memhook_free_pre?

Runtime Tools Feature Request

Most helpful comment

I wouldn't characterize the current memory tracking as a "severe bottleneck". There may be codes that interact badly with it, but keep in mind that it just adds marginal cost to the already expensive memory allocation and deallocation operations, so if turning it on hurts performance the app is probably managing its memory use poorly already.

For fun, I just tried running an ISx binary I had lying around, with and without memory tracking turned on. It was built in the XC-16 performance-testing configuration and I ran it in that configuration, once with the usual execopts and then again with --memtrack --memLeaks. Here is the result for the "regular" run:

CHPL_RT_COMM_UGNI_BLOCKING_CQ=y ./isx -nl 16 --mode=weakISO --useSubTimers=true
scaling mode = weakISO
total keys = 60129542144
keys per bucket = 134217728 (2**27)
maxKeyVal = 3670016
bucketWidth = 8192 (2**13)
numTrials = 1
numBuckets = 448

Verification successful!

averages across locales of min across trials (min..max):
input = 0.974733 (0.865558..1.04315)
bucket count = 1.12918 (1.12243..1.15517)
bucket offset = 7.00223e-06 (4e-06..1.4e-05)
bucketize = 1.78894 (1.66901..1.96901)
exchange = 2.83353 (2.62916..3.0738)
exchange only = 2.57988 (0.146861..2.83481)
exchange barrier = 0.253648 (2.9e-05..2.92413)
count keys = 0.322768 (0.285525..0.384699)
total = 7.04916 (6.90111..7.15713)

and here is the result for the memLeaks run:

CHPL_RT_COMM_UGNI_BLOCKING_CQ=y ./isx -nl 16 --mode=weakISO --useSubTimers=true --memTrack --memLeaks
scaling mode = weakISO
total keys = 60129542144
keys per bucket = 134217728 (2**27)
maxKeyVal = 3670016
bucketWidth = 8192 (2**13)
numTrials = 1
numBuckets = 448

Verification successful!

averages across locales of min across trials (min..max):
input = 0.974679 (0.877704..1.04117)
bucket count = 1.12968 (1.1225..1.15362)
bucket offset = 9.85937e-06 (5e-06..2.9e-05)
bucketize = 1.79164 (1.6675..1.97462)
exchange = 2.84902 (2.63982..3.10529)
exchange only = 2.57667 (0.09463..2.8349)
exchange barrier = 0.272348 (3.3e-05..2.99753)
count keys = 0.319396 (0.288657..0.379546)
total = 7.06443 (6.91318..7.18183)

There's hardly a difference. (The --memLeaks output came out at the end, after the performance info printed by the app itself, so I didn't include it here.)

All 8 comments

Related to https://github.com/chapel-lang/chapel/issues/9509

Misc thought -- this is also related to heap profiling/stats-gathering. We disable jemalloc stats by default for a slight perf boost, but you can enable it by building jemalloc with CHPL_JEMALLOC_ENABLE_STATS=true set. https://github.com/jemalloc/jemalloc/wiki/Use-Case%3A-Heap-Profiling (note that this is not a "supported" mode, I'm just offering it as a related thought that you can use as your own risk)

Thank you for the suggestion, @ronawho. I'll definitely be sure to try this later, would be cool to see if I can get a nice memory-aware prototype running, but that's kind of a ways off at this point.

There is currently an intra-node memory counter that is incremented for allocations and decremented for frees. Its value is returned by memoryUsed(). Currently this is only done when memory tracking is enabled, and conflict avoidance is provided by the memory tracking lock. It would be straightforward to make this atomic and hoist the operations on it out from under the memTracking test.

I wouldn't characterize the current memory tracking as a "severe bottleneck". There may be codes that interact badly with it, but keep in mind that it just adds marginal cost to the already expensive memory allocation and deallocation operations, so if turning it on hurts performance the app is probably managing its memory use poorly already.

For fun, I just tried running an ISx binary I had lying around, with and without memory tracking turned on. It was built in the XC-16 performance-testing configuration and I ran it in that configuration, once with the usual execopts and then again with --memtrack --memLeaks. Here is the result for the "regular" run:

CHPL_RT_COMM_UGNI_BLOCKING_CQ=y ./isx -nl 16 --mode=weakISO --useSubTimers=true
scaling mode = weakISO
total keys = 60129542144
keys per bucket = 134217728 (2**27)
maxKeyVal = 3670016
bucketWidth = 8192 (2**13)
numTrials = 1
numBuckets = 448

Verification successful!

averages across locales of min across trials (min..max):
input = 0.974733 (0.865558..1.04315)
bucket count = 1.12918 (1.12243..1.15517)
bucket offset = 7.00223e-06 (4e-06..1.4e-05)
bucketize = 1.78894 (1.66901..1.96901)
exchange = 2.83353 (2.62916..3.0738)
exchange only = 2.57988 (0.146861..2.83481)
exchange barrier = 0.253648 (2.9e-05..2.92413)
count keys = 0.322768 (0.285525..0.384699)
total = 7.04916 (6.90111..7.15713)

and here is the result for the memLeaks run:

CHPL_RT_COMM_UGNI_BLOCKING_CQ=y ./isx -nl 16 --mode=weakISO --useSubTimers=true --memTrack --memLeaks
scaling mode = weakISO
total keys = 60129542144
keys per bucket = 134217728 (2**27)
maxKeyVal = 3670016
bucketWidth = 8192 (2**13)
numTrials = 1
numBuckets = 448

Verification successful!

averages across locales of min across trials (min..max):
input = 0.974679 (0.877704..1.04117)
bucket count = 1.12968 (1.1225..1.15362)
bucket offset = 9.85937e-06 (5e-06..2.9e-05)
bucketize = 1.79164 (1.6675..1.97462)
exchange = 2.84902 (2.63982..3.10529)
exchange only = 2.57667 (0.09463..2.8349)
exchange barrier = 0.272348 (3.3e-05..2.99753)
count keys = 0.319396 (0.288657..0.379546)
total = 7.06443 (6.91318..7.18183)

There's hardly a difference. (The --memLeaks output came out at the end, after the performance info printed by the app itself, so I didn't include it here.)

so if turning it on hurts performance the app is probably managing its memory use poorly already.

I wouldn't say that this would be a result of poor memory management, and taking a quick glance over at isx.chpl most of the computations are done on Chapel arrays, right? What about applications that make use of data structures and work queues which could make use of linked list (unrolled or otherwise)? Plus there are issues with recycling memory, such as the ABA problem for non-blocking data structures, such that avoiding allocations may not be the best solution.

While it is nice to see that it has a marginal cost on performance for certain applications, I am a bit concerned for others. I don't really have any specific applications or benchmarks to properly back my claims up that it will be a performance bottleneck though, so I'll withdraw that assertion.

We're not asserting that memtracking is fast today, just not as slow as you might worry :)

We can and should have something much lighter weight like a simple atomic counter. The current mutex lock/unlock and hash-table manipulation isn't cheap, but Greg's point was allocation isn't cheap either, so the overhead of the current scheme shouldn't be disproportionality large.

I'm really just saying that if an application's performance were strongly affected by turning on memory tracking then that would be a clue that its performance was being driven by allocation and deallocation costs in the first place, and working on that might be the profitable thing to do.

As a data point for the bottleneck side: binary-trees is around 100x slower with memtracking. Binary-trees is effectively an allocation microbenchmark that should be close to worse-case scenario for memtracking since it's millions of concurrent allocations:

time ./binarytrees --n=18 >/dev/null

real    0m0.538s
time ./binarytrees --n=18 --memTrack >/dev/null

real    0m54.713s
Was this page helpful?
0 / 5 - 0 ratings