The atomic methods for Atomic
call the pointer_from_objref
and call the function on with llvmcall
. If we expose the pointer api, we can do atomic operation on array more easily. Are there any reason to dispatch the atomic operation only on Atomic
type?
Ah, I see, this differs from #29943. That is asking for atomic operations on pointers, while this is about doing atomic operations via pointers to other structures like arrays. I will try to clarify the title.
This recently came up here: https://discourse.julialang.org/t/parallelizing-a-histogram/41260/9
Would be great to have atomic_add!(A::Array, x, idxs)
, etc.
Quoting my comment in the discourse thread:
OK, so I uploaded the code I used for trying out atomic increments on array elements: https://github.com/tkf/ParallelIncrements.jl
(Comment added: I just realized that the name of the package is completely misleading since it was about single thread performance. Initially, I was thinking about trying it with multiple threads.)
This package contains a function
ref = atomicref(::Array, indices...)
which can be used to for atomic operations likeThreads.atomic_add!(ref, x)
.As you can see in the README, the overhead is 10x in my laptop. It can go up to 100x and possibly more depending on the kind of optimization the compiler can do. It might be possible to get a lower overhead using other memory orderings but I haven't tried it yet.
I'm not a big fan in this direction but I'm interested in whatever beneficial for HPC in Julia. So, I'm just sharing it just in case some people want to try it out (and maybe find some performance bugs I made in the sample code or something).
Came across this - that's on a GPU, though: https://devblogs.nvidia.com/gpu-pro-tip-fast-histograms-using-shared-atomics-maxwell/
A slowdown of 10x may actually be quite acceptable in some cases, compared to the alternative.
Obviously, when you deal with things like small histograms (histograms are just a nice example use case for this), it's most efficient for every thread to histogram separately in 1st level cache, and then merge after. But if the histogram is large (multi-dimensional, many bins), atomics with a 10x overhead (compared to non-atomic writes may still be faster) than maintaining one histogram per thread, especially if the number of threads is large. Just a hunch, of course. I guess it'll come down to memory bandwidths, latencies and caching behavior in the individual use case ... hard to predict.
@tfk, to you think the atomicref
comes with some additional overhead, compared to having an atomic_add
that directly operates on array and index, or will the compiler optimize that away?
I agree there might be some situations that this can be beneficial. I uploaded the code because I thought it'd be great if people can explore the use of it.
But, let me also note that my benchmark is about single-thread performance. If you are using multiple threads, atomic can hurt even more because it has to invalidate the cache of all CPUs. So, I'd guess 10x is more like a lower bound of the overhead.
atomicref
comes with some additional overhead, compared to having anatomic_add
that directly operates on array and index
I didn't see anything suspicious about it when I quickly look at LLVM IR.
(BTW, you might want to use @tkf to ping me, to avoid pinging another user.)
If you are using multiple threads, atomic can hurt even more because it has to invalidate the cache of all CPUs.
Yes, I've been thinking about that - however, if the access pattern is scattered access to a fairly large amount of memory, the threads would, for the most part, not share cache lines, right?
Ah, yes, you might be right.
For a histogram-type operation, I'd expect a (histogram-shaped distribution) of cache conflicts, resulting in a large slowdown for each additional thread you add鈥攕tarting at 10x slower just for going from single-threaded to 1 thread-but-threadsafe-with atomics. Unlike @tkf, I have no numbers though, and I'm just somewhat parroting his comment.
This is actually relevant, because I'm currently thinking about not (yet) having atomics for individual array elements (just too likely to be slow and/or incorrect), so that you should perhaps instead just use a lock over the whole array. https://docs.google.com/document/d/e/2PACX-1vT7Ibthj9WyM8s5bcQbiKsVK6MtvzqmnPFMy-bcjZLlbqv55a0_sTJ99AkbvPIZk3t7MbhZ57NzaIzC/pub
(note: the idealized cost of lock is on the order of a couple atomic operations, since that's about all it takes to implement one: it's important to mentally realize that the cost of a lock and an atomic are therefore somewhat nearly the same.)
Well, you kinda dash my hopes for efficient scattered atomic adds (maybe I can try to do some parallel benchmarking to get us numbers, though). :-)
But I'm very excited about the ideas in your manifest!
I wonder what's the best strategy for implementing histogram-like function with a very large output array using threads. Maybe (1) create a batch of local index list (for the entries to be incremented) in each thread grouped by sub-regions of the output array and then (2) send it to the "writer" tasks that write it to the sub-region of the output array? This way, I guess we can distribute the contention a bit by dividing it by the number of sub-regions (the contention in this strategy would be coming from the queue for communicating with each writer task).
@vtjnash, you were right, of course: Using a global lock is indeed fastest. Here's a benchmark I ran
(using Julia v1.5.0-beta1.0):
https://gist.github.com/oschulz/dce9d2f2104deb2ff42edaa814bb3790
I'm filling a histogram with 201 x 201 x 201 bins with 10^6 (randn(rng), randn(rng), randn(rng))
variates, using a separate Random123.Philox4x
RNG for each thread.
I defined
Base.@propagate_inbounds function atomic_addindex!(A::Array{Int64}, v::U, I...) where {U}
T = Int64
i = LinearIndices(size(A))[I...]
v_conv = convert(T, v)
ptr = pointer(A, i)
Base.Threads.llvmcall("%ptr = inttoptr i64 %0 to i64*\n%rv = atomicrmw add i64* %ptr, i64 %1 acq_rel\nret i64 %rv\n", T, Tuple{Ptr{T},T}, ptr, v_conv)::T
A
end
I hope that's correct.
With JULIA_NUM_THREADS=64
(Epyc-2 CPU, 64 physical cores, single socket), I get:
atomic_addindex!
: 519.649 msSpinLock()
: 4.784 msSo with the global lock, I get a 43x speedup on 64 threads - I wouldn't have expected it to scale that well.
The random number generation itself is about 20 ms using a thread and 630 渭s using all 64 threads, so it's only a relatively small fraction of the total time.
When I run with JULIA_NUM_THREADS=1
:
atomic_addindex!
: 224.736 msSpinLock()
: 195.301 msSo if there's a bit of work to do (computing the random numbers and hist-bin index), the overhead of atomic_addindex!
has little impact, single threaded. But with many threads, even on a large array, atomic access seems to cause way too much cache invalidation/contention.
Yay for global locks! :-)
Most helpful comment
For a histogram-type operation, I'd expect a (histogram-shaped distribution) of cache conflicts, resulting in a large slowdown for each additional thread you add鈥攕tarting at 10x slower just for going from single-threaded to 1 thread-but-threadsafe-with atomics. Unlike @tkf, I have no numbers though, and I'm just somewhat parroting his comment.
This is actually relevant, because I'm currently thinking about not (yet) having atomics for individual array elements (just too likely to be slow and/or incorrect), so that you should perhaps instead just use a lock over the whole array. https://docs.google.com/document/d/e/2PACX-1vT7Ibthj9WyM8s5bcQbiKsVK6MtvzqmnPFMy-bcjZLlbqv55a0_sTJ99AkbvPIZk3t7MbhZ57NzaIzC/pub