Julia: atomic operations on array elements and object fields

Created on 30 Jun 2019  路  13Comments  路  Source: JuliaLang/julia

The atomic methods for Atomic call the pointer_from_objref and call the function on with llvmcall. If we expose the pointer api, we can do atomic operation on array more easily. Are there any reason to dispatch the atomic operation only on Atomic type?

multithreading

Most helpful comment

For a histogram-type operation, I'd expect a (histogram-shaped distribution) of cache conflicts, resulting in a large slowdown for each additional thread you add鈥攕tarting at 10x slower just for going from single-threaded to 1 thread-but-threadsafe-with atomics. Unlike @tkf, I have no numbers though, and I'm just somewhat parroting his comment.

This is actually relevant, because I'm currently thinking about not (yet) having atomics for individual array elements (just too likely to be slow and/or incorrect), so that you should perhaps instead just use a lock over the whole array. https://docs.google.com/document/d/e/2PACX-1vT7Ibthj9WyM8s5bcQbiKsVK6MtvzqmnPFMy-bcjZLlbqv55a0_sTJ99AkbvPIZk3t7MbhZ57NzaIzC/pub

All 13 comments

Ah, I see, this differs from #29943. That is asking for atomic operations on pointers, while this is about doing atomic operations via pointers to other structures like arrays. I will try to clarify the title.

This recently came up here: https://discourse.julialang.org/t/parallelizing-a-histogram/41260/9

Would be great to have atomic_add!(A::Array, x, idxs), etc.

Quoting my comment in the discourse thread:

OK, so I uploaded the code I used for trying out atomic increments on array elements: https://github.com/tkf/ParallelIncrements.jl

(Comment added: I just realized that the name of the package is completely misleading since it was about single thread performance. Initially, I was thinking about trying it with multiple threads.)

This package contains a function ref = atomicref(::Array, indices...) which can be used to for atomic operations like Threads.atomic_add!(ref, x).

As you can see in the README, the overhead is 10x in my laptop. It can go up to 100x and possibly more depending on the kind of optimization the compiler can do. It might be possible to get a lower overhead using other memory orderings but I haven't tried it yet.

I'm not a big fan in this direction but I'm interested in whatever beneficial for HPC in Julia. So, I'm just sharing it just in case some people want to try it out (and maybe find some performance bugs I made in the sample code or something).

A slowdown of 10x may actually be quite acceptable in some cases, compared to the alternative.

Obviously, when you deal with things like small histograms (histograms are just a nice example use case for this), it's most efficient for every thread to histogram separately in 1st level cache, and then merge after. But if the histogram is large (multi-dimensional, many bins), atomics with a 10x overhead (compared to non-atomic writes may still be faster) than maintaining one histogram per thread, especially if the number of threads is large. Just a hunch, of course. I guess it'll come down to memory bandwidths, latencies and caching behavior in the individual use case ... hard to predict.

@tfk, to you think the atomicref comes with some additional overhead, compared to having an atomic_add that directly operates on array and index, or will the compiler optimize that away?

I agree there might be some situations that this can be beneficial. I uploaded the code because I thought it'd be great if people can explore the use of it.

But, let me also note that my benchmark is about single-thread performance. If you are using multiple threads, atomic can hurt even more because it has to invalidate the cache of all CPUs. So, I'd guess 10x is more like a lower bound of the overhead.

atomicref comes with some additional overhead, compared to having an atomic_add that directly operates on array and index

I didn't see anything suspicious about it when I quickly look at LLVM IR.

(BTW, you might want to use @tkf to ping me, to avoid pinging another user.)

If you are using multiple threads, atomic can hurt even more because it has to invalidate the cache of all CPUs.

Yes, I've been thinking about that - however, if the access pattern is scattered access to a fairly large amount of memory, the threads would, for the most part, not share cache lines, right?

Ah, yes, you might be right.

For a histogram-type operation, I'd expect a (histogram-shaped distribution) of cache conflicts, resulting in a large slowdown for each additional thread you add鈥攕tarting at 10x slower just for going from single-threaded to 1 thread-but-threadsafe-with atomics. Unlike @tkf, I have no numbers though, and I'm just somewhat parroting his comment.

This is actually relevant, because I'm currently thinking about not (yet) having atomics for individual array elements (just too likely to be slow and/or incorrect), so that you should perhaps instead just use a lock over the whole array. https://docs.google.com/document/d/e/2PACX-1vT7Ibthj9WyM8s5bcQbiKsVK6MtvzqmnPFMy-bcjZLlbqv55a0_sTJ99AkbvPIZk3t7MbhZ57NzaIzC/pub

(note: the idealized cost of lock is on the order of a couple atomic operations, since that's about all it takes to implement one: it's important to mentally realize that the cost of a lock and an atomic are therefore somewhat nearly the same.)

Well, you kinda dash my hopes for efficient scattered atomic adds (maybe I can try to do some parallel benchmarking to get us numbers, though). :-)

But I'm very excited about the ideas in your manifest!

I wonder what's the best strategy for implementing histogram-like function with a very large output array using threads. Maybe (1) create a batch of local index list (for the entries to be incremented) in each thread grouped by sub-regions of the output array and then (2) send it to the "writer" tasks that write it to the sub-region of the output array? This way, I guess we can distribute the contention a bit by dividing it by the number of sub-regions (the contention in this strategy would be coming from the queue for communicating with each writer task).

@vtjnash, you were right, of course: Using a global lock is indeed fastest. Here's a benchmark I ran
(using Julia v1.5.0-beta1.0):

https://gist.github.com/oschulz/dce9d2f2104deb2ff42edaa814bb3790

I'm filling a histogram with 201 x 201 x 201 bins with 10^6 (randn(rng), randn(rng), randn(rng)) variates, using a separate Random123.Philox4x RNG for each thread.

I defined

Base.@propagate_inbounds function atomic_addindex!(A::Array{Int64}, v::U, I...) where {U}
    T = Int64
    i = LinearIndices(size(A))[I...]
    v_conv = convert(T, v)
    ptr = pointer(A, i)
    Base.Threads.llvmcall("%ptr = inttoptr i64 %0 to i64*\n%rv = atomicrmw add i64* %ptr, i64 %1 acq_rel\nret i64 %rv\n", T, Tuple{Ptr{T},T}, ptr, v_conv)::T
    A
end

I hope that's correct.

With JULIA_NUM_THREADS=64 (Epyc-2 CPU, 64 physical cores, single socket), I get:

  • No multi-threading: 205.771 ms
  • Using atomic_addindex!: 519.649 ms
  • Using a global SpinLock(): 4.784 ms

So with the global lock, I get a 43x speedup on 64 threads - I wouldn't have expected it to scale that well.

The random number generation itself is about 20 ms using a thread and 630 渭s using all 64 threads, so it's only a relatively small fraction of the total time.

When I run with JULIA_NUM_THREADS=1:

  • No multi-threading: 203.243 ms ms
  • Using atomic_addindex!: 224.736 ms
  • Using a global SpinLock(): 195.301 ms

So if there's a bit of work to do (computing the random numbers and hist-bin index), the overhead of atomic_addindex! has little impact, single threaded. But with many threads, even on a large array, atomic access seems to cause way too much cache invalidation/contention.

Yay for global locks! :-)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

wilburtownsend picture wilburtownsend  路  3Comments

yurivish picture yurivish  路  3Comments

TotalVerb picture TotalVerb  路  3Comments

tkoolen picture tkoolen  路  3Comments

manor picture manor  路  3Comments