This SO answer shows a 20% speed-up on binary_search with prefetching enabled at the cost of:
We should consider adding this optimization if it turns out to be worth it.
It seems as there's no support for the llvm.prefetch intrinsic available in librustc_trans/intrinsic.rs. Is this intentional?
One can probably work around that using the link_llvm_intrinsics feature.
Do you know how I can enable the link_llvm_intrinsics feature when building the compiler, or do I have to enable it in every crate explicitly?
I declared an external function like this:
extern {
#[link_name = "llvm.prefetch"]
fn prefetch(address: *const i8, rw: i32, locality: i32, cache_type: i32);
}
and used it the the binary_search. But when building other crates the compiler bails out with:
Cannot invoke an intrinsic other than donothing, patchpoint or statepoint
invoke void @llvm.prefetch(i8* %37, i32 0, i32 1, i32 1)
to label %bb13 unwind label %cleanup
Do you know how I can enable the link_llvm_intrinsics feature when building the compiler?
I've never enabled/disabled features in the compiler crates (in my crates I just add it to the root of the crates that uses the feature). Maybe @alexcrichton does know or know somebody who knows.
The error there looks unrelated to the link_llvm_intrinsics feature being enabled/disabled, that specifically looks like a codegen error. I'm not particularly sure why that'd show up, though. This may mean that support in intrinsics.rs needs to be added to execute the intrinsic.
Ok, I'll try to add the prefetch intrinsic, was it intentionally left out?
Nah it probably just hasn't gotten around to getting bound yet, very little is intentionally left out!
Hmm, I'm stuck at adding the pretetch intrinsic. I added it in:
libcore/intrinsics.rslibrustc_trans/context.rslibrustc_trans/intrinsic.rslibrustc_typeck/check/intrinsic.rsBut I still get the error:
error[E0093]: unrecognized intrinsic function: `prefetch`
--> src/libcore/intrinsics.rs:569:5
|
569 | pub fn prefetch<T>(data: *const T, rw: i32, locality: i32, cache: i32);
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ unrecognized intrinsic
Is there some bootstrapping necessary?
The bootstrap compiler won't recognize the intrinsic, so you need to guard the definition with #[cfg(not(stage1))].
A similar problem will show up when you try to use the intrinsic in. I guess the easiest way to solve that without code duplication will be to define a #[cfg(stage1)] stub that does nothing.
@rkruppe You mean stage0, or?
Yes, I confused "which stage is the compiler" with "which stage are we compiling the source code for".
Ok, I added the prefetch intrinsic and benchmarked a small stand-alone binsearch.
https://gist.github.com/hirschenberger/dcf3fc6f2b7ddad08c889adfa8f5f1cf
With this setup, I got the following results:
test bench_binsearch ... bench: 105 ns/iter (+/- 0)test bench_binsearch ... bench: 95 ns/iter (+/- 0)That's approx. 10%, don't know if it's worth it and if this microbenchmark is significant. I think the compiler is already doing good work. Here's the output of perf for both versions:
320.577 L1-dcache-load-misses # 0,46% of all L1-dcache hits [66,26%]
69.887.806 L1-dcache-loads [67,23%]
35.613 L1-dcache-prefetches [66,60%]
190.783 L1-dcache-load-misses # 0,14% of all L1-dcache hits [67,13%]
133.433.668 L1-dcache-loads [67,00%]
51.848 L1-dcache-prefetches [65,94%]
Update:
I did some more tests and got the runtime down to: test bench_binsearch ... bench: 81 ns/iter (+/- 0)
The change prefetches the destination slice after every iteration.
@hirschenberger that's a 20% perf improvement!
Some numbers using slice::binary_search:
test bench_binsearch ... bench: 149 ns/iter (+/- 1)test bench_binsearch ... bench: 97 ns/iter (+/- 8)@hirschenberger Is the benchmark using prefetching correctly?
It looks like it always prefetches s (which would be the first element of the range being considered) instead of something like s + s.len() >> 1.
Also, I find it somewhat surprising that there is only one prefetch invocation: for binary search you usually prefetch both possible choices (see the SO answer from the OP).
The s slice is prefetched for the split_at call in the next iteration. Prefetching s.as_ptr().offset((s.len() >> 1) gives a slightly worse performance.
I also tested prefetching the head and tail splits as in the SO answer, but got no performance benefit. I think by using split_at, the two slices are already in the cache.
I tested several prefetching pattern and this one got the best results.
The s slice is prefetched for the
split_atcall in the next iteration.
split_at does not access the contents of the s slice, so it should not be affected by the prefetching. The only place that should perform data accesses within the slice is tail[0].
Could you post the LLVM IR (and/or assembly output) in a gist?
Prefetching
s.as_ptr().offset((s.len() >> 1)gives a slightly worse performance.
Of course, because that's not actually prefetching anything in advance (the data is used almost immediately after the prefetch). In the SO answer the prefetch was performed upon entering the binary search loop on both possible outcomes. How does that fare against the current implementation?
Also, notice that since the benchmark is performing a binary search on the same value at each iteration, the sequence of data accesses will never change, therefore they are likely to be in cache after the first iteration: the vector is 1e7 usize elements, but normal binary search will only touch ~24 of them, which should easily fit even in a small L1 cache (prefetching would double this number).
Here is a Gist with the LLVM-IR and ASM of the prefetching binsearch.
Of course, because that's not actually prefetching anything in advance (the data is used almost immediately after the prefetch). In the SO answer the prefetch was performed upon entering the binary search loop on both possible outcomes. How does that fare against the current implementation?
It performs worse: 120us vs. 82us.
Also, notice that since the benchmark is performing a binary search on the same value at each iteration, the sequence of data accesses will never change, therefore they are likely to be in cache after the first iteration: the vector is 1e7 usize elements, but normal binary search will only touch ~24 of them, which should easily fit even in a small L1 cache (prefetching would double this number).
I don't understand what you mean. Isn't the slice split in half at every iteration, sometimes using the head, sometimes the tail for further iteration, which makes it unpredictable for the HW-prefetcher and is the reason why manual prefetching is beneficial?
The main loop in the gist is:
.LBB0_3:
subq %rdi, %rsi
je .LBB0_4 ; tail.is_empty()
leaq (%rax,%rdi,8), %rbx ; the address being accessed
; rax points to s[0], rdi is head.len() -> rbx points to tail[0]
cmpq $2222222, (%rbx) ; *** the memory access ***
movl $1, %ecx
cmovaq %r10, %rcx
movl $0, %edx
cmovneq %rcx, %rdx
cmpq $-1, %rdx
je .LBB0_8 ; Less, continue on tail
testq %rdx, %rdx
je .LBB0_11 ; Equal, exit loop
movq %rdi, %rsi ; s = head (implemented as s.len = head.len)
jmp .LBB0_9 ; Greater, continue on head
.p2align 4, 0x90
.LBB0_8:
leaq 1(%rdi,%r11), %r11
# tail[1..]
addq $8, %rbx ; add (the size of) one element to the start pointer
decq %rsi ; subtract 1 to the length
movq %rbx, %rax ; and move it to s instead of tail
.LBB0_9:
; rax is the start pointer of s, rsi its length
prefetcht0 (%rax) ; *** the prefetch ***
movq %rsi, %rdi
shrq %rdi
cmpq %rdi, %rsi
jae .LBB0_3 ; LLVM was unable to remove the bound checks
jmp .LBB0_10 ; panic on illegal access
.p2align 4, 0x90
(the # comments are mine).
As expected the only memory accesses it performs are those from cmpq $2222222, (%rbx), but they are not at the address %rax. Instead, they are at %rax,%rdi,8 i.e. at %rax + %rdi * 8.
Yes, in the general case the splitting is unpredictable and results in an unpredictable access pattern, in which 50% of the times the element at the middle of head is accessed, while the other 50% of the times the element at the middle of tail is accessed. It is not clear why prefetching the element at the beginning of the appropriate sub-slice would be beneficial (although one could say that for the last iterations it does not really matter, as they will be on the same cache line).
I will try benchmarking on randomized data as soon as the prefetch intrinsic hits nightly. I hope that will point out in which cases the current prefetch strategy is superior to the one in the stackoverflow post.
See also my comment here
This paper about the branch predictor and cache effects on binary searches is worth reading. https://arxiv.org/abs/1509.05053
Triage: not sure if this optimization ever landed or not, though it seems the pre-requisites did.
I think this can be closed after #45333 landed.
Closing as fixed.
Most helpful comment
The main loop in the gist is:
(the
#comments are mine).As expected the only memory accesses it performs are those from
cmpq $2222222, (%rbx), but they are not at the address%rax. Instead, they are at%rax,%rdi,8i.e. at%rax + %rdi * 8.Yes, in the general case the splitting is unpredictable and results in an unpredictable access pattern, in which 50% of the times the element at the middle of
headis accessed, while the other50%of the times the element at the middle oftailis accessed. It is not clear why prefetching the element at the beginning of the appropriate sub-slice would be beneficial (although one could say that for the last iterations it does not really matter, as they will be on the same cache line).I will try benchmarking on randomized data as soon as the prefetch intrinsic hits nightly. I hope that will point out in which cases the current prefetch strategy is superior to the one in the stackoverflow post.