Rust: Evaluate using prefetching on slice's binary_search algorithm

Created on 18 Oct 2016 · 24Comments · Source: rust-lang/rust

This SO answer shows a 20% speed-up on binary_search with prefetching enabled at the cost of:

trashing at most 2 cache lines when an element is found (since two other elements will be prefetched),
increasing code-size with prefetching instructions.

We should consider adding this optimization if it turns out to be worth it.

C-enhancement T-libs

Source

gnzlbg

Most helpful comment

The main loop in the gist is:

.LBB0_3:
    subq    %rdi, %rsi
    je  .LBB0_4      ; tail.is_empty()
    leaq    (%rax,%rdi,8), %rbx   ; the address being accessed
; rax points to s[0], rdi is head.len() -> rbx points to tail[0]
    cmpq    $2222222, (%rbx)      ; *** the memory access ***
    movl    $1, %ecx
    cmovaq  %r10, %rcx
    movl    $0, %edx
    cmovneq %rcx, %rdx
    cmpq    $-1, %rdx
    je  .LBB0_8      ; Less, continue on tail
    testq   %rdx, %rdx
    je  .LBB0_11     ; Equal, exit loop
    movq    %rdi, %rsi ; s = head (implemented as s.len = head.len)
    jmp .LBB0_9      ; Greater, continue on head
    .p2align    4, 0x90
.LBB0_8:
    leaq    1(%rdi,%r11), %r11
# tail[1..]
    addq    $8, %rbx ; add (the size of) one element to the start pointer
    decq    %rsi     ; subtract 1 to the length
    movq    %rbx, %rax ; and move it to s instead of tail
.LBB0_9:
; rax is the start pointer of s, rsi its length
    prefetcht0  (%rax)        ; *** the prefetch ***
    movq    %rsi, %rdi
    shrq    %rdi
    cmpq    %rdi, %rsi
    jae .LBB0_3      ; LLVM was unable to remove the bound checks
    jmp .LBB0_10     ; panic on illegal access
    .p2align    4, 0x90

(the # comments are mine).
As expected the only memory accesses it performs are those from cmpq $2222222, (%rbx), but they are not at the address %rax. Instead, they are at %rax,%rdi,8 i.e. at %rax + %rdi * 8.

Yes, in the general case the splitting is unpredictable and results in an unpredictable access pattern, in which 50% of the times the element at the middle of head is accessed, while the other 50% of the times the element at the middle of tail is accessed. It is not clear why prefetching the element at the beginning of the appropriate sub-slice would be beneficial (although one could say that for the last iterations it does not really matter, as they will be on the same cache line).

I will try benchmarking on randomized data as soon as the prefetch intrinsic hits nightly. I hope that will point out in which cases the current prefetch strategy is superior to the one in the stackoverflow post.

ranma42 on 26 Apr 2017

👍2

All 24 comments

It seems as there's no support for the llvm.prefetch intrinsic available in librustc_trans/intrinsic.rs. Is this intentional?

hirschenberger on 12 Apr 2017

One can probably work around that using the link_llvm_intrinsics feature.

gnzlbg on 12 Apr 2017

Do you know how I can enable the link_llvm_intrinsics feature when building the compiler, or do I have to enable it in every crate explicitly?

I declared an external function like this:

extern {
    #[link_name = "llvm.prefetch"]
    fn prefetch(address: *const i8, rw: i32, locality: i32, cache_type: i32);
}

and used it the the binary_search. But when building other crates the compiler bails out with:

Cannot invoke an intrinsic other than donothing, patchpoint or statepoint
  invoke void @llvm.prefetch(i8* %37, i32 0, i32 1, i32 1)
          to label %bb13 unwind label %cleanup

hirschenberger on 13 Apr 2017

Do you know how I can enable the link_llvm_intrinsics feature when building the compiler?

I've never enabled/disabled features in the compiler crates (in my crates I just add it to the root of the crates that uses the feature). Maybe @alexcrichton does know or know somebody who knows.

gnzlbg on 13 Apr 2017

The error there looks unrelated to the link_llvm_intrinsics feature being enabled/disabled, that specifically looks like a codegen error. I'm not particularly sure why that'd show up, though. This may mean that support in intrinsics.rs needs to be added to execute the intrinsic.

alexcrichton on 13 Apr 2017

Ok, I'll try to add the prefetch intrinsic, was it intentionally left out?

hirschenberger on 13 Apr 2017

Nah it probably just hasn't gotten around to getting bound yet, very little is intentionally left out!

alexcrichton on 13 Apr 2017

Hmm, I'm stuck at adding the pretetch intrinsic. I added it in:

libcore/intrinsics.rs
librustc_trans/context.rs
librustc_trans/intrinsic.rs
librustc_typeck/check/intrinsic.rs

But I still get the error:

error[E0093]: unrecognized intrinsic function: `prefetch`
   --> src/libcore/intrinsics.rs:569:5
    |
569 |     pub fn prefetch<T>(data: *const T, rw: i32, locality: i32, cache: i32);
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ unrecognized intrinsic

Is there some bootstrapping necessary?

hirschenberger on 18 Apr 2017

The bootstrap compiler won't recognize the intrinsic, so you need to guard the definition with #[cfg(not(stage1))].

A similar problem will show up when you try to use the intrinsic in. I guess the easiest way to solve that without code duplication will be to define a #[cfg(stage1)] stub that does nothing.

hanna-kruppe on 18 Apr 2017

@rkruppe You mean stage0, or?

hirschenberger on 18 Apr 2017

Yes, I confused "which stage is the compiler" with "which stage are we compiling the source code for".

hanna-kruppe on 18 Apr 2017

👍1

Ok, I added the prefetch intrinsic and benchmarked a small stand-alone binsearch.

https://gist.github.com/hirschenberger/dcf3fc6f2b7ddad08c889adfa8f5f1cf

With this setup, I got the following results:

Valilla: test bench_binsearch ... bench: 105 ns/iter (+/- 0)
With prefetch: test bench_binsearch ... bench: 95 ns/iter (+/- 0)

That's approx. 10%, don't know if it's worth it and if this microbenchmark is significant. I think the compiler is already doing good work. Here's the output of perf for both versions:

Vanilla:

           320.577 L1-dcache-load-misses     #    0,46% of all L1-dcache hits   [66,26%]
        69.887.806 L1-dcache-loads                                              [67,23%]
            35.613 L1-dcache-prefetches                                         [66,60%]

With prefetching:

           190.783 L1-dcache-load-misses     #    0,14% of all L1-dcache hits   [67,13%]
       133.433.668 L1-dcache-loads                                              [67,00%]
            51.848 L1-dcache-prefetches                                         [65,94%]

Update:

I did some more tests and got the runtime down to: test bench_binsearch ... bench: 81 ns/iter (+/- 0)

The change prefetches the destination slice after every iteration.

hirschenberger on 18 Apr 2017

🎉1

@hirschenberger that's a 20% perf improvement!

gnzlbg on 18 Apr 2017

Some numbers using slice::binary_search:

Current nightly: test bench_binsearch ... bench: 149 ns/iter (+/- 1)
With prefetching: test bench_binsearch ... bench: 97 ns/iter (+/- 8)

hirschenberger on 18 Apr 2017

@hirschenberger Is the benchmark using prefetching correctly?

It looks like it always prefetches s (which would be the first element of the range being considered) instead of something like s + s.len() >> 1.
Also, I find it somewhat surprising that there is only one prefetch invocation: for binary search you usually prefetch both possible choices (see the SO answer from the OP).

ranma42 on 26 Apr 2017

The s slice is prefetched for the split_at call in the next iteration. Prefetching s.as_ptr().offset((s.len() >> 1) gives a slightly worse performance.

I also tested prefetching the head and tail splits as in the SO answer, but got no performance benefit. I think by using split_at, the two slices are already in the cache.

I tested several prefetching pattern and this one got the best results.

hirschenberger on 26 Apr 2017

The s slice is prefetched for the split_at call in the next iteration.

split_at does not access the contents of the s slice, so it should not be affected by the prefetching. The only place that should perform data accesses within the slice is tail[0].
Could you post the LLVM IR (and/or assembly output) in a gist?

Prefetching s.as_ptr().offset((s.len() >> 1) gives a slightly worse performance.

Of course, because that's not actually prefetching anything in advance (the data is used almost immediately after the prefetch). In the SO answer the prefetch was performed upon entering the binary search loop on both possible outcomes. How does that fare against the current implementation?

Also, notice that since the benchmark is performing a binary search on the same value at each iteration, the sequence of data accesses will never change, therefore they are likely to be in cache after the first iteration: the vector is 1e7 usize elements, but normal binary search will only touch ~24 of them, which should easily fit even in a small L1 cache (prefetching would double this number).

ranma42 on 26 Apr 2017

Here is a Gist with the LLVM-IR and ASM of the prefetching binsearch.

Of course, because that's not actually prefetching anything in advance (the data is used almost immediately after the prefetch). In the SO answer the prefetch was performed upon entering the binary search loop on both possible outcomes. How does that fare against the current implementation?

It performs worse: 120us vs. 82us.

Also, notice that since the benchmark is performing a binary search on the same value at each iteration, the sequence of data accesses will never change, therefore they are likely to be in cache after the first iteration: the vector is 1e7 usize elements, but normal binary search will only touch ~24 of them, which should easily fit even in a small L1 cache (prefetching would double this number).

I don't understand what you mean. Isn't the slice split in half at every iteration, sometimes using the head, sometimes the tail for further iteration, which makes it unpredictable for the HW-prefetcher and is the reason why manual prefetching is beneficial?

hirschenberger on 26 Apr 2017

The main loop in the gist is:

.LBB0_3:
    subq    %rdi, %rsi
    je  .LBB0_4      ; tail.is_empty()
    leaq    (%rax,%rdi,8), %rbx   ; the address being accessed
; rax points to s[0], rdi is head.len() -> rbx points to tail[0]
    cmpq    $2222222, (%rbx)      ; *** the memory access ***
    movl    $1, %ecx
    cmovaq  %r10, %rcx
    movl    $0, %edx
    cmovneq %rcx, %rdx
    cmpq    $-1, %rdx
    je  .LBB0_8      ; Less, continue on tail
    testq   %rdx, %rdx
    je  .LBB0_11     ; Equal, exit loop
    movq    %rdi, %rsi ; s = head (implemented as s.len = head.len)
    jmp .LBB0_9      ; Greater, continue on head
    .p2align    4, 0x90
.LBB0_8:
    leaq    1(%rdi,%r11), %r11
# tail[1..]
    addq    $8, %rbx ; add (the size of) one element to the start pointer
    decq    %rsi     ; subtract 1 to the length
    movq    %rbx, %rax ; and move it to s instead of tail
.LBB0_9:
; rax is the start pointer of s, rsi its length
    prefetcht0  (%rax)        ; *** the prefetch ***
    movq    %rsi, %rdi
    shrq    %rdi
    cmpq    %rdi, %rsi
    jae .LBB0_3      ; LLVM was unable to remove the bound checks
    jmp .LBB0_10     ; panic on illegal access
    .p2align    4, 0x90