Currently Vec's swap_remove has the following implementation:
pub fn swap_remove(&mut self, index: usize) -> T {
let length = self.len();
self.swap(index, length - 1);
self.pop().unwrap()
}
This is needlessly slow. The swap does a bounds check, and then pop does
another bounds check that never fails. Furthermore, there is an actual swap
right before the pop - this results in a lot useless moves that don't (seem to)
get optimized out.
It's possible to write a safe implementation that only does one bounds check and
uses two moves with a little unsafe code:
pub fn swap_remove(&mut self, index: usize) -> T {
unsafe {
// Singular bounds check on index ensures safety.
let hole: *mut T = &mut self[index];
let value = ptr::read(hole);
// Since the bounds check on index succeeded we know back >= 0.
let back = self.len() - 1;
ptr::copy(self.get_unchecked(back), hole, 1);
self.set_len(back);
value
}
}
I found it quite hard to benchmark this, but the following benchmark showed on
my machine the new implementation to be faster:
#![feature(test)]
extern crate test;
pub fn swap_remove_opt<T>(v: &mut Vec<T>, idx: usize) -> T {
unsafe {
// Singular bounds check on idx ensures safety.
let hole: *mut T = &mut v[idx];
let value = std::ptr::read(hole);
let back = v.len() - 1;
std::ptr::copy(v.get_unchecked(back), hole, 1);
v.set_len(back);
value
}
}
// Tiny LCG RNG.
fn rng_next(s: u64) -> u64 {
6364136223846793005 * s + 1
}
#[cfg(test)]
mod tests {
use super::*;
use test::Bencher;
#[bench]
fn bench_1000_overhead(b: &mut Bencher) {
let mut rng = 43;
let v: Vec<usize> = (0..1024).collect();
b.iter(|| {
let v = v.clone();
for _ in 0..512 {
rng = rng_next(rng);
let idx = (rng >> 32) % 512;
test::black_box(idx as usize);
}
v
})
}
#[bench]
fn bench_vec_1000_swap_remove(b: &mut Bencher) {
let mut rng = 43;
let v: Vec<usize> = (0..1024).collect();
b.iter(|| {
let mut v = v.clone();
for _ in 0..512 {
rng = rng_next(rng);
let idx = (rng >> 32) % 512;
test::black_box(v.swap_remove(idx as usize));
}
v
})
}
#[bench]
fn bench_vec_1000_swap_remove_opt(b: &mut Bencher) {
let mut rng = 43;
let v: Vec<usize> = (0..1024).collect();
b.iter(|| {
let mut v = v.clone();
for _ in 0..512 {
rng = rng_next(rng);
let idx = (rng >> 32) % 512;
test::black_box(swap_remove_opt(&mut v, idx as usize));
}
v
})
}
}
test tests::bench_1000_overhead ... bench: 1,049 ns/iter (+/- 16)
test tests::bench_vec_1000_swap_remove ... bench: 1,306 ns/iter (+/- 122)
test tests::bench_vec_1000_swap_remove_opt ... bench: 1,193 ns/iter (+/- 104)
As I said, it's quite hard to benchmark - lots of overhead and a small difference to be observed. Regardless, running multiple times show consistently swap_remove_opt to be faster.
Although I can't run the above benchmark on stable due to test being unstable,
I conjecture that on stable Rust the difference is much bigger. A quick look at
the disassembly should show why: https://godbolt.org/g/fU4Edu
Nightly fares a lot better in the disassembly but as seen above still loses out in the benchmark.
Thanks a lot for looking into this!
Comparing the assembly for the two implementations in nightly on godbolt reveals that the only substantial difference is that Vec::swap_remove writes the value back (as expected -- we don't have a way to tell LLVM that nobody will access the elements beyond len and that they will be overwritten before len is increased again, and frankly that isn't even true), the bounds checks are already combined into one. All remaining differences are scheduling/isel/regalloc differences and those are generally not correlated with source level changes in any obvious way. So I'm wondering whether we can get all of the benefit with less and less-tricky unsafe? For example, I dropped the get_unchecked_mut and still got identical assembly: https://godbolt.org/g/wdG8VN
@rkruppe Do you know what change causes nightly to be significantly better at reasoning about the bounds check than stable, and whether that functionality is coming to stable soon?
Regardless, the benchmarks I posted above are done on nightly, so there still seems to be a performance benefit.
Do you know what change causes nightly to be significantly better at reasoning about the bounds check than stable
Note that my implementation, which doesn't use get_unchecked, still has the same assembly as yours on 1.26.
and whether that functionality is coming to stable soon?
We backport serious bug fixes but not performance fixes. So any performance improvements in nightly get into beta in a six week cycle, and that beta becomes the next stable another six weeks later.
Regardless, the benchmarks I posted above are done on nightly, so there still seems to be a performance benefit.
Yes, there is a performance benefit to be had by replacing the current implementation in std. I'm only talking about how much unsafe we need to use to get that benefit.
@rkruppe I think I managed to reduce the unsafe code to a minimum with the following implementation:
pub fn swap_remove(&mut self, index: usize) -> T {
unsafe {
let hole: *mut T = &mut self[index];
std::ptr::replace(hole, self.pop().unwrap())
}
}
This still compiles to the essentially the same assembly on stable.
That's great! Do you want to add comments describing the motivation and why this unsafe code is sound and submit this optimization as a pull request?
Most helpful comment
@rkruppe I think I managed to reduce the unsafe code to a minimum with the following implementation:
This still compiles to the essentially the same assembly on stable.