Rfcs: Add `lower_bound`, `upper_bound` and `equal_range` for slices where T: Ord to complement `binary_search`

Created on 20 Oct 2017 · 43Comments · Source: rust-lang/rfcs

Motivation

Fom discussion on https://github.com/rust-lang/rust/pull/45333.

There is demand for lower_bound, upper_bound and equal_range especially for sorted sequences of non-unique elements. In such sequences the aforementioned operations are more interesting and useful than binary_search. The addition of these three ops will complement the support of sorted slices in the standard library.

Proposed APIs

`lower_bound`

/// Finds the first element in the sorted slice that is _not less_ than `x`.
///
/// Returns the index `a` such that all elements in `self[..a]` are less than `x` and all elements in `self[a..]` are greater or equal to `x`.
///
/// # Examples
/// ```
/// let a = [10, 11, 13, 13, 15];
/// assert_eq!(a.lower_bound(9), 0);
/// assert_eq!(a.lower_bound(10), 0);
/// assert_eq!(a.lower_bound(11), 1);
/// assert_eq!(a.lower_bound(12), 2);
/// assert_eq!(a.lower_bound(13), 2);
/// assert_eq!(a.lower_bound(14), 4);
/// assert_eq!(a.lower_bound(15), 4);
/// assert_eq!(a.lower_bound(16), 5);
/// ```
fn lower_bound(&self, x: &T) -> usize 
where
    T: Ord;

/// `lower_bound` with a comparator.
fn lower_bound_by<'a, F>(&'a self, f: F) -> usize
where
    F: FnMut(&'a T) -> Ordering;

/// `lower_bound` with key extraction.
fn lower_bound_by_key<'a, B, F>(&'a self, b: &B, f: F)  -> usize
where
    B: Ord,
    F: FnMut(&'a T) -> B

`upper_bound`

/// Finds the first element in the sorted slice that is _greater_ than `x`.
///
/// Returns the index `a` such that all elements in `self[..a]` are less or equal to `x` and all elements in `self[a..]` are greater than `x`.
///
/// # Examples
/// ```
/// let a = [10, 11, 13, 13, 15];
/// assert_eq!(a.upper_bound(9), 0);
/// assert_eq!(a.upper_bound(10), 1);
/// assert_eq!(a.upper_bound(11), 2);
/// assert_eq!(a.upper_bound(12), 2);
/// assert_eq!(a.upper_bound(13), 4);
/// assert_eq!(a.upper_bound(14), 4);
/// assert_eq!(a.upper_bound(15), 5);
/// assert_eq!(a.upper_bound(16), 5);
/// ```
fn upper_bound(&self, x: &T) -> usize
where
    T: Ord;

/// `upper_bound` with a comparator.
fn upper_bound_by<'a, F>(&'a self, f: F) -> usize
where
    F: FnMut(&'a T) -> Ordering;

/// `upper_bound` with key extraction.
fn upper_bound_by_key<'a, B, F>(&'a self, b: &B, f: F)  -> usize
where
    B: Ord,
    F: FnMut(&'a T) -> B

`equal_range`

/// Finds all elements _equal_ to `x`.
///
/// Returns the `Range` `a..b` such that all elements in `self[..a]` are less than `x`, all elements in `self[a..b]` are equal to `x`, and all elements in `self[b..]` are greater than `x`.
///
/// # Examples
/// ```
/// let a = [10, 11, 13, 13, 15];
///  for i in (9..17) {
///    assert_eq!(a.equal_range(i), (a.lower_bound(i), a.upper_bound(i)));
///  }
/// ```

fn equal_range(&self, x: &T) -> std::ops::Range<usize> 
where
    T: Ord;

/// `equal_range` with a comparator.
fn equal_range_by<'a, F>(&'a self, f: F) -> std::ops::Range<usize> 
where
    F: FnMut(&'a T) -> Ordering;

/// `equal_range` with key extraction.
fn equal_range_by_key<'a, B, F>(&'a self, b: &B, f: F)  -> std::ops::Range<usize> 
where
    B: Ord,
    F: FnMut(&'a T) -> B

T-libs

Source

alkis

👍36 ❤4 👀1

Most helpful comment

C++ has partition_point. It has clear definition like

let i = partition_point(slice, pred);
for x in &slice[..i] { assert!(  pred(x) ); }
for x in &slice[i..] { assert!( !pred(x) ); }

I can use this when I cannot remember the definition of lower/upper_bound :)
It is basic and primitive edition of lower/upper_bound and I want it in Rust rather than lower/upper_bound. (both is great)

VillSnow on 18 Jun 2020

👍3 ❤2

All 43 comments

Is is more convenient to have an Option return value or to just return the appropriate slice boundary?
There are use cases for both, it is not clear to me which one would be more ergonomic.
Also, how about returning a Range in equal_range*?

ranma42 on 20 Oct 2017

Returning a Range in equal_range sounds great. I updated the original post.

I don't understand the question about Option. What should we be the return types for upper_bound and lower_bound?

alkis on 20 Oct 2017

Since they are looking for "the first element in the sorted slice that is ..." I would return the length of the slice. This would make it easy to use these function to find the position that would be the insertion point of an element so that the order is preserved (and the element is added as the first/last of a streak of elements that compare as equal).
I am unsure if my explanation is comprehensible... I will check it again tomorrow morning XD

ranma42 on 20 Oct 2017

We return the length of the slice or the position of the lower bound? What happens if the element does not exist (past the end of the slice)? Do you think returning 0 for lower bound and self.len() for upper bound makes sense in that case?

alkis on 20 Oct 2017

I think 0 would not be a suitable substitute for None if the slice is not empty, because it would identify an element that does not satisfy the property.

For example, [21, 23, 23].lower_bound(26) should not return 0, because 21 is not "the first element in the sorted slice that is _not less_ than" 26. Instead I think returning 3 would be ok, as it would correctly indicate that we could find no element that is _not less_ then 26.

A similar reasoning also applies to upper_bound. This approach could be summarized in the following table:

x | lower_bound | upper_bound | equal_range
--- | --- | --- | ---
20- | 0 | 0 | any empty range (ex: Range(0, 0))
21 | 0 | 1 | Range(0, 1)
22 | 1 | 1 | any empty range (ex: Range(1, 1))
23 | 1 | 3 (or None) | Range(1, 3)
24+ | 3 (or None) | 3 (or None) | any empty range (ex: Range(3, 3))

Note that while the current definition of equal_range allows for any empty range to be returned, we might constrain it more and require that it always returns the range that corresponds to the would-be position of the x elements, even when there are none.

ranma42 on 21 Oct 2017

Thanks Andrea. I updated the code above with examples. Let me know if you agree.

equal_range(x) is equivalent to (lower_bound(x), upper_bound(x)) so everything just follows if we nail down what lower_bound and upper_bound return.

alkis on 21 Oct 2017

Yep, the examples match my expectations :)

Some nits (I think most of these are basically just typos):

upper_bound is still marked as returning Some<usize>
the descriptions for {lower,upper}_bound still state "If such an element exists it returns Some with the index of such element. Otherwise it returns None.". I would expect something like "If such an element exists it returns the index of such element. Otherwise it returns the length of the slice."
the description for equal_range does not match returning (lower_bound(x), upper_bound(x)) (12 does not exist in the example slice, yet the equal_range would not be (0, 0) nor (5, 5)).

Some proposals for an alternative wording of the description (very slice-oriented):

lower_bound returns the index a such that all elements in self[..a] are less than x and all elements in self[a..] are greater or equal to x.
upper_bound returns the index b such that all elements in self[..b] are less or equal to x and all elements in self[b..] are greater than x.
equal_range returns the Range a..b such that all elements in self[..a] are less than x, all elements in self[a..b] are equal to x, and all elements in self[b..] are greater than x.

This wording is more verbose, but it shows why the "element not found" is not actually a special case for these functions and highlights in a very direct way the relationship between equal_range and {lower,upper}_bound.

ranma42 on 22 Oct 2017

I like the new wording. Incorporated.

alkis on 22 Oct 2017

I think "any empty range" is unfortunate, because there's very useful information that throws away.

The other use of equal_range is, with an obviously-placeholder name,

/// Returns the range of positions at which `x` could be inserted such
/// that the sequence would remain sorted.
fn where_to_insert_in_sorted(&self, x: &T) -> RangeInclusive<usize>;

That return value would have the _same_ start and end as equal_range. (C++ avoids this distinction by returning a pair that can be considered as [a,b) for "the items" and [a,b] for "the insertion positions".)

Phrased differently, I think it's important that upper_bound and lower_bound are the .start and .end of the equal_range return value (just calculated a bit more efficiently).

scottmcm on 25 Oct 2017

Isn't there any meaningful data to be returned from a "not found" lower_bound? Can't it behave similar to binary_search?

Same for upper_bound.

arthurprs on 25 Oct 2017

@scottmcm equal_range does not return "any empty range". What is returned is defined as your first suggestion.

@arthurprs can you give an example?

alkis on 25 Oct 2017

@alkis nevermind, that was a confusion moment.

I think lower_bound should return 0 and upper_bound should return slice.len() if no range can be found. Same as you proposed on https://github.com/rust-lang/rfcs/issues/2184#issuecomment-338328470

arthurprs on 25 Oct 2017

@arthurprs this is what it will return given the description. The examples also show these cases.

alkis on 25 Oct 2017

I see, it's just that fn upper_bound(&self, x: &T) -> Option<usize> still has the Option<> in the result type.

arthurprs on 25 Oct 2017

Fixed.

alkis on 25 Oct 2017

What's the next step? Does this need to go through a PR submission to the RFC repository?

alkis on 30 Oct 2017

Is there a reason these functions don't use Result in their return type like binary_search does. Isn't it sometimes still interesting to know if there is an existing equal element or not?

bluss on 4 Nov 2017

It is a tradeoff. If we do the check then all users need to pay the price
of the extra branch. If we don't then the users that don't need it don't
pay for it. Since rust is going for zero cost/systems programming, I find
that this API is a better fit.

alkis on 4 Nov 2017

👍1

I published the implementation in a crate here: https://crates.io/crates/ordslice. When/if this lands the crate can be deprecated.

alkis on 7 Nov 2017

Could we add the option for the bound to be inclusive or exclusive using Bound? The result would look like this:

fn lower_bound(&self, x: Bound<&T>) -> usize 
where
    T: Ord;

This is similar to the API I used for RBTree in intrusive-collections.

Amanieu on 15 Nov 2017

Do you have an idea of the performance cost of such a change? Also do you have examples of using such API in practice that either allow use-cases that are not possible otherwise and/or more readable code?

alkis on 15 Nov 2017

Can't exclusive bounds be emulated by altering the comparison fn?

arthurprs on 15 Nov 2017

the need for something like lower_bound/upper_bound in a number of situations. What I've ended up doing is to use binary_seach_by and using a closure which only returns Greater or Less. For example in this function in rand.

The main thing that was awkward with that solution, and which still feels awkward in the lower_bound_by/upper_bound_by API is the use of Ordering as the closure return type. In part because the enum names take up about half of the closure function.

But for lower_bound_by/upper_bound_by also because returning Equal effectively treated the same as either Less or Greater. I.e. lower_bound_by/upper_bound_by effectively want a binary answer, but uses a three-value enum to express that.

I think the lower_bound_by/upper_bound_by API would be a lot more pleasant to use if the closure returned a bool instead. That way you could write myslice.lower_bound_by(|item| item.foo <= 20).

The only problem is that this doesn't work well with equal_range_by, which would need to keep using Ordering.

sicking on 12 Jul 2018

👍2

You can find lower_bound, upper_bound, and equal_range in https://docs.rs/superslice/0.1.0/superslice/.

alkis on 12 Jul 2018

I was hoping to move forward the discussion so that maybe these functions can be added to the standard library. But happy to do that elsewhere if that's preferred?

sicking on 12 Jul 2018

For equal_range, I think it'd be more useful if you could provide a range of numbers to be searched for in the slice. Similar to BTreeMap::range. So something like:

let vec = vec![3, 5, 6, 6, 8, 10, 12];
assert_eq!(vec.search_something_range(5..9), (1, 5));
assert_eq!(vec.search_something_range(5..10), (1, 5));
assert_eq!(vec.search_something_range(5..=10), (1, 6));
assert_eq!(vec.search_something_range((Exclusive(6), Inclusive(20)), (4, 7));

It'd also be cool if you could do

let vec = vec![3, 5, 6, 6, 8, 10, 12];
assert_eq!(vec.search_something_range(6), (2, 4));

However that'd mean we couldn't use the RangeBounds trait.

sicking on 20 Jul 2018

FWIW, if there's interest in having this use cases discussed in this RFC directly addressed in stdlib, independent of what API is used to do it, then I suggest speaking up. The PR in rust-lang/rust#52530 got rejected because it wasn't clear that the usecase needed addressing.

sicking on 27 Jul 2018

I would be really happy to have that in stdlib

ghost on 16 Aug 2018

I am hitting this issue in sled, and finding myself writing a few wrappers around standard library functionality that do not feel obviously correct (similar to https://github.com/rust-lang/rust/issues/52529). It's important for me to quickly locate these bounds, both exclusively and inclusively, to enable various tree search and iteration features.

spacejam on 21 Nov 2018

Could we add the option for the bound to be inclusive or exclusive using Bound

Why this would be useful? What does xs.lower_bound(Bound::Unbounded) mean, for example? In general, I find it hard to understand at a glance what is the difference between xs.lower_bound(Bound::Exclusive(&92)) and xs.lower_bound(Bound::Inclusive(&92)). std::collections::BTreeMap only uses Bound with range.

I rather strongly feel that the proposed signature of

fn (lower|upper)_bound(&self, x: &T) -> usize 
where
    T: Ord;

is "canonical". For a given x, there are number of possible insertion points for x in the underlying slice, and lower / upper return smallest/greatest one respectively. Adding Result or Bound into the mix seem to only complicate the API? It seems like we should just add these methods to std already :)

I am not so sure about equal_range:

while I've used lower/upper bound in a number of projects, I personally haven't found a use for equal_range, which could be an indication that this is not a super-useful method after all.
I imagine the Bounds API might be better for this case?

pub fn binary_search_range<K, R>(&self, range: R) -> std::ops::Range<usize> where
    K: Ord + ?Sized,
    R: RangeBounds<K>,
    T: Borrow<K>,

matklad on 22 Feb 2019

Why this would be useful? What does xs.lower_bound(Bound::Unbounded) mean, for example?

It would always return 0 (the first element).

In general, I find it hard to understand at a glance what is the difference between xs.lower_bound(Bound::Exclusive(&92)) and xs.lower_bound(Bound::Inclusive(&92)).

If the slice does not contain 92, then they are equivalent. Otherwise, Inclusive will point to the first instance of 92 in the slice (there can be more than one), and Exclusive will point to the element after the last 92.

Amanieu on 22 Feb 2019

👍1

So, lower_bound(Exclusive(&x)) is the same as upper_bound(Inclusive(&x))? And vice verse?

10   92  92  92  100
     ^           ^
     LI          LE
     UE          UI

If this is the case, looks like Bound does not by us any expressivity? EDIT: for "point" queries. I 100% see usefulness of Bound for "range" queries, where the return type is not usize, but a range/slice.

matklad on 22 Feb 2019

I haven't been following this thread, but having the "lower bound" of 92 ever be _above_ all the 92s seems like we've either lost track of what "lower" means, or these are really bad names for whatever operations are being proposed.

Assuming these are the operations the names imply they are, and we do want Bounds that also mean what their names imply, wouldn't the expected behavior be more like this?

10  92  92  92  100
^   ^       ^   ^
LE  LI      UI  UE

Ixrec on 22 Feb 2019

I haven't been following this thread, but having the "lower bound" of 92 ever be _above_ all the 92s seems like we've either lost track of what "lower" means, or these are really bad names for whatever operations are being proposed.

You need to think of the bounds in terms of a range of values that you are interested in. So lower_bound means "the range of values that I am interested in starts at 92 (inclusive/exclusive depending on the bound), give me the index of the first value in that range".

The definition for upper_bound is similar: "the range of values that I am interested in ends at 173 (inclusive/exclusive), give me the index of the value in that range".

Amanieu on 22 Feb 2019

Interesting. I think that means the names are off. lower_bound(x) seems like it'd mean the lower bound of values equal to x. If what we want is the first x value, then we should just call the method something obvious like first(x).

But "upper bound"/"upper bound (inclusive)"/"lower bound (exclusive)" apparently means "the first non-x value after an x value" which is a super weird concept I don't know of a concise name for. Hm...

Ixrec on 22 Feb 2019

For a couple concrete "point" query examples I'm working with, consider a B+ tree where pointers to children are stored in a sorted slice as (separator_key, child_node) where the separator_key is the lowest inclusive bound of values that may be present in that node. While traversing the tree, we need to find the child with the highest separator_key that is less than or equal to the key being searched for in the tree.

In general, I find all of the below use cases useful for working with versioned trees:

find the highest value that is less than or equal to a key (B+ tree child separator keys while trying to find a leaf node that may contain a key)
find the highest value that is less than a key (get key+value before a given key, useful for GC in MVCC and reverse iterators in unidirectionally-linked trees)
find the lowest value that is greater than or equal to a key (build simple iterators)
find the lowest value that is greater than a key (get next value, useful for MVCC)

All of the above can be expressed with iterators, but that prevents their compatibility with binary_search*.

spacejam on 22 Feb 2019

❤1

I believe that this is expressible with bound-less methods like this:

xs[..xs.upper_bound(key)].last() highest less or equal
xs[..xs.lower_bound(key)].last() highest less
xs.get(xs.lower_bound(key))      lowest greater or equal
xs.get(xs.upper_bound(key))      lowest greater

The first two seem trickier than ideal, but I don't think they can be better expressed with Bound semantics of https://github.com/rust-lang/rfcs/issues/2184#issuecomment-466369691 (which I believe is what @Amanieu proposes and which coinsides with my intuition). The reason is that adding Bound does not change the set of indices you can get, you still need to -1 somehow.

The version of @Ixrec (https://github.com/rust-lang/rfcs/issues/2184#issuecomment-466371296) does make this easier though.

On the meta level, if we can come up with two semantics for Bound which are differ by 1, that probably means that Bound would be a problematic API.

matklad on 22 Feb 2019

It's not just a simple -1. If you have multiple identical values then the difference between an inclusive and exclusive lower bound can be more than 1 element:

10  92  92  92  100
    ^           ^
    LI          LE

Amanieu on 22 Feb 2019

@Amanieu sorry, I wasn't clear, the -1 referred to the difference between https://github.com/rust-lang/rfcs/issues/2184#issuecomment-466369691 and https://github.com/rust-lang/rfcs/issues/2184#issuecomment-466371296. That is, it's the difference between semantics one might thing Bound has, not the difference between LI and LE (which can be arbitrary in one semantics and zero or one in another semantics) .

matklad on 22 Feb 2019

It would be nice if this could move forward somehow. I was surprised that there was a binary_search in std, but that it apparently selects an element at random if there are multiple elements with the same value.

avl on 18 Sep 2019

👍5

C++ has partition_point. It has clear definition like

let i = partition_point(slice, pred);
for x in &slice[..i] { assert!(  pred(x) ); }
for x in &slice[i..] { assert!( !pred(x) ); }

VillSnow on 18 Jun 2020

👍3 ❤2

@VillSnow this sounds like a great idea, as this is indeed more general than binary search, and has trivial design space. Would you like to send a PR to libcore to add this? I am not on the libs team, but seems like this API can be accepted without too much deliberation:

impl<T> [T] {
  fn partition_point<T, P>(&self, mut pred: P) -> usize
  where
      P: FnMut(&T) -> bool,
}

As basic check, rust also use partition and partition_in_place, so cargo-culting the name from C++ seems fine.

matklad on 19 Jun 2020

@matklad
It would be my first PR in my life. How can I send a PR?
I cloned rust-lang/rust, passed x.py check, implement my partition_point and passed x.py check again. ( ~~and accidentally my Linux machine downs so it tooks a bit time to recover :_(~~ done)

my understanding is; first make a issue "Traking issue for partition_point", set issue number of unstable attribute and send PR?
How to set feature of unstable attribute? Can I make as I like?

https://github.com/VillSnow/rust/commit/4c8ce48a15c88715955e56c9c35959d9dffad5ec

VillSnow on 21 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings