Fom discussion on https://github.com/rust-lang/rust/pull/45333.
There is demand for lower_bound, upper_bound and equal_range especially for sorted sequences of non-unique elements. In such sequences the aforementioned operations are more interesting and useful than binary_search. The addition of these three ops will complement the support of sorted slices in the standard library.
lower_bound/// Finds the first element in the sorted slice that is _not less_ than `x`.
///
/// Returns the index `a` such that all elements in `self[..a]` are less than `x` and all elements in `self[a..]` are greater or equal to `x`.
///
/// # Examples
/// ```
/// let a = [10, 11, 13, 13, 15];
/// assert_eq!(a.lower_bound(9), 0);
/// assert_eq!(a.lower_bound(10), 0);
/// assert_eq!(a.lower_bound(11), 1);
/// assert_eq!(a.lower_bound(12), 2);
/// assert_eq!(a.lower_bound(13), 2);
/// assert_eq!(a.lower_bound(14), 4);
/// assert_eq!(a.lower_bound(15), 4);
/// assert_eq!(a.lower_bound(16), 5);
/// ```
fn lower_bound(&self, x: &T) -> usize
where
T: Ord;
/// `lower_bound` with a comparator.
fn lower_bound_by<'a, F>(&'a self, f: F) -> usize
where
F: FnMut(&'a T) -> Ordering;
/// `lower_bound` with key extraction.
fn lower_bound_by_key<'a, B, F>(&'a self, b: &B, f: F) -> usize
where
B: Ord,
F: FnMut(&'a T) -> B
upper_bound/// Finds the first element in the sorted slice that is _greater_ than `x`.
///
/// Returns the index `a` such that all elements in `self[..a]` are less or equal to `x` and all elements in `self[a..]` are greater than `x`.
///
/// # Examples
/// ```
/// let a = [10, 11, 13, 13, 15];
/// assert_eq!(a.upper_bound(9), 0);
/// assert_eq!(a.upper_bound(10), 1);
/// assert_eq!(a.upper_bound(11), 2);
/// assert_eq!(a.upper_bound(12), 2);
/// assert_eq!(a.upper_bound(13), 4);
/// assert_eq!(a.upper_bound(14), 4);
/// assert_eq!(a.upper_bound(15), 5);
/// assert_eq!(a.upper_bound(16), 5);
/// ```
fn upper_bound(&self, x: &T) -> usize
where
T: Ord;
/// `upper_bound` with a comparator.
fn upper_bound_by<'a, F>(&'a self, f: F) -> usize
where
F: FnMut(&'a T) -> Ordering;
/// `upper_bound` with key extraction.
fn upper_bound_by_key<'a, B, F>(&'a self, b: &B, f: F) -> usize
where
B: Ord,
F: FnMut(&'a T) -> B
equal_range/// Finds all elements _equal_ to `x`.
///
/// Returns the `Range` `a..b` such that all elements in `self[..a]` are less than `x`, all elements in `self[a..b]` are equal to `x`, and all elements in `self[b..]` are greater than `x`.
///
/// # Examples
/// ```
/// let a = [10, 11, 13, 13, 15];
/// for i in (9..17) {
/// assert_eq!(a.equal_range(i), (a.lower_bound(i), a.upper_bound(i)));
/// }
/// ```
fn equal_range(&self, x: &T) -> std::ops::Range<usize>
where
T: Ord;
/// `equal_range` with a comparator.
fn equal_range_by<'a, F>(&'a self, f: F) -> std::ops::Range<usize>
where
F: FnMut(&'a T) -> Ordering;
/// `equal_range` with key extraction.
fn equal_range_by_key<'a, B, F>(&'a self, b: &B, f: F) -> std::ops::Range<usize>
where
B: Ord,
F: FnMut(&'a T) -> B
Is is more convenient to have an Option return value or to just return the appropriate slice boundary?
There are use cases for both, it is not clear to me which one would be more ergonomic.
Also, how about returning a Range in equal_range*?
Returning a Range in equal_range sounds great. I updated the original post.
I don't understand the question about Option. What should we be the return types for upper_bound and lower_bound?
Since they are looking for "the first element in the sorted slice that is ..." I would return the length of the slice. This would make it easy to use these function to find the position that would be the insertion point of an element so that the order is preserved (and the element is added as the first/last of a streak of elements that compare as equal).
I am unsure if my explanation is comprehensible... I will check it again tomorrow morning XD
We return the length of the slice or the position of the lower bound? What happens if the element does not exist (past the end of the slice)? Do you think returning 0 for lower bound and self.len() for upper bound makes sense in that case?
I think 0 would not be a suitable substitute for None if the slice is not empty, because it would identify an element that does not satisfy the property.
For example, [21, 23, 23].lower_bound(26) should not return 0, because 21 is not "the first element in the sorted slice that is _not less_ than" 26. Instead I think returning 3 would be ok, as it would correctly indicate that we could find no element that is _not less_ then 26.
A similar reasoning also applies to upper_bound. This approach could be summarized in the following table:
x | lower_bound | upper_bound | equal_range
--- | --- | --- | ---
20- | 0 | 0 | any empty range (ex: Range(0, 0))
21 | 0 | 1 | Range(0, 1)
22 | 1 | 1 | any empty range (ex: Range(1, 1))
23 | 1 | 3 (or None) | Range(1, 3)
24+ | 3 (or None) | 3 (or None) | any empty range (ex: Range(3, 3))
Note that while the current definition of equal_range allows for any empty range to be returned, we might constrain it more and require that it always returns the range that corresponds to the would-be position of the x elements, even when there are none.
Thanks Andrea. I updated the code above with examples. Let me know if you agree.
equal_range(x) is equivalent to (lower_bound(x), upper_bound(x)) so everything just follows if we nail down what lower_bound and upper_bound return.
Yep, the examples match my expectations :)
Some nits (I think most of these are basically just typos):
upper_bound is still marked as returning Some<usize>{lower,upper}_bound still state "If such an element exists it returns Some with the index of such element. Otherwise it returns None.". I would expect something like "If such an element exists it returns the index of such element. Otherwise it returns the length of the slice."equal_range does not match returning (lower_bound(x), upper_bound(x)) (12 does not exist in the example slice, yet the equal_range would not be (0, 0) nor (5, 5)).Some proposals for an alternative wording of the description (very slice-oriented):
lower_bound returns the index a such that all elements in self[..a] are less than x and all elements in self[a..] are greater or equal to x.upper_bound returns the index b such that all elements in self[..b] are less or equal to x and all elements in self[b..] are greater than x.equal_range returns the Range a..b such that all elements in self[..a] are less than x, all elements in self[a..b] are equal to x, and all elements in self[b..] are greater than x.This wording is more verbose, but it shows why the "element not found" is not actually a special case for these functions and highlights in a very direct way the relationship between equal_range and {lower,upper}_bound.
I like the new wording. Incorporated.
I think "any empty range" is unfortunate, because there's very useful information that throws away.
The other use of equal_range is, with an obviously-placeholder name,
/// Returns the range of positions at which `x` could be inserted such
/// that the sequence would remain sorted.
fn where_to_insert_in_sorted(&self, x: &T) -> RangeInclusive<usize>;
That return value would have the _same_ start and end as equal_range. (C++ avoids this distinction by returning a pair that can be considered as [a,b) for "the items" and [a,b] for "the insertion positions".)
Phrased differently, I think it's important that upper_bound and lower_bound are the .start and .end of the equal_range return value (just calculated a bit more efficiently).
Isn't there any meaningful data to be returned from a "not found" lower_bound? Can't it behave similar to binary_search?
Same for upper_bound.
@scottmcm equal_range does not return "any empty range". What is returned is defined as your first suggestion.
@arthurprs can you give an example?
@alkis nevermind, that was a confusion moment.
I think lower_bound should return 0 and upper_bound should return slice.len() if no range can be found. Same as you proposed on https://github.com/rust-lang/rfcs/issues/2184#issuecomment-338328470
@arthurprs this is what it will return given the description. The examples also show these cases.
I see, it's just that fn upper_bound(&self, x: &T) -> Option<usize> still has the Option<> in the result type.
Fixed.
What's the next step? Does this need to go through a PR submission to the RFC repository?
Is there a reason these functions don't use Result in their return type like binary_search does. Isn't it sometimes still interesting to know if there is an existing equal element or not?
It is a tradeoff. If we do the check then all users need to pay the price
of the extra branch. If we don't then the users that don't need it don't
pay for it. Since rust is going for zero cost/systems programming, I find
that this API is a better fit.
I published the implementation in a crate here: https://crates.io/crates/ordslice. When/if this lands the crate can be deprecated.
Could we add the option for the bound to be inclusive or exclusive using Bound? The result would look like this:
fn lower_bound(&self, x: Bound<&T>) -> usize
where
T: Ord;
This is similar to the API I used for RBTree in intrusive-collections.
Do you have an idea of the performance cost of such a change? Also do you have examples of using such API in practice that either allow use-cases that are not possible otherwise and/or more readable code?
Can't exclusive bounds be emulated by altering the comparison fn?
the need for something like lower_bound/upper_bound in a number of situations. What I've ended up doing is to use binary_seach_by and using a closure which only returns Greater or Less. For example in this function in rand.
The main thing that was awkward with that solution, and which still feels awkward in the lower_bound_by/upper_bound_by API is the use of Ordering as the closure return type. In part because the enum names take up about half of the closure function.
But for lower_bound_by/upper_bound_by also because returning Equal effectively treated the same as either Less or Greater. I.e. lower_bound_by/upper_bound_by effectively want a binary answer, but uses a three-value enum to express that.
I think the lower_bound_by/upper_bound_by API would be a lot more pleasant to use if the closure returned a bool instead. That way you could write myslice.lower_bound_by(|item| item.foo <= 20).
The only problem is that this doesn't work well with equal_range_by, which would need to keep using Ordering.
You can find lower_bound, upper_bound, and equal_range in https://docs.rs/superslice/0.1.0/superslice/.
I was hoping to move forward the discussion so that maybe these functions can be added to the standard library. But happy to do that elsewhere if that's preferred?
For equal_range, I think it'd be more useful if you could provide a range of numbers to be searched for in the slice. Similar to BTreeMap::range. So something like:
let vec = vec![3, 5, 6, 6, 8, 10, 12];
assert_eq!(vec.search_something_range(5..9), (1, 5));
assert_eq!(vec.search_something_range(5..10), (1, 5));
assert_eq!(vec.search_something_range(5..=10), (1, 6));
assert_eq!(vec.search_something_range((Exclusive(6), Inclusive(20)), (4, 7));
It'd also be cool if you could do
let vec = vec![3, 5, 6, 6, 8, 10, 12];
assert_eq!(vec.search_something_range(6), (2, 4));
However that'd mean we couldn't use the RangeBounds trait.
FWIW, if there's interest in having this use cases discussed in this RFC directly addressed in stdlib, independent of what API is used to do it, then I suggest speaking up. The PR in rust-lang/rust#52530 got rejected because it wasn't clear that the usecase needed addressing.
I would be really happy to have that in stdlib
I am hitting this issue in sled, and finding myself writing a few wrappers around standard library functionality that do not feel obviously correct (similar to https://github.com/rust-lang/rust/issues/52529). It's important for me to quickly locate these bounds, both exclusively and inclusively, to enable various tree search and iteration features.
Could we add the option for the bound to be inclusive or exclusive using
Bound
Why this would be useful? What does xs.lower_bound(Bound::Unbounded) mean, for example? In general, I find it hard to understand at a glance what is the difference between xs.lower_bound(Bound::Exclusive(&92)) and xs.lower_bound(Bound::Inclusive(&92)). std::collections::BTreeMap only uses Bound with range.
I rather strongly feel that the proposed signature of
fn (lower|upper)_bound(&self, x: &T) -> usize
where
T: Ord;
is "canonical". For a given x, there are number of possible insertion points for x in the underlying slice, and lower / upper return smallest/greatest one respectively. Adding Result or Bound into the mix seem to only complicate the API? It seems like we should just add these methods to std already :)
I am not so sure about equal_range:
pub fn binary_search_range<K, R>(&self, range: R) -> std::ops::Range<usize> where
K: Ord + ?Sized,
R: RangeBounds<K>,
T: Borrow<K>,
Why this would be useful? What does
xs.lower_bound(Bound::Unbounded)mean, for example?
It would always return 0 (the first element).
In general, I find it hard to understand at a glance what is the difference between
xs.lower_bound(Bound::Exclusive(&92))andxs.lower_bound(Bound::Inclusive(&92)).
If the slice does not contain 92, then they are equivalent. Otherwise, Inclusive will point to the first instance of 92 in the slice (there can be more than one), and Exclusive will point to the element after the last 92.
So, lower_bound(Exclusive(&x)) is the same as upper_bound(Inclusive(&x))? And vice verse?
10 92 92 92 100
^ ^
LI LE
UE UI
If this is the case, looks like Bound does not by us any expressivity? EDIT: for "point" queries. I 100% see usefulness of Bound for "range" queries, where the return type is not usize, but a range/slice.
I haven't been following this thread, but having the "lower bound" of 92 ever be _above_ all the 92s seems like we've either lost track of what "lower" means, or these are really bad names for whatever operations are being proposed.
Assuming these are the operations the names imply they are, and we do want Bounds that also mean what their names imply, wouldn't the expected behavior be more like this?
10 92 92 92 100
^ ^ ^ ^
LE LI UI UE
I haven't been following this thread, but having the "lower bound" of 92 ever be _above_ all the 92s seems like we've either lost track of what "lower" means, or these are really bad names for whatever operations are being proposed.
You need to think of the bounds in terms of a range of values that you are interested in. So lower_bound means "the range of values that I am interested in starts at 92 (inclusive/exclusive depending on the bound), give me the index of the first value in that range".
The definition for upper_bound is similar: "the range of values that I am interested in ends at 173 (inclusive/exclusive), give me the index of the value in that range".
Interesting. I think that means the names are off. lower_bound(x) seems like it'd mean the lower bound of values equal to x. If what we want is the first x value, then we should just call the method something obvious like first(x).
But "upper bound"/"upper bound (inclusive)"/"lower bound (exclusive)" apparently means "the first non-x value after an x value" which is a super weird concept I don't know of a concise name for. Hm...
For a couple concrete "point" query examples I'm working with, consider a B+ tree where pointers to children are stored in a sorted slice as (separator_key, child_node) where the separator_key is the lowest inclusive bound of values that may be present in that node. While traversing the tree, we need to find the child with the highest separator_key that is less than or equal to the key being searched for in the tree.
In general, I find all of the below use cases useful for working with versioned trees:
All of the above can be expressed with iterators, but that prevents their compatibility with binary_search*.
I believe that this is expressible with bound-less methods like this:
xs[..xs.upper_bound(key)].last() highest less or equal
xs[..xs.lower_bound(key)].last() highest less
xs.get(xs.lower_bound(key)) lowest greater or equal
xs.get(xs.upper_bound(key)) lowest greater
The first two seem trickier than ideal, but I don't think they can be better expressed with Bound semantics of https://github.com/rust-lang/rfcs/issues/2184#issuecomment-466369691 (which I believe is what @Amanieu proposes and which coinsides with my intuition). The reason is that adding Bound does not change the set of indices you can get, you still need to -1 somehow.
The version of @Ixrec (https://github.com/rust-lang/rfcs/issues/2184#issuecomment-466371296) does make this easier though.
On the meta level, if we can come up with two semantics for Bound which are differ by 1, that probably means that Bound would be a problematic API.
It's not just a simple -1. If you have multiple identical values then the difference between an inclusive and exclusive lower bound can be more than 1 element:
10 92 92 92 100
^ ^
LI LE
@Amanieu sorry, I wasn't clear, the -1 referred to the difference between https://github.com/rust-lang/rfcs/issues/2184#issuecomment-466369691 and https://github.com/rust-lang/rfcs/issues/2184#issuecomment-466371296. That is, it's the difference between semantics one might thing Bound has, not the difference between LI and LE (which can be arbitrary in one semantics and zero or one in another semantics) .
It would be nice if this could move forward somehow. I was surprised that there was a binary_search in std, but that it apparently selects an element at random if there are multiple elements with the same value.
C++ has partition_point. It has clear definition like
let i = partition_point(slice, pred);
for x in &slice[..i] { assert!( pred(x) ); }
for x in &slice[i..] { assert!( !pred(x) ); }
I can use this when I cannot remember the definition of lower/upper_bound :)
It is basic and primitive edition of lower/upper_bound and I want it in Rust rather than lower/upper_bound. (both is great)
@VillSnow this sounds like a great idea, as this is indeed more general than binary search, and has trivial design space. Would you like to send a PR to libcore to add this? I am not on the libs team, but seems like this API can be accepted without too much deliberation:
impl<T> [T] {
fn partition_point<T, P>(&self, mut pred: P) -> usize
where
P: FnMut(&T) -> bool,
}
As basic check, rust also use partition and partition_in_place, so cargo-culting the name from C++ seems fine.
@matklad
It would be my first PR in my life. How can I send a PR?
I cloned rust-lang/rust, passed x.py check, implement my partition_point and passed x.py check again. ( and accidentally my Linux machine downs so it tooks a bit time to recover :_( done)
my understanding is; first make a issue "Traking issue for partition_point", set issue number of unstable attribute and send PR?
How to set feature of unstable attribute? Can I make as I like?
https://github.com/VillSnow/rust/commit/4c8ce48a15c88715955e56c9c35959d9dffad5ec
Most helpful comment
C++ has partition_point. It has clear definition like
I can use this when I cannot remember the definition of lower/upper_bound :)
It is basic and primitive edition of lower/upper_bound and I want it in Rust rather than lower/upper_bound. (both is great)