The current API for manipulating String objects is largely relegated to str, and many operations treat the source string as immutable and return a new &str reference with the result of the operation (unless non-contiguous results are not guaranteed, e.g. .replace(), etc.)
Oftentimes destructive modifications to a String are needed "while it is being created" and developers are forced to either write their own logic to duplicate str functionality to update String values or else use .to_owned() on the result of str functions.
e.g. a function runs a command and returns its output as a string, with trailing whitespace removed:
....
let mut path = String::from_utf8(output.stdout)?;
let new_len = path.trim_end().len();
path.truncate(new_len);
Ok(path)
vs
....
let path = String::from_utf8(output.stdout)?;
let path_str = path.trim_end();
Ok(path_str.to_owned())
I believe we're close to being able to support a safe mutating api that provides a &str and uses an &str result to redefine the string (mainly to pop or dequeue elements from the underlying storage), something along the lines of
fn String::mutate(&mut self, F) -> ()
where F: Fn(&'restricted str) -> &'restricted str`
which passes in an (exclusive) reference to the current value of the string and expects to receive a subset: &'restricted str as a result, where self.as_bytes() == &[.., subset.as_bytes(), ..] holds true (i.e. subset.as_bytes().as_ptr() >= self.as_bytes().as_ptr() && subset.as_bytes().as_ptr() + subset.as_bytes().len() < self.as_bytes().as_ptr() + self.as_bytes().len())
This makes the most sense it if it is possible to make compile-time guarantees ensuring the above predicate. I don't think #2000 via https://github.com/rust-lang/rust/issues/44580 currently makes this possible, but I feel as if there is overlap between const generics and the ability to do so.
Using it would look something like this:
let mut start = " hello, world ".to_owned();
// Now trim both ends of the string without reallocating
start.mutate(|s| s.trim());
assert_eq!(&start, "hello, world");
while forbidding the following:
let other_string = "not my string";
let mut start = " hello, world ".to_owned();
// Try to provide a reference to an str that is not a subset of the string itself
start.mutate(|_| &other_string); // should not compile
We have String::truncate already. We cannot drop a prefix from the allocation though, so first I'd suggest adding String::shift_left methods to Vec and String:
impl<T> Vec<T> {
pub fn shift_left(&mut self, mid: usize) {
if mid >= self.len() { return; }
self.rotate_left(mid); // Reimplement with faster unsafe code
self.truncate(self.len()-shift);
}
}
impl String {
pub fn shift_left(&mut self, shift: usize) {
if shift >= self.len() { return; }
assert!(self.is_char_boundary(shift));
self.vec.shift_left(shift);
}
}
At that point, your String::mutate could be implemented with some unsafe safety checks and shift_left plus truncate, but your signature looks like
fn String::mutate(&mut self, f: F)
where for<'restricted> F: FnOnce(&'restricted mut str) -> &'restricted mut str
{
// unsafely copy self to old_self
let self = f(self);
// unsafely compute shift and new_len from old_self and self and check they lie within the allocation
self.truncate(shift+new_len);
self.shift_left(shift);
}
I'm unsure if miri could actually be made happy with mutate thought and maybe this needs more is_char_boundary calls. Ask @RalfJung
See also: https://crates.io/search?q=rope
note just using:
fn mutate(&mut self, F) -> ()
where F: for<'a> Fn(&'a str) -> &'a str
doesn't work without checks because of,
string.mutate(|_| "static string always works");
To fix this we can add checks that check if the string is in fact a sub-string of self. If it isn't, then we can just overwrite the current string with the new string entirely. for example
Indeed, that is why the musings about const generics came into play.
@KrishnaSannasi it did not occur to me to have the failure mode for mutate (when result is not a substring) be the allocation of a new string rather than an error, which does indeed make it less pressing for the substr check to be a compile-time error.
I don't think const generics can help here, we need a way to say exactly this lifetime, not at least this lifetime. But I don't think that would be reasonable to implement. (sounds tricky)
Ah, but it would if you disregard lifetimes altogether and instead think of the generics as the start address of the slice and its length.
we need a way to say exactly this lifetime, not at least this lifetime.
That was my initial line of thought before thinking of a possible out via const generics. It's certainly possible if we pass in a wrapper that includes a phantom type such that it can't be constructed anywhere else, but call any of the str methods would return a bare &str dropping the wrapper so it would be ugly.
A wrapper type may work (we could forward all of str's methods to it, and it would rewrap the str), but that's tricky and I think sticking to run time checks will be much easier to justify.
I suppose impl Trait tricks could hide opaque wrapper types perhaps, except Deref tricks cannot handle methods like split_at:
pub trait Substring : Deref<Target=str>+DerefMut { }
impl String {
fn ::mutate<S,F>(&mut self, f: F)
where
S: Substring,
F: FnOnce(S) -> S
{
struct MySubstring<'a>(&mut String,Range<usize>);
... impls ...
let s = f(MySubstring(self,0..self.len()));
...
}
}
We should add the shift_left methods regardless because doing this optimally requires new unsafe code.
What about the less ambitious alternative i.e. introduce the specific in-place modification functions?
impl String {
pub fn trim_start_in_place(&mut self);
pub fn trim_end_in_place(&mut self);
pub fn trim_in_place(&mut self);
pub fn trim_start_matches_in_place<'a>(&'a mut self, pat: impl Pattern<'a>);
pub fn trim_end_matches_in_place<'a>(&'a mut self, pat: impl Pattern<'a>);
pub fn trim_matches_in_place<'a>(&'a mut self, pat: impl Pattern<'a>);
}
@burdges
impl<T> Vec<T> {
pub fn shift_left(&mut self, mid: usize);
}
This already exists, x.shift_left(mid) is the same as x.drain(..mid). (The same exists on String.)
Trim was just an example. What if you wanted to replace a string with the first occurrence of foo within it, etc?
I don't understand why the magic 'restricted semantics matter at all. In what way would the implementation be unable to deal with a string from outside its own buffer?
The original idea was that the method would be compile-time guaranteed to be (re-)allocation-free and the magic lifetimes were to be able to avoid a Result<..> return value. Obviously the only way a String could take over a slice would be if it owned it in the first place, but if the string reallocates as the "operation not possible" failure mode then I suppose that magic isn't needed at all.
The only time reallocation would be required is if the new slice is larger than the String's current capacity. Beyond that, the entire implementation is just to copy the slice into the bottom of the String's buffer and set its length to the size of the slice.
Ideally, there would be no copying needed if the slice is a subset of the string.
@mqudsi That can only work in cases like truncate. If the substring does not start at the beginning of the string, then there must be a copy.
I think @sfackler is right: We should implement mutate as a "copy into anywhere inside myself from anywhere including possibly elsewhere inside myself" method. It almost always copies anyways, so zero reason to enforce anything about subslices. Internal copy methods like this make sense even for &mut str and &mut [T] but any truncate calls require String or Vec.
This is sounding somewhat similar to Vec::splice, for what it's worth: https://doc.rust-lang.org/nightly/alloc/vec/struct.Vec.html#method.splice
Most helpful comment
I don't understand why the magic
'restrictedsemantics matter at all. In what way would the implementation be unable to deal with a string from outside its own buffer?