Rfcs: Taking the first N bytes of a `str` that still make up valid UTF-8

Created on 16 Oct 2018 · 9Comments · Source: rust-lang/rfcs

In the past I've done stuff like &s[..max_len.min(s.len())] to truncate strings, but it turns out this is subtly broken (and will panic) for strings where max_len happens to be in the middle of a multibyte utf8-sequence (e.g. for the case !s.is_char_boundary(max_len)).

I've made a utility function for this (below), but it would be nice if a method on str existed for this case. In particular, I think the fact that the naive solution is broken on non-ASCII text makes it worthwhile, since developers are less likely to test on such text.

I have no opinions on its name (I'm genuinely terrible at names), nor on further extensions / variations or anything like that.

Anyway, below is the source for my version of it, provided mostly as to be completely clear on what I'm talking about. I think in practice this would go as a method on str and so would have a somewhat different implementation.

pub fn slice_up_to(s: &str, max_len: usize) -> &str {
    if max_len >= s.len() {
        return s;
    }
    let mut idx = max_len;
    while !s.is_char_boundary(idx) {
        idx -= 1;
    }
    &s[..idx]
}

T-libs

Source

thomcc

Most helpful comment

FWIW, str itself uses this to put some kind of simple limit on its panic message for slicing errors @ truncate_to_char_boundary.

bluss on 8 Dec 2018

👍2

All 9 comments

This kinda feels a bit too niche to be in std rather than crates.io. In particular, why do you need this operation in the first place? Where does max_len come from? Do you really want up to some number of bytes, rather than characters? If the latter, given things like Unicode combining code points, what does "character" really mean in this context?

SimonSapin on 16 Oct 2018

Also, such a function may break the string on boundaries that are considered nonsensical, such as between diacritical marks. Would be really weird for the substring to be missing a diacritic on the last character, and the substring right after it to have an orphaned diacritic at the beginning.

If the concern is transmission/storage, UTF-8 is already well-equipped for processing as the bytes come, as the encoding has, in each byte, metadata saying either that the byte continues an already-started codepoint, or that the byte starts a new codepoint, along with that codepoint's exact byte length. Once a codepoint arrives in full and is consumed, further bytes will never invalidate it.

eaglgenes101 on 16 Oct 2018

👍1

FWIW, i've also needed and written this function before.

cramertj on 18 Oct 2018

@cramertj Can you say more about how and why it’s used? And my other questions above.

SimonSapin on 18 Oct 2018

At the very least, it should split at grapheme cluster boundaries. Unicode code point is not the same as visible character.

le-jzr on 18 Oct 2018

Presumably any time you need to fill a fixed size buffer with text, and either have no way to return the overflow, or simply don't want to have to preserve the encoder/decoder state when encoding/decoding across multiple buffers.

Diggsey on 18 Oct 2018

👍2

Sorry about vanishing there, had a very busy week at work and haven't had the time to elaborate until now.

Do you really want up to some number of bytes, rather than characters?

For my case, yes. I don't think this should consider characters, since as you mention, there are multiple things one may want in terms of characters. Anyway, most (all?) functions that take indices on str take byte indices, which IMO is the correct call.

Why do you need this operation in the first place? Where does max_len come from?

I've hit it a couple times (although until the most recent time I didn't notice the unicode bug), usually when dealing with filling text into a buffer, or when truncating a string before performing a somewhat expensive operation on it (in this case, it's a match operation performed on very many strings, most of which are short, but the long ones might be very long and full of nonsense).

More generally, I feel that the rationale behind having an is_char_boundary function is similar, and also:

This kinda feels a bit too niche to be in std rather than crates.io

The benefit of being in std is that the issue is not obvious at first glance. If there were an idiomatic method on str for this, subtle unicode bugs in code could be prevented (e.g. it's existence would encourage more correct code).

A possibly less niche function that might assist here instead would be something like a str::prev_char_boundary(&self, index: usize) -> usize function which takes a byte index which may be in the middle of a char, and returns the previous valid byte index. E.g.

// Note: I haven't tested this, and typed it directly into github
impl str {
    // ...
    pub fn prev_char_boundary(&self, mut index: usize) -> usize {
        if index >= self.len() { return self.len(); } // Or maybe it should assert. Dunno.
        while !self.is_char_boundary(index) {
            index -= 1;
        }
        index
    }
    // ...
}

This wouldn't really help with encouraging correct code, but it would make fixing the issue easier when you do find it.

thomcc on 19 Oct 2018

FWIW, str itself uses this to put some kind of simple limit on its panic message for slicing errors @ truncate_to_char_boundary.

bluss on 8 Dec 2018

👍2

@bluss that's exactly where I copy the code from to use in my project.

@SimonSapin I mostly use this when I want to construct a fixed capacity string in stack (e.g. ArrayString from bluss' arrayvec) from a maybe unlimited length string, and I am OK with truncating if the capacity is not enough. I won't say this usage is very frequent, but without this, the &str API feels somewhat incomplete.

Edit: On a second thought I think my usage seems to be more closely related to ArrayString as it is used only in place where we have both a fixed capacity and a string instead of bytes, so seems OK to me for it to be just added to ArrayString as a method (or for constructor and a fill in method).