rust 🚀 - Tracking issue for Read::chars

It would be nice if std::io::Chars was an Iterator<Item=char>, just like std::str::Chars. I don't have a proposal for how decoding errors should be handled, though.

eminence on 25 Aug 2015

If we want to make std::io::Chars a simple iterator on char, a solution is to have it return None on a UTF-8 error, and set an error flag in the Chars itself (with an associated was_valid_utf8() method or something). An empty sequence is considered valid UTF-8.
It does make error detection slightly less convenient, as writing for c in reader.chars() does not keep the Chars for later verification.
Here is an example use-case where we try to recover after an invalid sub-sequence:

let mut chars = reader.chars();
loop {
    for c in chars {
        // ...
    }
    if chars.was_valid_utf8() { break; }
    println!("Encountered invalid byte sequence.");
}

Or it could provide a more informative message, similar to the current CharsError.
Maybe this could apply to the other adapters as well? Or is this a bad pattern?

On the other hand, it is not so difficult to treat the Result item explicitly, or to wrap the current Chars to get the behavior I described (an unwrapper wrapping, interesting notion), so maybe the current situation is acceptable as is.

gyscos on 10 Sep 2015

Nominating for 1.6 discussion

alexcrichton on 4 Nov 2015

:bell: This issue is now entering its cycle-long final comment period for deprecation :bell:

These functions seem like excellent candidates to move out-of-tree into an ioutil crate or something like that. This deprecation won't actually happen until that crate exists, however.

alexcrichton on 5 Nov 2015

.chars() is a nice operation to have, it'd be a shame to have to pull in a separate crate just to get such a useful iterator. The error type doesn't seem problematic; you'd have to handle an io::Error anyways.

Perhaps a .chars_lossy() iterator that yields the UTF-8 replacement character on a UTF-8 error and stops on the first io::Error.

abonander on 5 Nov 2015

I’m in favor of stabilizing Read::chars eventually, but it’s not ready yet:

It’s unstable because

the semantics of a partial read/write of where errors happen is currently unclear and may change

(The same would apply to chars_lossy.)

This behavior should be per Unicode Standard §5.22 "Best Practice for U+FFFD Substitution" http://www.unicode.org/versions/Unicode8.0.0/ch05.pdf#G40630

Roughly, that means stopping at the first unexpected byte. This is not the behavior currently implemented, which reads as many bytes as indicated by the first byte and then checks them. This is a problem as, with only Read (as opposed to, say, BufRead), you can’t put a byte "back" in the stream after reading it.

Here are some failing tests.

        let mut buf = Cursor::new(&b"\xf0\x9fabc"[..]);
        let mut chars = buf.chars();
        assert!(match chars.next() { Some(Err(CharsError::NotUtf8)) => true, _ => false });
        assert_eq!(chars.next().unwrap().unwrap(), 'a');
        assert_eq!(chars.next().unwrap().unwrap(), 'b');
        assert_eq!(chars.next().unwrap().unwrap(), 'c');
        assert!(chars.next().is_none());

        let mut buf = Cursor::new(&b"\xed\xa0\x80a"[..]);
        let mut chars = buf.chars();
        assert!(match chars.next() { Some(Err(CharsError::NotUtf8)) => true, _ => false });
        assert_eq!(chars.next().unwrap().unwrap(), 'a');
        assert!(chars.next().is_none());

        let mut buf = Cursor::new(&b"\xed\xa0a"[..]);
        let mut chars = buf.chars();
        assert!(match chars.next() { Some(Err(CharsError::NotUtf8)) => true, _ => false });
        assert_eq!(chars.next().unwrap().unwrap(), 'a');
        assert!(chars.next().is_none());

I’ve looked at fixing this, but it basically involves duplicating all of the UTF-8 decoding logic from str::from_utf8, which I’m not really happy with. (That many more tests would need to be added.) I’ll try to think of some way to have a more basic decoder that can be used by both.

SimonSapin on 5 Nov 2015

Moving chars to BufRead does not seem unreasonable.

sfackler on 5 Nov 2015

@sfackler I concur. I was thinking the exact same thing. In fact, it is unfortunate that Read::bytes() is already stabilized because, like chars(), it is almost always preferable to have it on a buffered source. A lot of the Read types really do not tolerate small, frequent reads well (#28073)

abonander on 5 Nov 2015

While I agree that chars is useful, questions like those brought up by @SimonSapin make me think that it's too weighty/controversial to be in the standard library. I'm not sure that this is really ever going to be seriously used either due to the performance concerns, and I agree that bytes is a bit of a misnomer in that it's super non-performant much of the time as well.

alexcrichton on 6 Nov 2015

How does bytes perform when used with BufReader<_>?

SimonSapin on 6 Nov 2015

@SimonSapin there were some perf numbers reading from a file with bytes in #28073. The difference is pretty significant.

abonander on 6 Nov 2015

I think chars would work pretty well if implemented on top of a buffer. It doesn't have to consume any bytes it doesn't need to, so no unnecessary loss of data.

abonander on 6 Nov 2015

I may be doing something hugely inefficient here but I really needed read.chars for a crate I was working on that attempted to produce an iterator of rustc serialize json items from a read stream. I had to fallback on vendoring most of the source around read.chars to as a workaround to get it working on stable awhile back. It was almost a requirement as the rustc builder for json requires a iterable of chars. It would be really nice to have this stabilized if env in some form of method name that implied lossyness.

softprops on 6 Nov 2015

@SimonSapin

I would expect it to perform much better than on a raw File because the number of syscalls issued is far less, but on the other hand it is likely much slower than another equivalent operation such as mmap + iterate the bytes via a slice.

@cybergeek94

Yeah I suspect the performance isn't _abysmal_, but when compared to str::chars it's likely still much slower (just more stuff that has to be checked during I/O)

@softprops

Yeah that's actually a case where I believe chars is inappropriate because it's just going to be inherently slower than any equivalent operation of "iterate over the characters of this string". For example a json deserializer may want to take an iterator of characters, but the actual iterator does something like BufReader where it reads a huge string, then literally uses str::chars, only reading a new string once it's exhausted (handling the boundary).

alexcrichton on 6 Nov 2015

@alexcrichton
Wouldn't a chars method on a BufReader-backed Read serve this exact purpose? It reads a big slice of bytes, hopefully in UTF-8 format, which is exactly what a string would be; shouldn't it be the same then to iterate chars on these bytes compared to using str::chars?

gyscos on 6 Nov 2015

Yeah that's actually a case where I believe chars is inappropriate because it's just going to be inherently slower than any equivalent operation of "iterate over the characters of this string"

In this case it's a literally a stream of json I'm working with. My use case was interfacing with dockers streaming json endpoints. In which case json objects are pushed through a streamed response. I'm not sure how I'd accomplish that with a string.

softprops on 6 Nov 2015

@alexcrichton Hnadling the boundary is tricky. It’s handled in https://github.com/SimonSapin/rust-utf8/blob/master/lib.rs

SimonSapin on 6 Nov 2015

@Gyscos hm yes, I guess it would! That would definitely mean that chars could only be on BufReader.

@softprops in theory you can transform an iterator of &str slices into an iterator of char values, which is kinda what BufReader would be doing

alexcrichton on 7 Nov 2015

So a solution is to move chars to BufRead as @sfackler mentionned. I was worried this would force to needlessly wrap some Readers that didn't _need_ to, but I just noticed Cursor<Vec<u8>> and Stdin already provide a BufRead access.
But all of that is about the performance aspect.
Concerning the error handling, BufRead::lines return an iterator on io::Result<String> to properly propagate the error, which includes invalid UTF-8 (by returning a ErrorKind::InvalidData). Couldn't chars do the same, instead of using a custom error?

gyscos on 7 Nov 2015

Yeah I have a feeling that would be sufficient, there's more fine-grained error information we can give (such as the bytes that were read, if any), but iterators like lines are already lossy where you lose access to the underlying data if there's an error, so it's probably not too bad.

It does raise a question though that if chars exists only on BufRead for performance reasons, why does bytes not exist only on BufRead? Just a minor inconsistency.

alexcrichton on 7 Nov 2015

I think the reasoning to move chars to BufRead is for correctness wrt @SimonSapin's comment - to be able to peek at the next byte without consuming it.

sfackler on 7 Nov 2015

The performance issue being alleviated is only a benevolent side-effect. bytes doesn't require a buffer for correctness.

abonander on 7 Nov 2015

Hm I don't think that this is a correctness problem that can be solved by "just moving to BufRead", it's always a possibility that the buffer is 1 byte in size so it behaves the same as reading one byte at a time. Basically BufRead just reduces the number of calls to read, but it essentially always has the same problems wrt partial reads, short reads, and consuming data.

alexcrichton on 7 Nov 2015

@SimonSapin was saying that BufReader could potentially put things back in the buffer, but it's not actually exposed by the BufRead trait.

gyscos on 7 Nov 2015

You don't put things back in the buffer, you just don't call consume on things you don't want to, right?

sfackler on 7 Nov 2015

it's always a possibility that the buffer is 1 byte in size so it behaves the same as reading one byte at a time.

If only BufRead had an API to force an early read into the buffer... but I digress. I closed that RFC because I didn't have the time or the energy to argue for it.

The iterator can always consume the last bytes out of the buffer to force it to read more. If the next read is too short (a partial read returning < 4 bytes? honestly, does that even happen?), it can be considered an error just like if we hit EOF while expecting more bytes.

abonander on 7 Nov 2015

Compared to a Read, the BufRead makes it at least _possible_ to be correct: since it should only consume the longest correct subsequence, it could read the bytes one by one and consume them only when they keep the subsequence valid. Only, as @SimonSapin said, this would mean duplicating some code for the checks.
So it _is_ possible to implement the correct behaviour for BufRead (while it is not for Read), but it is inconvenient if the reader has too short a buffer, or only returns too short reads.

gyscos on 10 Nov 2015

The too-short read problem is independent of using a BufRead or a Read. The underlying Read could return data 1 byte at a time itself if it was feeling particularly pathological. A BufReader that only has 1 byte remaining in its buffer is only going to return that one byte in the next call.

sfackler on 10 Nov 2015

I'm not sure I understand the too-short read problem.

If the reader stops returning anything in the middle of a codepoint, it's obviously an error.
If it doesn't, we can read the bytes from the BufRead one by one, and consume it until we find an invalid byte (that we don't consume). This works even with a BufReader using a 1-byte buffer.
This is completely impossible with a simple Read, since we must consume each byte as we read it, preventing us from leaving the first invalid byte untouched.
In this regard, BufRead vs Read does change what is possible to achieve.

gyscos on 10 Nov 2015

Yeah, I was commenting on "but it is inconvenient if the reader has too short a buffer, or only returns too short reads." That's just a thing you have to deal with.

sfackler on 10 Nov 2015

I see.

To recap, concerning Chars, the options are (apart from taking it out of tree):

Not follow the UTF-8 best practices on short read and:
1. Sometimes consume too much from the BufRead in order to force a new read
2. Return an error on short read
Follow the UTF-8 best practices by properly handling byte-by-byte verification by:
1. Duplicating code from str::from_utf8 (actually str::run_utf8_validation_iterator) and using it from here
2. Sharing code with str, by:
  1. Exposing code from the str module (making it public?) - will need some modification in the str module
  2. Moving the Chars functionality to the str module (it is basically a streamed utf-8 decoder, could be more general than just Read?)

Note that not following the UTF-8 best practices mean BufRead does not add any correctness, "just" some performance improvements.

I like option 2.ii.a, but maybe it's too much of an impact for this Chars problem?

gyscos on 10 Nov 2015

Given the number of details here, I think stabilizing a chars method on BufRead/Read should probably require an RFC, especially if we start messing with the str module API. It seems like that makes it (and the other methods in this tracking issue) an ideal candidate to evolve out of tree.

BurntSushi on 1 Dec 2015

The libs team discussed this during triage today and the conclusion was to deprecate tee and broadcast while leaving chars unstable. Further action there can hold off on an RFC.

alexcrichton on 3 Dec 2015

These functions seem like excellent candidates to move out-of-tree into an ioutil crate or something like that. This deprecation won't actually happen until that crate exists, however.

So is there a crate? Sorry if I missed one.

ArtemGr on 13 Dec 2015

Ah not yet as there didn't seem to be that much desire for them, but I could certainly push up a crate if one is desired!

alexcrichton on 14 Dec 2015

I just published tee and broadcast as separate crates. I have used both std lib apis in different projects but never together so I thought it would be nice to have the same funtionality, but in smaller, more focused crates https://github.com/softprops/tee https://github.com/softprops/broadcast

softprops on 15 Dec 2015

Thanks @softprops!

alexcrichton on 15 Dec 2015

@softprops Great, thanks!

ArtemGr on 15 Dec 2015

Moving out of final-comment-period and the title is now updated to indicate that this issue is solely related to Read::chars now.

alexcrichton on 16 Dec 2015

I think moving chars to BufRead makes a lot of sense. It does require reimplementing str::from_utf8 logic to implement it properly, which is unfortunate, but there's not really a way around it.

Without chars, you can get the bytes from a BufRead and call str::from_utf8, but it will check the data all at once instead of reading out chars until it hits an error. You'd have to call valid_up_to and retry the conversion in order to handle the bad bytes.

withoutboats on 18 Dec 2015

I'm not clear on the intended semantics of chars here. Is it:

We get stuck on a bad character, not reading any of its bytes.
We get stuck on a bad character, reading the valid part but leaving the invalid part.
We don't get stuck because we consume the invalid part.
We don't get stuck because we substitute invalid characters with U+FFFD.
Something else?

taralx on 14 Jan 2016

@taralx If "get stuck" means return an error, then the desired behaviour is the second option: read (consume) the valid part, but leave the invalid part. That's why a BufRead is required, to peek at the (potentially invalid) byte before consuming it.

gyscos on 14 Jan 2016

I don't believe the implementation reads the "valid part" of the char; if there isn't a complete UTF8 code point, nothing is read.

withoutboats on 14 Jan 2016

Sorry, I was talking about the intended semantics, not the current behaviour.
Currently, Chars always reads as many bytes as the first byte indicates, and return an error if it isn't a valid UTF-8 sequence.

gyscos on 14 Jan 2016

Yes, "get stuck" means returning an error. The difficulty is that BufRead has limited lookahead -- you're pretty much always getting 1 character of lookahead, but no more without the ability to force buffer fills.

taralx on 15 Jan 2016

Yes, this was talked about a few comments above: 1 byte of read ahead is enough to stop reading in time (only consume the byte if the sub-sequence is still valid). It does, however, require additional code to validate partial UTF-8 sequences.

This code exists more or less in the str modules (here), but is:

Not ready as is (it expects an iterator on an existing slice)
Not visible

So the solution would be to either:

Duplicate and adapt the code into io::Chars
Generalize and expose the code in str

Both also require Chars to use a BufRead instead of a Read.

gyscos on 15 Jan 2016

1 byte of read ahead is enough to stop reading in time

I'm not sure I agree, based on the code you linked to. I can see codepaths that will result in the iterator being rewound by more than 1 byte in the event of an error.

taralx on 16 Jan 2016

Indeed, I was mistaken, this code also tries to read all the bytes before attempting decoding.
And the methods taking iterators (like next_code_point) simply ignore errors in the continuation bytes.
So I guess the code will have to be written from scratch.

gyscos on 16 Jan 2016

I'm not sure it's possible to do it with only 1 byte of lookahead. You need to be able to detect sequences like E2 82 00 (illegal, but only after you see the 00) without consuming any of those bytes.

taralx on 16 Jan 2016

We're allowed to consume any byte if it form a potentially valid subsequence. So if there is a byte XY such that E2 82 XY is valid, then consuming E2 82 is allowed. By definition, this means a 1-byte lookahead is enough.

gyscos on 16 Jan 2016

Okay, so it's the second one above: We error out on a bad character, reading the valid part but leaving the invalid part.

taralx on 16 Jan 2016

When emitting '\u{FFFD}' replacement characters for decoding errors, reading 1 byte at a time (or behaving as if) is the behavior recommended by Unicode Standard §5.22 "Best Practice for U+FFFD Substitution" http://www.unicode.org/versions/Unicode8.0.0/ch05.pdf#G40630

This is in case the first byte unexpected in a sequence is the start of another sequence. In the example above, b"\xE2\x82\x00" should decode to 2 code points "\u{FFFD}\0".

SimonSapin on 18 Jan 2016

IMO 1 byte lookahead is not enough to guarantee that no data is lost in case of error.
Consuming valid subsequences should _not_ be allowed because if it continues with an invalid byte, the formerly valid subsequence is invalidated and there's no possibility to "unconsume" it.

If anyone is interested, I've implemented this and other changes to the std::io-traits in my crate netio.
The library provides stronger guarantees than std::io, i.e. it never allows to lose any data in case of error.
Feel free to copy anything you like.

However, there's still one unresolved issue: It's not possible to decide from std::str::Utf8Error if the error is due to an invalid sequence or just incomplete data. My goal is to return std::io::ErrorKind::InvalidData in the former case and std::io::ErrorKind::UnexpectedEOF in the latter, which is currently no possible without reimplementing UTF decoding which I don't want to do.
I've filed an issue for this (#32584). It's similar to #12113 but not exactly the same. #12113 is no issue for me, because in netio invalid or incomplete UTF-8 is never consumed and always left in the stream.

troplin on 5 May 2016

I give this opinion as a casual user of chars() rather than someone that understands unicode in detail. That being said....

It would be good to have an iterator that behaved like chars() in the str module. I think making io::Read::chars() behave like that would mean it doesn't fit in with the rest of the io module (errors and Option/Result). If this method was called for example chars_lossy then it would be very helpful to have a note at the start of chars() documentation explaining that those who do not care about errors and just want to crash on them should use chars_lossy (otherwise newbies like me get confused and have to read a really long thread to understand the issues :P)

derekdreery on 14 May 2016

It would be good to have an iterator that behaved like chars() in the str module.

That is not possible. The contents of a &str is guaranteed to be well-formed UTF-8 so there never is a decoding error. The bytes from Read are arbitrary.

I think Read::chars should be renamed for that reason, by the way: #33761

this method was called for example chars_lossy

String::from_utf8_lossy is an existing function that replaces ill-formed byte sequences with a \u{FFFD} replacement character, which looks like this: �. It would be possible to add a separate Read::utf8_chars_lossy method, or a lossy method on the return value of Read::utf8_chars that does the same.

those who do not care about errors and just want to crash on them should use chars_lossy

No! Panicking on ill-formed byte sequences is very different from "lossy" decoding, which silently replaces them with �.

SimonSapin on 20 May 2016

I think Read::chars should be renamed for that reason, by the way:

(I meant this to link to #33761. Edited.)

SimonSapin on 20 May 2016

@SimonSapin apologies I probably wrote that comment late at night.

I agree with what you said, that was what I was trying to say. To summarise:

function Read::utf8_chars_lossy that won't error but will insert \u{FFFD}
function BufRead::chars whose Item is Result<char, _>, and who doesn't consume any malformed bytes
function Read::chars whose Item is Result<char, _>, and whose Error type contains any consumed bytes from the stream not used in a char
documentation on Read explaining the choice

Do I understand you correctly?

derekdreery on 22 May 2016

Your BufRead::chars would still need to return a partial character in the error part. See above about lookahead.

taralx on 22 May 2016

@taralx What I mean is something like the following:

enum Error (
    ...
    MalformedCharError([u8;4])
)

or something similar handling the fact that the size is unknown.

derekdreery on 22 May 2016

I don’t think the API needs to change at all: one fn chars(self) -> Chars<Self> default method on the std::io::Read trait, where Chars<R: Read>: Iterator<Item = Result<char, CharsError>>. There is no need for CharsError to include the ill-formed byte sequence, what matters is not consuming too much of the input to avoid dropping valid code points. I believe that means reading one byte at a time.

Since we’re reading one byte at a time, there is no need to involve the BufRead trait. The docs can recommend using a BufReader wrapper type if small reads in the underlying stream are expensive (e.g. in File and TcpStream, each read is a syscall and so a context switch). But &[u8] is fine, for example, even though it doesn’t implement the BufRead trait.

chars_lossy is relatively easy to build on top of chars:

stream.chars().map(|result| match result {
    Ok(c) => Ok(c),
    Err(CharsError::NotUtf8) => Ok(std::char::REPLACEMENT_CHARACTER),
    Err(CharsError::Other(e)) => Err(e),
})

Note that for std::char::decode_utf16 we chose not to include lossy decoding, since it’s even easier for users: .map(|result| result.unwrap_or(std::char::REPLACEMENT_CHARACTER))

I’ll prepare a PR now.

SimonSapin on 22 May 2016

See https://github.com/rust-lang/rust/issues/33801.

SimonSapin on 23 May 2016

Since we’re reading one byte at a time, there is no need to involve the BufRead trait.

@SimonSapin I don't agree with that.
By that argumentation, all current methods of BufRead could live in Read. None of them need BufRead for correctness.
At least the way they are implemented now. I'd prefer better guarantees, but that's probably out of scope.

The docs can recommend using a BufReader wrapper type if small reads in the underlying stream are expensive (e.g. in File and TcpStream, each read is a syscall and so a context switch). But &[u8] is fine, for example, even though it doesn’t implement the BufRead trait.

Huh? impl<'a> BufRead for &'a [u8] exists.

troplin on 24 May 2016

@SimonSapin You need lookahead. How will you handle the following byte sequence without it? E2 82 E2 82 82

taralx on 24 May 2016

Huh? impl<'a> BufRead for &'a [u8] exists.

Uh, never mind that part, I misread rustdoc.

SimonSapin on 24 May 2016

@taralx Yes, I realized while implementing #33801 that the Utf8Chars iterator needs to hold a Option<u8> buffer: https://github.com/rust-lang/rust/pull/33801/commits/7ab345098625cf4e3b7f930e0ae003fda00da166#diff-668f8f358d4a93474b396dcb3727399eR1558

It might be better to require BufRead after all. I’ll try that in an additional commit in #33801 to see what it looks like.

Note that the algorithm still has to read one byte at a time: BufRead::fill_buf is not guaranteed to give "new" bytes until the current buffer is entirely consumed (or at least in BufReader it doesn’t), and the byte sequence for a single code point might span across a buffer boundary.

SimonSapin on 25 May 2016

I’ve added a commit to #33801.

SimonSapin on 25 May 2016

🔔 This issue is now entering its final comment period 🔔

The libs team was unfortunately a little late on stabilizations this cycle, but we're thinking of probably doing a backport of stabilizations partway through the beta cycle. The API we're considering for stabilization is the one proposed by @SimonSapin in https://github.com/rust-lang/rust/pull/33801. If we do decide to accept that PR we'll probably have a new FCP next cycle for actual stabilization but if we don't accept it we'll probably just deprecate the feature.

Regardless, discussion on the PR is more than welcome!

alexcrichton on 21 Jun 2016

I still have one major objection to #33801 that I've noted in that PR.

taralx on 21 Jun 2016

My concern with #33801 relates to partial reads. As it stands, there's no way to know how much data was consumed in the case of a Read source like &[u8]. Some ways to fix that is returning the number of consumed bytes in Utf8Error::IncompleteUtf8 (not great), return the number of bytes left over in that variant, or return the specific bytes left over in that variant.

jwilm on 1 Jul 2016

After thinking on InvalidUtf8 and IncompleteUtf8 errors some more, I've come to the conclusion they need to include the problematic bytes. This nicely handles the case of a partial read when using a &[u8] (a consumer can just stick the bytes in the front of the buffer before reading again), and an InvalidUtf8 could be handled however the consumer sees fit.

jwilm on 1 Jul 2016

👍3

🔔 This issue is now entering its final comment period 🔔

@alexcrichton Looks like this was forgotten? Anything that needs to be done here to stabilize this?

Manishearth on 5 Jan 2017

❤2

I've published a crate called utf8parse which has a dramatically different API for parsing a &[u8] as UTF-8. I don't believe it's a complete solution to the problems discussed here, but perhaps showing it here may be helpful. Let me give you a quick summary of the utf8parse API, and then I'll compare it with the issues being discussed with the Utf8Chars API.

There's two types in utf8parse, a Parser and a Receiver. First, the Receiver type looks like this:

/// Handles codepoint and invalid sequence events from the parser.
pub trait Receiver {
    /// Called whenever a codepoint is parsed successfully
    fn codepoint(&mut self, char);

    /// Called when an invalid_sequence is detected
    fn invalid_sequence(&mut self);
}

After creating a Parser, it's only got one useful method:

fn advance<R>(&mut self, receiver: &mut R, byte: u8)
    where R: Receiver

One byte is pushed at a time into the parser, and occasionally codepoints/errors are provided to the Receiver. Driving the parser looks like this:

let bytes = bytes_from_somewhere();
let mut receiver = make_my_receiver();
let mut parser = Parser::new();

for byte in &bytes {
    parser.advance(&mut receiver, byte);
}

Not as nice as the Utf8Chars iterator, but it has some advantages. First, the code driving the parser can know exactly where an error is encountered. For example, it would be possible to keep track of indices when a valid codepoint is encountered and when and invalid codepoint is identified. In this way, it's possible to know the bad bytes. Additionally, if the end of the buffer is reached without receiving a codepoint at the same time, the incomplete bytes would be known based on the byte index of the last valid codepoint.

In summary, this solves including bytes in the Incomplete/InvalidUtf8 error types at the cost of a non-iterator API. It separates out fetching bytes from some Read type and actually parsing those bytes as UTF-8.

Maybe it makes sense for the standard library to include a low level UTF-8 parser like this that can be tailored for different situations - instead of this use-case specific iterator based solution.

jwilm on 5 Jan 2017

@Manishearth that was basically 6 months ago at this point, and unfortunately I've forgotten the context of this in the interim.

alexcrichton on 6 Jan 2017

cc @rust-lang/libs reviving this

Manishearth on 6 Jan 2017

Does this sound correct?:

The issue could be summarised as: How to handle reading chars over a datastream, specifically how to handle either incomplete utf8 byte sequences and incorrect bytes (w.r.t. utf8).

It's hard because when you encounter an invalid sequence of bytes that may be valid with more data, it could either be an error, or that you just need to read more bytes (it's ambiguous).

If you know it's incomplete, you may want to call Read and try again with the incomplete part prepended, but if it's incomplete because something has errored you want to return the error with the offending bytes.

It seems the consensus is that for error and incomplete bytes, you return a struct that contains an enum variant saying whether it is an error or a possibly incomplete set of bytes, along with the bytes. It's then the responsibility of a higher level iterator how to handle these cases (as not all use cases will want to handle it the same).

derekdreery on 6 Jan 2017

TL;DR: I think it is very hard to come up with an abstraction that: is zero-cost, covers all use cases, and is not terrible to use.

I’m in favor of deprecating and eventually removing this with no in-std replacement.

I think that anything that looks at one u8 or one char at a time is gonna have abysmal performance. Instead we probably want &str slices that reference fragments of some [u8] buffer.

I spent some time thinking of a low-level API that would make no assumptions about how one would want to use it ("pushing" vs "pulling" bytes and string slices, buffer allocation strategy, error handling, etc.) I came up with this:

pub fn decode_utf8(input: &[u8]) -> DecodeResult { /* ... */ }

pub enum DecodeResult<'a> {
    Ok(&'a str),

    /// These three slices cover all of the original input.
    /// `decode` should be called again with the third one as the new input.
    Error(&'a str, InvalidSequence<'a>, &'a [u8]),

    Incomplete(&'a str, IncompleteChar),
}

pub struct InvalidSequence<'a>(pub &'a [u8]);

pub struct IncompleteChar {
    // Fields are private. They include a [u8; 4] buffer.
}

impl IncompleteChar {
    pub fn try_complete<'char, 'input>(&'char mut self, mut input: &'input [u8])
                                       -> TryCompleteResult<'char, 'input> { /* ... */ }
}

pub enum TryCompleteResult<'char, 'input> {
    Ok(&'char str, &'input [u8]),  // str.chars().count() == 1
    Error(InvalidSequence<'char>, &'input [u8]),
    StillIncomplete,
}

It’s complicated. It requires the user to think about a lot of corner cases, especially around IncompleteChar. Explaining how to properly use it takes several paragraphs of docs.

We can hide some of the details with a stateful decoder:

pub struct Decoder { /* Private. Also with a [u8; 4] buffer. */ }

impl Decoder {
    pub fn new() -> Self;

    pub fn decode<'decoder, 'input>(&'decoder mut self, &'input [u8])
                                    -> DecoderResult<'decoder, 'input>;

    /// Signal that there is no more input.
    /// The decoder might contain a partial `char` which becomes an error.
    pub fn end<'decoder>(&self) -> Result<(), InvalidSequence<'decoder>>>;
}

/// Order of fields indicates order in the input
pub struct DecoderResult<'decoder, 'input> {
    /// str in the `Ok` case is either empty or one `char` (up to 4 bytes)
    pub partially_from_previous_input_chunk: Result<&'decoder str, InvalidSequence<'decoder>>,

    /// Up to the first error, if any
    pub decoded: &'input str,

    /// Whether we did find an error
    pub error: Result<(), InvalidSequence<'input>>

    /// Call `decoder.decode()` again with this, if non-empty
    pub remaining_input_after_error: &'input [u8]
}

/// Never more than 3 bytes.
pub struct InvalidSequence<'a>(pub &'a [u8]);

Even so, it’s very easy to misuse, for example by ignoring part of DecoderResult. Using a tuple instead of a struct makes it more visible when a field is ignored, but then we can’t name them to help explain which is what.

Either of these is complicated enough that I don’t think it belongs in libcore or libstd.

SimonSapin on 4 Mar 2017

👍1

By the way, I’ve submitted #40212 to make something like the above (possibly specialized to one use case) easier to implement outside of std based on std::str::from_utf8. Doing so avoids re-implementing most of the decoding logic, and allows taking advantage of optimizations in std like #30740.

SimonSapin on 4 Mar 2017

I would support deprecating Read::chars in favor of seeing what the community can produce.

taralx on 5 Mar 2017

Another attempt turned out almost nice:

pub struct Decoder { 
    buffer: [u8; 4],
    /* ... */
}

impl Decoder {
    pub fn new() -> Self { /* ... */ }
    pub fn next_chunk<'a>(&'a mut self, input_chunk: &'a [u8]) -> DecoderIter<'a> { /* ... */ }
    pub fn last_chunk<'a>(&'a mut self, input_chunk: &'a [u8]) -> DecoderIter<'a> { /* ... */ }
}

pub struct DecoderIter<'a> {
    decoder: &'a mut Decoder,
    /* ... */ 
}

impl<'a> Iterator for DecoderIter<'a> {
    type Item = Result<&'a str, &'a [u8]>;
}

Except it doesn’t work. &'a str in the result conflicts with &'a mut Decoder in the iterator. (It is sometimes backed by buffer in the decoder.) The result needs to borrow the iterator, which means the Iterator trait can’t be implemented, and for loops can’t be used:

impl<'a> DecoderIter<'a> {
    pub fn next(&mut self) -> Option<Result<&str, &[u8]>> { /* ... */ }
}

    let mut iter = decoder.next_chunk(input);
    while let Some(result) = iter.next() {
         // ...
    }

This compiles, but something like String::from_utf8_lossy(&[u8]) -> Cow<str> can’t be implemented on top of it because str fragments always borrow the short-lived decoder (and iterator), not only the original input.

We can work around that by adding enough lifetimes parameters and one weird enum… but yeah, no.

```rust
pub struct Decoder { /* ... */ }

impl Decoder {
pub fn new() -> Self { /* ... */ }

pub fn next_chunk<'decoder, 'input>(&'decoder mut self, input_chunk: &'input [u8])
                                    -> DecoderIter<'decoder, 'input> { /* ... */ }

pub fn last_chunk<'decoder, 'input>(&'decoder mut self, input_chunk: &'input [u8])
                                    -> DecoderIter<'decoder, 'input> { /* ... */ }

}

pub struct DecoderIter<'decoder, 'input> { /* ... */ }

impl<'decoder, 'input> DecoderIter<'decoder, 'input> {
pub fn next<'buffer>(&'buffer mut self)
-> Option,
EitherLifetime<'buffer, 'input, [u8]>>> { /* ... */ }
}

pub enum EitherLifetime<'buffer, 'input, T: ?Sized + 'static> {
Buffer(&'buffer T),
Input(&'input T),
}

impl<'buffer, 'input, T: ?Sized> EitherLifetime<'buffer, 'input, T> {
pub fn get<'a>(&self) -> &'a T where 'buffer: 'a, 'input: 'a {
match *self {
EitherLifetime::Input(x) => x,
EitherLifetime::Buffer(x) => x,
}
}
}

SimonSapin on 6 Mar 2017

Except it doesn’t work. &'a str in the result conflicts with &'a mut Decoder in the iterator.

Can you elaborate? I don't follow here.

taralx on 7 Mar 2017

@taralx

DecoderIter<'a> contains &'a mut Decoder.
Decoder contains a [u8; 4] buffer.
The DecoderIter::next method takes &'b mut self. This implies 'a: 'b (b outlives a).
In some cases, the next method wants to return a borrow of the buffer with (simplified) std::str::from_utf8_unchecked(&self.decoder.buffer).
Since this borrow is going through &'b mut self, its lifetime cannot be longer than 'b. This conflicts with the 'a: 'b requirement. If we could somehow make that borrow (such as with unsafe), then &mut Decoder in DecoderIter would no longer be exclusive since there’s another borrow of part of it (the buffer).
The solution is to return &'b str instead of &'a str. But the Iterator trait cannot express that since the return type of next is Option<Self::Item>, and there is no way to include the lifetime of the iterator itself in the associated type Item.

Perhaps it’s clearer with code. This does not compile: https://gist.github.com/anonymous/0587b4484ec9a15f5c5ce6908b3807c1, unless you change Result<&'a str, &'a [u8]> to Result<&str, &[u8]> (borrowing self) and make next an inherent method instead of an Iterator impl.

SimonSapin on 7 Mar 2017

I tend to agree that this should be removed from std, though for a slightly different reason: I'd like the interface for "returning a stream of chars from an io::Reader" to work with any encoding, not just UTF-8. If the general purpose of io::Read is to represent a stream of arbitrary bytes, then some other trait should represent a stream of decoded characters (though for completeness, I would probably also have such a trait implement io::Read). This would be similar to how you can chain streams in other languages: Decoder(BufferedReader(FileReader("somefile.txt"))). Unless/until general encoding functionality lives in std, I don't see much benefit in such a half-hearted implementation.

wilysword on 5 Sep 2017

https://github.com/hsivonen/encoding_rs/issues/8 has some discussion of Unicode stream and decoders for not-only-UTF-8 encodings.

SimonSapin on 5 Sep 2017

The libs team discussed this and consensus was to deprecate the Read::chars method and its Chars iterator.

@rfcbot fcp close

Code that does not care about processing data incrementally can use Read::read_to_string instead. Code that does care presumably also wants to control its buffering strategy and work with &[u8] and &str slices that are as large as possible, rather than one char at a time. It should be based on the str::from_utf8 function as well as the valid_up_to and error_len methods of the Utf8Error type. One tricky aspect is dealing with cases where a single char is represented in UTF-8 by multiple bytes where those bytes happen to be split across separate read calls / buffer chunks. (Utf8Error::error_len returning None indicates that this may be the case.) The utf-8 crate solves this, but in order to be flexible provides an API that probably has too much surface to be included in the standard library.

Of course the above is for data that is always UTF-8. If other character encoding need to be supported, consider using the encoding_rs or encoding crate.

SimonSapin on 30 Mar 2018

👍2

Team member @SimonSapin has proposed to close this. The next step is review by the rest of the tagged teams:

[x] @Kimundi
[x] @SimonSapin
[x] @alexcrichton
[x] @aturon
[x] @dtolnay
[x] @sfackler
[ ] @withoutboats

No concerns currently listed.

Once a majority of reviewers approve (and none object), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

rfcbot on 30 Mar 2018

:bell: This is now entering its final comment period, as per the review above. :bell:

rfcbot on 4 Apr 2018

The final comment period is now complete.

rfcbot on 14 Apr 2018

Deprecated in https://github.com/rust-lang/rust/pull/49970

SimonSapin on 24 Apr 2018

Rust: Tracking issue for Read::chars

Most helpful comment

All 88 comments

Related issues