This is a tracking issue for the deprecated std::io::Read::chars API.
It would be nice if std::io::Chars was an Iterator<Item=char>, just like std::str::Chars. I don't have a proposal for how decoding errors should be handled, though.
If we want to make std::io::Chars a simple iterator on char, a solution is to have it return None on a UTF-8 error, and set an error flag in the Chars itself (with an associated was_valid_utf8() method or something). An empty sequence is considered valid UTF-8.
It does make error detection slightly less convenient, as writing for c in reader.chars() does not keep the Chars for later verification.
Here is an example use-case where we try to recover after an invalid sub-sequence:
let mut chars = reader.chars();
loop {
for c in chars {
// ...
}
if chars.was_valid_utf8() { break; }
println!("Encountered invalid byte sequence.");
}
Or it could provide a more informative message, similar to the current CharsError.
Maybe this could apply to the other adapters as well? Or is this a bad pattern?
On the other hand, it is not so difficult to treat the Result item explicitly, or to wrap the current Chars to get the behavior I described (an unwrapper wrapping, interesting notion), so maybe the current situation is acceptable as is.
Nominating for 1.6 discussion
:bell: This issue is now entering its cycle-long final comment period for deprecation :bell:
These functions seem like excellent candidates to move out-of-tree into an ioutil crate or something like that. This deprecation won't actually happen until that crate exists, however.
.chars() is a nice operation to have, it'd be a shame to have to pull in a separate crate just to get such a useful iterator. The error type doesn't seem problematic; you'd have to handle an io::Error anyways.
Perhaps a .chars_lossy() iterator that yields the UTF-8 replacement character on a UTF-8 error and stops on the first io::Error.
I’m in favor of stabilizing Read::chars eventually, but it’s not ready yet:
It’s unstable because
the semantics of a partial read/write of where errors happen is currently unclear and may change
(The same would apply to chars_lossy.)
This behavior should be per Unicode Standard §5.22 "Best Practice for U+FFFD Substitution" http://www.unicode.org/versions/Unicode8.0.0/ch05.pdf#G40630
Roughly, that means stopping at the first unexpected byte. This is not the behavior currently implemented, which reads as many bytes as indicated by the first byte and then checks them. This is a problem as, with only Read (as opposed to, say, BufRead), you can’t put a byte "back" in the stream after reading it.
Here are some failing tests.
let mut buf = Cursor::new(&b"\xf0\x9fabc"[..]);
let mut chars = buf.chars();
assert!(match chars.next() { Some(Err(CharsError::NotUtf8)) => true, _ => false });
assert_eq!(chars.next().unwrap().unwrap(), 'a');
assert_eq!(chars.next().unwrap().unwrap(), 'b');
assert_eq!(chars.next().unwrap().unwrap(), 'c');
assert!(chars.next().is_none());
let mut buf = Cursor::new(&b"\xed\xa0\x80a"[..]);
let mut chars = buf.chars();
assert!(match chars.next() { Some(Err(CharsError::NotUtf8)) => true, _ => false });
assert_eq!(chars.next().unwrap().unwrap(), 'a');
assert!(chars.next().is_none());
let mut buf = Cursor::new(&b"\xed\xa0a"[..]);
let mut chars = buf.chars();
assert!(match chars.next() { Some(Err(CharsError::NotUtf8)) => true, _ => false });
assert_eq!(chars.next().unwrap().unwrap(), 'a');
assert!(chars.next().is_none());
I’ve looked at fixing this, but it basically involves duplicating all of the UTF-8 decoding logic from str::from_utf8, which I’m not really happy with. (That many more tests would need to be added.) I’ll try to think of some way to have a more basic decoder that can be used by both.
Moving chars to BufRead does not seem unreasonable.
@sfackler I concur. I was thinking the exact same thing. In fact, it is unfortunate that Read::bytes() is already stabilized because, like chars(), it is almost always preferable to have it on a buffered source. A lot of the Read types really do not tolerate small, frequent reads well (#28073)
While I agree that chars is useful, questions like those brought up by @SimonSapin make me think that it's too weighty/controversial to be in the standard library. I'm not sure that this is really ever going to be seriously used either due to the performance concerns, and I agree that bytes is a bit of a misnomer in that it's super non-performant much of the time as well.
How does bytes perform when used with BufReader<_>?
@SimonSapin there were some perf numbers reading from a file with bytes in #28073. The difference is pretty significant.
I think chars would work pretty well if implemented on top of a buffer. It doesn't have to consume any bytes it doesn't need to, so no unnecessary loss of data.
I may be doing something hugely inefficient here but I really needed read.chars for a crate I was working on that attempted to produce an iterator of rustc serialize json items from a read stream. I had to fallback on vendoring most of the source around read.chars to as a workaround to get it working on stable awhile back. It was almost a requirement as the rustc builder for json requires a iterable of chars. It would be really nice to have this stabilized if env in some form of method name that implied lossyness.
@SimonSapin
I would expect it to perform much better than on a raw File because the number of syscalls issued is far less, but on the other hand it is likely much slower than another equivalent operation such as mmap + iterate the bytes via a slice.
@cybergeek94
Yeah I suspect the performance isn't _abysmal_, but when compared to str::chars it's likely still much slower (just more stuff that has to be checked during I/O)
@softprops
Yeah that's actually a case where I believe chars is inappropriate because it's just going to be inherently slower than any equivalent operation of "iterate over the characters of this string". For example a json deserializer may want to take an iterator of characters, but the actual iterator does something like BufReader where it reads a huge string, then literally uses str::chars, only reading a new string once it's exhausted (handling the boundary).
@alexcrichton
Wouldn't a chars method on a BufReader-backed Read serve this exact purpose? It reads a big slice of bytes, hopefully in UTF-8 format, which is exactly what a string would be; shouldn't it be the same then to iterate chars on these bytes compared to using str::chars?
Yeah that's actually a case where I believe chars is inappropriate because it's just going to be inherently slower than any equivalent operation of "iterate over the characters of this string"
In this case it's a literally a stream of json I'm working with. My use case was interfacing with dockers streaming json endpoints. In which case json objects are pushed through a streamed response. I'm not sure how I'd accomplish that with a string.
@alexcrichton Hnadling the boundary is tricky. It’s handled in https://github.com/SimonSapin/rust-utf8/blob/master/lib.rs
@Gyscos hm yes, I guess it would! That would definitely mean that chars could only be on BufReader.
@softprops in theory you can transform an iterator of &str slices into an iterator of char values, which is kinda what BufReader would be doing
So a solution is to move chars to BufRead as @sfackler mentionned. I was worried this would force to needlessly wrap some Readers that didn't _need_ to, but I just noticed Cursor<Vec<u8>> and Stdin already provide a BufRead access.
But all of that is about the performance aspect.
Concerning the error handling, BufRead::lines return an iterator on io::Result<String> to properly propagate the error, which includes invalid UTF-8 (by returning a ErrorKind::InvalidData). Couldn't chars do the same, instead of using a custom error?
Yeah I have a feeling that would be sufficient, there's more fine-grained error information we can give (such as the bytes that were read, if any), but iterators like lines are already lossy where you lose access to the underlying data if there's an error, so it's probably not too bad.
It does raise a question though that if chars exists only on BufRead for performance reasons, why does bytes not exist only on BufRead? Just a minor inconsistency.
I think the reasoning to move chars to BufRead is for correctness wrt @SimonSapin's comment - to be able to peek at the next byte without consuming it.
The performance issue being alleviated is only a benevolent side-effect. bytes doesn't require a buffer for correctness.
Hm I don't think that this is a correctness problem that can be solved by "just moving to BufRead", it's always a possibility that the buffer is 1 byte in size so it behaves the same as reading one byte at a time. Basically BufRead just reduces the number of calls to read, but it essentially always has the same problems wrt partial reads, short reads, and consuming data.
@SimonSapin was saying that BufReader could potentially put things back in the buffer, but it's not actually exposed by the BufRead trait.
You don't put things back in the buffer, you just don't call consume on things you don't want to, right?
it's always a possibility that the buffer is 1 byte in size so it behaves the same as reading one byte at a time.
If only BufRead had an API to force an early read into the buffer... but I digress. I closed that RFC because I didn't have the time or the energy to argue for it.
The iterator can always consume the last bytes out of the buffer to force it to read more. If the next read is too short (a partial read returning < 4 bytes? honestly, does that even happen?), it can be considered an error just like if we hit EOF while expecting more bytes.
Compared to a Read, the BufRead makes it at least _possible_ to be correct: since it should only consume the longest correct subsequence, it could read the bytes one by one and consume them only when they keep the subsequence valid. Only, as @SimonSapin said, this would mean duplicating some code for the checks.
So it _is_ possible to implement the correct behaviour for BufRead (while it is not for Read), but it is inconvenient if the reader has too short a buffer, or only returns too short reads.
The too-short read problem is independent of using a BufRead or a Read. The underlying Read could return data 1 byte at a time itself if it was feeling particularly pathological. A BufReader that only has 1 byte remaining in its buffer is only going to return that one byte in the next call.
I'm not sure I understand the too-short read problem.
If the reader stops returning anything in the middle of a codepoint, it's obviously an error.
If it doesn't, we can read the bytes from the BufRead one by one, and consume it until we find an invalid byte (that we don't consume). This works even with a BufReader using a 1-byte buffer.
This is completely impossible with a simple Read, since we must consume each byte as we read it, preventing us from leaving the first invalid byte untouched.
In this regard, BufRead vs Read does change what is possible to achieve.
Yeah, I was commenting on "but it is inconvenient if the reader has too short a buffer, or only returns too short reads." That's just a thing you have to deal with.
I see.
To recap, concerning Chars, the options are (apart from taking it out of tree):
BufRead in order to force a new readstr::from_utf8 (actually str::run_utf8_validation_iterator) and using it from herestr, by:str module (making it public?) - will need some modification in the str moduleChars functionality to the str module (it is basically a streamed utf-8 decoder, could be more general than just Read?)Note that not following the UTF-8 best practices mean BufRead does not add any correctness, "just" some performance improvements.
I like option 2.ii.a, but maybe it's too much of an impact for this Chars problem?
Given the number of details here, I think stabilizing a chars method on BufRead/Read should probably require an RFC, especially if we start messing with the str module API. It seems like that makes it (and the other methods in this tracking issue) an ideal candidate to evolve out of tree.
The libs team discussed this during triage today and the conclusion was to deprecate tee and broadcast while leaving chars unstable. Further action there can hold off on an RFC.
These functions seem like excellent candidates to move out-of-tree into an ioutil crate or something like that. This deprecation won't actually happen until that crate exists, however.
So is there a crate? Sorry if I missed one.
Ah not yet as there didn't seem to be that much desire for them, but I could certainly push up a crate if one is desired!
I just published tee and broadcast as separate crates. I have used both std lib apis in different projects but never together so I thought it would be nice to have the same funtionality, but in smaller, more focused crates https://github.com/softprops/tee https://github.com/softprops/broadcast
Thanks @softprops!
@softprops Great, thanks!
Moving out of final-comment-period and the title is now updated to indicate that this issue is solely related to Read::chars now.
I think moving chars to BufRead makes a lot of sense. It does require reimplementing str::from_utf8 logic to implement it properly, which is unfortunate, but there's not really a way around it.
Without chars, you can get the bytes from a BufRead and call str::from_utf8, but it will check the data all at once instead of reading out chars until it hits an error. You'd have to call valid_up_to and retry the conversion in order to handle the bad bytes.
I'm not clear on the intended semantics of chars here. Is it:
@taralx If "get stuck" means return an error, then the desired behaviour is the second option: read (consume) the valid part, but leave the invalid part. That's why a BufRead is required, to peek at the (potentially invalid) byte before consuming it.
I don't believe the implementation reads the "valid part" of the char; if there isn't a complete UTF8 code point, nothing is read.
Sorry, I was talking about the intended semantics, not the current behaviour.
Currently, Chars always reads as many bytes as the first byte indicates, and return an error if it isn't a valid UTF-8 sequence.
Yes, "get stuck" means returning an error. The difficulty is that BufRead has limited lookahead -- you're pretty much always getting 1 character of lookahead, but no more without the ability to force buffer fills.
Yes, this was talked about a few comments above: 1 byte of read ahead is enough to stop reading in time (only consume the byte if the sub-sequence is still valid). It does, however, require additional code to validate partial UTF-8 sequences.
This code exists more or less in the str modules (here), but is:
So the solution would be to either:
io::CharsstrBoth also require Chars to use a BufRead instead of a Read.
1 byte of read ahead is enough to stop reading in time
I'm not sure I agree, based on the code you linked to. I can see codepaths that will result in the iterator being rewound by more than 1 byte in the event of an error.
Indeed, I was mistaken, this code also tries to read all the bytes before attempting decoding.
And the methods taking iterators (like next_code_point) simply ignore errors in the continuation bytes.
So I guess the code will have to be written from scratch.
I'm not sure it's possible to do it with only 1 byte of lookahead. You need to be able to detect sequences like E2 82 00 (illegal, but only after you see the 00) without consuming any of those bytes.
We're allowed to consume any byte if it form a potentially valid subsequence. So if there is a byte XY such that E2 82 XY is valid, then consuming E2 82 is allowed. By definition, this means a 1-byte lookahead is enough.
Okay, so it's the second one above: We error out on a bad character, reading the valid part but leaving the invalid part.
When emitting '\u{FFFD}' replacement characters for decoding errors, reading 1 byte at a time (or behaving as if) is the behavior recommended by Unicode Standard §5.22 "Best Practice for U+FFFD Substitution" http://www.unicode.org/versions/Unicode8.0.0/ch05.pdf#G40630
This is in case the first byte unexpected in a sequence is the start of another sequence. In the example above, b"\xE2\x82\x00" should decode to 2 code points "\u{FFFD}\0".
IMO 1 byte lookahead is not enough to guarantee that no data is lost in case of error.
Consuming valid subsequences should _not_ be allowed because if it continues with an invalid byte, the formerly valid subsequence is invalidated and there's no possibility to "unconsume" it.
If anyone is interested, I've implemented this and other changes to the std::io-traits in my crate netio.
The library provides stronger guarantees than std::io, i.e. it never allows to lose any data in case of error.
Feel free to copy anything you like.
However, there's still one unresolved issue: It's not possible to decide from std::str::Utf8Error if the error is due to an invalid sequence or just incomplete data. My goal is to return std::io::ErrorKind::InvalidData in the former case and std::io::ErrorKind::UnexpectedEOF in the latter, which is currently no possible without reimplementing UTF decoding which I don't want to do.
I've filed an issue for this (#32584). It's similar to #12113 but not exactly the same. #12113 is no issue for me, because in netio invalid or incomplete UTF-8 is never consumed and always left in the stream.
I give this opinion as a casual user of chars() rather than someone that understands unicode in detail. That being said....
It would be good to have an iterator that behaved like chars() in the str module. I think making io::Read::chars() behave like that would mean it doesn't fit in with the rest of the io module (errors and Option/Result). If this method was called for example chars_lossy then it would be very helpful to have a note at the start of chars() documentation explaining that those who do not care about errors and just want to crash on them should use chars_lossy (otherwise newbies like me get confused and have to read a really long thread to understand the issues :P)
It would be good to have an iterator that behaved like
chars()in thestrmodule.
That is not possible. The contents of a &str is guaranteed to be well-formed UTF-8 so there never is a decoding error. The bytes from Read are arbitrary.
I think Read::chars should be renamed for that reason, by the way: #33761
this method was called for example
chars_lossy
String::from_utf8_lossy is an existing function that replaces ill-formed byte sequences with a \u{FFFD} replacement character, which looks like this: �. It would be possible to add a separate Read::utf8_chars_lossy method, or a lossy method on the return value of Read::utf8_chars that does the same.
those who do not care about errors and just want to crash on them should use
chars_lossy
No! Panicking on ill-formed byte sequences is very different from "lossy" decoding, which silently replaces them with �.
I think Read::chars should be renamed for that reason, by the way:
(I meant this to link to #33761. Edited.)
@SimonSapin apologies I probably wrote that comment late at night.
I agree with what you said, that was what I was trying to say. To summarise:
Read::utf8_chars_lossy that won't error but will insert \u{FFFD}BufRead::chars whose Item is Result<char, _>, and who doesn't consume any malformed bytesRead::chars whose Item is Result<char, _>, and whose Error type contains any consumed bytes from the stream not used in a charRead explaining the choiceDo I understand you correctly?
Your BufRead::chars would still need to return a partial character in the error part. See above about lookahead.
@taralx What I mean is something like the following:
enum Error (
...
MalformedCharError([u8;4])
)
or something similar handling the fact that the size is unknown.
I don’t think the API needs to change at all: one fn chars(self) -> Chars<Self> default method on the std::io::Read trait, where Chars<R: Read>: Iterator<Item = Result<char, CharsError>>. There is no need for CharsError to include the ill-formed byte sequence, what matters is not consuming too much of the input to avoid dropping valid code points. I believe that means reading one byte at a time.
Since we’re reading one byte at a time, there is no need to involve the BufRead trait. The docs can recommend using a BufReader wrapper type if small reads in the underlying stream are expensive (e.g. in File and TcpStream, each read is a syscall and so a context switch). But &[u8] is fine, for example, even though it doesn’t implement the BufRead trait.
chars_lossy is relatively easy to build on top of chars:
stream.chars().map(|result| match result {
Ok(c) => Ok(c),
Err(CharsError::NotUtf8) => Ok(std::char::REPLACEMENT_CHARACTER),
Err(CharsError::Other(e)) => Err(e),
})
Note that for std::char::decode_utf16 we chose not to include lossy decoding, since it’s even easier for users: .map(|result| result.unwrap_or(std::char::REPLACEMENT_CHARACTER))
I’ll prepare a PR now.
Since we’re reading one byte at a time, there is no need to involve the BufRead trait.
@SimonSapin I don't agree with that.
By that argumentation, all current methods of BufRead could live in Read. None of them need BufRead for correctness.
At least the way they are implemented now. I'd prefer better guarantees, but that's probably out of scope.
The docs can recommend using a BufReader wrapper type if small reads in the underlying stream are expensive (e.g. in File and TcpStream, each read is a syscall and so a context switch). But &[u8] is fine, for example, even though it doesn’t implement the BufRead trait.
Huh? impl<'a> BufRead for &'a [u8] exists.
@SimonSapin You need lookahead. How will you handle the following byte sequence without it? E2 82 E2 82 82
Huh? impl<'a> BufRead for &'a [u8] exists.
Uh, never mind that part, I misread rustdoc.
@taralx Yes, I realized while implementing #33801 that the Utf8Chars iterator needs to hold a Option<u8> buffer: https://github.com/rust-lang/rust/pull/33801/commits/7ab345098625cf4e3b7f930e0ae003fda00da166#diff-668f8f358d4a93474b396dcb3727399eR1558
It might be better to require BufRead after all. I’ll try that in an additional commit in #33801 to see what it looks like.
Note that the algorithm still has to read one byte at a time: BufRead::fill_buf is not guaranteed to give "new" bytes until the current buffer is entirely consumed (or at least in BufReader it doesn’t), and the byte sequence for a single code point might span across a buffer boundary.
I’ve added a commit to #33801.
🔔 This issue is now entering its final comment period 🔔
The libs team was unfortunately a little late on stabilizations this cycle, but we're thinking of probably doing a backport of stabilizations partway through the beta cycle. The API we're considering for stabilization is the one proposed by @SimonSapin in https://github.com/rust-lang/rust/pull/33801. If we do decide to accept that PR we'll probably have a new FCP next cycle for actual stabilization but if we don't accept it we'll probably just deprecate the feature.
Regardless, discussion on the PR is more than welcome!
I still have one major objection to #33801 that I've noted in that PR.
My concern with #33801 relates to partial reads. As it stands, there's no way to know how much data was consumed in the case of a Read source like &[u8]. Some ways to fix that is returning the number of consumed bytes in Utf8Error::IncompleteUtf8 (not great), return the number of bytes left over in that variant, or return the specific bytes left over in that variant.
After thinking on InvalidUtf8 and IncompleteUtf8 errors some more, I've come to the conclusion they need to include the problematic bytes. This nicely handles the case of a partial read when using a &[u8] (a consumer can just stick the bytes in the front of the buffer before reading again), and an InvalidUtf8 could be handled however the consumer sees fit.
🔔 This issue is now entering its final comment period 🔔
@alexcrichton Looks like this was forgotten? Anything that needs to be done here to stabilize this?
I've published a crate called utf8parse which has a dramatically different API for parsing a &[u8] as UTF-8. I don't believe it's a complete solution to the problems discussed here, but perhaps showing it here may be helpful. Let me give you a quick summary of the utf8parse API, and then I'll compare it with the issues being discussed with the Utf8Chars API.
There's two types in utf8parse, a Parser and a Receiver. First, the Receiver type looks like this:
/// Handles codepoint and invalid sequence events from the parser.
pub trait Receiver {
/// Called whenever a codepoint is parsed successfully
fn codepoint(&mut self, char);
/// Called when an invalid_sequence is detected
fn invalid_sequence(&mut self);
}
After creating a Parser, it's only got one useful method:
fn advance<R>(&mut self, receiver: &mut R, byte: u8)
where R: Receiver
One byte is pushed at a time into the parser, and occasionally codepoints/errors are provided to the Receiver. Driving the parser looks like this:
let bytes = bytes_from_somewhere();
let mut receiver = make_my_receiver();
let mut parser = Parser::new();
for byte in &bytes {
parser.advance(&mut receiver, byte);
}
Not as nice as the Utf8Chars iterator, but it has some advantages. First, the code driving the parser can know exactly where an error is encountered. For example, it would be possible to keep track of indices when a valid codepoint is encountered and when and invalid codepoint is identified. In this way, it's possible to know the bad bytes. Additionally, if the end of the buffer is reached without receiving a codepoint at the same time, the incomplete bytes would be known based on the byte index of the last valid codepoint.
In summary, this solves including bytes in the Incomplete/InvalidUtf8 error types at the cost of a non-iterator API. It separates out fetching bytes from some Read type and actually parsing those bytes as UTF-8.
Maybe it makes sense for the standard library to include a low level UTF-8 parser like this that can be tailored for different situations - instead of this use-case specific iterator based solution.
@Manishearth that was basically 6 months ago at this point, and unfortunately I've forgotten the context of this in the interim.
cc @rust-lang/libs reviving this
Does this sound correct?:
The issue could be summarised as: How to handle reading chars over a datastream, specifically how to handle either incomplete utf8 byte sequences and incorrect bytes (w.r.t. utf8).
It's hard because when you encounter an invalid sequence of bytes that may be valid with more data, it could either be an error, or that you just need to read more bytes (it's ambiguous).
If you know it's incomplete, you may want to call Read and try again with the incomplete part prepended, but if it's incomplete because something has errored you want to return the error with the offending bytes.
It seems the consensus is that for error and incomplete bytes, you return a struct that contains an enum variant saying whether it is an error or a possibly incomplete set of bytes, along with the bytes. It's then the responsibility of a higher level iterator how to handle these cases (as not all use cases will want to handle it the same).
TL;DR: I think it is very hard to come up with an abstraction that: is zero-cost, covers all use cases, and is not terrible to use.
I’m in favor of deprecating and eventually removing this with no in-std replacement.
I think that anything that looks at one u8 or one char at a time is gonna have abysmal performance. Instead we probably want &str slices that reference fragments of some [u8] buffer.
I spent some time thinking of a low-level API that would make no assumptions about how one would want to use it ("pushing" vs "pulling" bytes and string slices, buffer allocation strategy, error handling, etc.) I came up with this:
pub fn decode_utf8(input: &[u8]) -> DecodeResult { /* ... */ }
pub enum DecodeResult<'a> {
Ok(&'a str),
/// These three slices cover all of the original input.
/// `decode` should be called again with the third one as the new input.
Error(&'a str, InvalidSequence<'a>, &'a [u8]),
Incomplete(&'a str, IncompleteChar),
}
pub struct InvalidSequence<'a>(pub &'a [u8]);
pub struct IncompleteChar {
// Fields are private. They include a [u8; 4] buffer.
}
impl IncompleteChar {
pub fn try_complete<'char, 'input>(&'char mut self, mut input: &'input [u8])
-> TryCompleteResult<'char, 'input> { /* ... */ }
}
pub enum TryCompleteResult<'char, 'input> {
Ok(&'char str, &'input [u8]), // str.chars().count() == 1
Error(InvalidSequence<'char>, &'input [u8]),
StillIncomplete,
}
It’s complicated. It requires the user to think about a lot of corner cases, especially around IncompleteChar. Explaining how to properly use it takes several paragraphs of docs.
We can hide some of the details with a stateful decoder:
pub struct Decoder { /* Private. Also with a [u8; 4] buffer. */ }
impl Decoder {
pub fn new() -> Self;
pub fn decode<'decoder, 'input>(&'decoder mut self, &'input [u8])
-> DecoderResult<'decoder, 'input>;
/// Signal that there is no more input.
/// The decoder might contain a partial `char` which becomes an error.
pub fn end<'decoder>(&self) -> Result<(), InvalidSequence<'decoder>>>;
}
/// Order of fields indicates order in the input
pub struct DecoderResult<'decoder, 'input> {
/// str in the `Ok` case is either empty or one `char` (up to 4 bytes)
pub partially_from_previous_input_chunk: Result<&'decoder str, InvalidSequence<'decoder>>,
/// Up to the first error, if any
pub decoded: &'input str,
/// Whether we did find an error
pub error: Result<(), InvalidSequence<'input>>
/// Call `decoder.decode()` again with this, if non-empty
pub remaining_input_after_error: &'input [u8]
}
/// Never more than 3 bytes.
pub struct InvalidSequence<'a>(pub &'a [u8]);
Even so, it’s very easy to misuse, for example by ignoring part of DecoderResult. Using a tuple instead of a struct makes it more visible when a field is ignored, but then we can’t name them to help explain which is what.
Either of these is complicated enough that I don’t think it belongs in libcore or libstd.
By the way, I’ve submitted #40212 to make something like the above (possibly specialized to one use case) easier to implement outside of std based on std::str::from_utf8. Doing so avoids re-implementing most of the decoding logic, and allows taking advantage of optimizations in std like #30740.
I would support deprecating Read::chars in favor of seeing what the community can produce.
Another attempt turned out almost nice:
pub struct Decoder {
buffer: [u8; 4],
/* ... */
}
impl Decoder {
pub fn new() -> Self { /* ... */ }
pub fn next_chunk<'a>(&'a mut self, input_chunk: &'a [u8]) -> DecoderIter<'a> { /* ... */ }
pub fn last_chunk<'a>(&'a mut self, input_chunk: &'a [u8]) -> DecoderIter<'a> { /* ... */ }
}
pub struct DecoderIter<'a> {
decoder: &'a mut Decoder,
/* ... */
}
impl<'a> Iterator for DecoderIter<'a> {
type Item = Result<&'a str, &'a [u8]>;
}
Except it doesn’t work. &'a str in the result conflicts with &'a mut Decoder in the iterator. (It is sometimes backed by buffer in the decoder.) The result needs to borrow the iterator, which means the Iterator trait can’t be implemented, and for loops can’t be used:
impl<'a> DecoderIter<'a> {
pub fn next(&mut self) -> Option<Result<&str, &[u8]>> { /* ... */ }
}
let mut iter = decoder.next_chunk(input);
while let Some(result) = iter.next() {
// ...
}
This compiles, but something like String::from_utf8_lossy(&[u8]) -> Cow<str> can’t be implemented on top of it because str fragments always borrow the short-lived decoder (and iterator), not only the original input.
We can work around that by adding enough lifetimes parameters and one weird enum… but yeah, no.
```rust
pub struct Decoder { /* ... */ }
impl Decoder {
pub fn new() -> Self { /* ... */ }
pub fn next_chunk<'decoder, 'input>(&'decoder mut self, input_chunk: &'input [u8])
-> DecoderIter<'decoder, 'input> { /* ... */ }
pub fn last_chunk<'decoder, 'input>(&'decoder mut self, input_chunk: &'input [u8])
-> DecoderIter<'decoder, 'input> { /* ... */ }
}
pub struct DecoderIter<'decoder, 'input> { /* ... */ }
impl<'decoder, 'input> DecoderIter<'decoder, 'input> {
pub fn next<'buffer>(&'buffer mut self)
-> Option
EitherLifetime<'buffer, 'input, [u8]>>> { /* ... */ }
}
pub enum EitherLifetime<'buffer, 'input, T: ?Sized + 'static> {
Buffer(&'buffer T),
Input(&'input T),
}
impl<'buffer, 'input, T: ?Sized> EitherLifetime<'buffer, 'input, T> {
pub fn get<'a>(&self) -> &'a T where 'buffer: 'a, 'input: 'a {
match *self {
EitherLifetime::Input(x) => x,
EitherLifetime::Buffer(x) => x,
}
}
}
Except it doesn’t work.
&'a strin the result conflicts with&'a mut Decoderin the iterator.
Can you elaborate? I don't follow here.
@taralx
DecoderIter<'a> contains &'a mut Decoder.Decoder contains a [u8; 4] buffer.DecoderIter::next method takes &'b mut self. This implies 'a: 'b (b outlives a).std::str::from_utf8_unchecked(&self.decoder.buffer).&'b mut self, its lifetime cannot be longer than 'b. This conflicts with the 'a: 'b requirement. If we could somehow make that borrow (such as with unsafe), then &mut Decoder in DecoderIter would no longer be exclusive since there’s another borrow of part of it (the buffer).&'b str instead of &'a str. But the Iterator trait cannot express that since the return type of next is Option<Self::Item>, and there is no way to include the lifetime of the iterator itself in the associated type Item.Perhaps it’s clearer with code. This does not compile: https://gist.github.com/anonymous/0587b4484ec9a15f5c5ce6908b3807c1, unless you change Result<&'a str, &'a [u8]> to Result<&str, &[u8]> (borrowing self) and make next an inherent method instead of an Iterator impl.
I tend to agree that this should be removed from std, though for a slightly different reason: I'd like the interface for "returning a stream of chars from an io::Reader" to work with any encoding, not just UTF-8. If the general purpose of io::Read is to represent a stream of arbitrary bytes, then some other trait should represent a stream of decoded characters (though for completeness, I would probably also have such a trait implement io::Read). This would be similar to how you can chain streams in other languages: Decoder(BufferedReader(FileReader("somefile.txt"))). Unless/until general encoding functionality lives in std, I don't see much benefit in such a half-hearted implementation.
https://github.com/hsivonen/encoding_rs/issues/8 has some discussion of Unicode stream and decoders for not-only-UTF-8 encodings.
The libs team discussed this and consensus was to deprecate the Read::chars method and its Chars iterator.
@rfcbot fcp close
Code that does not care about processing data incrementally can use Read::read_to_string instead. Code that does care presumably also wants to control its buffering strategy and work with &[u8] and &str slices that are as large as possible, rather than one char at a time. It should be based on the str::from_utf8 function as well as the valid_up_to and error_len methods of the Utf8Error type. One tricky aspect is dealing with cases where a single char is represented in UTF-8 by multiple bytes where those bytes happen to be split across separate read calls / buffer chunks. (Utf8Error::error_len returning None indicates that this may be the case.) The utf-8 crate solves this, but in order to be flexible provides an API that probably has too much surface to be included in the standard library.
Of course the above is for data that is always UTF-8. If other character encoding need to be supported, consider using the encoding_rs or encoding crate.
Team member @SimonSapin has proposed to close this. The next step is review by the rest of the tagged teams:
No concerns currently listed.
Once a majority of reviewers approve (and none object), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!
See this document for info about what commands tagged team members can give me.
:bell: This is now entering its final comment period, as per the review above. :bell:
The final comment period is now complete.
Deprecated in https://github.com/rust-lang/rust/pull/49970
Most helpful comment
After thinking on
InvalidUtf8andIncompleteUtf8errors some more, I've come to the conclusion they need to include the problematic bytes. This nicely handles the case of a partial read when using a&[u8](a consumer can just stick the bytes in the front of the buffer before reading again), and anInvalidUtf8could be handled however the consumer sees fit.