I ran into a case using the csv crate in combination with structopt where I needed to parse a single char from an argument to use as a CSV delimiter. However, it appears that csv::StringBuilder only accepts u8 delimiters. I was surprised to see that there was no type-safe way to convert between char to u8 in the standard library. It would be nice if u8 implemented TryFrom<char>, where the conversion fails if the char value is outside of the U+0000 to U+00FF codepoint range (0-255).
An example use would look like this:
fn main() {
let success = u8::try_from('A').expect("Conversion should have succeeded");
let failure = u8::try_from('我').expect_err("Conversion should have failed");
}
ASCII range
This differs from impl From<u8> for char which supports U+0000 to U+00FF.
@kennytm Good point! Thanks for pointing out the semantic difference in From<u8>. In that case, I think it's necessary to revise my proposal to specifically U+0000 to U+00FF (value range of 0-255).
if all of 0..=255 is allowed I believe what you want can be done as follows:
if c as u32 <= u8::max_value() as u32 {
Some(c as u8)
} else {
None
}
@Lokathor Thanks for the code sample! That's actually very similar to what I've incorporated into my own project, so it's nice to see external confirmation that this general pattern exists. I was hoping to at least raise some discussion on whether a similar conversion could or should be included as part of the standard library.
This also works: u8::try_from(u32::from(some_char))
If this is added, I'd prefer if it's a method on char called something like try_into_latin1, to clarify that using it constitutes a semantic conversion (and not just type-checking bureaucracy), and to discourage using it as the least-effort/least-thought-out way to convert a stream of chars into a stream of u8s (which should explicitly refer to some character encoding).
“Latin-1” may refer to slightly different things though.
Originally, it is Part 1: Latin alphabet No. 1 of ISO/IEC 8859. It is a set of 191 characters, each assigned an 8-bit value. The ranges 0x00 to 0x1F and 0x7F to 0x9F are not assigned.
IANA calls ISO-8859-1 an extension of the above that assigns the C0 and C1 control codes to these two ranges. This makes it compatible with ASCII, and all 256 bytes are assigned a character or control code. Unicode chose to make its first 256 code points U+0000 to U+00FF compatible with this, so u8::try_from(u32::from(some_char)) effectively encodes to IANA’s ISO-8859-1. To avoid confusion, https://infra.spec.whatwg.org/#isomorphic-encode calls this operation isomorphic encode instead.
Windows code page 1252 is a different extension of ISO/IEC 8859-1 that assigns various punctuation or letters to some of the 0x80 to 0x9F range. (In 0x00 to 0x7F it is compatible with ASCII and IANA’s ISO-8859-1 though) Windows-1252 is sometimes incorrectly called ISO-8859-1 or Latin-1.
In web browsers, at this point for compatibility reasons, many labels (used for example in charset in Content-Type) including ascii, latin1 and iso-8859-1 map to an extension of Windows-1252 where the 5 unassigned byte values are mapped to corresponding C1 control codes like in IANA’s ISO-8859-1.
Back to Rust API naming, we already have the precedent of impl From<u8> for char that does the opposite conversion. And my pre-1.0 proposal to rename str::as_bytes to str::as_utf8 was rejected.
FWIW, I think this makes sense. In general, any time there's A: From<B>, I think it makes sense to have B: TryFrom<A> for going the other way. (Assuming, of course, that it's possible to implement, but if it isn't then I'm also skeptical of the From...)
EDIT: Said otherwise, we have u8: TryFrom<u32>, and char: From<u8> is precedent for "u8 is a USV in 0..256" (as opposed to a code unit, or something), so we might as well have u8: TryFrom<char>.
Oh? Is there a convention to not provide A: From<B> if B: TryFrom<A> makes no sense, like say if A is a hash of B or something? Or did you mean only in the context of std?
I think wrapper types improve code quality of course, but sometimes people then use From/Into more once they have specific types, which harms readability over named methods.
I suppose one guideline would be impl From<B> for A if code would be improved by using the bound B: Into<A> but avoid doing so otherwise.
Is there a convention to [...]
Nothing official, as far as I know. It's just a heuristic I use, since if you can't recover the original, then it generally seems to me like it's doing something non-obvious enough that it would be better as a method with a more illustrative name. (Of course, this is also only for Froms with the same ownership type -- I'm not expecting to be able to recover the original from something like String: From<&str>.)
Scrolling through std's From impls they do mostly match this. I probably wouldn't bother ever writing out the the TryFrom opposite of RecvTimeoutError: From<RecvError>, since it's just matches!(foo, RecvTimeoutError::Disconnected), but it would be completely plausible if someone wanted to have it. Similarly for String: From<char> -- I'm not sure that char: TryFrom<String> would be a good API (would one really ever want to .try_into() such that it fails if there's not exactly one char?) but it's pretty clear what the behaviour would be.
The ones that are a bit weirder are things like BinaryHeap<T>: From<Vec<T>> -- there's a From back the other way, but the first way was actually a heapify operation, so the original _value_ can't be recovered exactly. I think that's a place where I'd have actually preferred it just be BinaryHeap::new _without_ the from existing -- it gives a more accessible place to document complexity and such.
All that said, https://github.com/rust-lang/rfcs/pull/2484 includes some text proposing rules more along these lines.
Most helpful comment
If this is added, I'd prefer if it's a method on
charcalled something liketry_into_latin1, to clarify that using it constitutes a semantic conversion (and not just type-checking bureaucracy), and to discourage using it as the least-effort/least-thought-out way to convert a stream ofchars into a stream ofu8s (which should explicitly refer to some character encoding).