Rfcs: Add math to the char primitive

Created on 17 Oct 2017  Â·  11Comments  Â·  Source: rust-lang/rfcs

At the moment, working with chars requires a lot of conversions because of missing implementations:

fn char_delta(a: char, b: char) -> u8 {
  (a as i8 - b as i8).abs() as u8
}

fn main() {
  println!("{}", char_delta('a', 'h'));
}

I want to propose adding the same math to char which is already available for numeric types (for example i8), so that the sample might be written as

fn char_delta(a: char, b: char) -> u8 {
  (a - b).abs() as u8
}

fn main() {
  println!("{}", char_delta('a', 'h'));
}
T-libs

Most helpful comment

char is a Unicode scalar value, which is represented by a 32 bit unsigned integer. Not all 32 bit unsigned integers are Unicode scalar values.

All 11 comments

This needs much stronger motivation. What are the use cases for doing arithmetic on codepoints? I can think of some, but they are not common.

Also, what is the return type of these arithmetic operations? What happens when the result is not a scalar value?

What are the use cases for doing arithmetic on codepoints?

Maybe it's just me, but I would expect this behavior, especially coming from a language, like C. The use-cases are various, however I would not say that they are that uncommon. I can think of hashing, sorting, compression, etc. functions, which might require math on chars.

Also, what is the return type of these arithmetic operations?

In the above example, the return type would be i8, however I wanted to leave that open for discussion. I am a Rust beginner and pretty sure that I get a few things wrong ;)

What happens when the result is not a scalar value?

When would that happen? How does i8 handle that kind of situation? Why should char be special in that regard?

char is a Unicode scalar value, which is represented by a 32 bit unsigned integer. Not all 32 bit unsigned integers are Unicode scalar values.

There are so many ways this is wrong. A char has no meaningful arithmetic associated with it outside of the whole shabang of Unicode. Any kind of association stems from there.

Just consider that "âš " is code point 26A0 and "âš ".chars().map(|c| c as i8).collect() gives you [-96] which is just random garbage.

I'm actually wondering why rustc has no warning any time a char is converted. char as i8/u8 or something like this is almost certainly just wrong. Maybe we can have a lint for that?

@BurntSushi if we only convert char to i32, there should be no problem at all. As for the other way around, it indeed might become troublesome in certain cases and I would like to explore in what ways it can fail and what possible countermeasures might be available. With the current situation, nothing stops me from creating a char from an i32 (correct me if I'm wrong), so a better solution should be welcome anyway (at least, we can't make it worse than it is at the moment... so i don't think it should be a blocker - rather one more reason to finally develop a solution)

@lukaslueg Imho, for any single feature of the Rustlang, one can come up with an example in which the feature makes no sense - something which is neither impressive nor surprising. Further, I think your behavior is ignorant and hostile. You are not alone in the world and there are many people with wildly more different requirements than you (or I for that matter) could ever come up with.

Well then, as for meaningful arithmetic for char, I already demonstrated my needs in the OP. I want the diff between an a and a b. It does not matter to me if the char is Unicode in that case, because I have a text and I want to do mathematical operations on the text and I don't care the least for the representation afterwards ([-96] is alright and I can work with that!). For example, if I want to build a simple char-wise addition of a key to a text, then I need to calculate the sum of chars, which is legitimate and quite common in IT Security. Check out block ciphers to name one example.

As for block ciphers, (tl;dr) their use case is for example TLS, which you are using right this instant to browse GitHub. My browser tells me that it is using GCM to talk to the server. Char arithmetic (or at least the conversion of the chars) is used in my browser. With the spread of HTTPS and transfer encryption (git-ssl, ssh, sftp, etc. used for the web, I4.0, IoT, banking, and so on), more and more applications require software to do char arithmetic. How come it is not simplified and safe in Rustlang?

[Moderator note: Everyone, please remember to be respectful both in criticism and responses. Thanks!]

All the use cases you mentioned are better served by using the .bytes() or .as_bytes() API.

You mention C having arithmetic on chars, but actually, it doesn't -- the char you talk of in C is Rust's u8, not Rust's char. When you do arithmetic on characters in C you are doing arithmetic on the code units, not codepoints. C doesn't expose codepoints in its standard library. You can feed it UTF-8 strings and operate on those as 8 bit code units or as variable width encoded codepoints (but this is different from operating on the codepoint values! When you pull a char out of a UTF8 string that char is not the same bunch of bytes as the bytes you can find at that position in the string! This is only true for UTF32, not UTF8 or UTF16).

None of the operations you list require software to do char arithmetic. They require software to do byte arithmetic, and you can do that in Rust by consuming a string as a byte stream (.bytes()/.as_bytes()) instead of as a code point stream (.chars()).

With the current situation, nothing stops me from creating a char from an i32

Casting to char from i32 causes a compiler error:

error[E0604]: only `u8` can be cast as `char`, not `i32`
 --> src/main.rs:2:5
  |
2 |     0i32 as char
  |     ^^^^^^^^^^^^

For example, if I want to build a simple char-wise addition of a key to a text, then I need to calculate the sum of chars, which is legitimate and quite common in IT Security.

Encryption is generally done on the binary representation of a string, rather than on decoded Unicode scalar values. In Rust, the str::as_bytes method returns the binary representation as a [u8] slice, and encryption libraries operate on the u8 bytes from this slice.

It's also worth mentioning that if you want ASCII strings like the default is in C you should probably be using Vec<u8> and &[u8]. The standard library also has an Ascii type that helps here.

Edit: nope, it just has AsciiExt which provides some helpers. I was thinking of https://docs.rs/ascii .

I'm sorry if I have offended you. Let me try again to speak to the point:

The char-datatype has no meaningful arithmetic defined on it because it is not a "numeric thing". Just like "one orange plus another orange" does in fact never give you "two oranges" (yet maybe orange squish if you try hard enough). The numeric representation, however, does, and that is of course what we mean when we say something like that. Likewise, you can't do arithmetic on a char because there is no such thing as the difference between "濫" and "💗". You can do arithmetic on the _numeric representation_, which is what 'ⅷ' as u32 or as_bytes() gives you.
If you do that, your intention is made clear to drop the semantics of a unicode scalar and work on opaque numbers, something one could have cooked up in any other way like counting oranges.

If you want to do arithmetic on a numeric representation of unicode text, you probably don't want to deal with scalars in the first place but just work on the utf-8 encoded bytes directly. As previously pointed out, as_bytes() will give you that.

ok, I see, thank you all for clearing up my misunderstanding! After doing some tests, using as_bytes() is the right way to go. Seems like I still have much to learn.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

torkleyy picture torkleyy  Â·  3Comments

3442853561 picture 3442853561  Â·  4Comments

clarfonthey picture clarfonthey  Â·  3Comments

3442853561 picture 3442853561  Â·  3Comments

rudolfschmidt picture rudolfschmidt  Â·  3Comments