Rust: RFC: Rename `char` to make it clearer that it is a unicode codepoint/scalar value

Created on 6 Mar 2014  ·  21Comments  ·  Source: rust-lang/rust

Our char type is a Unicode scalar value (codepoint excluding the surrogate range), which can lead to confusion because (a) it differs to other languages and (b) it doesn't directly encourage good unicode hygiene ("Oh, a character? that's what the user sees").

Possible names include codepoint, ucs4, or rune like Go.

Other languages names for a unicode scalar value/what char means:

  • Haskell: Char is a codepoint (although surrogates are allowed)
  • D: dchar (char is a "UTF-8 code unit" and wchar is a "UTF-16 code-unit" (i.e. aliases for u8 and u16?): http://dlang.org/type.html)
  • Go: rune
  • C#/Java/Scala etc.: char is a 16-bit integer (i.e. UTF-16 code unit)
  • C/C++: char is (normally) a byte, i.e. a UTF-8 code unit.

(Other languages like Python don't have a type for a single character and don't have a type called char, and so aren't meaningful for this comparison.)

(This issue brought to you by reddit.)

A-unicode

Most helpful comment

I think that various names on the table here are inadequate for different reasons:

  • char means different things in different contexts, and what users think of as "characters" is closer to grapheme clustuers, which can be made of multiple code points
  • codepoint is not quite adequate as we exclude surrogate code points
  • ucs4 I think should refer to [char] strings/vectors rather than a single char unit

That leaves rune (from Go), which I think is the best by elimination.

In Go it is exactly an alias for int32, and only represents a code point or Unicode scalar value by convention. This differs from Rust where we restrict char values to the range of Unicode scalar values, but that difference is consistent with the difference between Rust’s str that is strictly UTF-8 (unless you mess it up with unsafe code, which we would consider a bug) and Go’s string type that’s a sequence of bytes, and only by convention often contains UTF-8.

So, proposal:

  • Rename char to rune (rune being a shorter name for Unicode scalar value)
  • Rename accordingly functions and methods that have "char" in their name.
  • Possibly have type ucs4 = [rune] (assuming DST)

All 21 comments

(I'm personally against calling it "rune" since that word feels like a glyph/grapheme rather than a codepoint to me... but there's precedent in Go and in BSD for that name.)

Also note that UCS-4 was historically not same to UTF-32. Wikipedia says they are now identical, but Unicode FAQ seems to suggest another. ucs (possibly re-acronymed as "Unicode Code, Scalar") might work if the correctness is important.

One argument against renaming char would be the precedent that other languages already assign different meanings to it. So the name char would be consistent in the sense that it has no real meaning already. ;)

@Kimundi OTOH it looks like nobody (from the above list) is using char to describe a Unicode scalar value, so we would be adding one more meaning to the list.

I really don't like rune because it strongly implies a 1:1 relation with a grapheme/glyph.

char is a good name already, and you can't provide a better one. I'd like
to keep it as is.
2014年3月6日 下午6:44于 "Huon Wilson" [email protected]写道

Our char type is a Unicode scalar valuehttp://www.unicode.org/glossary/#unicode_scalar_value(codepoint excluding the surrogate range), which can lead to confusion
because (a) it differs to other languages and (b) it doesn't directly
encourage good unicode hygiene ("Oh, a character? that's what the user
sees").

Possible names include codepoint, ucs4, or rune like Go.

Other languages names for a unicode scalar value/what char means:

  • Haskell: Char is a codepoint (although surrogates are allowed)
  • D: dchar (char is a "UTF-8 code unit" and wchar is a "UTF-16
    code-unit" (i.e. aliases for u8 and u16?): http://dlang.org/type.html)
  • Go: rune
  • C#/Java/Scala etc.: char is a 16-bit integer (i.e. UTF-16 code unit)
  • C/C++: char is (normally) a byte, i.e. a UTF-8 code unit.

(Other languages like Python don't have a type for a single character and
don't have a type called char, and so aren't meaningful for this
comparison.)


Reply to this email directly or view it on GitHubhttps://github.com/mozilla/rust/issues/12730
.

Don't like rune, hate usv for its meaningless. +1 for char.
2014年3月6日 下午11:21于 "Luca Bruno" [email protected]写道

@Kimundi https://github.com/Kimundi OTOH it looks like nobody (from the
above list) is using char to describe a Unicode scalar value, so we would
be adding one more meaning to the list.

I'm in favor of re-using rune as:

  1. developers already know it
  2. describes exactly our case
  3. avoids NIH.

Otherwise some acronym like usv or usv32, which are short and hygienic to
the standard but pose barriers to new-comers.


Reply to this email directly or view it on GitHubhttps://github.com/mozilla/rust/issues/12730#issuecomment-36898077
.

This is a minor wart that I'm not inclined to change.

Why does char not allow surrogate code points?

@brson: Surrogate code points aren't Unicode scalar values. They're just an implementation detail of UTF-16. The UTF-8 standard explicitly forbids encoding them too.

I think that various names on the table here are inadequate for different reasons:

  • char means different things in different contexts, and what users think of as "characters" is closer to grapheme clustuers, which can be made of multiple code points
  • codepoint is not quite adequate as we exclude surrogate code points
  • ucs4 I think should refer to [char] strings/vectors rather than a single char unit

That leaves rune (from Go), which I think is the best by elimination.

In Go it is exactly an alias for int32, and only represents a code point or Unicode scalar value by convention. This differs from Rust where we restrict char values to the range of Unicode scalar values, but that difference is consistent with the difference between Rust’s str that is strictly UTF-8 (unless you mess it up with unsafe code, which we would consider a bug) and Go’s string type that’s a sequence of bytes, and only by convention often contains UTF-8.

So, proposal:

  • Rename char to rune (rune being a shorter name for Unicode scalar value)
  • Rename accordingly functions and methods that have "char" in their name.
  • Possibly have type ucs4 = [rune] (assuming DST)

Several people include core team ones don't like rune. I don't like it too.
Char is a good name.
2014年3月20日 下午12:14于 "Simon Sapin" [email protected]写道

I think that various names on the table here are inadequate for different
reasons:

  • char means different things in different contexts, and what users
    think of as "characters" is closer to grapheme clustuershttp://www.unicode.org/glossary/#grapheme_cluster,
    which can be made of multiple code points
  • codepoint is not quite adequate as we exclude surrogate code points
  • ucs4 I think should refer to [char] strings/vectors rather than a
    single char unit

That leaves rune (from Go), which I think is the best by elimination.

In Go it is exactly an alias for int32http://golang.org/pkg/builtin/#rune,
and only represents a code point or Unicode scalar value by convention.
This differs from Rust where we restrict char values to the range of
Unicode scalar values, but that difference is consistent with the
difference between Rust’s str that is strictly UTF-8 (unless you mess it
up with unsafe code, which we would consider a bug) and Go’s string typehttp://golang.org/pkg/builtin/#stringthat’s a sequence of bytes, and only by convention often contains UTF-8.

So, proposal:

  • Rename char to rune (rune being a shorter name for Unicode scalar
    value)
  • Possibly have type ucs4 = rune


Reply to this email directly or view it on GitHubhttps://github.com/mozilla/rust/issues/12730#issuecomment-38133887
.

@liigo Could you explain why you don’t like "rune" and why "character" being ambiguous is not a problem, as you see it?

Someone has been answered your questions, see comments above.
2014年3月20日 下午5:17于 "Simon Sapin" [email protected]写道

@liigo https://github.com/liigo Could you explain why you don’t like
"rune" and why "character" being ambiguous is not a problem, as you see it?


Reply to this email directly or view it on GitHubhttps://github.com/mozilla/rust/issues/12730#issuecomment-38147217
.

I think that various names on the table here are inadequate for different reasons:

To be pedantic rune was a name on the table here. :P


Also, this should be an RFC in rust-lang/rfcs, now that we have that process. Closing. (If someone else doesn't step up to write it up, I'm happy to do it... eventually.)

char name wasted my many days as I was thinking it as plain c++ char. Too bad if some one like me assume it by name. Now I read type carefully before using.

My point char is confusing name.

If core team doesn't like above names invent new one but not char atleast. It will save someone's time.

It's old topic but it hurts me that's why I'm adding my comment.

c8 is good name atleast it will force us to understand what is c8 like i8, u8, i32 etc

char name wasted my many days as I was thinking it as plain c++ char. Too bad if some one like me assume it by name. Now I read type carefully before using.

My point char is confusing name.

If core team doesn't like above names invent new one but not char atleast. It will save someone's time.

It's old topic but it hurts me that's why I'm adding my comment.

agree with this,make misunderstanding for common used name ·char·,make it much harder to get into use rust,as target of some useable Programming Language maybe should respect some ·general knowledge· for most of other languages already made it like ·noun·

It was the least bad option among everything considered, and it's highly unlikely that it would change at this point with the language stable. Since it's a Unicode scalar value (not just any code point), there's always a 1:1 mapping between strings and [char].

C has signed char, unsigned char, char which is a distinct type that may or may not be signed but is always a distinct type from both signed char and unsigned char with special rules, wchar_t (which varies in size based on platform choices, it's a code point on Linux and a UTF-16 code unit on Windows), char16_t and char32_t. C++ is also adding char8_t and it may come to C too. Even though char16_t implies 16-bit, it's the same type as uint_least16_t and can be larger. It's only guaranteed to be Unicode if the platform defines __STD_UTF_16__ is defined. Similarly, char32_t is only guaranteed Unicode if platform defines __STD_UTF_32__ defined. Since a lot of this has not existed historically, there are many alternatives broadly used in language / library ecosystems.

Coming from this in C and C++, I struggle to see how the the naming of char makes it much harder to get into the language. It takes someone a moment of thought to read and absorb the chosen definition.

On that note, how do I unsubscribe from all threads in a repository? ...

should be in page top of 'watch' botton?

It was the least bad option among everything considered, and it's highly unlikely that it would change at this point with the language stable. Since it's a Unicode scalar value (not just any code point), there's always a 1:1 mapping between strings and [char].

C has signed char, unsigned char, char which is a distinct type that may or may not be signed but is always a distinct type from both signed char and unsigned char with special rules, wchar_t (which varies in size based on platform choices, it's a code point on Linux and a UTF-16 code unit on Windows), char16_t and char32_t. C++ is also adding char8_t and it may come to C too. Even though char16_t implies 16-bit, it's the same type as uint_least16_t and can be larger. It's only guaranteed to be Unicode if the platform defines __STD_UTF_16__ is defined. Similarly, char32_t is only guaranteed Unicode if platform defines __STD_UTF_32__ defined. Since a lot of this has not existed historically, there are many alternatives broadly used in language / library ecosystems.

Coming from this in C and C++, I struggle to see how the the naming of char makes it _much harder_ to get into the language. It takes someone a moment of thought to read and absorb the chosen definition.

On that note, how do I unsubscribe from all threads in a repository? ...

should be in page top of 'watch' botton?

I don't want to ignore it but rather remove all the explicit subscriptions.

@thestinger I suspect https://github.com/notifications/subscriptions is the closest to what you ask that Github has.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

thestinger picture thestinger  ·  234Comments

withoutboats picture withoutboats  ·  308Comments

nikomatsakis picture nikomatsakis  ·  340Comments

cramertj picture cramertj  ·  512Comments

withoutboats picture withoutboats  ·  213Comments