Our char
type is a Unicode scalar value (codepoint excluding the surrogate range), which can lead to confusion because (a) it differs to other languages and (b) it doesn't directly encourage good unicode hygiene ("Oh, a character? that's what the user sees").
Possible names include codepoint
, ucs4
, or rune
like Go.
Other languages names for a unicode scalar value/what char
means:
Char
is a codepoint (although surrogates are allowed)dchar
(char
is a "UTF-8 code unit" and wchar
is a "UTF-16 code-unit" (i.e. aliases for u8
and u16
?): http://dlang.org/type.html)rune
char
is a 16-bit integer (i.e. UTF-16 code unit)char
is (normally) a byte, i.e. a UTF-8 code unit.(Other languages like Python don't have a type for a single character and don't have a type called char
, and so aren't meaningful for this comparison.)
(This issue brought to you by reddit.)
(I'm personally against calling it "rune" since that word feels like a glyph/grapheme rather than a codepoint to me... but there's precedent in Go and in BSD for that name.)
Also note that UCS-4 was historically not same to UTF-32. Wikipedia says they are now identical, but Unicode FAQ seems to suggest another. ucs
(possibly re-acronymed as "Unicode Code, Scalar") might work if the correctness is important.
One argument against renaming char
would be the precedent that other languages already assign different meanings to it. So the name char
would be consistent in the sense that it has no real meaning already. ;)
@Kimundi OTOH it looks like nobody (from the above list) is using char
to describe a Unicode scalar value, so we would be adding one more meaning to the list.
I really don't like rune
because it strongly implies a 1:1 relation with a grapheme/glyph.
char
is a good name already, and you can't provide a better one. I'd like
to keep it as is.
2014年3月6日 下午6:44于 "Huon Wilson" [email protected]写道:
Our char type is a Unicode scalar valuehttp://www.unicode.org/glossary/#unicode_scalar_value(codepoint excluding the surrogate range), which can lead to confusion
because (a) it differs to other languages and (b) it doesn't directly
encourage good unicode hygiene ("Oh, a character? that's what the user
sees").Possible names include codepoint, ucs4, or rune like Go.
Other languages names for a unicode scalar value/what char means:
- Haskell: Char is a codepoint (although surrogates are allowed)
- D: dchar (char is a "UTF-8 code unit" and wchar is a "UTF-16
code-unit" (i.e. aliases for u8 and u16?): http://dlang.org/type.html)- Go: rune
- C#/Java/Scala etc.: char is a 16-bit integer (i.e. UTF-16 code unit)
- C/C++: char is (normally) a byte, i.e. a UTF-8 code unit.
(Other languages like Python don't have a type for a single character and
don't have a type called char, and so aren't meaningful for this
comparison.)—
Reply to this email directly or view it on GitHubhttps://github.com/mozilla/rust/issues/12730
.
Don't like rune, hate usv for its meaningless. +1 for char.
2014年3月6日 下午11:21于 "Luca Bruno" [email protected]写道:
@Kimundi https://github.com/Kimundi OTOH it looks like nobody (from the
above list) is using char to describe a Unicode scalar value, so we would
be adding one more meaning to the list.I'm in favor of re-using rune as:
- developers already know it
- describes exactly our case
- avoids NIH.
Otherwise some acronym like usv or usv32, which are short and hygienic to
the standard but pose barriers to new-comers.—
Reply to this email directly or view it on GitHubhttps://github.com/mozilla/rust/issues/12730#issuecomment-36898077
.
This is a minor wart that I'm not inclined to change.
Why does char not allow surrogate code points?
@brson: Surrogate code points aren't Unicode scalar values. They're just an implementation detail of UTF-16. The UTF-8 standard explicitly forbids encoding them too.
I think that various names on the table here are inadequate for different reasons:
char
means different things in different contexts, and what users think of as "characters" is closer to grapheme clustuers, which can be made of multiple code pointscodepoint
is not quite adequate as we exclude surrogate code pointsucs4
I think should refer to [char]
strings/vectors rather than a single char
unitThat leaves rune
(from Go), which I think is the best by elimination.
In Go it is exactly an alias for int32
, and only represents a code point or Unicode scalar value by convention. This differs from Rust where we restrict char
values to the range of Unicode scalar values, but that difference is consistent with the difference between Rust’s str
that is strictly UTF-8 (unless you mess it up with unsafe code, which we would consider a bug) and Go’s string
type that’s a sequence of bytes, and only by convention often contains UTF-8.
So, proposal:
char
to rune
(rune being a shorter name for Unicode scalar value)type ucs4 = [rune]
(assuming DST)Several people include core team ones don't like rune. I don't like it too.
Char is a good name.
2014年3月20日 下午12:14于 "Simon Sapin" [email protected]写道:
I think that various names on the table here are inadequate for different
reasons:
- char means different things in different contexts, and what users
think of as "characters" is closer to grapheme clustuershttp://www.unicode.org/glossary/#grapheme_cluster,
which can be made of multiple code points- codepoint is not quite adequate as we exclude surrogate code points
- ucs4 I think should refer to [char] strings/vectors rather than a
single char unitThat leaves rune (from Go), which I think is the best by elimination.
In Go it is exactly an alias for int32http://golang.org/pkg/builtin/#rune,
and only represents a code point or Unicode scalar value by convention.
This differs from Rust where we restrict char values to the range of
Unicode scalar values, but that difference is consistent with the
difference between Rust’s str that is strictly UTF-8 (unless you mess it
up with unsafe code, which we would consider a bug) and Go’s string typehttp://golang.org/pkg/builtin/#stringthat’s a sequence of bytes, and only by convention often contains UTF-8.So, proposal:
- Rename char to rune (rune being a shorter name for Unicode scalar
value)- Possibly have type ucs4 = rune
—
Reply to this email directly or view it on GitHubhttps://github.com/mozilla/rust/issues/12730#issuecomment-38133887
.
@liigo Could you explain why you don’t like "rune" and why "character" being ambiguous is not a problem, as you see it?
Someone has been answered your questions, see comments above.
2014年3月20日 下午5:17于 "Simon Sapin" [email protected]写道:
@liigo https://github.com/liigo Could you explain why you don’t like
"rune" and why "character" being ambiguous is not a problem, as you see it?—
Reply to this email directly or view it on GitHubhttps://github.com/mozilla/rust/issues/12730#issuecomment-38147217
.
I think that various names on the table here are inadequate for different reasons:
To be pedantic rune
was a name on the table here. :P
Also, this should be an RFC in rust-lang/rfcs, now that we have that process. Closing. (If someone else doesn't step up to write it up, I'm happy to do it... eventually.)
char name wasted my many days as I was thinking it as plain c++ char. Too bad if some one like me assume it by name. Now I read type carefully before using.
My point char is confusing name.
If core team doesn't like above names invent new one but not char atleast. It will save someone's time.
It's old topic but it hurts me that's why I'm adding my comment.
c8 is good name atleast it will force us to understand what is c8 like i8, u8, i32 etc
char name wasted my many days as I was thinking it as plain c++ char. Too bad if some one like me assume it by name. Now I read type carefully before using.
My point char is confusing name.
If core team doesn't like above names invent new one but not char atleast. It will save someone's time.
It's old topic but it hurts me that's why I'm adding my comment.
agree with this,make misunderstanding for common used name ·char·,make it much harder to get into use rust,as target of some useable Programming Language maybe should respect some ·general knowledge· for most of other languages already made it like ·noun·
It was the least bad option among everything considered, and it's highly unlikely that it would change at this point with the language stable. Since it's a Unicode scalar value (not just any code point), there's always a 1:1 mapping between strings and [char]
.
C has signed char
, unsigned char
, char
which is a distinct type that may or may not be signed but is always a distinct type from both signed char
and unsigned char
with special rules, wchar_t
(which varies in size based on platform choices, it's a code point on Linux and a UTF-16 code unit on Windows), char16_t
and char32_t
. C++ is also adding char8_t
and it may come to C too. Even though char16_t
implies 16-bit, it's the same type as uint_least16_t
and can be larger. It's only guaranteed to be Unicode if the platform defines __STD_UTF_16__
is defined. Similarly, char32_t
is only guaranteed Unicode if platform defines __STD_UTF_32__
defined. Since a lot of this has not existed historically, there are many alternatives broadly used in language / library ecosystems.
Coming from this in C and C++, I struggle to see how the the naming of char
makes it much harder to get into the language. It takes someone a moment of thought to read and absorb the chosen definition.
On that note, how do I unsubscribe from all threads in a repository? ...
should be in page top of 'watch' botton?
It was the least bad option among everything considered, and it's highly unlikely that it would change at this point with the language stable. Since it's a Unicode scalar value (not just any code point), there's always a 1:1 mapping between strings and
[char]
.C has
signed char
,unsigned char
,char
which is a distinct type that may or may not be signed but is always a distinct type from bothsigned char
andunsigned char
with special rules,wchar_t
(which varies in size based on platform choices, it's a code point on Linux and a UTF-16 code unit on Windows),char16_t
andchar32_t
. C++ is also addingchar8_t
and it may come to C too. Even thoughchar16_t
implies 16-bit, it's the same type asuint_least16_t
and can be larger. It's only guaranteed to be Unicode if the platform defines__STD_UTF_16__
is defined. Similarly,char32_t
is only guaranteed Unicode if platform defines__STD_UTF_32__
defined. Since a lot of this has not existed historically, there are many alternatives broadly used in language / library ecosystems.Coming from this in C and C++, I struggle to see how the the naming of
char
makes it _much harder_ to get into the language. It takes someone a moment of thought to read and absorb the chosen definition.On that note, how do I unsubscribe from all threads in a repository? ...
should be in page top of 'watch' botton?
I don't want to ignore it but rather remove all the explicit subscriptions.
@thestinger I suspect https://github.com/notifications/subscriptions is the closest to what you ask that Github has.
Most helpful comment
I think that various names on the table here are inadequate for different reasons:
char
means different things in different contexts, and what users think of as "characters" is closer to grapheme clustuers, which can be made of multiple code pointscodepoint
is not quite adequate as we exclude surrogate code pointsucs4
I think should refer to[char]
strings/vectors rather than a singlechar
unitThat leaves
rune
(from Go), which I think is the best by elimination.In Go it is exactly an alias for
int32
, and only represents a code point or Unicode scalar value by convention. This differs from Rust where we restrictchar
values to the range of Unicode scalar values, but that difference is consistent with the difference between Rust’sstr
that is strictly UTF-8 (unless you mess it up with unsafe code, which we would consider a bug) and Go’sstring
type that’s a sequence of bytes, and only by convention often contains UTF-8.So, proposal:
char
torune
(rune being a shorter name for Unicode scalar value)type ucs4 = [rune]
(assuming DST)