Treating []u8 as strings is incorrect. []u8 is an array of octets, not an array of characters. Zig should support Unicode more explicitly and enforce the distinction between []u8 and str in the language and standard library.
I propose adding a rune type, which holds one unicode codepoint. The underlying storage mechanism isn't relevant to the programmer, who can only assume it's an int capable of holding a unicode codepoint. On platforms whose pointers are sufficiently sized, it should probably be a usize under the covers. I also propose adding a str type, which is opaque but offers length and indexing of runes. The underlying string encoding is also not important to the programmer, but some possible strategies include always using UTF-8 or UCS-32, or upgrading the encoding as necessary to fit the runes the user attempts to place in it.
Also provided should be standard library functions for manipulating strings separately from []u8, and helpful functions to convert str to []u8 and back again in arbitrary encodings.
It seems to me that what this issue is calling for is standard library code to handle strings.
Can you explain what the use case is for the rune and str types you proposed? Under what circumstances would you use them?
For example, it seems to me that the hello world application should use []u8 and not str for the command line arguments to main, as well as the bytes being printed to stdout, because that's what is happening - command line args are arrays of octets, and what goes to stdout is arrays of octets. Text editors typically encode characters as UTF-8, and zig allows UTF-8 (any array of octets, really) in string literals. So if, for example, the hello world application used these types at all, it would introduce an unnecessary runtime conversion between str encoded data and []u8.
Point being, when actually do we want to use these types being proposed? I can certainly think of some use cases, such as implementing a text box in a GUI application. But at that point, why does it need to be part of the language? Would not a standard library module suffice?
Can you explain what the use case is for the rune and str types you proposed? Under what circumstances would you use them?
During any string manipulation. Iterating over a str with for would give you a bunch of runes. This is more useful than iterating over a bunch of u8s, which are not characters but could be partial runes depending on the encoding of the []u8.
zig allows UTF-8 (any array of octets, really) in string literals [...] it would introduce an unnecessary runtime conversion between str encoded data and []u8.
It is necessary. Not decoding the strings will produce buggy code. The compiler can optimize the conversion away by compiling string literals into str structs instead of []u8, which it should do.
hello world application should use []u8 and not str for the command line arguments to main, as well as the bytes being printed to stdout
Args could be []u8, sure, because that's what they are. You would have to decode them before using them as strings. But there's no reason you couldn't just write []u8 to stdout if you prefer. Could also have a special syntax ala Python for string literals that are encoded as UTF-8 and become []u8 rather than str. Would also be nice to detect the signature of main and do sane argument decoding in the crt0 if the user requests args as []str.
Point being, when actually do we want to use these types being proposed? I can certainly think of some use cases, such as implementing a text box in a GUI application. But at that point, why does it need to be part of the language? Would not a standard library module suffice?
String handling is intrinsic to any programming language. Well over half of programs will do string manipulation, I expect. []u8 are not strings, and if you try to do string manipulation with them, your code will be broken. I strongly encourage you to add a language-level distinction between []u8 and str to prevent users from running into bugs. This discussion happened for Python 3 and they made the correct choice, by the way.
It is necessary. Not decoding the strings will produce buggy code.
const io = @import("std").io;
pub fn main(args: [][]u8) -> %void {
%%io.stdout.printf("Hello, 世界\n");
}
$ ./test
Hello, 世界
Where's the bug?
What does an additional type in this example accomplish?
It sounds like you're saying, this is not an example where it is necessary, but users will have other use cases where they want to be doing string manipulation rather than array of octet manipulation, such as, say, taking stdin, uppercasing it, and printing it to stdout. Would that be a fair use case?
The bug doesn't present itself in the simple case. Here are a number of examples that would break or be unsupported:
Your example only works because you're really just copying []u8 around. You're not actually doing string operations.
Why do you think any major language designed in the past 10 years, no matter how low level, has had proper unicode string support?
So proper unicode string support is doable in userland without any modification to the Zig language or runtime. String literals can be converted to anything at compile time with userland functions, and all those string manipulation operations mentioned above can be done in userland functions as well, even also at compile time.
Am I right in saying that this is a proposal for a standard library module, and not a proposal for a language change?
Not as far as I can tell. I presume that making some_str[1] do the right thing involves language changes, and making string literals emit the string type instead of []u8 is also a language change. I'm not sure if zig already supports compile-time reflection to determine the parameters of main - to support the crt0 change I proposed that may require language changes.
I presume that making some_str[1] do the right thing involves language changes
With a userland string solution, you would probably not be able to do some_str[1], unless some_str was a slice. If your string solution is a struct that contains a slice, then perhaps some_str.chars[1] would work without language changes.
But if your string solution is utf8-encoded or a rope data structure or something else that's not simply a slice of characters, then character access wouldn't be as simple as the [] makes it look. Zig does not have operator overloading, and that's to avoid hidden runtime costs. If character access requires a O(log(n)) tree traversal, then make that operation a function and call it like a function.
String literals can be converted to any encoding you want at compile time like this:
const motd = decodeUtf8("こんにちは");
I'm not sure if zig already supports compile-time reflection to determine the parameters of main - to support the crt0 change I proposed that may require language changes.
Is there a reason other than convenience why you wouldn't want to do the conversion at the top of your main implementation by calling a userland function? Is there a reason to do it in the crt0 instead?
But if your string solution is utf8-encoded or a rope data structure or something else that's not simply a slice of characters, then character access wouldn't be as simple as the [] makes it look. Zig does not have operator overloading, and that's to avoid hidden runtime costs. If character access requires a O(log(n)) tree traversal, then make that operation a function and call it like a function.
I'm not suggesting operator overloading - I'm just suggesting strings behave this way, which is why a language change is required.
String literals can be converted to any encoding you want at compile time like this:
What does that even do? The behavior of your example is not predictable by people who understand string encodings without reading the docs and probably the code because when you write the docs you will likely fail to understand what's confusing about it.
Is there a reason other than convenience why you wouldn't want to do the conversion at the top of your main implementation by calling a userland function? Is there a reason to do it in the crt0 instead?
No, just convenience.
There's an elephant in the room in this discussion, which is that Zig wants to take memory allocation seriously. How exactly to handle memory allocation is a big discussion, and one that should probably happen in a different issue. Some high level points that are relevant here are:
fn splitString(s: &const String, sep: &const String, allocator: &Allocator)->[]String.fn setGlobalAllocator(allocator: &Allocator).There's a lot there to discuss, and again, that should probably be in another issue. The point I'm trying to make in this issue is that we can't supply functions like string splitting without thinking about memory allocation. Languages that don't ask you to think about memory allocation are, from Zig's perspective, sub-optimal languages. Python 3, JavaScript, and Java all have garbage collection, which makes string manipulation look very nice at a high level, but fails Zig's goal of optimality.
So far, Zig provides a List(T) class that does memory management by asking explicitly for an allocator when you construct the list. If we run with this idea for now, then you could make a string builder class that can decode from utf8, encode to a utf8 output buffer, stores unicode data opaquely, even offers functions like splitting and random character access. Does it make sense for one of these string builds to exist at compile time? Maybe, but it would be complicated, since allocators probably need to work differently at runtime vs compile time. Does Zig want to supply such a string builder class in its standard library? Maybe.
@SirCmpwn Can you give examples of low level languages with proper unicode string support? I'd like to check out how they do memory management.
Rust and Go come to mind as being fairly low level and having sane Unicode support. Also, you can do a lot of things with this design without bringing allocation into the discussion.
After a few minutes of research, it looks like Rust supports 1 global allocator per build artifact determined at compile time. This lacks the features of the Jai solution, which allow multiple distinct concurrent allocators running in the same application at the same time.
I believe Go has a memory management strategy with hidden allocations and a garbage collector. I've heard some people call Go a low level language, and I'm aware of an intense debate on the internet about Go's memory management strategy, even getting so intense as to call Go's marketers liars. But Go drama aside, Go's memory management strategy is not acceptable for Zig, so Go's unicode strategy is not very helpful in designing a unicode strategy for Zig.
I wouldn't mind a global allocator. Perhaps you could use an allocator keyword to set a new allocator for a given scope? Again, though, many many sane Unicode string handling functions don't need allocators. And for that matter non-sane string splitting probably needs allocation as well. I don't really this it's relevant to this issue.
I wouldn't mind a global allocator. Perhaps you could use an allocator keyword to set a new allocator for a given scope?
That sounds like Rust and Jai respectively.
The allocation discussion is a bit off topic, but it is relevant to keep in mind. Let's get back to string support and discuss some string functions/methods we might want to have.
I've gone through the list of Java 7's String methods and pasted in some highlights to discuss. The code examples are Java's API, not a proposed signature for Zig, although I'd like to discuss what Zig's version of each feature would be.
String(byte[] bytes, Charset charset) and byte[] getBytes(Charset charset): What charsets should be supported? Just UTF-8? Maybe also ISO-8859-1? Maybe also Windows-1252? Maybe "all of them"? Maybe that's configurable at compile time? Or maybe there could be a dynamic library that provides these? Memory allocation is relevant here.int compareTo(String anotherString) and boolean equals(Object anObject): Easy to implement. And I believe these will also work with naive UTF-8 []u8 lexicographical comparison.int compareToIgnoreCase(String str), boolean equalsIgnoreCase(String anotherString), String toLowerCase(), and String toUpperCase(): This requires a table of unicode points with data about each character. This would be a significant feature to provide, and we may want to provide a standard solution to this. Memory allocation is relevant to the last two methods here.String toLowerCase(Locale locale) and String toUpperCase(Locale locale): I didn't know that uppercasing and lowercasing were sensitive to locale. Should Zig worry about this?int indexOf(int ch) and int indexOf(String str): Easy.boolean startsWith(String prefix) and boolean endsWith(String suffix): Easy.int hashCode(): Implementations would be easy, but deciding on an implementation might be hard.String replace(char oldChar, char newChar) and String replace(CharSequence target, CharSequence replacement): Algorithmically easy. Memory allocation is relevant here.String[] split(String regex): Probably wouldn't drag regex into this, but otherwise easy to implement. Memory allocation is relevant here.String substring(int beginIndex, int endIndex) and String trim(): Easy to implement. Memory allocation might be relevant here depending on the underlying string implementation.Additionally, both Rust and Java 7 seem to have methods related to interpreting UTF-8 or UTF-16 bytes as sequences of variable-length codepoints, but that might not be necessary depending on the string implementation. It does raise a question though, which is how should unicode strings really be implemented?
I propose adding a rune type, which holds one unicode codepoint. The underlying storage mechanism isn't relevant to the programmer, who can only assume it's an int capable of holding a unicode codepoint.
To me, this just means it's a u32. The range of possible unicode codepoints isn't a mystery. It's 0-0x10FFFF, which is bigger than a u16 and small enough for a u32.
On platforms whose pointers are sufficiently sized, it should probably be a usize under the covers.
Wouldn't this be way too big on 64-bit platforms?
The underlying string encoding is also not important to the programmer, but some possible strategies include always using UTF-8 or UCS-32, or upgrading the encoding as necessary to fit the runes the user attempts to place in it.
I've seen these strategies done before, and they've all got their strengths and weaknesses. I've come to the conclusion that there is no such thing as a single best implementation of a unicode string, but rather countless subtle optimizations you can make to suit your different usecases. (This makes string implementations very similar to memory allocators in that regard.)
Zig can provide a general-purpose string implementation, but I don't like the idea of the standard implementation getting any special treatment that a homemade implementation can't get. An optimal string implementation is part of Zig's quest for optimality, and that's not possible with a standard string implementation. This means that userland string solutions need to be first-class citizens.
String(byte[] bytes, Charset charset) and byte[] getBytes(Charset charset): What charsets should be supported? Just UTF-8? Maybe also ISO-8859-1? Maybe also Windows-1252? Maybe "all of them"? Maybe that's configurable at compile time? Or maybe there could be a dynamic library that provides these? Memory allocation is relevant here.
I would make encoding a seperate concern from the rest of the string impl and put it in its own module. Not that it answers any of your questions, just a comment I have.
int compareToIgnoreCase(String str), boolean equalsIgnoreCase(String anotherString), String toLowerCase(), and String toUpperCase(): This requires a table of unicode points with data about each character. This would be a significant feature to provide, and we may want to provide a standard solution to this. Memory allocation is relevant to the last two methods here.
String toLowerCase(Locale locale) and String toUpperCase(Locale locale): I didn't know that uppercasing and lowercasing were sensitive to locale. Should Zig worry about this?
Most languages choose to only handle upper and lowercase for latin characters, which is the only commonly used set of characters for which it really makes much linguistic sense. In a Unicode implementation you'll find that human languages are really resistant to implementing in software, and in the standard library will probably have to concede to only handling the common cases and leave exhaustive implementations of this and that third parties.
int hashCode(): Implementations would be easy, but deciding on an implementation might be hard.
Zig should probably standardize a hashing strategy for all things, not just strings.
Wouldn't this be way too big on 64-bit platforms?
You're right, it should just be a u32.
Most languages choose to only handle upper and lowercase for latin characters, which is the only commonly used set of characters for which it really makes much linguistic sense. In a Unicode implementation you'll find that human languages are really resistant to implementing in software, and in the standard library will probably have to concede to only handling the common cases and leave exhaustive implementations of this and that third parties.
Here's an example corroborating your point. In JavaScript "ΣΣ".toLowerCase() == "σς". The same uppercase sigma lowers into two different lowercase sigmas, because there's a special character for a lowercase sigma at the end of a word.
This kinda makes me want to not even both with uppercase/lowercase at all, not even for the ascii characters, just so no one is expecting things to work when they don't. Either that, or offer toUpperCase and toLowerCase just for u8s, and possibly even explicitly say it's just for ascii, like asciiToUpperCase(). This could be useful for hexidecimal representations, for example.
I propose adding a rune type
I also propose adding a str type
I'm not convinced that this is a language change rather than a standard library feature.
If this isn't implemented right, there can be lots of future pain.
https://docs.python.org/release/3.2/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
There are lots of misunderstandings about unicode codepoints. They are not "characters" in general; actual on-screen symbols/glyphs (grapheme clusters) are variable codepoint-length.
The following operations were mentioned:
When I refer to u8 sequences I mean utf8 encoded strings. I do see the value of a type that witnesses a valid utf8 encoded string, and which supports various codepoint decode utilities, and maybe grapheme cluster splitting etc.
This thread just gets better with time.
Can we get some clarification on how runes and strings could be implemented in the standard library, and how that will relate to language-level features like string literals?
The distinguishing property of a string, relative to a byte array, is that it always represents a valid sequence of Unicode code points, as interpreted using some (typically unspecified) text encoding. An actual string type allows you to express that with the static type system. If you use a byte array instead, then you need to validate the byte array at runtime before you can do anything with it. And then you need to validate it again, in the next function that does something with that byte array.
Obviously you wouldn't use an actual string type at an external boundary, e.g. the command line, where it would be invalid to assume properly encoded UTF-8 text. But a proper string type allows you to validate an untrusted byte array, and then (conditional on validation) use the new value (of type string) at any internal boundary where a trusted string is required.
So how would an actual string type be implemented? My best guess is that it would be (a pointer to?) an array of Runes, and that a Rune would be an opaque type that can only have values in the range [U+000000, U+10FFFF]. I think this could be enforced by e.g. exposing a function that accepts a 32 bit integer and returns either a Rune or an error, depending on the value of the integer.
Is that the intended path forward? If so, will string literals represent bytes or Runes? Will hex escapes that don't map to a valid Rune (e.g. xFF) be removed from the language?
@CantrellD It's already possible to implement comptime initialization/validation of UTF-8 text, like:
fn u(comptime s: []const u8) Utf8String {
return try Utf8.stringFromUtf8(s) catch unreachable;
}
test "format and print" {
// æ (U+00E6)
// Utf8.print is an enhanced print
try Utf8.print(stdout, "{}", .{"æ"}); // ok
try Utf8.print(stdout, "{}", .{"\xC3\xA6"}); // ok
try Utf8.print(stdout, "{}", .{"\xE6"}); // runtime error
// new "z" specifier for arbitrary bytes (not NUL-terminated: "s")
try Utf8.print(stdout, "{z}", .{"\xE6"}); // ok
try Utf8.print(stdout, "{s}", .{"\xE6"}); // ok
var s = "\xE6";
try Utf8.print(stdout, "{}", .{s}); // runtime error
try Utf8.print(stdout, "{z}", .{s}); // ok
try Utf8.print(stdout, "{s}", .{s}); // ok
var s1 = u("\xE6"); // comptime error
var s2 = try Utf8.stringFromUtf8(s); // runtime error
var s3 = Utf8.stringFromUtf8Unchecked(s); // risky
}
@iology Please excuse my ignorance, but are Utf8String and Utf8 already available in the standard library, or is that just an example? I tried to find them, but failed.
I ask in part because it isn't clear to me that you can instantiate Utf8String with a validated runtime value, which is an important use case.
@CantrellD Yes, just example code. A lot can be learned from the Rust stdlib.
btw, correction to my example:
// these are all runtime behaviors, unless you `comptime print(...)` or forbid "{}" for []u8 at comptime
Utf8.print("{}", .{"æ"}); // ok
Utf8.print("{}", .{"\xC3\xA6"}); // ok
Utf8.print("{}", .{"\xE6"}); // error
I ask in part because it isn't clear to me that you can instantiate Utf8String with a validated runtime value, which is an important use case.
While how we get a validated runtime value is still unknown (of what type the value is?), maybe you need a new builtin function like @toUtf8StringUnchecked or if possible simply @bitCast(string, validated_but_structure_unknown).
Is that the intended path forward? If so, will string literals represent bytes or Runes? Will hex escapes that don't map to a valid Rune (e.g. xFF) be removed from the language?
I think no, otherwise UTF8-encoded raw identifiers would naturally be allowed and breaking change between versions of the Unicode standard would not be a concern for Zig. (#3947) edit: I guess there will unlikely be a full featured UTF-8 module in the standard library.
Though not a big problem, restriction on \xHH will make it inconvenient for initializing byte strings or [ASCII-compatible encoding inserted here] strings.
While how we get a validated runtime value is still unknown (of what type the value is?), maybe you need a new builtin function
I'm not sure you do need a new builtin function, actually; I think it may be sufficient to define a library which exports (a) an opaque type called string, (b) a function that transforms untrusted byte arrays into strings (or else fails, if the byte array is invalid), and (c) a set of fundamental functions for string processing. You'd need to avoid instantiating invalid strings within that library, but outside the library I believe it would be impossible to do so.
That assumes that you want a string type, to enforce the weird rules that Unicode tries to create for what a valid sequence of codepoints should look like. Regardless, I believe you'd need an opaque Rune type, probably defined in roughly the same way I just described, to restrict the range of values that can exist for individual codepoints.
Though not a big problem, restriction on xHH will make it inconvenient for initializing byte strings or [ASCII-compatible encoding inserted here] strings.
Given that strings and bytestrings are different things, I think it's more reasonable to have distinct syntax for representing bytestrings. The hex escapes aren't safe in normal string literals, but they'd be fine in bytestring literals.
As it is, you can create "string" literals that aren't actually valid strings. I'm not aware of any other language (aside from Python, and probably C) that allows that.
Edit: I've been using "rune" as a synonym for "codepoint" to align with the terminology in the original post, but I finally looked it up, and I think it might be an alias for u32 that golang invented. So, it's probably better if I stop using it. Apparently static type safety for unicode strings is less ubiquitous than I thought.
to enforce the weird rules that Unicode tries to create for what a valid sequence of codepoints should look like
Any sequence of codepoints never generate invalid code units (u8/u16/u32) and that all depends on locales and fonts in use, thus not a real issue for Zig as long as these functions are not required in the standard library.
I'm not sure you do need a new builtin function, actually; I think it may be sufficient to define a library which exports (a) an opaque type called string, (b) a function that transforms untrusted byte arrays into strings (or else fails, if the byte array is invalid), and (c) a set of fundamental functions for string processing.
This is how I view it: (a) plus (b) gives a "builtin" function, although not in the form @builtin. (a) plus (c) gives more builtins. This is because I assume that you also need indexing support s[index]. So it is an opaque type with indexing syntax support. (operator overloading is not supported at the moment, maybe never will) edit: excuse me, having a special syntax is already builtin support, a library is unable to invent new syntax. My brain needs to take some cool drink.
Given that strings and bytestrings are different things, I think it's more reasonable to have distinct syntax for representing bytestrings.
I'm completely fine with myComptimeCStrGeneratorThroughDoubleEscape("\\xHH\\xHH\\xHH"). No special syntax.
All the discussion from people above boils down to these questions:
How much unicode support do you need on the syntax level? (source file already requires UTF-8 encoding)
Is efficient handling of unicode text impossible without compiler support? (simple jobs can be covered by std.mem through manipulating bytes, so only special treatment needs to go into the Unicode module)
Where and how often do you really need it? Do you just need a standard implementation in the stdlib? (other than initializing string literals in a special syntax plus comptime validation, why should it work differently?)
There are already some good arguments above. (2) is the most interesting but I'm not an expert on compiler.
Any sequence of codepoints never generate invalid code units (u8/u16/u32) and that all depends on locales and fonts in use, thus not a real issue for Zig as long as these functions are not required in the standard library.
By "weird rules" I meant e.g. codepoints in the range [U+D800, U+DFFF] being reserved exclusively for UTF-16 encoded text. I don't know if Zig will ever care about those rules; I was just noting that they exist.
This is because I assume that you also need indexing support s[index]. So it is an opaque type with indexing syntax support.
If indexing support is needed, then AFAICT it's not possible to implement a string type with static type safety (for UTF-8 validity) in the standard library. If that's the case, then why was this issue closed? Am I misunderstanding the original proposal?
I'm completely fine with myComptimeCStrGeneratorThroughDoubleEscape("\xHH\xHH\xHH"). No special syntax.
I didn't mean to suggest that bytestring literals are necessary, only that they're an option, if restricting string literals to valid UTF-8 is otherwise too inconvenient.
All the discussion from people above boils down to these questions
I'm not sure those cover the most central question of this issue, as I understand it:
Will Zig (or the standard library) expose a string type that allows you to prove (with the static type system) that a runtime string value has already been validated?
By "weird rules" I meant e.g. codepoints in the range [U+D800, U+DFFF] being reserved exclusively for UTF-16 encoded text. I don't know if Zig will ever care about those rules; I was just noting that they exist.
Their literal form is already forbidden in source code, but the escaping (\u{D800}) is currently allowed for byte strings. They don't do any more harm than other code points except that they may be rejected by other libraries for the same reason.
Am I misunderstanding the original proposal?
I'm not sure if the OP had thought about your ideas. Isn't s[i] just a syntactic sugar for s.byteAt(i) or s.codePointAt(i) (may return error, may return a reference or a copy of the value)? My concern about builtins is only about the explicitness denoted by that @ symbol, please never mind.
Not my expertise. Emphasis added for others. For some real code you can look at std.unicode.Utf8View.initComptime
Their literal form is already forbidden in source code, but the escaping (\u{D800}) is currently allowed for byte strings.
...Huh. The status quo actually makes way more sense if I interpret everything as a bytestring, as opposed to a unicode string that Zig represents using bytes for some reason. It is of course reasonable to allow arbitrary bytes in bytestrings; I'm just slow.
I'm not sure if the OP had thought about your ideas.
Yeah, that's entirely possible. I don't think of them as my ideas, but I'm also not sure how prevalent they are outside of OOP. I shouldn't make assumptions.
Not my expertise. Emphasis added for others. For some real code you can look at std.unicode.Utf8View.initComptime
Totally fair, and appreciated. Thank you.
Most helpful comment
The bug doesn't present itself in the simple case. Here are a number of examples that would break or be unsupported:
Your example only works because you're really just copying []u8 around. You're not actually doing string operations.