Zig: Add a "string" type alias for "[]const u8"

Created on 5 Jul 2020  路  8Comments  路  Source: ziglang/zig

From Zig's documentation:

    // Zig has no concept of strings. String literals are const pointers to
    // arrays of u8, and by convention parameters that are "strings" are
    // expected to be UTF-8 encoded slices of u8.
    // Here we coerce [5]u8 to []const u8
    const hello: []const u8 = "hello";
    const world: []const u8 = "涓栫晫";

The type []const u8 is 10 characters long and there are approximately ~1000 occurrences of it in std alone. In most cases it means a UTF-8 encoded string, in the rest it means a u8/byte buffer to read from. It's probably not very hard to figure out the meaning from the context (the surrounding function/struct name) but it takes a tiny bit of effort. A type alias for []const u8 called string (or text, or something else) could be added in order to unambiguously convey the type of the entity. It might also be shorter/easier to type. It seems appropriate to use shorter names for types that are used extensively (u32 instead of uint32, f32 instead of float, etc.).

proposal

Most helpful comment

I don't think it's in zig's best interest to have a single type called string. There's ascii, extended ascii and utf-8 which all look like []u8 for example, and must be treated differently. What I think we do need is a way to nicely annotate what string encoding is being used.

All 8 comments

For consistency, string wouldn't really just mean []u8, with const string meaning []const u8? Thus the typing benefits are decreased, and it becomes less clear what exactly a string is.

Windows love u16.

std::string     std::basic_string<char>
std::wstring    std::basic_string<wchar_t>
std::u8string   std::basic_string<char8_t>  // (since C++20)
std::u16string  std::basic_string<char16_t> // (since C++11)
std::u32string  std::basic_string<char32_t> // (since C++11)

I don't think it's in zig's best interest to have a single type called string. There's ascii, extended ascii and utf-8 which all look like []u8 for example, and must be treated differently. What I think we do need is a way to nicely annotate what string encoding is being used.

An actual string type can bring along a few extra features not present with []const u8:

  • methods that only make sense on strings
  • a field .is_valid_utf8: ?bool

I was going to write the same as @Sobeston.

If a type alias will be introduced, then it makes sense to tie the alias to the encoding of the string:

const hello: []const u8 = "hello";
const hello: utf8 = "hello";

"utf8" conveys more information than "string" and is shorter to type, down to 4 characters now.

You can make such a type in user space. No need for it to be a language primitive.

You can make such a type in user space. No need for it to be a language primitive.

Sure. But it would be nice if everyone picks the same name for the same concept, for example:

str8 for a UTF-8 encoded string (str16 for UTF-16LE?, and str32 for UTF-32)

This is very unlikely to happen if Zig doesn't pick a name, and simply let people define their own type aliases to []const u8/16/32.

Some people would stick with []const u8, others might use c++'s names, some (I guess like me) would just call it string and pretend that there's only UTF-8 or something...

Regardless []const u8 can either mean a UTF-8 string or a source of bytes, and not every source of bytes is a UTF-8 string.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

komuw picture komuw  路  3Comments

andrewrk picture andrewrk  路  3Comments

zimmi picture zimmi  路  3Comments

bheads picture bheads  路  3Comments

bronze1man picture bronze1man  路  3Comments