Zig: consider allowing non-ascii identifiers

Created on 19 Dec 2019  ·  69Comments  ·  Source: ziglang/zig

@Serentty writes in https://github.com/ziglang/zig/issues/663#issuecomment-565856023 :

Non-ASCII identifiers are a very important feature to me. For code which isn't meant to be published for an English-speaking audience, I regularly use identifiers which can't be represented in ASCII. The current “solution” of enforcing only ASCII in identifiers is very anglocentric.

I'm opening this issue to be a discussion area about possibly allowing non-ascii identifiers in Zig.

In Go, identifiers can be made of unicode points classified as "Letter" or "Number, decimal digit". I don't know how difficult that would be to program into Zig and specify in Zig's grammar spec. Java and JavaScript have similar identifier specifications. Would Zig having rules like that be valuable?

In Zig, you can make any sequence of bytes an identifier if you use an extra 3 characters for each identifier, e.g. var @"你好" = @"世界"();. This is pretty painful if you're doing this for literally every identifier, but it's at least something.

Backticks are not used in Zig's grammar today, so perhaps we could shorten the 3 characters to 2 like so:

var `你好` = `世界`();

This looks a bit nicer, mimics SQL identifier escaping, and is much simpler to implement than anything to do with unicode character properties. Would this be a meaningful improvement over @"你好"? (This proposal has some details to iron out, but I'd like to get a sense for if this would even be helpful.)

proposal

Most helpful comment

One benefit of status quo that I am reluctant to give up is that the definition of zig tokenization is finished. Aside from language changes before 1.0.0, tokenization is immortal and unchanging; already in its final form. 100% stable.

A dependency on Unicode tables means more than just the chore of implementing and maintaining support. It means the zig language itself depends on a third party standard that changes independently of Zig, without warning, and without any decision-making power from the Zig community.

I would be interested to explore what it might look like to support non-ascii identifiers without any knowledge of Unicode. For example, the naive approach of allowing any sequence of (non ASCII) bytes as an identifier. Some downsides I can think of:

  • no normalization support (utf 8 encoded bytes must match)
  • weird identifiers are possible using non ascii spaces or other surprising characters

That is all I can think of. Upsides would be that the tokenization rules would remain trivially simple, and zig compilers and tools could remain correct, regardless of Unicode standards changing over time, and regardless of whether they have access to libraries such as ICU.

All 69 comments

It would need to follow unicode TR31 or possibly even TR46. Note that the unicode tables required are non-trivial in size.

I think that Rust's approach is a very good one. It determines what is and isn't allowed in identifiers based on whether or not characters have the XID_Start and XID_Continue Unicode properties. It also normalizes all identifiers using NFC before comparing them, so differences in normalization between source files can't lead to identifiers failing to match. Finally, it forbids any unassigned code points (at the time of the release of the current version of the compiler) from being used in identifiers, since their properties are unknown.

@daurnimator It's true that any reasonable solution (other than just a free-for-all) would require including some Unicode property tables with the compiler. We're probably looking at maybe 50 KiB of data for this. If this is truly a size concern, making it an optional component could be possible. However, I suspect that these tables will quickly get dwarfed in size by other components of the toolchain.

Finally, I want to address the backtick proposal. Personally, this doesn't feel like it would be that pleasant to use. It seems more like a way to interface with existing identifiers that you absolutely _must_ use than a way do deal with identifiers day-to-day.

In comparison, simply checking Unicode properties isn't really that hard to implement. The largest concern I think would be the size of the tables, not the code that needs to look characters up in those tables. Rust is taking so long on this issue because before they stabilize it, they want to implement dozens of lints to warn users about similar-looking characters, mixed script identifiers, and so on. Depending on whether or not those things are seen as a priority, this could be anywhere from a simple fix to a huge project. Personally, I lean more towards not caring about confusable identifiers, unlike the Rust team. If your team members are screwing with you by replacing random letter As in your identifiers with Cyrillic, you need to find a better team.

I think using characters other that basic latin will make harder to reuse code.

@Rocknest This is something that I've seen come up again and again in discussions when programming languages start to have a discussion about how to handle identifiers. However, it has never been a convincing argument to me. Now, that's not to say that having identifiers in say, the Greek alphabet, won't make it harder for people to use some library you've written. That's essentially a given. Rather, what doesn't convince me is that this is a sufficient reason to disallow such identifiers. A programming language, to put it simply, is not your mom. It's up to you who the target audience for your code is, and how you want to present it to them.

But let's say that you don't agree with that. Let's say you want to encourage everyone to use English identifiers in their code to improve global code reuse, so you design your language to enforce only ASCII identifiers, since that will encourage people to use English. Well, in my experience, this simply doesn't work. When someone wants to use identifiers in a certain (human) language, they do, no matter what characters the (programming) language lets them use. I've heard from a Japanese developer friend of mine (and I have seen for myself in codebases that I have inspected), that when people are forced to use ASCII identifiers, what they end up doing is writing identifiers in their preferred language, but filtered through the most inconsistent, ad-hoc, ugliest romanization schemes that you have ever seen. Ultimately, this does more to hurt the readability of code than it does to help it.

var сумма: i32 = 0;
for (массив) |икс| {
   сумма += икс;
   if (икс < 3) {
     continue;
    }
   break;
}

Is this going to affect the speed of zig fmt?

Currently zig fmt feels laggy compared to go fmt, but I don't know how many inefficiencies are there at the moment, and this seems to me a more universally relevant point than whether a relatively small group of people should be forced to use English identifiers or not.

What engineering implications come with this proposal?

Is this going to affect the speed of zig fmt?

Go supports Unicode identifiers and go fmt is super fast.

@kristoff-it

This shouldn't affect the speed of formatting at all in any noticeable way. It still just has to search for brace and whitespace characters, which will be encoded exactly the same way as before. The parser already handles multibyte UTF-8 sequences for the sake of comments and string literals, so this wouldn't slow that down either.

Also, I strongly disagree with your conclusion about only a small number of people being “forced to use English identifiers”. As I mentioned earlier, there are vast swaths of the world where English is not widely spoken by programmers, which includes much of East Asia. This isn't just a matter of personal preference in those places, unlike it would be in most of Europe, where English is required for most programming jobs.

@Serentty i've seen and worked with Java code with cyrillic identifiers. And its AWFUL. You have to switch keyboard layouts every few seconds or do copy-paste madness and that reduces coding speed to be unpractical outside educational purposes. And ad-hoc romanization arises naturally making unicode identifiers obsolete.

The whole problem with ad-hoc romanization is that it is by its nature inconsistent. But I think ultimately, arguments about whether each of us would rather work with such identifiers or not are a distraction from the actually problem at hand. I've never agreed with the thinking that if you don't like something, you should stop other people from doing it.

@Serentty i said that this feature is devoid of practical use. And i think that i would not solve anything, so why add something useless to the language?

By the way in Zig hard tabs are not allowed so your last argument probably does not apply here also.

binary size

Binary size of the unicode data is not a factor here. 50K installation size is nothing compared to the rest of the payload.

@Rocknest

i said that this feature is devoid of practical use.

You yourself said that you have seen people using Cyrillic identifiers. So clearly there is a use for this. You might not like it, sure, but it is indeed something that many people use.

By the way in Zig hard tabs are not allowed so your last argument probably does not apply here also.

I'm not a fan of that decision either, but regardless, enforcing a certain indentation style is nowhere near as serious of an issue as the cultural implications of enforcing the English subset of the Latin script.

I support non-ascii identifiers without backticks.
Different countries have different culture. So many words in a language cannot be translated into another language accurately. For example, in Chinese's RPG games, we often use pinyin(almost all ascii characters, we can guess what do they mean from pronunciations and contexts, not easy) as identifiers for skill books or role actions when there's no Chinese identifier support. We do not use Engilish, because almost nobody can understand the meaning of Nine Yin Bone Claw that translated from 九阴白骨爪 in Chinese.
If we are lucky with a chance to build a game using c#, most identifiers in battle logic will be Chinese directly.

How about a keyword eg: use_identifier greek, latin (just for example) in the top of the file. This way you can make sure that it would be quite consistent and only use required tables? Is a 'state per file' thing so bad in these cases also?

It also normalizes all identifiers using NFC before comparing them, so differences in normalization between source files can't lead to identifiers failing to match.

Normalization sounds like a job for zig fmt if possible.

I think it's possible to allow non-ascii identifiers while still achieving my goals in #663 of making valid zig source code easy to process by naive tools. If identifiers are guaranteed to be normalized and validated in order for your code to compile, then a naive parser could consider any non-ascii bytes outside comments and string literals to be identifier characters.

Maybe normalization isn't necessary, but it seems like a nice feature to include in the proposal. Does normalization require the full libicu dependency, or is that possible to do with just some data tables?

Also, I strongly disagree with your conclusion about only a small number of people being “forced to use English identifiers”. As I mentioned earlier, there are vast swaths of the world where English is not widely spoken by programmers, which includes much of East Asia. This isn't just a matter of personal preference in those places, unlike it would be in most of Europe, where English is required for most programming jobs.

@Serentty

I'm from Italy, I've worked for a while in SEA, and I've found that people there, at least in tech circles, can deal with English much better than the average Italian developer (there = Singapore, Malaysia, Thailand).

My point is that I think it's an exaggeration to say that allowing non-ASCII identifiers is going to have that big of an impact because they will still have to deal with ASCII identifiers, like today, from all the libraries that they will use, so it's not like they're going to have the option of not having to learn to read/write English symbol names. This is only going to add freedom to the identifiers they will create themselves, which is nice, but not that huge of a change in the daily life of the average SEA developer.

On the other hand, the arguments about lowering code reusability are totally moot in my mind. People that don't want to worry about others reusing their code will find a way to make it hard to understand anyway, like you pointed out, and conversely I think a library that uses Japanese identifiers internally, but provides an English interface and documentation, should still count as reasonably universal code.

So in conclusion I think the whole "programmer's freedom" vs "code universality" dichotomy doesn't really expose all the important, practical questions that should be explored first.

A few examples:

  • Is this going to complicate the compiler and impact performance vs not doing it?
  • Is this going to make it harder for people to contribute to the compiler & related tooling?
  • Is this going to make it harder for people to develop external tools that deal with Zig source code?
  • Is this going to make the docs slower to load and navigate on non-beefy machines?

And for each question: If so, by how much?

Additionally:

  • Is this going to cause some users to encounter weird/unexpected behavior?

For example, referring back to Rust's strategy, people might want to map HashMaps from databases to structs, and they might feel confused and disappointed when they discover that their seemingly normal identifiers sometimes don't map correctly because of different normalization choices between the various tools involved (zig, database, other clients).

To me this example doesn't even seem that much hypothetical because I wanted to cover the mapping use case in my Redis client. With the current setup, when somebody wants to map any non-trivial identifier, they can do so with @"", like with @"stream-remote" in this example:
https://github.com/kristoff-it/zig-okredis/blob/master/REPLIES.md#adding-types-for-custom-commands-lua-scripts-or-redis-modules

A good property of the latin alphabet is that it's simple, much simpler than all the alternatives, and that results in some nice properties, more often than not.

If the downsides are minor, I see no reason to prevent people from using the symbols they like most, but we should really start with a thorough and practical investigation of that first, and leave other concerns for second. IMO.

Also, I almost forgot, people in Asia are going to use Zen anyway, no?

:laughing:

@thejoshwolfe

Normalization sounds like a job for zig fmt if possible.

Normalisation is the compiler's job. The symbol table needs normalised identifiers to properly match them. zig fmt may pre-normalise code, so that the compiler can run the fast-path normalisation check instead of doing actual text transformations. And even where normalisation of symbols is needed, they are 99% of the time so small that this task is done rather quickly.

Personally, I think that switching to backticks is the best solution here. Such a syntax (presently @"identifier") is needed anyways to avoid namespace clashes,identifier` is easier to read / harder to mistake for a string if you don't notice the @ (e.g. if you're already dealing with strings because the type being assigned to is a string), and from the OP it seems people find that easier to type.

I don't mind introducing this shorter syntax for escaped identifiers, but I don't see it as being a solution to the problem here. Such identifiers are essentially second-class.

A good property of the latin alphabet is that it's simple, much simpler than all the alternatives, and that results in some nice properties, more often than not.

This is just patently untrue. While the Latin alphabet is nowhere near the most complicated, it's not even close to being the simplest. It has many properties, such as having two versions of each letter for a case distinction, that most writing systems eschew. I'd rather not deal in rationalizations here. The (English subset of) the Latin alphabet is privileged in computing because of the historical influence and dominance of American and British computer companies, not because of any inherent properties that make it better for computers. This is, in my opinion, entirely a legacy issue.

Personally, I think that switching to backticks is the best solution here. Such a syntax (presently @"identifier") is needed anyways to avoid namespace clashes,identifier` is easier to read / harder to mistake for a string if you don't notice the @ (e.g. if you're already dealing with strings because the type being assigned to is a string), and from the OP it seems people find that easier to type.

The @"......" syntax (or something like it) will need to remain to be able to refer to non-normalised symbols (assuming that unicode identifiers would go through e.g. NFKC) as well as symbols with e.g. spaces in them.

@daurnimator I was specifically stating that only the syntax should change, not the semantics.

Another option would be to have all identifiers in backticks, but that just trades one problem (treating non-ASCII identifiers as second class) for another (all identifiers are a PITA to type).

Upon further reflection, I think this is worth it, with the caveat that if there's a performance impact in code that would work without Unicode support, I'd prefer an option to assert that all identifiers are ASCII and avoid the checks (maybe a compile-time option - CMake setting for stage1 - instead of runtime).

@pixelherodev I've never seen any correlation between compile times and support for non-ASCII identifiers. I would be greatly surprised if there's a noticeable impact at all on codebases of non-trivial size. Either way, this is something that can't really be known until after support for such identifiers is implemented. If an all-ASCII pragma speeds things up, it can always be added.

@pixelherodev I think you should have a look at the tool ripgrep, which is damn fast no matter whether or not you look for full Unicode text. There is no performance impact worth noting.

Edit: Btw., even the dinosaur C++98 has full Unicode identifiers. (Though compiler devs mostly started implementing that beginning with C++11.)

I've been reading up on the ID_Start/ID_Continue character classes and on normalization, and it seems non-trivial. This isn't as simple as a table lookup for each codepoint. I'm starting to think we would either need to link against ICU just to tokenize identifiers, or write up an implementation of the subset of the unicode standard we need in Zig, which I'm estimating so far is at around 10,000 lines of code plus the data tables. That's just to support normalization and the character classes we want. :grimacing:

It's against the spirit of Zig to have a solution that's "good enough", like supporting normalization for any characters except Hangul syllables. If we do anything with Unicode, we've got to go all the way, whether that means writing and maintaining our own complete implementation of the feature or just linking against ICU. Are we willing to link against ICU in the compiler and in zig fmt?

Maybe I'm being naive, having basically never worked with unicode identifiers, but why decode the unicode at all? Why not just compare if they're the same byte values? Or maybe some minimal checks that the data isn't entirely malformed.

I know I should just let the adults talk here, but some simple to grasp explanation of why this part of the discussion is even happening I think would be useful for us unicode noobs :)

Is normalization that common an issue that we'd need to solve for it to make things usable? I have to normalize case, and the funky quote characters I copy paste out of websites/word documents, and the compiler doesn't have to do much in order to force me to do that besides be case sensitive.

I forget where I read it or if it were in a video but I thought I remember akelley talking about zig never having to decode unicode in order to support it, and that being a good thing. Still seems like a good goal to me, unless we're forced to due to unicode having absolutely horrid usability in some languages for some reason?

@thejoshwolfe On Linux I wouldn't say linking against ICU is a big deal because pretty much any system will have it installed already, but in the Windows and Mac worlds where you need to bundle all of your dependencies, I think it would be good to look into ICU's “Data Customizer”. You can drastically reduce the size of the library by stripping out unneeded tables, which in Zig's case would probably include stuff like case mapping/folding, character directionality, and so on. Really all that is needed is the tables of canonical normalization forms, and the (X)ID_Start/Continue properties.

In an ideal world the OS would do this for you, but it looks like the Windows API function to normalize a string is still based on Unicode 4.0.

One benefit of status quo that I am reluctant to give up is that the definition of zig tokenization is finished. Aside from language changes before 1.0.0, tokenization is immortal and unchanging; already in its final form. 100% stable.

A dependency on Unicode tables means more than just the chore of implementing and maintaining support. It means the zig language itself depends on a third party standard that changes independently of Zig, without warning, and without any decision-making power from the Zig community.

I would be interested to explore what it might look like to support non-ascii identifiers without any knowledge of Unicode. For example, the naive approach of allowing any sequence of (non ASCII) bytes as an identifier. Some downsides I can think of:

  • no normalization support (utf 8 encoded bytes must match)
  • weird identifiers are possible using non ascii spaces or other surprising characters

That is all I can think of. Upsides would be that the tokenization rules would remain trivially simple, and zig compilers and tools could remain correct, regardless of Unicode standards changing over time, and regardless of whether they have access to libraries such as ICU.

I could accept the naïve solution if avoiding a dependency on changing standards is deemed a big priority. The same approach is the de facto solution for Linux and Windows filesystems: two files can have names that only differ in their normalization form. Only Apple normalizes filenames as far as I'm aware. I would rather see this than no support for non-ASCII identifiers at all. The _only_ things I would consider incredibly important is that the compiler make sure that identifiers are valid UTF-8, regardless of which characters they might contain, and allowing these identifiers to be used without backticks or the like. The first trivial and fast to do, and it fits with the goal of ensuring UTF-8 for source files. The latter is because backticks don't really make sense to me as a concept. Chances are a project will either want to allow such identifiers or not, and backticks simply make it harder for the people who want them to use them, while not preventing them in projects that don't want them.

There are a lot "modern", fast and compact C/C++ libraries with 1-3 files.
E.g. utf8proc (C, 3 files) or ext-unicode-db (C++, single-include, no normalization), etc.

In an ideal world the OS would do this for you, but it looks like the Windows API function to normalize a string is still based on Unicode 4.0.

The more OS dependencies, the harder zig becomes to bootstrap into a new arch/OS. If it's going to realize the dream of replacing C, I imagine it is going to have to be as easy or easier to get onto bare metal as C is. Maybe what I'm talking about could be mitigated by allowing a bootstrap subset of the language, but it seems that zig should be careful to avoid all dependencies it can. Relying on any OS facilities beyond posix sounds like a bad idea to me.

Sure, I just think that operating systems should provide a lot more in terms of Unicode-related functionality, since it's something that nearly all programs should be using and shouldn't have to haul along with them. Either way, the OSes don't provide that, so it's not worth fretting over for Zig anyway.

@kavika13 Take, for example, the letter Å. It can be encoded in at least three ways. As a single letter U+00C5, as the physical unit symbol U+212B, or as merely A with a combining ° U+0041 U+030A. And unlike cyrillic vs. latin a, all three are meant to be the very same character. Normalisation transforms all three of the above into one form, be it U+00C5 for recomposing or U+0041 U+030A for decomposing normalisation. Different keyboard layouts have different strategies for how they encode characters. And different operating systems have different ideas of how to fill gaps in underspecified layouts.

@ others: I like the raw bytes of UTF-8 approach. It's practical, it is fragile when it comes to cases like the above, but it Works™. Unicode is quite a complicated and complex moving target to keep up with.

I think Zig can take the raw bytes approach. Every input method I've ever seen that wasn't some hackjob that somebody rolled themself has produced text in NFC. Other normalization forms may exist, but they're rarity, and most filesystems do just fine for people all over the world without worrying about normalization.

However, assuming that the validity of the UTF-8 is still being ensured, it should be possible to forbid certain things which shouldn't be in identifiers, and where complete coverage is possible without having to update Zig according to new Unicode standards.

  1. The C1 control characters. These are in the range directly above ASCII. When taken together with the ASCII control characters, they count for pretty much all of the control characters in Unicode, and it's unlikely that more will ever be added.

  2. The 66 noncharacters in Unicode. These are code points reserved for internal string processing in memory. The last two code points of each plane are noncharacters, as is a range in the BMP. This set is stable, and no new noncharacters will ever be added as a matter of policy.

What are the consequences of allowing non-utf8 byte sequences?

@kristoff-it

It means that the source file itself is no longer valid UTF-8, which goes against the guarantee that a compiler for Zig will always be working with that encoding. If this is something that is already being enforced for string literals and comments, it would be very strange not to enforce it for identifiers as well, considering that any non-UTF-8 byte sequences in identifiers are more likely to be a mistake than intentional. We're talking about a language that disallows CR characters anywhere in your source code. I think making sure the text encoding is valid is in the same spirit as that.

Rather, I think the question we should be asking here is what the cost is of validating the UTF-8. The answer here is that it's close to zero. Unlike normalization, simply validating the encoding requires absolutely no tables. It simply involves shifting some bits around and keeping track of what the last few bytes were, so it has great cache locality.

I think the raw utf8 approach is the best. Like the unix philosophy "do one thing well", zig's a compiler so the zig devs should focus on making a great compiler. If your team needs normalization, use a tool made for the job: cat test.zig | uconv -x '::nfkc;' | zig build-exe /dev/stdin.

To make this easier, maybe build.zig should have an option to attach a user supplied filter command that pre-processes source files before being handed to the zig compiler?

zig's a compiler so the zig devs should focus on making a great compiler. If your team needs normalization, use a tool made for the job

Maybe there is some value to splitting this conversation between unicode support in zig and in zig fmt?

I entirely buy this argument for zig, but zig fmt is a code normalization tool. It is not the compiler.

Maybe all of the criticism in the thread above still apply, though?

  • Do changes that improve zig fmt necessarily need to affect the core language definition?
  • Would linking the ICU into zig fmt cause too much instability, either dependency-wise or logically?
  • Would linking the ICU into zig fmt it cause too much bloat (esp if we were somehow able to strip down our needed tables)?

I know they're the same executable right now for convenience. Maybe that convenience is too nice to give up. But maybe it's worth considering them to be logically separate?

If your team needs ... use a tool made for the job

One of the nice things about what Zig is intending, right now, is that the entire build toolchain is self-contained. Install zig, install your repot, you get everything you need and can build it for every target OS. No need to install cmake or meson or make (which doesn't come with windows) or whatever.

That might be an argument for making sure that mainstream cases (like unicode identifiers might be ... and unicode normalization might not be?) work without having to rely on OS facilities or third-party tools.

I entirely buy this argument for zig, but zig fmt is a code normalization tool. It is not the compiler.

One thing to note with the raw bytes approach though, is that zig fmt would sometimes not be able to normalize identifiers, because that might actually change the semantics of the program. If two identifiers are different in bytes but the same after normalization, this represents source code that zig fmt would actually not be capable of canonicalizing (without mangling one or both of the identifiers).

But I'm guessing people would probably want to enforce a rule where identifiers are normalized, and any resulting name collisions are errors to fix, or are mangled to make the distinction clear. (Mangled meaning, append _1 or _2 etc)

At first one would think this defeats the raw bytes approach, because if zig fmt has to do this then it has to ship with unicode tables and respond to Unicode updates. However, the important thing to note is that the Language Specification (#75) would not depend on Unicode. This would be the decision of zig fmt or other third-party code canonicalization tools. This is an important distinction because one of the goals of the Zig language is to allow third party implementations of the language. The raw bytes approach allows third party implementations of the language to not depend on Unicode, while still allowing non-ascii identifiers, and such third-party implementations would be compatible with canonicalized, normalized source code, even if they are not capable of doing such normalizations.

The raw bytes approach would mean that zig fmt has the option of providing an advanced feature such as unicode normalization of identifiers, but it would not be required to do such thing in order to be a valid, useful tool.

Thinking about it some more, zig fmt would not be able to do Unicode normalization if the language accepted raw byte identifiers. Because such normalization would possibly change the semantics of the program, and to detect this requires compilation and semantic analysis, whereas zig fmt only operates at the AST level.

So Unicode normalization would actually not be available, unless such a tool was willing to possibly break working valid zig code according to the language specification.

What to do about this, I am not sure. Perhaps this is something that can be left up to the job of a code editor or project-specific tooling.

The Unicode discussion of Immutable Identifiers is interesting. The idea is to define a relatively small and permanently stable set of codepoints which are not identifier characters, and then allow everything else forever. This is the simplest possible implementation of allowing non-ascii identifiers.

One example of this strategy is XML's identifier specification. You can see here that XML's specification for names does not depend on the Unicode spec.

However, to quote the Unicode discussion:

The drawback of this method is that it allows “nonsense” to be part of identifiers because the concerns of lexical classification and of human intelligibility are separated.

This strategy also cannot include any kind of normalization; it would mean the compiler compares raw bytes. It would be the responsibility of Zig developers to normalize their source code in order to prevent bugs. (And as @andrewrk says, it wouldn't make sense for zig fmt to do the normalization in this case, because zig fmt is supposed to preserve the semantics of the code.)

Would this kind of policy be a good idea? Would it be acceptable for the compiler to stay out of the normalization business, and trust developers to use Unicode responsibly?

Would this kind of policy be a good idea? Would it be acceptable for the compiler to stay out of the normalization business, and trust developers to use Unicode responsibly?

I think it's okay to trust developers to do that. It's a lot easier to input non-normalized text on purpose than accidentally. If people are running into weirdness where identifiers aren't lining up, chances are someone on the team is pulling some sort of prank, or is using some wacky input method which they probably shouldn't be using for programming anyway. As I've said, it's what most filesystems do, and people don't have too many issues dealing with typing filenames from the command line. It would be _nice_ to deal with normalization like Rust does, but I think the naïve approach is still way better than nothing.

I think that I'm in favor of that as well.

IMO, It has all the major advantages that would be present with normalization without the drawbacks and without bloating the language or the tooling.

Thanks for the discussion all! I wrote up a concrete proposal at #4151 .

To me, switching input methods (English-Chinese) is already inconvenient so I think this is more an aesthetical issue than RSI or readability/internationalizaion problem. The @"<indentifer>" syntax is explicit so everyone can see some special chars are being used. To make editing easier people may add a shortcut for inserting @" or @"..." (eg. Ctrl-2 for the second to the last used special indentifer) This can be done by configuring the IDE or some powerful input methods (?) so this is not a big problem unless you cannot bare with the 3 chars adding to the length of the idenfitier.

Added: I think providing an flag for doing normalization to zig fmt is not bad.

Thank you for the discussion all. I am closing this in favor of status quo. The @"" syntax allows any string literal to be used as an identifier, and Zig remains blissfully unaware of Unicode.

I'm disappointed by this decision, but the fact that there is at least an escape syntax in case FFI requires this is nice.

@Serentty Here are some things to consider:

  • zig-fmt: auto wrap raw unicode identifier into @"..." (--flag? --undo-flag?)

  • zig-fmt: auto unwrap @"..." which does not contain any non-ASCII character

  • support adding plugins for transpiling in build.rs

  • provide a good input method (plus a specialized keyboard?) that can help type the code below quickly. I guess some languages which use more English punctuation symbols are easier to type; Japanese may be easier than Chinese.

Example 1:

// the strings in the array don't count (extra difficulty if you like it)
const @"数字列" = [_][]const u8{"〇", "一", "二", "三", "四", "五", "六", "七", "八", "九"};

var @"排列组" = @"类_全排列".@"生成"([]const u8, @"数字列");

while (@"排列组".@"取出"()) |@"排列"| {
    @"显示"(@"排列");
}

Example 2:

const 数字列 = [_][]const u8{"〇", "一", "二", "三", "四", "五", "六", "七", "八", "九"};

var 排列组 = 类_全排列.生成([]const u8, 数字列);

while (排列组.取出()) |排列| {
    显示(排列);
}

translation:

const numerals = [_][]const u8{"〇", "一", "二", "三", "四", "五", "六", "七", "八", "九"};

var permutations = Permutation.generate([]const u8, numerals);

while (permutations.iter()) |permutation| {
    display(permutation);
}

The temporary RSI developed after typing the two examples above (with the help of English punctuation mode + special key to output raw keyboard input) has made me understand again why forced to type @"..." => forced to use ASCII. It still hurts even without @"...". To avoid RSI I may need to type more slowly. :/ More importantly, I don't like mixing English keywords with Chinese and it is hard to name the variables because there is no plural form in Chinese. Whether @"..." looks second-class or not is never a big deal to me.

But I believe with enough effort one day the whole zig stdlib will be translated, or simply be documented in Chinese.

I know this one is closed but... given there are many languages with unicode identifiers like Swift, Python and JS (I think?), how much is this in practical use? I've seen unsubstantiated opinions but no actual data. If unicode is valuable for identifiers I would expect there to be a fairly large corpus of Swift, Python and JS code that use them. I have tried to search Github, but the only examples I found were example source files to demonstrate the languages' capabilities.

They are unlikely to see use in English projects, since we're used to
ASCII being good enough. For people whose language is unrepresentable in
ASCII, Unicode identifiers are much more likely to be used.

Of course this is just another unsubstantiated opinion. My point is
merely that your data set may be biased.

@pixelherodev I have searched for certain Chinese identifiers in github projects. I'm making the assumption that this is a fairly unbiased way of looking for code. Do you have a better way to search? We can assume that code is written something like: <multiple non ascii bytes> = ... but I was unable to do such a general search on github.

I would be surprised if there's anyone programming in a language other than English to be honest. I know in France (and presumably Germany?) they used to program in their own language, but these days... I'd be very surprised.

Well, most github projects, to my understanding, are in English. Github
searches will fundamentally give inaccurate information as a result.

I'm not sure there is a good way to do this. Maybe look for a code forge
used predominantly for non-English projects?

@pixelherodev I only looked at source code that already contained certain Chinese words, and checked if they were used as identifiers or as comment for identifiers. It was the latter. Typical code:

//长度
func (t *taskList) Len() int {
    return len(t.data)
}

Github seems popular with Chinese projects as well.

There are other languages which can be used, not just Chinese, and it's
entirely possible there are Chinese projects using them that are being
missed because a) they're not on github b) github's search is confused,
etc. It's also possible that people simply don't know (or care)/that it's
possible, and that people would use it if they knew - or that they've run
into issues using it for real projects that the fun demos overlook.
There's a myriad of possibilities, and it's worth considering all of them
instead of jumping to conclusions.

I am only saying that I am unable to find any good examples of unicode identifier use in my search. I was also looking at Chinese sites and checking random projects, and they follow the kind of code I typed in before (this is not by doing a Github code search by looking at popular Go sites in Mainland China).

I picked Chinese it has a huge internal market for development (as opposed to many other countries that mainly deliver internationally), plus that I know a bit of Chinese. If someone would be interested in looking at code written in Japan, India or Russia I'd be most interested in knowing what they would find out.

@lerno “易语言” was once famous in China, and even now there are still people using it. Maybe you can talk to its users. I for one don't want Zig to be tied to any complex encoding scheme (including UTF-8, though ...) as long as there is no problem doing name mangling or anything for compiling/transpiling/debugging, even if people only write tools that support UTF-8. I say this not because code in different languages is difficult to read, but that a DSL may do better than Zig for people's use cases while keeping the Zig language simple.

If unicode is valuable for identifiers I would expect there to be a fairly large corpus of Swift, Python and JS code that use them. I have tried to search Github, but the only examples I found were example source files to demonstrate the languages' capabilities.

I think that's a flawed assumption. A lot of the time, because different tools have differing levels of support for text, many programmers will simply used romanized identifiers to avoid the hassle. So I think this is an issue of a vicious cycle of bad support, not of value. A lot of the code from Japanese programmers that I have read _does_ use Japanese in identifiers, but romanizes it, often quite inconsistently. A friend of mine who is a programmer in Japan says that he would rather deal with Japanese characters in identifiers than the ad-hoc romanization that people use.

Regardless, I think the concept of “other languages” often dominates these kinds of conversations, when in truth I often find use for identifiers outside of ASCII even for English. There are English words such as “résumé” which aren't representable in ASCII, and having to substitute characters because of technical limitations leaves a bad taste in my mouth. I think it's really lamentable that something as limited and old as ASCII continues to be seen as any less legacy of a technology as Windows NT 4.0 would be. And I don't see “it's reliable because it's limited” as a good reason most of the time, since languages which allow a much wider breadth of identifiers don't have problems. I don't know what kind of weird homoglyph attacks people expect, but they're just not an issue in code. URLs? For sure, but code isn't URLs.

@Serentty My statement is merely an invitation for further consideration. "Unicode allows people write code in their own language" is a fact of course, but the implicit assumption that "given unicode identifiers they will use them" is not as far as I can tell. There is no actual research whatsoever backing that up, so we need to consider this an opinion and not a fact.

"易语言" on the other hand is interesting as this is an actual data point possibly in favour of that argument. It should be noted that it seems to also write keywords in Chinese. What I would like to know is whether it was successful because the whole language was expressed in Chinese, or if it would have been similarly popular if keywords would have been in, say english. I am not familiar with it so I do not know if it was used professionally or merely mostly as an introduction to programming. Knowing that would bring more facts to the discussion.

Most language design decisions have plenty of research that you can review to get a fair view of pros and cons. This feature is a bit odd in that there seems to be no research and yet it is often assumed to be an important addition.

The reason I mention Go, Python and Swift is that the public source code on Github could be used to provide interesting data.

It is not only the matter of unicode. Python, Swift and Go all offer slightly different uses of unicode. Is one more attractive than the other? Which language has the most uses of unicode or are they the same? The data could help guide how unicode should be implemented in other languages, like Zig.

Another relevant perspective here is that unlike the older character encodings - everything from extended latin to Big5, unicode is more of a way to encode a visual representation than to encode words and numbers. This is why unicode allows you to encode letters in various ways. But this also is the reason why unicode is a pretty bad way to encode identifiers. This has more to do with unicode itself than allowing identifiers outside of ASCII. Some people have suggested to solve the problem by supporting more encodings and let the allowed identifiers follow the encoding. So for example if Big5 is used then you get Chinese identifiers, if any of the Japanese encodings are used, Japanese etc. That would solve the identifier issue mentioned, but unfortunately ends up creating problems that UTF8 does not have. Still, interesting.

Another relevant perspective here is that unlike the older character encodings - everything from extended latin to Big5, unicode is more of a way to encode a visual representation than to encode words and numbers. This is why unicode allows you to encode letters in various ways.

This is just entirely untrue. Unicode does not encode visual representations. It encodes abstract characters. Some of these abstract characters, such as certain mathematical symbols, might seem at first to be a way to encode visual variation, but these code points exist for use as those mathematical symbols, which are semantically different characters.

This is why unicode allows you to encode letters in various ways.

I presume you're talking about composed and decomposed representations here? This is an issue in legacy encodings too, which is why Unicode has both. Legacy encodings for Vietnamese in particular. But unlike legacy encodings, it has a very well-defined concept of canonical equivalence.

Some people have suggested to solve the problem by supporting more encodings and let the allowed identifiers follow the encoding.

You mean like actually encoding the source files in that encoding? I don't see how that solves the “identifier issue”. If your issue is having multiple ways to encode the same character, then you still have that with legacy encodings. If your issue is limiting programmers to “reasonable” identifiers, then it doesn't fix that either because the legacy East Asian encodings include all sorts of dingbats and doohickeys, but without any sort of concept of character classes or properties to use to forbid them. But perhaps most importantly, most of these encodings haven't been updated in decades, and are falling very far behind Unicode in terms of representing CJK text. I don't think you can even represent the Taiwanese phrase for “good morning” (gâu-tsá, 𠢕早) in Big5.

This is just entirely untrue. Unicode does not encode visual representations.

Do you claim that an encoding such as iso latin 1 is equivalent to unicode representation of the same and are equally suited to the unique encoding of identifiers?

I presume you're talking about composed and decomposed representations here? This is an issue in legacy encodings too, which is why Unicode has both. Legacy encodings for Vietnamese in particular.

There are certainly problems with other encodings, I am merely supplying the observation that at least quite a few of older encodings have tighter correspondence to the encoded character and thus make it easier to write rules for it.

For example, if we have cyrillic encoding, it the range of identifiers could be limited to the cyrillic alphabet which can be done in a compact manner in that encoding.

Note that I'm not suggesting that Zig should do this, I am merely trying to the advantages of doing so.

I do know it has significant disadvantages as well and I am not proposing this as a solution.

However, it could be that there could be something interesting in this observation. For example, rather than adopting certain unicode categories as viable in identifiers, one could perhaps suggest that certain well defined groups of subsets should be used, creating an informed subset for each language.

There might be separators that work well for certain characters to complement _, should non-arabic numerals be treated as letters or numbers (e.g. 三 has a numeric value in unicode but does not belong to the unicode number category)?, etc. Deliberate choices can be made rather than implicit ones.

But perhaps most importantly, most of these encodings haven't been updated in decades, and are falling very far behind Unicode in terms of representing CJK text.

No, I'm not suggesting it as a replacement.

Do you claim that an encoding such as iso latin 1 is equivalent to unicode representation of the same

Both encode characters used for writing. One is bigger. Conceptually, yes, they do indeed do the same thing.

and are equally suited to the unique encoding of identifiers?

Actually, I would claim that it's Latin-1 that is unsuitable for identifiers at this point, given that it's a standard that the ISO has abandoned for a long time. As for the characters themselves, I don't see why É or Æ should be more suitable for identifiers than 愛 or λ. If your objection is that Unicode contains lots of characters such as punctuation and dingbats which aren't logical to have in identifiers, then the solution is to do the same thing that every programming language already does even within the confines of ASCII: just don't allow those. Unicode itself has already done all of the hard work here, and provides Unicode Standard Annex 31 for exactly this purpose.

For example, if we have cyrillic encoding, it the range of identifiers could be limited to the cyrillic alphabet

Using encoding as a proxy for language or allowed characters is a very bad idea. For one thing, many encodings include characters which aren't used to write that language, such as how the JIS character set (used in Shift-JIS and EUC-JP) includes the Russian Cyrillic alphabet. So it can handle Japanese and Russian! But not Ukrainian or Serbian. And most legacy encodings are also missing one or two really common characters for the languages they've traditionally been used for, too. It's just a mess of intersecting Venn diagrams that don't really mean anything.

But perhaps more importantly, the user already _has_ an idea of which characters they want to use in identifiers. They aren't going to start using Chinese characters in an otherwise Cyrillic codebase just because they can, and lament the fact that the programming language didn't prevent them from doing so. If they do that, it's probably because they wanted to, or had a good reason to. Having to specify what characters an individual project should allow in identifiers (among those that the programming language itself allows), sounds like the job of a linter, not the compiler.

For example, rather than adopting certain unicode categories as viable in identifiers, one could perhaps suggest that certain well defined groups of subsets should be used, creating an informed subset for each language.

The thing is, and I'm sort of repeating myself here, programmers already _do_ use a subset: precisely the subset that they find acceptable. This isn't something that will just explode if you don't contain it.

There might be separators that work well for certain characters to complement _, should non-arabic numerals be treated as letters or numbers (e.g. 三 has a numeric value in unicode but does not belong to the unicode number category)?, etc. Deliberate choices can be made rather than implicit ones.

Now this one is actually interesting, although I don't think I would be a fan of treating certain characters differently in different compiler language settings. At this point the compiler has taken on a tremendous amount of complexity just for incredibly minor tweaks, and has passed that on to the end user, requiring a _human_ language to be specified when compiling.

For the case of 三 itself (I know this was just an example an not your actual point), I think the way it's classified makes sense. In modern usage, they're used more often for words which contain them than for writing out numeric literals. Not being able to name an identifier “三角” (triangle) because it starts with a Unicode number would be a bigger problem than not being able to use 三 instead of 3 in a numeric literal, I think.

@Serentty Leaving aside that the iso encodings are legacy at this point (I don't dispute that), don't you see anything valid in the fact that in particular for languages with an alphabet, that each character in that alphabet has a single byte representation?

Standard Annex 31 is a good argument though, and I honestly forgot about that one.

In regards to 三 then yes I agree on the classification, although for people wanting to go all unicode they might argue that

int 年 = 一九九二;

Should be allowed 😄

don't you see anything valid in the fact that in particular for languages with an alphabet, that each character in that alphabet has a single byte representation?

To be honest... not really. I could see myself agreeing with that a few decades ago, when plaintext could very realistically fill up all of a system's RAM, but nowadays, the space premium of UTF-8 is worth it many times over in my opinion. It's absolute peanuts compared to the ways that software wastes memory these days.

If you mean the advantage of being fixed-width, then that's definitely more compelling, I admit. In a low-level language where you expect indexing to reflect memory offsets, being able to index into the actual characters of a string as memory offsets is very useful, and in the early days of computing it was perhaps even the only practical approach. I mostly just see this as a necessary sacrifice at this point, though.

Was this page helpful?
0 / 5 - 0 ratings