Rust: Tracking issue for RFC 2457, "Allow non-ASCII identifiers"

Created on 29 Oct 2018  ·  102Comments  ·  Source: rust-lang/rust

This is a tracking issue for the RFC "Allow non-ASCII identifiers" (rust-lang/rfcs#2457).

Steps:

  • [x] Implement the RFC (cc @rust-lang/compiler @Manishearth)

    • [x] Normalize identifiers to NFC whilst parsing (#66670, #67702)

    • [x] Ensure that #![forbid(non_ascii_idents)] works. (#61883)

    • [x] Lint: confusable_idents(#71542, #72770)

    • [x] Lint: ~less_used_codepoints~ uncommon_codepoints (#67810)

    • [x] Adjustments to "bad stylenon_standard_style" lints. (See #73839)

    • [x] Lint: mixed_script_confusables(#72770)

    • [x] Provide reusable crates for above lints and checks on crates.io. (unicode-security)

  • [ ] Adjust documentation (see instructions on forge)
  • [ ] Stabilization PR (see instructions on forge)

Unresolved questions:

  • [ ] Which context is adequate for confusable detection: file, current scope, crate?
  • [ ] Should ZWNJ and ZWJ be allowed in identifiers?
  • [ ] How are non-ASCII idents best supported in debuggers?
  • [x] Which name mangling scheme is used by the compiler? (Punycode, see RFC2603)
  • [x] Is there a better name for the less_used_codepoints lint?
    Resolved in favour of uncommon_codepoints
  • [ ] Which lint should the global mixed scripts confusables detection trigger?
  • [ ] How badly do non-ASCII idents exacerbate const pattern confusion
    (rust-lang/rust#7526, rust-lang/rust#49680)?
    Can we improve precision of linting here?
  • [ ] In mixed_script_confusables, do we actually need to make an exception for Latin identifiers?
  • [ ] Terminal width is a tricky with unicode. Some characters are long, some have lengths dependent on the fonts installed (e.g. emoji sequences), and modifiers are a thing. The concept of monospace font doesn't generalize to other scripts as well. How does rustfmt deal with this when determining line width?
  • [x] right-to-left scripts can lead to weird rendering in mixed contexts (depending on the software used), especially when mixed with operators. This is not something that should block stabilization, however we feel it is important to explicitly call out. Future RFCs (preferably put forth by RTL-using communities) may attempt to improve this situation (e.g. by allowing bidi control characters in specific contexts).
  • [ ] Tweak XID_Start / XID_Continue? https://github.com/rust-lang/rust/issues/4928
    > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1518.htm
    >
    > The ISO JTC1/SC22/WG14 (C language) think that possibly UTR#31 didn't quite hit the nail on the head in terms of defining identifier syntax. They have a couple tweaks in mind. Consider following their lead.

    * [x] Similarly to out-of-line modules (mod фоо;), extern crates and paths with a first segment naming a crate should not be able to do filesystem search using those non-ASCII identifiers (i.e. no , extern crate ьаг; or му_сгате::baz). (#73305)

zulip channel topic for real-time discussion:
https://rust-lang.zulipchat.com/#narrow/stream/213817-t-lang/topic/nonascii.20identifiers(rfc.202457)

B-RFC-approved B-RFC-implemented B-unstable C-tracking-issue F-non_ascii_idents T-lang

Most helpful comment

If we need something even stronger we might try "mythic". :stuck_out_tongue:

All 102 comments

last unresolved question isn't a real unresolved question, it was included in the RFC for completeness but does not block this issue.

@joshtriplett Please check that the list of checkboxes above are satisfactory. :)

@Manishearth alright; leave a note under it to that effect?

The note saying so is already in the unresolved q

Is there a better name for the less_used_codepoints lint?

Substituting "rare" or "unusual" for "less used" seems to me a simple, if not necessarily final, improvement, replacing the somewhat awkward "less used" with a single, shorter, more usual synonym.

(Edit: I note that I personally oppose allowing non-ASCII identifiers, but I recognize that the Rust Team favors it, and I have no problem bowing to their decision and chipping in my cents to help.)

I like "unusual"
-Manish Goregaokar

On Mon, Oct 29, 2018 at 6:56 PM 8573 notifications@github.com wrote:

Is there a better name for the less_used_codepoints lint?

Substituting "rare" or "unusual" for "less used" seems to me a simple, if
not necessarily final, improvement, replacing the somewhat awkward "less
used" with a single, shorter, more usual synonym.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/rust-lang/rust/issues/55467#issuecomment-434035545,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABivSBq4pwBDJPCioj_7Jlu_fx5eoCRNks5up09wgaJpZM4X-3kG
.

I would prefer “rare” as it sounds more objective to me than “unusual”, and perhaps less judgemental as well.

My first thought was "uncommon", but that's not strong enough of an adjective to get the intended meaning across.

I'm partial towards "rare" as well; rare_codepoints is pretty short and sweet.

If we need something even stronger we might try "mythic". :stuck_out_tongue:

I prefer uncommon_codepoints as being more serious/boring than rare/legendary/mythic/etc.

@eddyb—

I prefer uncommon_codepoints as being more serious/boring than rare/legendary/mythic/etc.

As a native and competent English-user who generally is seen as serious/boring [1], while I agree that "legendary" and "mythic" sound rather fantastical [2], I don't think "rare" does. The distinction I would draw between uncommon_codepoints and rare_codepoints is that "rare" may be seen as stronger than "uncommon", at least in US English (Merriam-Webster and Wiktionary say so, whereas Oxford and Cambridge don't say so explicitly).

[1] (I recognize this may not have been a trait I displayed when I was a member of your channel.)
[2] (similarly to the Unicode term "astral plane", which, possibly relevantly, if I understand correctly, was changed to "supplementary plane")

I'm much against non-ascii identifiers, but I'm not going to repeat what other folks said. I just noticed that no example of malicious code was provided so far, so I'm providing one:

fn list_items_in_category(category: &str) -> io::Result<String> {
    let cate𝚐ory = sanitize_untrusted_input(category);
    debug!("Listing category {}", cate𝚐ory);
    system(format!("grep '^{},' /my/simple/database | awk -F , "{{ print $2 }}", category))
}

Who can spot the problem without looking at character codes?

As a side note, I'd like to provide my experience with attempts to localize everything (feel free to skip the rest of my comment if you want to remain technical). I went to a school where they translate literally every technical term. In an attempt to make everything understandable to everyone, they translated even things that are very difficult to translate reasonably.

You'd expect that it was much easier to learn at that school compared to others schools that don't do that, right? Well, life is weird. It was hard to understand, I felt like Alice in wonderland and it took me a week to realize that "that weird term I didn't hear before" was the actual thing I wanted to study and the very reason I signed up for that specific school!

Of course, this is not directly an argument against non-ascii identifiers. I just wanted to express my concerns to all those wonderful loving people (seriously) who want for everyone to feel great in Rust community, so that they remain vigilant and avoid accidentally going against their own beliefs.

while I agree that "legendary" and "mythic" sound rather fantastical [2], I don't think "rare" does.

(Agreed. My comment was entirely in jest (as I hope should've been evident?), and I had no intention of tarnishing "rare" by association, a word which is itself ordinary and common.)

Who can spot the problem without looking at character codes?

It's trivial on my font. Also, we want to start with lints against this sort of thing.

Who can spot the problem without looking at character codes?

Github's monospace font configuration on my machine ended up using a binocular glyph for U+0067 LATIN SMALL LETTER G but a monocular one for U+1D690 MATHEMATICAL MONOSPACE SMALL G, so I spotted it instantly.

That said, the confusable_idents lint will once implemented almost certainly flag this code, since MATHEMATICAL MONOSPACE SMALL G → LATIN SMALL LETTER G is listed in confusables.txt.

The Oxford English Dictionary has as one definition of "uncommon":

Of an unusual type or character; exceptional in kind or quality

which seems especially appropriate 😉(Emphasis mine.)

I feel it's a more suitable choice of words than "rare", which has significantly more meanings than "uncommon", some associated specifically with being good, e.g. (also OED):

Unusually good, fine, or worthy; of uncommon excellence or merit.

I also think more reserved wording is generally appropriate for compiler naming conventions.

@Kixunil

but I'm not going to repeat what other folks said

That's precisely what you're doing -- your "counterexample" is caught by _both_ the less_used_codepoints lint and the confusable_idents lint.

At this point these counterexample discussions have been done to death (and we're leveraging a unicode spec designed to deal with this!) -- please actually check if your "counterexample" isn't something we or the Unicode Consortium have thought of already.

Exactly. People have thought a lot about this, and it's certainly possible to implement a feature that many people will find useful while dealing with complications that might arise. At this point, adding this feature is a given. If you have ideas about how to improve the lints for finding confusable identifiers, by all means share them, but there's no need to simply point out an issue that everyone is already aware of.

Re. less_used_codepoints: I propose that we bring the bikeshed to a halt in favor of uncommon_codepoints because "meh".

I was personally in favor of rare_codepoints but uncommon_codepoints is basically the same and also works for me so... 🤷‍♀️

Everyone OK with this?

works for me. Slight preference for rare but very slight. Both work for me, and I don't think it's really worth bikeshedding this too much :)

@Manishearth Sorry, I didn't mean to argue that the example is unsolved, I just wanted to provide actual code for those who might have difficulties imagining how to turn this feature into something malicious.

I wonder though, why confusables lint is not mandatory (according to RFC). It sounds to me like making borrow checker non-mandatory. My understanding is that Rust should be safe by default, where you can opt-into unsafety. For me, this means denying confusables in every crate I make, if I want to ensure high quality of my crates.

This isn't a matter of safety in the way that Rust describes it.

Making the warnings on by default has been discussed. Please let's not use this tracking issue to relitigate things which have already been decided through a rather long RFC.

I was personally in favor of rare_codepoints but uncommon_codepoints is basically the same and also works for me so... 🤷‍♀️

Everyone OK with this?

Where will it be used? What is its purpose? In what and for who are this signs rare or uncommon? How does it matter that these signs are rare or not in the current context?

Reading this page from top to bottom didn't give me enough context to understand what is it about. Is it an argument option that will be exposed in Rust linter? Or some internal identifier that won't appear anywhere in the face of a lint user?

@psychoslave

This is the tracking issue for https://github.com/rust-lang/rfcs/pull/2457, which contains all the information you're looking for.

It's the name of an on-by-default lint exposed to the user that bans a bunch of codepoints from identifiers which we don't expect people to want to use.

I've ticked the box for less_used_codepoints, resolving it in favor of uncommon_codepoints;
One step closer :)

Hi @Manishearth, thank you.

Can you confirm me that in this context, a lint means "a rule that can be checked by a linter"?

If that so, what I interpret from the RFC is that it's about (possibly) concerning codepoints, so I would advice concerning_codepoints. I didn't read the Unicode Security Mechanisms Section 3.1 General Security Profile for Identifiers yet, so I might overguessing here, but it looks like the issue is not about codepoint use frequency. Please let me know your opinion, especially if you did already read the previously mentionned Unicode document.

Here is an other feeback, related to this section:

In Indic scripts these characters force different visual forms, which is not very necessary for programming. These have further semantic meaning in Arabic where they can be used to mark prefixes or mixed-script words, which will not crop up so often in programming (we're not able to use - in identifiers for marking pre/suffixes in Latin-script identifiers and it's fine).

What data back that this is fine? Not being able to use a dash within identifiers is a deffect included in most languages, sure. But it leads to decreased readibility (there is a reason why typographers came with a dash and not an underscore or camelcases) and probably doesn't bring any benefit. Indeed, for substraction (and inverse unary function) to be a distinguishable operator, it's enough to enforce that it should be at least separated by a space before it, and forbid dash to be a starting character of an identifier.

The RFC should avoid to come to conclusions of what is unecessary for programming (remember that two distinct symbols is enough to programm anything), and rather focus on what is required to match whatever the user would like to use without introducing design or security flow. Otherwise this would lead to integrate personal cultural biases as rigid rules, where more flexible solution are equally possible.

Can you confirm me that in this context, a lint means "a rule that can be checked by a linter"?

rustc has built-in lints, there is a separate linter (clippy) but we're talking about the builtin lints here. This is the name of a warning that rust will produce.

so I might overguessing here

you are, please actually read that section: It denotes sets of characters that the unicode consortium believes will be normally useful in identifiers. The characters not belonging to this set get linted. There _are_ concerning codepoints there, but there also are codepoints that Unicode doesn't believe will be common in identifiers (e.g. stuff from dead scripts, legacy codepoints, etc)

We're trying to defer to the Unicode Consortium's judgement as much as possible here.

it's enough to enforce that it should be at least separated by a space before it, and forbid dash to be a starting character of an identifier.

This kind of thing isn't a slam dunk, now you have increased the whitespace sensitivity of the language, which may or may not be a good thing.

What data back that this is fine?

The data to back it is this: Rust and many other languages don't support dashes in identifiers, and yet there is not much (if any) complaining about that. Same goes for apostrophes in identifiers.

ZWJ/ZWNJ are tricky invisible code points that have security risks. The Unicode Consortium recommends that they not be allowed in identifiers. I felt it was worth calling out that they _are_ useful for writing identifier-like text, however _for now_ it's fine if we follow Unicode's recommendation and leave them out.

These are not rigid rules: these can be changed in a backwards-compatible way in the future: If enough people writing Rust in Perso-Arabic script complain that they need ZWJ, we can tweak this lint in such a way as to make this work whilst preserving security properties. For now, it's better to just follow Unicode's recommendations and implement the simpler thing.


This has been a really long RFC process, and a lot of this stuff has been discussed already. I understand that you may have missed that discussion, but please don't relitigate things that we've already gone over multiple times. The point of the tracking issue is to finalize some of the unresolved questions from the RFC and track implementation/stabilization, _not_ to rehash the RFC (unless the implementation and experimentation indicate that we need to go back to the drawing board)

Regarding the drawbacks, most of it seems seriously flawed with English-centric concerns, and seems completely ignoring that if someone intent to change a code written in an other language than English, there should be serious consideration about either knowing this language or using dedicated tools able to make bidirectional translation of identifiers (for example backed on a context-sensitive database of lexemes). GopherCon 2017: Aditya Mukerjee - Translating Go to Other (Human) Languages, and Back Again - YouTube
gives an example of such a tool for Go.

With that in mind, the only credible mentionned drawback still a bit relevant seems an increased complexity for the compiler, but this is really not an issue that Rust users will have to worry about.

I would recommand to drop this section altogether as I fail to see how it brings anything constructive, and I feel unease with controversial statements it contains like "Foreign characters are hard to read" (Foreign for who? What about people for who latin-script is the set of foreign characters?). But that is just my opinion.

Kind regards

Those are basically drawbacks that were put forth by people in the thread paraphrased, their existence does not mean we _agree_ with them.

The RFC landed: which means there was sufficient evidence/discussion showing these drawbacks were flawed, or the RFC was amended to address them. The drawbacks are still included for reference, so as to document the discussion.

I'm well aware of Aditya's talk.

Again, you're commenting about an accepted RFC. This is not the place or time to do that.

Thank you @Manishhearth for you kind answer. I'm sorry I see your first reply only now, and not before I made a second comment (oddly).

Regarding the use of lint, I wasn't yet acustomed at the use of this word to describe the rules that the lint sofware is checking against, let alone the corresponding warning message. Thank you for your explanation.

For information, I arrived on this page as I was trying to use a function whose identifier used ĉ. So I'm very interested to see this process going forward, and would like to help as far as I can in this.

We're trying to defer to the Unicode Consortium's judgement as much as possible here.

That's seem a good way to go, as it enables to leverage on their work. All the more when staying flexible with user feedback is also kept is in mind.

The data to back it is this: Rust and many other languages don't support dashes in identifiers, and yet there is not much (if any) complaining about that. Same goes for apostrophes in identifiers.

Well, for the sake of the exercise, consider this feedback as a complain. :D Admitedly that still doesn't make much of it, but at least no one should ever again say that there is not any. ;)

To my mind actually having this facility would be a killer feature and I would be extremely sad and disapointed if it would remain missing. All the more when I could have provided the feedback that made it included. So here is my trial, please consider adding dash as a legal identifier continuation code-point, if that can still fall in this debate, or please provide me the guidance to make the demand properly in a separated query.

Regarding the proposal of alternatives to uncommon_codepoints, is concerning_codepoints an appropriate and therefor rejected proposal?

So here is my trial, please consider adding dash as a legal identifier continuation code-point, if that can still fall in this debate, or please provide me the guidance to make the demand properly in a separated query.

This would belong in a separate RFC. But it can't be done in a backwards compatible way because we already allow operators that are not whitespace-separated (it can, using editions, but you need _really_ good justification to go that route)

@psychoslave In the future it would be completely possible to add support for dashes and hyphens in identifiers. The only character that could not be used in names without breaking things is the ASCII hyphen-minus, which is the character that appears in most keyboard layouts, and is often used as a substitute for these more-specific characters. Since Rust only uses the hyphen-minus as a minus sign and not the others, I see no reason not to allow them (although lints should definitely be added to point out when multiple dash-like characters are used in a confusing way).

@Serentty great news. Do you have an approximate idea when this might come into mainlands, that is future is a few weeks from now, some months away or maybe even later? Cheers

@psychoslave As far as I'm aware, there are currently no plans to allow characters that are not recommended by Unicode for use in identifiers. If there are any dash-like characters that Unicode recommends, then they should already be supported by current plans. Otherwise, allowing them would have to be proposed in a new RFC, which you could write.

Will this kind of code be allowed? I would be deeply unsatisfied, otherwise:

fn main() {
    let 🐧 = "kwak kwak";
}

@Boiethios the answer to this question will be unsatisfactory to you.

@Manishearth We haven't seen much (any) development on this for some time... Seeing as you are the Unicode expert... perhaps you could lead a T-compiler WG to get the changes implemented?

I have no time to do this, I can advise on unicode issues but can neither help with organization nor coding here.

Will this ever be finished?

There's nobody actively working on it at the moment, but it should be at some point.

This is something that I care a lot about, so maybe I'll ask around and see if I can learn enough about the compiler internals to work on this.

What is the status of this feature?

@GrayJack Please see the description in the issue. Not much has happened since.

Reading the RFC, I couldn't help to notice that

Composed characters like those used in the word ḱṷṓn can be represented in different ways with Unicode. These different representations are all the same identifier in Rust.

Which scares me a bit as it looks like Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇͓̔͋͊̓ could be used to craft some pretty identifiers. But it could also have, ah, interesting uses for obfuscation.

While I understand the drive for allowing non-latin identifiers, I feel like this might also be opening a door to some new forms of attacks, making it possible for code to be deceiving in intent to a casual PR review by forging different identifiers that appear to be the same on screen. Maybe this is far fetched...

The security implications have already been discussed, and mitigations (mostly lints) are mostly decided upon, but actually implementing them is what is holding this up.

Shouldn't we, for the sake of third party tools that can potentially be unicode naïve, attempt to "soft-enforce" byte comparable tokens by having a warning when encountering non-NFC chars in idents?

This was brought up in the thread at some point, and unfortunately not all entered text is NFC, and it would be annoying if we emitted warnings that basically amount to "we don't like your keyboard".

That said, rustfmt could normalize idents if it wishes. That would be okay, things getting magically normalized isn't uncommon. I think I've seen at least one OS clipboard do this (?).

I haven't seen an input method that doesn't enter NFC, but if such an input method exists I would love to see it. Ultimately though, even if such an input method exists, I would personally argue that the inconvenience of the “I don't like your keyboard warning” would be less than the inconvenience of seemingly identical (not just visually identical, which is to be expected in Unicode, but semantically identical) identifiers not resolving to the same referent. If my keyboard were spitting out decomposed characters and I were using it for programming, I would want to know as soon as possible, because this is very much not the behaviour I expect from my input method.

Oh, I just saw the commit above where the compiler normalizes any identifiers before using them anyway. I'm fine with this solution.

@Serentty I've definitely seen some input methods like this. There is no inconvenience of seemingly identical identifiers; the whole _point_ is that we're normalizing them anyway (as most systems do). It is really bad form to annoy the user about a thing that most systems paper over.

@Manishearth Yeah, my first comment was from before I saw the thing about normalizing the identifiers. That takes care of all of my concern. This does get me thinking that perhaps there should be a warning if you use code points that were unassigned as of the release date of the compiler that you're using, since the normalization of unassigned code points is not stable. Something like:

“This character was not in Unicode at the time of the compiler's release. You may run into issues with name resolution for this identifier unless you update your compiler.”

@Serentty unassigned identifiers will already not be allowed because they're not XID.

please, read the RFC, these things are already dealt with, and I don't want to explain them for the umpteenth time.

@Manishearth I read the RFC months and months ago. It has been so long that I'm starting to forget everything. Sorry about that.

Copypasting the comment https://github.com/rust-lang/rust/pull/66670#issuecomment-569484077:

By the way, the RFC says the normalization is performed during parsing rather than lexing.
That means proc macros receive the original unnormalized tokens and can perform their own normalization (same as with escaping in literals).
But that's probably not the intent of the RFC.

On the other hand, if the normalization is performed during lexing, that would be the first case in which we do not preserve original tokens by design (except for a big hack with doc comments in macros).

That is definitely not the explicit intent of the RFC, but you're right that there's a tradeoff here. I'd leave it up to the compiler folks to figure out what best to do here.

First bulllet point, "Normalize identifiers to NFC whilst parsing" is done in #66670 and #67702 . Thanks to @petrochenkov for instructions~

So i created a new repo https://github.com/crlf0710/unicode-security for uts #39 checks and data. Currently it contains only one piece of data (identifier status) which will be used for implementing the uncommon_codepoints lint.

Now I'm seeking for getting the code reviewed, transfered to unicode-rs, published to crates.io and start working on the uncommon_codepoints lint. I wonder what's the best way to move forward?

Let me ping a few people to hear their opinions on the lib part. @alexcrichton @Manishearth @SimonSapin @kwantam

Should we move this into unicode-rs?

Oh, you already suggested that. Go ahead and add me as an owner and I can transfer it.

Done, and updated your access to admin

The uncommon_codepoints lint has been implemented in #67810 .

The confusable_idents lint has been implemented in #71542, but we're currently allow-by-default now for a few reasons.

By reasons you mean false positives? By any chance do you have a link to a crater run or something which shows some of them? Asking out of mere curiosity.

A few reasons: 1. There's a compilation time penalty, especially for tiny crates but with a lot of identifiers, we've done some optimizations but there's still some, for up to 5%. 2. Some identifier pairs are considered confusable, and has shown up in pure ASCII code base including libcore, libstd, and rustc_*. The pairs are indeed confusable to some extent, and for the rustc codebase personally I'm inclined to just go up and fix most of them. The most common ones are "I" to "l", "rl" to "r1", "el" to "e1", which i do think using more descriptive variable names are better. But for the scope of whole ecosystem, maybe it's worth adding a few special cases or move them to a separate lint too.

Hmm. So, I've not been participating too much in this discussion, but I have to say that at first blush linting about rl vs r1 feels quite surprising and intrusive to me. I would have assumed that the intent of the lint was to capture identifiers that looked the same because of distinct unicode characters that appear the same, not because of two distinct ASCII characters. Most fixed-width, progarmming fonts try to make I, l, 1, and so forth fairly distinct, no? (Admittedly, not in other type-faces.)

So a tweak we can make is only focusing on mixed script confusables. There is already a lint for mixed-script confusables _within a single identifier_, we can expand that logic to work at the crate level between identifiers too.

Also, yes, if you're using a decent programming font, l, I, and 1 should be distinct.

The RFC also does not _require_ us to implement the confusables lint, it points out that it might be expensive.

Even with good fonts, I actually would appreciate a linter telling me about rl vs r1. They're still similar enough that I could mix them up, and it's a very strong sign that I'm being overly terse with my variable names.

But I definitely agree that mixed-script confusables is the more important thing to lint on and more important to bump up to warn-by-default. Maybe we should ask URLO how they feel about "same script confusables", whether it should be a separate lint, or even a clippy lint.

On an unrelated note:

The uncommon_codepoints lint has been implemented in #67810 .

If I'm reading the test in this PR correctly, Greek and Japanese letters are being flagged as "uncommon". For a warn-by-default lint, that feels like it goes against the basic premise that we're trying to allow non-ASCII identifiers.

Is it supposed to become allow-by-default later, and is only intended for English-language codebases to be extra paranoid? Or are the characters in that test some weird alternative non-canonical way of writing those letters which we really do want to lint against, and there were just no comments explaining that?

If I'm reading the test in this PR correctly, Greek and Japanese letters are being flagged as "uncommon".

They're not, that's U+00B5 MICRO SIGN and the small katakana letters respectively. Neither should be in use for normal text.

We should have comments explaining that.

That test should also have some checks for various non-ascii strings that _should_ pass.

Wait, is that _all_ small katakana, or do you just mean the small katakana extensions used for writing Ainu? Because many small katakana characters are indeed used in normal text to spell such common words as “party”.

These are the Ainu ones. I might ask Unicode why they're classed as Technical, though, that seems incorrect.

I agree that the classification seems wrong. As far as I can tell, those code points are the recommended characters to use for writing the Ainu language, which is critically endangered but not dead. So it's not completely unthinkable that someone might want to write a Rust program (or any other text covered by the Unicode security spec) in Ainu. The characters seem better classified as confusables.

Correct, I've already emailed the internal unicode list about it.

In the meantime, I think we should follow the spec.

I've also added the mixed_scripted_confusables data and created a PR: https://github.com/unicode-rs/unicode-security/pull/13 . Comments welcome!

EDIT: posted in wrong thread, let's discuss the mixed_script_confusable in the corresponding PR.

Even with good fonts, I actually would appreciate a linter telling me about rl vs r1. They're still similar enough that I could mix them up, and it's a very strong sign that I'm being overly terse with my variable names.

While I agree with that, I think it's more appropriate for an "allow by default" lint, or perhaps clippy, than a "warn by default" lint (which I believe this to be, based on a quick skim of the RFC). I'm struggling a bit to capture my intuition here -- roughly speaking, it's that it seems like there's just little precedent from other languages or environments, and what folks consider confusable is likely to be highly dependent on the individual.

Skimming the RFC, I see it includes this text, which seems to back up my intuition (and in fact specifically cites l vs i as an example of what not to warn about):

There are many confusables within scripts -- Arabic has a bunch of these as does Han (both with other Han characters and with kana), but since these are within the same language group this is outside the scope of this RFC. Such confusables are equivalent to l vs I being confusable in some fonts.

(...sorry if I'm confusing things up here, i.e., I'm talking about the wrong lint.)

How about this:
Split the confusable_lints lint into two:

  • confusable_idents: which is based on unicode confusables.txt but excluding the pairs between ASCII codepoints. This is warn-by-default.
  • confusable_idents_ascii: which is based on unicode confusables.txt but only including the pairs between ASCII codepoints. This is permanently allow-by-default.

I think this way i can make it less surprising and intrusive to people and does not have any effect on existing code base, while remaining a useful tool that when people writing non-ascii code, they'll be prompted to keep identifiers visually distinct enough for good readability.

Since most programming fonts clearly distinguish potentially confusing ASCII codepoints we don't need to warn about them IMHO. The confusable_idents can be amended to discard any warning between two ASCII-only identifiers. This avoids warnings in existing code and should be easier to implement than excluding ASCII confusables on a character basis.

I don't think a confusable_idents_ascii is necessary as programming languages do just fine without.

@pyfisch Mind PR to the original RFC and clarify? After that i will amend the implementation accordingly.

Edit: I just realized that the text I quote below is from a section in the RFC named "Alternative mixed script lints" and is thus not what was decided upon. This makes my comment irrelevant, though I still personally think Rust should only warn about mixed-script confusables, not same-script confusables. I've left my original comment below.


Why not just have the implementation match the RFC, and only lint on mixed script confusables? Warning on same-script confusables was explicitly "outside the scope of this RFC" and is something that in my opinion should go through its own RFC or be left to clippy.

@mjbshaw the RFC explicitly calls for a confusable in-scope identifiers lint.

However, the RFC does not make it mandatory. Honestly I think we can do wthout this lint for a first pass, the others are good enough mostly. We can introduce this if people ask for it.

On Wed, May 06, 2020 at 06:38:16AM -0700, Niko Matsakis wrote:

Even with good fonts, I actually would appreciate a linter telling me about rl vs r1. They're still similar enough that I could mix them up, and it's a very strong sign that I'm being overly terse with my variable names.

While I agree with that, I think it's more appropriate for an "allow by default" lint, or perhaps clippy, than a "warn by default" lint (which I believe this to be, based on a quick skim of the RFC). I'm struggling a bit to capture my intuition here -- roughly speaking, it's that it seems like there's just little precedent from other languages or environments, and what folks consider confusable is likely to be highly dependent on the individual.

I agree as well. The intention of the "confusables" lint was to handle
cases of Unicode characters with identical or near-identical rendering,
such as the Cyrillic capital 'A' that renders identically to the
Latin/English capital 'A'. I think warning on l vs 1 vs I by
default would generate far too much noise and lead people to disable the
warning; those characters are intended to have distinct renderings.

On Thu, May 07, 2020 at 06:31:24AM -0700, Michael Bradshaw wrote:

Why not just have the implementation match the RFC, and only lint on mixed script confusables? Warning on same-script confusables was explicitly "outside the scope of this RFC" and is something that in my opinion should go through its own RFC or be left to clippy.

Agreed. Let's not special-case ASCII here; we should be warning about
confusing characters between two different scripts.

I think this means we should just not implement the confusables lint, and stick to the mixed script confusables one

On Sun, May 10, 2020 at 06:43:57PM -0700, Manish Goregaokar wrote:

I think this means we should just not implement the confusables lint, and stick to the mixed script confusables one

I agree. I'm tempted to say we should just call the resulting lint
"confusables" rather than mixed_script_confusables, though, for
the simplicity of adjusting that lint. If we decide we want a
same-script confusables lint in the future, we could call that
same_script_confusables, since it'll be less common.

(Please don't take that as name bikeshedding. Only suggesting it to
avoid requiring the more verbose name if we're unlikely to need another
similar lint.)

@Manishearth Personally i think the RFC accepted mixed script confusable lint is quite weak in my opinion. The RFC is relying confusable_idents to provide confusable detection between similar identifers of both same script and mixed script, and mixed_script_confusables to detect suspicious or accidental usages of ad-hoc unicode scripts. The later doesn't really cover much.

Created a topic on zulip:
https://rust-lang.zulipchat.com/#narrow/stream/213817-t-lang/topic/nonascii.20identifiers(rfc.202457)

Implemented ascii-only identifier checking (#[no_mangle] case and non-inline modules without #[path]) in #72259

It is worth noting that every single modern filesystem I know of supports Unicode in file names. Many tools have problems with Unicode file names on Windows, but that is a problem with the tools, not with Unicode filenames.

Handling file names in a cross-platform way that works nicely is tricky, however.

From the RFC:

This RFC keeps out-of-line modules without a #[path] attribute ASCII-only. The allowed character set for names on crates.io is not changed.

Note: This is to avoid dealing with file systems on different systems right now. A future RFC may allow non-ASCII characters after the file system issues are resolved.

You are free to write an additional RFC that attempts to address this. This was explicitly excluded from the original RFC to reduce scope.

@DemiMarie The issue is that filesystems in whether they perform normalization on filenames (and if so which version of Unicode they use to perform normalization), and whether they allow unpaired surrogates or invalid byte sequences. And from what I understand Apple's new APFS even forbids characters newer than Unicode 9.0 from being used in filenames. So while this is certainly something that can be done, it's a project of its own in terms of scope.

73305 has landed which disables the loading of external crates with non-ascii-identifier as name.

That was actually my main use case for this feature. The project I was using a non-ascii crate name in never went anywhere (also I couldn't have a non-ascii repo name on both GitHub and GitLab 😒), but still: Why disallow it?

Both several comments above and the RFC itself has already been talking about this, if you want to discuss more about this after reading those, i think you can continue within the Zulip channel (the link is also a few comments above).

Oh, I see. So an application crate can still have a non-ASCII name, but libraries can't because that would open up the possibility for lots of weird filesystem issues.

73300 has landed. Since the lints in this RFC is implemented at crate scope, an warning will now be emitted(through another lint) if you try to adjust the level of these lints on items or expressions.

72770 has landed, which implements the mixed script confusable lint. Thank you for your instructions @Manishearth !

73839 has landed, which exhibits the current behavior of non_standard_style lints on unicode identifiers.

And this concludes the initial implementation phase of this RFC.

I'll switch the tag of this tracking issue from B-RFC-approved to B-RFC-implemented now :)

Was this page helpful?
0 / 5 - 0 ratings