The bug report is from @ppedrot:
Coq is Unicode-aware if I might say, but it is not at all Unicode-compatible. In particular, it does not implement normalization of Unicode strings, so that identifiers that should be equal per the norm are not in Coq.
Since that PR was the wrong place to continue the discussion, I created the new issue.
@ppedrot also claims that this feature would cause an unspecified soundness problem:
I know this kind of features used to be the source of security issues for some file systems, but for Coq it'd be immediately a source of soundness issues, since IIRC normalization is locale-specific.
However, normalization is not locale-specific. Parsing of strings might depend on the input encoding, and mixing encodings might cause trouble, but that seems a separate problem, and one only exploitable in adversarial settings. If such issues constitute valid bugs, maybe Coq should enforce that encodings are not mixed, e.g. by storing the input encoding together with the Coq version number and refusing to load output files using different encodings from the current one.
Naive question: why not to already replace Id.equal by equality modulo canonical equality (combining sequences, singleton equivalence, ...) or to normalize to, say, NFC at the time of building an Id.t? There seems to be an OCaml package for that: uunf.
I'd be interested in seeing an actual proof of False out of this.
I like @herbelinās idea, since it prevents the clear danger of āinconsistentā normalization, and it follows recommendations by https://unicode.org/faq/normalization.html; Iāve also read the intro of UAX#15. Singleton equivalence might break some code ā today, you can give distinct meanings to the āOmegaā and āOhmā characters. Thatās probably a good thing, but worth mentioning (and maybe asking users?).
In fact, you could also normalize input before lexing, tho that has a few extra corner cases: a complication I just read about is that nfc-normalize (concat a b) need not equal concat (nfc-normalize a) (nfc-normalize b); but this is only possible when combining characters appear at the end or beginning of a or b, which seems fragile and doesnāt work well today, so Iād propose to outlaw that (consider a notation token beginning/ending in a combining character, and an editor that normalizes uses of that notation); using NFC (instead of NFD) would complicate/prevent parsing such notations, and thatās probably good ā where good means simplest to support robustly in Coq, editors and tooling.
I also agree about NFC ā NFD lets you split accents from their ābaseā characters (which sounds potential trouble), and NFKC/D normalization destroys sub/superscripts and other special characters, and Iām sure Iām not the only one using them.
See also this discussion. Unicode normalization provides an equivalence relation on identifiers, and we can break this equivalence in Coq using notations:
Notation "[ x Ģ ]" := x (at level 0).
Notation "[ x ]" := x (at level 0).
Definition e := 10.
Definition Ć© := 5.
Eval cbv in [ eĢ ]. (* 10 *)
Eval cbv in [ Ć© ] . (* 5 *)
The solution to this would indeed be to normalize to NFC during lexing.
Oh, I had completely forgot about this issue when I commented in the forum.