Coq: Implement Unicode normalization, and potential soundness issues

Created on 28 Mar 2020 · 5Comments · Source: coq/coq

The bug report is from @ppedrot:

Coq is Unicode-aware if I might say, but it is not at all Unicode-compatible. In particular, it does not implement normalization of Unicode strings, so that identifiers that should be equal per the norm are not in Coq.

Since that PR was the wrong place to continue the discussion, I created the new issue.

@ppedrot also claims that this feature would cause an unspecified soundness problem:

I know this kind of features used to be the source of security issues for some file systems, but for Coq it'd be immediately a source of soundness issues, since IIRC normalization is locale-specific.

However, normalization is not locale-specific. Parsing of strings might depend on the input encoding, and mixing encodings might cause trouble, but that seems a separate problem, and one only exploitable in adversarial settings. If such issues constitute valid bugs, maybe Coq should enforce that encodings are not mixed, e.g. by storing the input encoding together with the Coq version number and refusing to load output files using different encodings from the current one.

enhancement discussion

Source

Blaisorblade

👍1

All 5 comments

Naive question: why not to already replace Id.equal by equality modulo canonical equality (combining sequences, singleton equivalence, ...) or to normalize to, say, NFC at the time of building an Id.t? There seems to be an OCaml package for that: uunf.

herbelin on 28 Mar 2020

👍1

I'd be interested in seeing an actual proof of False out of this.

Zimmi48 on 28 Mar 2020

👍1

I like @herbelin’s idea, since it prevents the clear danger of “inconsistent” normalization, and it follows recommendations by https://unicode.org/faq/normalization.html; I’ve also read the intro of UAX#15. Singleton equivalence might break some code — today, you can give distinct meanings to the “Omega” and “Ohm” characters. That’s probably a good thing, but worth mentioning (and maybe asking users?).

In fact, you could also normalize input before lexing, tho that has a few extra corner cases: a complication I just read about is that nfc-normalize (concat a b) need not equal concat (nfc-normalize a) (nfc-normalize b); but this is only possible when combining characters appear at the end or beginning of a or b, which seems fragile and doesn’t work well today, so I’d propose to outlaw that (consider a notation token beginning/ending in a combining character, and an editor that normalizes uses of that notation); using NFC (instead of NFD) would complicate/prevent parsing such notations, and that’s probably good — where good means simplest to support robustly in Coq, editors and tooling.

I also agree about NFC — NFD lets you split accents from their “base” characters (which sounds potential trouble), and NFKC/D normalization destroys sub/superscripts and other special characters, and I’m sure I’m not the only one using them.

Blaisorblade on 28 Mar 2020

See also this discussion. Unicode normalization provides an equivalence relation on identifiers, and we can break this equivalence in Coq using notations:

Notation "[ x ́  ]" := x (at level 0).
Notation "[ x   ]" := x (at level 0).

Definition e := 10.
Definition é := 5.

Eval cbv in [ é ]. (*  10 *)
Eval cbv in [ é ] . (*  5 *)

The solution to this would indeed be to normalize to NFC during lexing.