Katex: Ambiguous Unicode characters

Created on 14 Nov 2017  ·  7Comments  ·  Source: KaTeX/KaTeX

My recent series of Unicode pull requests omitted some characters because their proper treatment wasn’t obvious. This issue addresses some of those characters, presents a proposed treatment, and solicits opinions. Anyone can opine, but responses will be especially valued from @kevinbarabash, @edemaine, @xymostech, @sophiebits, @gagern, @kasperpeulen, and @flying-sheep, If and when we reach agreement, I’ll submit PRs for the characters that we’ve agreed upon.

For each of the following items, I ask that you consider the resolution and either vote thumbs up, or vote thumbs down and state a reason why.

Unicode

Most helpful comment

This character is U+2223, DIVIDES. I’ve withheld it until now only because it looks just like |, U+007C. So the question here is do we include any confusables at all. If we do, then the path forward is reasonably clear. unicode-math maps this character to \mid, the vertical line with rel spacing, useful for set builder notation. John Cook maps it the same way, as does Microsoft Word.

Resolved: Map to KaTeX symbol \mid.

All 7 comments

These two characters obviously map to \therefore and \because. I’ve withheld them until now because there is a conflict as to the proper type of atom. KaTeX symbols.js says they are rel atoms. unicode-math says they are \mathord. I think KaTeX got this one right. They come from the amssymb package, which contains: \DeclareMathSymbol{\therefore} {\mathrel}{AMSa}{"29}.

Resolved: Map to KaTeX symbols \therefore and \because respectively.

These two characters obviously map to \gtrdot and \lessdot, but there is again a type conflict. KaTeX symbols.js says they are bin atoms. unicode-math says they are \mathrel. I’’ve never used these symbols, so I don’t have a personal opinion on this one. The symbols come from the amssymb package, which contains: \DeclareMathSymbol{\gtrdot} {\mathbin}{AMSb}{"6D}

Resolved: Map to KaTeX symbols \gtrdot and \lessdot respectively.

This character is U+2223, DIVIDES. I’ve withheld it until now only because it looks just like |, U+007C. So the question here is do we include any confusables at all. If we do, then the path forward is reasonably clear. unicode-math maps this character to \mid, the vertical line with rel spacing, useful for set builder notation. John Cook maps it the same way, as does Microsoft Word.

Resolved: Map to KaTeX symbol \mid.

That’s all I have prepared for now. More to come later.

I would have expected U+2223 (DIVIDES) to resolve to \divides, though I guess that's only defined in a somewhat obscure package (mathabx). I think \mvert which is equivalent to \mid is the typical way to denote the divides operator in AMSMath. (FWIW, mathabx's \divides is a thinner vertical line with somewhat larger space than \mid.)

Also I observe that unicode-math defines U+2223 to \mid. Overall I'm in favor of this.

The type conflicts are weird -- why does unicode-math redefine the type of existing characters? I think it makes sense for us to follow the aliasing, but not redefine the types. I find it weird that \gtrdot and \lessdot are \mathbin not \mathrel, but if AMS defines them that way, I'm fine with it.

i’m for using confusables.

  • ⁃ hyphen-bullet
  • ‐ hyphen
  • - hyphen-minus (ASCII)
  • − minus
  • – en-dash
  • — em-dash

are all confusable, and those are just the most common ones, there’s more!

LaTeX translates hyphen-minus into minus, idk if we do it, too.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

oddhack picture oddhack  ·  3Comments

fabiospampinato picture fabiospampinato  ·  4Comments

pyramation picture pyramation  ·  4Comments

trollanfer picture trollanfer  ·  5Comments

sophiebits picture sophiebits  ·  3Comments