Katex: Add support for Persian & Arabic alphabet in text and eastern-arabic numerals + Proposal Commit

Created on 11 Jun 2017  路  16Comments  路  Source: KaTeX/KaTeX

Currently countries with more than 700 million people combined use Persian, Arabic and Urdu alphabets worldwide.
These countries also have a huge number of students. Iran itself has 18 million students for example.
Because of this we see more and more educational startups that are targeting these countries.
I tried to add support for symbols of these languages to KaTeX with as minimal footprint as possible.
I think the results are pretty good. I haven't added any screenshotter tests yet but by testing with make serve and by eye I think I got a fairly precise result.
We face 2 challenges when we try to add these symbols:

  1. They use Eastern Arabic Numerals as opposed to (Western) Arabic Numerals in their schools.
  2. In English alphabet for example each character code has a single glyph with a specified size but The characters of these alphabets are cursive so they change shape and size.
  • Adding the symbols was easy because like CJK characters, all these alphabets are in a uniform [0x0600 - 0x06FF] unicode range.
  • I couldn't use KaTeX default fonts because they don't have this character set and the glyphs have serious problems. So I added a new (open) font called Vazir-Code. The biggest advantage of this font is that it is monospace so cursive characters have a roughly same height and width everywhere.
  • I calculated font metrics using awesome fontkit package.
  • Then I added the Eastern Arabic numerals as textords in math mode and alphabet as textords in text mode.
  • finally I added a new font beside main and math font.
  • in buildCommon I add a new css class that only sets the font-family to Vazir-Code.

IMHO we can have 2 different approaches from here:

  1. If the changes generally looks good to you, I can add the font to the main fonts folder of KaTeX, calculate metrics using the python script and finally add some test and screenshotter test.
  2. We also can add a small plugin system to KaTeX so everyone can add their own fonts and css to the system an library should consume the plugins symbols.

IMHO first approach is better solution here because Latin (I mean every latin based alphabet), CJK and Persian-Arabic alphabets constitute majority of the alphabets in use.

You can find my changes in this commit: https://github.com/HosseinAgha/KaTeX/commit/8696318b7509681a75dfa213009111cc818a66d0

It is just a proposal (proof of concept) and I have not created a pull request yet.
I really appreciate your opinion on this.

Unicode enhancement

Most helpful comment

@kevinbarabash good point. We always use Persian and Arabic characters that need to collate inside \text{} in our schools. So fortunately it shouldn't make any problems. in other words, we don't have math mode mathord variables in Persian and Arabic.
But we need Persian numbers in math mode and fortunately these numbers don't need to be collated.

All 16 comments

Overall, sounds cool -- thanks for your efforts in putting this together! In general, this seems like a good case study for adding languages/fonts in general, and is possibly relevant to the discussion on roman lower-case Greek letters on #564. I wonder if it could be helpful to write a document about adding languages/fonts in this way, once we figure out all the details.

I do feel like a plugin infrastructure may be warranted here: even if we put all of this in the main repository, users will probably want to deploy with the most appropriate subset of features for their application (in particular, not including fonts for languages they don't plan to use). For comparison, MathJax has an extension system, both built-in extensions (e.g. color) and official third party extensions. I'm not convinced we need extensions for "main features" (except maybe as a tool for users to add custom features that don't make sense for broader public, but macros will hopefully get us there). But I think it'd be good to have a plugin infrastructure for fonts specifically, because fonts are naturally on the larger side.

This also highlights an issue discussed on #632 of whether we should put fonts into a separate repository. If we made such a move, it would probably make it more palatable to start adding additional optional fonts, whereas currently repository size is a concern.

Specific to this Farsi patch:

  • Font license looks great. We can even modify the font if we need to. Probably need to include the font license/link somewhere in the repository though.
  • I don't have a sense of whether monospace fonts are considered "nice" for rendering Farsi. (In English, they're generally considered ugly, but e.g. in CJK all fonts are monospace.) It'd be helpful to have some aesthetic commentary here, especially in the context of mathematics.
  • I assume the idea is that authors would write Farsi characters in Unicode.
  • I assume a similar thing is possible in LaTeX, with a suitable package inclusion? What font(s) do they use?
  • Out of curiousity: I could see two contexts where you'd want Farsi within a mathematical formula: variable names and \text{...} descriptions (e.g. in underbraces, cases environment, etc.). Are both uses common, or more one than the other?

I'm really fan of adding a lightweight plugin system to add fonts to KaTeX and I also don't like MathJax way of handling extensions. I'll be happy to contribute and I think we may discuss this in a different issue in detail.

@edemaine Regarding the Farsi patch:

  • Yes monospace fonts are not very beautiful in Farsi either. I cannot explain the problem with Farsi and Arabic characters better than this W3C picture. as you can see Farsi and Arabic letters are cursive and can join and change size. In other words Farsi letters are semantically encoded to unicode so a single unicode code point may represent multiple glyphs. using a monospace font does not eliminate but minimizes this issue by preserving a fixed character width.
  • Yes IMHO unicode is the best way to represent Farsi and Arabic in web.
  • Good Idea thanks! For Persian in TeX we use XePersian package. In the package I found a IranianSans font and I will investigate more but as I said due to the way KaTeX renders Math and cursive nature of Farsi alphabet I'm not optimistic using these fonts creates best results.
  • We need Farsi and Arabic only inside \text as far as I know schools don't use Farsi variable names at all. So Farsi mathords in math mode is not required.
    We also need Persian numerals in math mode (textord) they are exactly a map of normal english (arabic to be precise!) numerals but adding support for them is absolutely necessary because most of the K12 mathematics (all of Persian schools) use these numerals you can find a table showing this mapping here.
    I've added both Farsi in \text and Farsi (eastern arabic) numerals support in my proposal.

I had a quick look at HosseinAgha/KaTeX@8696318, cool stuff. Having a plugin architecture for other languages would be really helpful.

@edemaine @kevinbarabash thank you for your comments.
As I understand you both think we need a plugin architecture in order to add Persian-Arabic (or any new language + font) support.
So I'll work on that then I'll create a new pull request with my proposal plugin architecture.

@HosseinAgha That sounds great! Thanks for taking on that additional task. Feel free to discuss/ask questions here too.

Great, great stuff.

On the question of monospace vs proportional fonts, I think you should not prematurely rule out use of a proportional font, if any exist. Given PR #670 , KaTeX has no current need for horizontal metrics other than italic skew. Of course, KaTeX is not yet mature and there may be unforeseen issues, but I am hopeful that there will never be any need for horizontal metrics.

A plug-in architecture would be wonderful. Hopefully, it will be useful for more than fonts. For instance, an mhchem plug-in could become feasible after a plug-in architecture becomes available. Or perhaps such an effort would need a separate plug-in architecture.

:+1: !

This will be important for KA Lite and Kolibri, now that we're moving to support RTL and Arabic.

@ronkok @kevinbarabash Because of huge changes in #670 I should change some of my plans for adding plugin system to KaTeX. This is to only say that I have not given up on this but I was too busy lately and I need to take a deep look inside #670 changes and your discussions under the PR before finalizing any idea.

As I described above the biggest challenge to add these alphabets is that we have a map of unicode number --> glyph font metrics in KaTeX font metrics file it is a __many to one__ map so for each unicode character we can have a single metric. but in Persian literature for example each unicode character maps to up to 4 glyphs! So we need to have a __many to many__ mapping function to find a glyph metrics based on it's unicode and it's context. Only solution that I found out was to use a monospace font so we can have roughly same metrics for each unicode character.
@edemaine, @kevinbarabash, @ronkok your possible solutions and ideas on this issue is highly appreciated.

@HosseinAgha, I speak as an expert only in PR #670; others here have deeper insights into the larger KaTeX issues. The whole point of PR #670 was to make horizontal font metrics unnecessary. PR #670 uses CSS methods, not calculations on font metrics. If any subject matter is passed to one of the PR #670 functions, KaTeX will put that subject matter into a span, and the browser will make the accent width match the span width. No character width metrics are needed for anything in PR #670, or for anything else in KaTeX.

Of course, KaTeX does need vertical font metrics. It always has. And the task you have taken on looks to me like a very challenging one, but I don't see how PR #670 adds to that challenge. Am I missing something?

@ronkok Cool! I did not know that you handled horizontal width using css tricks in #670! So I think most of my concerns about horizontal aligning of the math symbols were unnecessary after all. Thanks for #670.
I think I should only worry about vertical changes in glyphs now.

@HosseinAgha, Now that I think back, \widehat and \widetilde both make a character count of the subject matter. Those two functions are tolerant of minor changes in width, so a character count was sufficient and no font width metrics were needed.

But that tolerance has its limits. If your work causes a single Persian screen character to signal that it has six characters, then the \widetilde rendering will look a little odd. The relevant line of code is in stretchy.svgSpan:
const numChars = group.value.value.length;

As you continue with your work, we should test that line to make sure it is not drawing the wrong inference about character count. Everything else in PR #670 is completely independent of character width. I'm glad that has helped you.

Good luck!

@HosseinAgha one thing to note is that letters inside \text{} are collated but text outside is not. Were you looking to support these writing systems inside and/or outside of \text{}?

@kevinbarabash good point. We always use Persian and Arabic characters that need to collate inside \text{} in our schools. So fortunately it shouldn't make any problems. in other words, we don't have math mode mathord variables in Persian and Arabic.
But we need Persian numbers in math mode and fortunately these numbers don't need to be collated.

For supporting new scripts inside text{}, see PR #1060. I don't know what RTL issues will arise, but this patch that just landed makes it easy to add new scripts that are supported in text{} environments.

This is complete now that KaTeX supports https://github.com/HosseinAgha/persian-katex-plugin.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jason-s picture jason-s  路  3Comments

hagenw picture hagenw  路  3Comments

oddhack picture oddhack  路  3Comments

fabiospampinato picture fabiospampinato  路  4Comments

q2apro picture q2apro  路  3Comments