Currently countries with more than 700 million people combined use Persian, Arabic and Urdu alphabets worldwide.
These countries also have a huge number of students. Iran itself has 18 million students for example.
Because of this we see more and more educational startups that are targeting these countries.
I tried to add support for symbols of these languages to KaTeX with as minimal footprint as possible.
I think the results are pretty good. I haven't added any screenshotter tests yet but by testing with make serve and by eye I think I got a fairly precise result.
We face 2 challenges when we try to add these symbols:
[0x0600 - 0x06FF] unicode range. textords in math mode and alphabet as textords in text mode. main and math font. buildCommon I add a new css class that only sets the font-family to Vazir-Code. IMHO we can have 2 different approaches from here:
fonts folder of KaTeX, calculate metrics using the python script and finally add some test and screenshotter test. IMHO first approach is better solution here because Latin (I mean every latin based alphabet), CJK and Persian-Arabic alphabets constitute majority of the alphabets in use.
It is just a proposal (proof of concept) and I have not created a pull request yet.
I really appreciate your opinion on this.
Overall, sounds cool -- thanks for your efforts in putting this together! In general, this seems like a good case study for adding languages/fonts in general, and is possibly relevant to the discussion on roman lower-case Greek letters on #564. I wonder if it could be helpful to write a document about adding languages/fonts in this way, once we figure out all the details.
I do feel like a plugin infrastructure may be warranted here: even if we put all of this in the main repository, users will probably want to deploy with the most appropriate subset of features for their application (in particular, not including fonts for languages they don't plan to use). For comparison, MathJax has an extension system, both built-in extensions (e.g. color) and official third party extensions. I'm not convinced we need extensions for "main features" (except maybe as a tool for users to add custom features that don't make sense for broader public, but macros will hopefully get us there). But I think it'd be good to have a plugin infrastructure for fonts specifically, because fonts are naturally on the larger side.
This also highlights an issue discussed on #632 of whether we should put fonts into a separate repository. If we made such a move, it would probably make it more palatable to start adding additional optional fonts, whereas currently repository size is a concern.
Specific to this Farsi patch:
\text{...} descriptions (e.g. in underbraces, cases environment, etc.). Are both uses common, or more one than the other?I'm really fan of adding a lightweight plugin system to add fonts to KaTeX and I also don't like MathJax way of handling extensions. I'll be happy to contribute and I think we may discuss this in a different issue in detail.
@edemaine Regarding the Farsi patch:
\text as far as I know schools don't use Farsi variable names at all. So Farsi mathords in math mode is not required.math mode (textord) they are exactly a map of normal english (arabic to be precise!) numerals but adding support for them is absolutely necessary because most of the K12 mathematics (all of Persian schools) use these numerals you can find a table showing this mapping here.\text and Farsi (eastern arabic) numerals support in my proposal.I had a quick look at HosseinAgha/KaTeX@8696318, cool stuff. Having a plugin architecture for other languages would be really helpful.
@edemaine @kevinbarabash thank you for your comments.
As I understand you both think we need a plugin architecture in order to add Persian-Arabic (or any new language + font) support.
So I'll work on that then I'll create a new pull request with my proposal plugin architecture.
@HosseinAgha That sounds great! Thanks for taking on that additional task. Feel free to discuss/ask questions here too.
Great, great stuff.
On the question of monospace vs proportional fonts, I think you should not prematurely rule out use of a proportional font, if any exist. Given PR #670 , KaTeX has no current need for horizontal metrics other than italic skew. Of course, KaTeX is not yet mature and there may be unforeseen issues, but I am hopeful that there will never be any need for horizontal metrics.
A plug-in architecture would be wonderful. Hopefully, it will be useful for more than fonts. For instance, an mhchem plug-in could become feasible after a plug-in architecture becomes available. Or perhaps such an effort would need a separate plug-in architecture.
:+1: !
This will be important for KA Lite and Kolibri, now that we're moving to support RTL and Arabic.
@ronkok @kevinbarabash Because of huge changes in #670 I should change some of my plans for adding plugin system to KaTeX. This is to only say that I have not given up on this but I was too busy lately and I need to take a deep look inside #670 changes and your discussions under the PR before finalizing any idea.
As I described above the biggest challenge to add these alphabets is that we have a map of unicode number --> glyph font metrics in KaTeX font metrics file it is a __many to one__ map so for each unicode character we can have a single metric. but in Persian literature for example each unicode character maps to up to 4 glyphs! So we need to have a __many to many__ mapping function to find a glyph metrics based on it's unicode and it's context. Only solution that I found out was to use a monospace font so we can have roughly same metrics for each unicode character.
@edemaine, @kevinbarabash, @ronkok your possible solutions and ideas on this issue is highly appreciated.
@HosseinAgha, I speak as an expert only in PR #670; others here have deeper insights into the larger KaTeX issues. The whole point of PR #670 was to make horizontal font metrics unnecessary. PR #670 uses CSS methods, not calculations on font metrics. If any subject matter is passed to one of the PR #670 functions, KaTeX will put that subject matter into a span, and the browser will make the accent width match the span width. No character width metrics are needed for anything in PR #670, or for anything else in KaTeX.
Of course, KaTeX does need vertical font metrics. It always has. And the task you have taken on looks to me like a very challenging one, but I don't see how PR #670 adds to that challenge. Am I missing something?
@ronkok Cool! I did not know that you handled horizontal width using css tricks in #670! So I think most of my concerns about horizontal aligning of the math symbols were unnecessary after all. Thanks for #670.
I think I should only worry about vertical changes in glyphs now.
@HosseinAgha, Now that I think back, \widehat and \widetilde both make a character count of the subject matter. Those two functions are tolerant of minor changes in width, so a character count was sufficient and no font width metrics were needed.
But that tolerance has its limits. If your work causes a single Persian screen character to signal that it has six characters, then the \widetilde rendering will look a little odd. The relevant line of code is in stretchy.svgSpan:
const numChars = group.value.value.length;
As you continue with your work, we should test that line to make sure it is not drawing the wrong inference about character count. Everything else in PR #670 is completely independent of character width. I'm glad that has helped you.
Good luck!
@HosseinAgha one thing to note is that letters inside \text{} are collated but text outside is not. Were you looking to support these writing systems inside and/or outside of \text{}?
@kevinbarabash good point. We always use Persian and Arabic characters that need to collate inside \text{} in our schools. So fortunately it shouldn't make any problems. in other words, we don't have math mode mathord variables in Persian and Arabic.
But we need Persian numbers in math mode and fortunately these numbers don't need to be collated.
For supporting new scripts inside text{}, see PR #1060. I don't know what RTL issues will arise, but this patch that just landed makes it easy to add new scripts that are supported in text{} environments.
This is complete now that KaTeX supports https://github.com/HosseinAgha/persian-katex-plugin.
Most helpful comment
@kevinbarabash good point. We always use Persian and Arabic characters that need to collate inside
\text{}in our schools. So fortunately it shouldn't make any problems. in other words, we don't have math modemathordvariables in Persian and Arabic.But we need Persian numbers in math mode and fortunately these numbers don't need to be collated.