Katex: add support for macros via \def and \newcommand

Created on 18 Jun 2015  ·  23Comments  ·  Source: KaTeX/KaTeX

In the process of helping @gagern with https://github.com/Khan/KaTeX/pull/246 I came to the realize that the environments are essential functions that do some extra parsing and do that parsing in a slightly different way. In the case of the array/matrix environments they recognize & as a special token to separate cells.

Macros are similar to environments except instead of simply parsing the incoming tokens differently, they also generate a new input stream from their definition by inserting the arguments they parsed into the macro definition. Then they parse this new input stream in place of the command and its arguments.

Why not create a new Parser/Lexer instance whenever we run into a macro, expand the macro and parse the new input stream on it's own. The parse tree from that expansion can then be returned to the parent Parser instance. Neither the parser or the lexer has much state so it shouldn't consume too much memory to new up additional Parsers/Lexers when needed.

cc: @xymostech @spicyj

enhancement

Most helpful comment

@edemaine @gagern is anything blocking us for being able to implement \def and \newcommand. It seems like \def would probably be easier b/c it doesn't support optional parameters.

All 23 comments

I think https://github.com/Khan/KaTeX/issues/37 could be handled in a similar fashion.

I just noticed my note about \def\oneover{\frac{1}}. To handle this situation, if temporary Parser is expecting more tokens when it comes to the end of the expanded input stream it switches over to its parent's lexer.

I think it would might be a good idea to first figure out a solid API for defining new commands in JS. Something like

defineCommand("\\limsup", function() {
  return hlist(text("lim"), spaces.thinspace, text("sup"));
});

and then we can implement \newcommand on top of that. I think most people using KaTeX (such as KA) won't want to support def/newcommand for user-defined expressions but people might want to define some site-specific commands.

(Incidentally, we don't have limsup or liminf because I want something like this first.)

Site-specific commads are on my agenda as well. I'd like to see single-string shorthands for simple cases:

katex.render(formula, element, {
    macros: {
        limsup: "\\operatorname{lim}\\,\\operatorname{sup}",
        RR: "\\mathbb{R}",
        set: "\\left\\{#1\\,\\middle\\vert\\,#2\\right\\}"
    }});

You could construct frunctions in the way KaTeX implements them from these string representations. For more complicated things, we could still let the user provide a function. But then compatibility is a major issue, so I'd make this something like this:

katex.render(formula, element, {
    macros: {
        limsup: defineMacro(1, function(api) {
            return api.hlist(api.text("lim"), api.spaces.thinspace, api.text("sup"));
        })
    }});

That way, the implementation would have access to a well-defined set of functions, encapsulated by that api object. We could make an effort to maintain backwards-compatibility there. The number 1 in the first argument to defineMacro would be a version number. Version numbers smaller than the current one could result in the handler getting wrapped inside some kind of compatibility layer.

I would like to keep macros per-parser, so that different KaTeX formulas can coexist on a page without interfering with one another. But you could also store any macro defined by \newcommand back to the macros property of the options object. That way, different formulas sharing the same macros object could share the same macros, even if those are defined from within the formulas.

I think something simple like defineCommand("\\operatorname", "\\text{lim}\\,\\operatorname{sup}");.

I think we should probably have ways to to define global macros and per parser macros. defineCommand for the former and @gagern's suggestion of passing a map to render for latter.

I'm not sure what the benefit of having JS API for construction the definition of the macro. Maybe someone can provide an example where providing the TeX definition as a string would be more complicated than a JS API. Also, TeX syntax is well defined so using that would hopefully mitigate versioning issues.

If we're missing basic commands that people need to implement commands on top of we should implement those using the current system and then they'd automatically be exposed to users both for direct use and within macros.

In terms of performance aspect of defining macros with a TeX source string, we could store the parsed string as a tree of ParseNodes and clone that whenever the command is used.

Maybe someone can provide an example where providing the TeX definition as a string would be more complicated than a JS API.

For example a function which takes a bunch of digits as an argument and returns a barcode representation of that sequence, composed of rules. You could do this in LaTeX, but it would require quite advanced macro programming, so for KaTeX it's certainly easier to do such things in JavaScript, perhaps building on existing libraries. Most mathematicians don't need barcodes, but I hope you accept this example nevertheless, since you never know what people might or might not need.

In terms of performance aspect of defining macros with a TeX source string, we could store the parsed string as a tree of ParseNodes and clone that whenever the command is used.

Can we safely do so? For functions which can be used in text and in math mode, we should probably cache two distinct versions, since the parser includes that mode in many of its nodes. And I think we should offer some opt-out mechanism for that caching, although we could of course do that once the need arises.

Can we safely do so?

It might be tricky to get this right. It's probably best to not do this optimization until we know whether any optimization is necessary.

I wrote some code trying to implement this, and found out that I hadn't considered all aspects. The main problem is that in our current setup, a function _must_ return a _single_ value. It is not allowed to return zero or more than one value. A macro which expands to a new set of tokens violates this. Ideally, the macro would return _nothing_, but just cause the parser state to change to something which has the macro body at the beginning of its current input token stream. Currently we have no mechanism to return nothing.

To show that this is relevant, consider the following scenario:

\newcommand{\foo}{123}
$e^\foo$

LaTeX renders this as $e^1 2 3$, i.e. although the whole macro _appears_ to be in the superscript position, actually it's only the first token of its expansion. So it is not enough to parse the expansion of the macro in a separate Parser and Lexer.

The best solution I can think of right now is having parseFunction return some special value when it encountered a macro, and catching that special value some levels up the call graph. But since parseFunction is pretty deep in the call graph, that's difficult. For example, if the macro expansion contains a &, then we have to deal with that in parseExpression, four layers above. handleSupSubscript and parseArguments will have to deal with the special case of a macro expansion separately, I guess.

Call graph

But perhaps I'm going about this the wrong way. Perhaps macro expansion should be a separate layer, between lexer and parser. So any time the Parser needs a new token, it can ask the MacroExpander for the next token. And the MacroExpander will know whether the token it got from the lexer is a known macro name, and if so, will expand that. If we ever do things like conditionals and recursion, we'd have to deal with these in the MacroExpander. That way, the parser will never see a token which still needs to be expanded. At least in theory.

In practice, it's not that simple. For example, we currently inform the lexer of the kind of thing it should lex; we can't do that if the tokens in question come from a macro expansion, so we have to add functions to turn tokens back into lengths, colors and so on. We also will need some cooperation between MacroExpander and Parser when defining a macro. Chapter 20 of The Tex Book has a list of all situations that could suppress a macro expansion; getting them all right will require considerable work.

Speaking about The TeX Book, perhaps the MacroExpander mentioned above could be shortened to Gullet, which is the term used throughout that book. Of course, in that case the Lexer should become Mouth for consistency, so I'm not sure we'd want to do that. Just thinking.

Maybe a stack of input streams. When we hit a macro we push its expansion onto the stack. When the parser grabs the next token it's always the next token in the input stream at the top of the stack. If it hits the end of an input stream, it pops that input stream off the stack. If there are no input streams left to pop then that's EOL. The lexer would have to be updated to deal with the stack of input streams, but this should work with the changing modes in the parser.

My macros branch has support for site-specific macros, i.e. macros defined by the invoking JavaScript code, not (yet) using \def or \newcommand. They don't accept parameters either so far. Nevertheless I consider this a juge step forward.

Unfortunately, that branch builds on some other things, so please put this on the roadmap after #266. See https://github.com/gagern/KaTeX/compare/8496a1e40f8dfbfc8dd7d2e7b5096bd5c2b01c02...gagern:macros for a preview relative to that issue.

The way I implemented this is by introducing a MacroExpander (or “gullet”) layer between Lexer and Parser, as outlined above, since I assume this to be closest to what TeX does. It also allows using macros in places where sizes or colors are required, which would not work otherwise.

@gagern awesome progress! Can you test a macro like this?

\def\oneover{\frac{1}}
\oneover{x}

I'll have a closer look at the diff this evening.

Can you test a macro like this?

http://localhost:7936/?text=\oneover{x}&\oneover=\frac{1} looks as it should.

But your question reminds me that there was something about infix operators like \over and the breakOnInfix parameter that still required work. Indeed, \scriptstyle 1\over x is currently broken. Working on that.

Do you mind me asking what happened here? I looked into how difficult it would be to allow for functions to return multiple values, and it wasn't that difficult. It would require rewriting the parseExpression function a bit, but nothing severe.

As for an input stack, I took a look at the source for lautex and that's essentially how they handle macro expansion. However, we would have to remove the ability for the Lexer to simply take an position value for where to grab the next token and would be reduced to simple API, getNextToken.

Items that make use of Lexer positions:

infixs -- I feel there is an easy fix to this, and it may have been what gagern had in his working example, we can put something like this in the parseExpression body after adding an inInfix flag to parser, or passing it as a variable (similar to breakOnInfix):

    if (infixOperators.contains(token.text) !== -1) {
        if (breakOnInfix) { throw new Error(); }
        var numer = body;
        this.consume();
        var denom = this.parseExpression(true); // breakOnInfix = true
        // inFix operators eat an entire mathlist
        body = [ handleInfix(token.text, numer, denom) ];
    }

As for the issue @gagern mention concerning scriptstyle and infixes, I think this is a non trivial issue but also an issue that is current, so I don't think it would be a regression.

The parseArguments function also deals with lexer positions. It passes the positions of all the arguments to the functions, but it appears that no one uses it so it seems that it can be safely removed.

[…] what happened here?

Since my own branch of KaTeX worked well enough for our project, I somehow got distracted and didn't land all of my modifications. With macros being the main missing component, unless you consider render-to-canvas a candidate for merging. I've been using macros from an external dictionary this past year, using this code (currently same as the diff link I provided above). It builds on a draft version of #364, so there are some chances that it would apply to current source without too much manual merging. Will probably try a rebase tomorrow.

I seem to recall that I did want to get some improvements to token position reporting in first, as a pull request in its own right, and somehow that got me stalled. By the way, @cbreeden, error reporting is the prime use of positions, and having that available at many levels can help producing quality error messages.

@gagern Hi, thanks! I have forgotten that the infix handling bugged me. Now that you reminded me, and I have thought about it again, I think I might know of a way to get rid of the requirement of scanning through the list after each parse looking for infixes. I'm going to see if I can get it to work.

But that's not very relevant to this topic I suppose. I took a look at your code a few weeks ago and it looked promising. Do you mind if I hack around with it to see how it feels?

@cbreeden This is open source, hack all you like. Of course, if you come up with something good, and that gets merged, I'd be glad if it had some API similar to what I had, since as I said I'm using this in a project and would like to continue doing so with a more recent version.

Rebasing the code is no fun at all, mostly because it's been so long since I worked with this.

@cbreeden and anyone else interested: I just rebased the code and created PR #493 for it, but I gave up on disentangling the error-reporting improvements from the parser-lexer restructuring from the macro expander. I hope I can get someone to review the combined effort.

@edemaine @gagern is anything blocking us for being able to implement \def and \newcommand. It seems like \def would probably be easier b/c it doesn't support optional parameters.

Now that https://github.com/Khan/KaTeX/pull/1309 has landed we're getting closer to \def and \newcommand, thanks @edemaine. I assume the difference between \gdef and \def is that if \def is used inside a group, that whatever it defines is limited to that group and then afterwards, it either becomes undefined or the previous version definition becomes active. This suggests that entries in the macros dictionary should instead be stacks. The interesting part will be knowing when to pop things off those stacks b/c we don't want to push a copy of macros every time we enter a group.

Once \def is implemented, \newcommand seems relatively trivial. We just need to check the commands doesn't already exist. Not sure how hard the optional parameter aspect of this would be.

One thing that's interesting of about the current implementation is that we have multiple dictionaries for different things: macros, functions, symbols, environments, but it seems like these should actually all be in the same lookup table. In TeX, \def allows users to redefine commands so KaTeX should do the same.

I've been thinking about exactly this, and plan to work on it. I'm not too worried about functions, because macros can override them, but we need similar functionality for environments (this is #977) and we will need it for lengths too (#687). The approach to avoid duplication is for each environment to have a parent pointer, which at the root level is the built-ins. This slows down lookup slightly if you're several groups in, but I imagine it's how TeX (and Python, Javascript, etc.) does it as well.

For equation numbering (and defs) across multiple calls to KaTeX, I imagining that we can pass in an initial global environment to use, and it gets updated for the next call. All conceptually coming together! But will take some time to implement.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mpolyak picture mpolyak  ·  3Comments

shaunc picture shaunc  ·  4Comments

pyramation picture pyramation  ·  4Comments

msridhar picture msridhar  ·  3Comments

asmeurer picture asmeurer  ·  3Comments