Mdx: MDX performance

Created on 16 Jul 2020  路  7Comments  路  Source: mdx-js/mdx

鈽傦笍 This umbrella issue is for tracking work related to improving performance to MDX.

I've been working with @pvdz on MDX performance. We've noted a few aspects that add unnecessary work which we should be able to reduce, especially in v2.

Numerous babel parse and transformation steps

Firstly, we have multiple babel parse steps throughout the MDX transpilation pipeline.

Imports and exports

  • Partitioning imports and exports
  • Finding the default export

Peter has done some work here in gatsby-plugin-mdx that we can potentially adapt gatsbyjs/gatsby#25437 for usage in core.

Shortcode generation

We use babel to figure out what imports and exports exist, and then use that to instantiate variables coming from MDXProvider with makeShortcode. Also related to gatsbyjs/gatsby#25437

mdxType

This is used by the runtime (react/preact/vue) to determine which component to render. This is something we can do from the MDXAST in v2 since the JSX structure is represented.

Returning a compiled string that inevitably needs to be transpiled

Secondly, to these parse steps we also return a JSX string. In nearly all cases this JSX string is then transpiled to JS and mdx pragma function calls. This was originally an intentional output because we wanted to make MDX more palatable and familiar. However, it might make sense to serialize directly to function calls and JS.

This would remove a babel step users need (unless they're using optional syntax or need browser polyfills which is still achievable in user land).


You all are welcome to bring up other areas of the codebase we can make more performant or other ideas as well! In fact, we'd love your thoughts.

馃弫 areperf 馃拵 v2 馃挰 typdiscussion 馃 typenhancement

Most helpful comment

Yeah so if we keep certain artificials limitations (which already apply today) in place then we can distill the imports/exports from the mdast without the need of Babel. That's been the source of some significant perf improvements at startup time (like https://github.com/gatsbyjs/gatsby/pull/25757).

The reasoning here is that the import and export syntax is very strict and if we disallow comments in between then a regular expression or simple string manipulation can quickly get us the answers we need (-> the symbols being imported and exported).

For imports the only limitation might be not to allow comments inside an import and only at the end of a line. These are the forms of import:

  • import ID from 'y'
  • import * as ID from 'y'
  • import {ID} from 'y'
  • import {ID as ID2} from 'y'
  • import ID, {ID2} from 'y'
  • import * as ID, {ID2} from 'y'

The {} pattern can repeat and for each case as is optional. For the fix in Gatsby, to get the imported idents, I took these imports and used a regex to remove all parts that were not interested in, leaving us with comma separated sets of ID or ID as ID2. You can easily take the last ID and that'll be the one you want.

Leaning on the fact that imports are constants (and valid input), no further need to dedupe them is required.

So to make life easy, the only syntactical restriction, beyond non-standard syntax of course, is to disallow comments inside the import declaration. And maybe disallow the variant where from is omitted (where you import a module for side effects).

For exports it's a little trickier, mainly because you can export arbitrary expressions and because of defaults in destructuring. However, it turns out that exports are currently limited to a single line. That's great because that makes them easy to slice out.

Further more, if you apply the same comment restriction to exports and disallow destructuring defaults, you can "cheat" your way out of not requiring any JS parser and still distill all the exported symbols, as well as finding the default export. You can even support the newer export <pattern> from 'file', which I believe is currently not supported.

  • export default function abc(){}
  • export const foo = bar
  • export class Boo {}
  • export { ding, dong as dang }
  • export let [a, b] = obj
  • export let [a = 1, b = 2] = obj <-- this is the one to disallow

In all the above cases, except last, you can parse up to the first = character (for var, let, and const exports) to get all the exported symbol names safely. The syntax for function and class is restricted enough by itself. The re-export syntax can be done similarly as the imports above. All in all, it'll be much faster than the overhead of a full JS parse.

For JSX serialization you can use a faster parser/printer than Babel. I know Acorn can do it. There's also Sucrase, and a few others.

My suggestion to John was to default to anything fast and to expose an option for the user to do it for you instead, since mdx doesn't reaaally care how the jsx gets compiled to JS. Or wouldn't need to, as far as I understand. So a user could give mdx a callback like function callback(jsxString) { return parser(jsxString).serialize(pragma); } and mdx would just run it instead.

If I'm not mistaken, this way MDX wouldn't need to run a JS parser at all.

One other potential trick is to concat the expressions with a searchable separator (an identifier of sorts or the debugger statement) and concat the jsx expressions together. Feed them to a parser, print them again, split on the debugger statement (or whatever you pick). That may already be what's happening now, I'm not sure..?

Oh and a third option is to allow the user to pass through a Babel config / options for the whole build step. That way if Babel is ran inside MDX anyways, it can just as well also do all the other transformations, like polyfill transforms etc, so that the main pipeline doesn't need to process it again. Potentially. But that might be a pretty big pandora's box of complexity to open up.

All 7 comments

However, it might make sense to serialize directly to function calls and JS.

Even if we compile # This and <That /> to function calls, I can still see folks using JSX inside JSX, or inside expressions though, and expect of MDX for them to be handled:

<Heading icon={<Icon />}>

## Heading {something ? <Y /> : <Z />}

I can still see folks using JSX inside JSX, or inside expressions though, and expect of MDX for them to be handled

Yeah, I think what we can do is transpile JSX to function calls inside expressions potentially. I _should_ still be a lot faster than handling the whole document (of course this is something we will benchmark to be sure).

Probably also faster if we only process jsx there, and nothing else, leaving that up to folks. But indeed, wondering on the benchmarks of 100 expressions vs 1 file

Yeah so if we keep certain artificials limitations (which already apply today) in place then we can distill the imports/exports from the mdast without the need of Babel. That's been the source of some significant perf improvements at startup time (like https://github.com/gatsbyjs/gatsby/pull/25757).

The reasoning here is that the import and export syntax is very strict and if we disallow comments in between then a regular expression or simple string manipulation can quickly get us the answers we need (-> the symbols being imported and exported).

For imports the only limitation might be not to allow comments inside an import and only at the end of a line. These are the forms of import:

  • import ID from 'y'
  • import * as ID from 'y'
  • import {ID} from 'y'
  • import {ID as ID2} from 'y'
  • import ID, {ID2} from 'y'
  • import * as ID, {ID2} from 'y'

The {} pattern can repeat and for each case as is optional. For the fix in Gatsby, to get the imported idents, I took these imports and used a regex to remove all parts that were not interested in, leaving us with comma separated sets of ID or ID as ID2. You can easily take the last ID and that'll be the one you want.

Leaning on the fact that imports are constants (and valid input), no further need to dedupe them is required.

So to make life easy, the only syntactical restriction, beyond non-standard syntax of course, is to disallow comments inside the import declaration. And maybe disallow the variant where from is omitted (where you import a module for side effects).

For exports it's a little trickier, mainly because you can export arbitrary expressions and because of defaults in destructuring. However, it turns out that exports are currently limited to a single line. That's great because that makes them easy to slice out.

Further more, if you apply the same comment restriction to exports and disallow destructuring defaults, you can "cheat" your way out of not requiring any JS parser and still distill all the exported symbols, as well as finding the default export. You can even support the newer export <pattern> from 'file', which I believe is currently not supported.

  • export default function abc(){}
  • export const foo = bar
  • export class Boo {}
  • export { ding, dong as dang }
  • export let [a, b] = obj
  • export let [a = 1, b = 2] = obj <-- this is the one to disallow

In all the above cases, except last, you can parse up to the first = character (for var, let, and const exports) to get all the exported symbol names safely. The syntax for function and class is restricted enough by itself. The re-export syntax can be done similarly as the imports above. All in all, it'll be much faster than the overhead of a full JS parse.

For JSX serialization you can use a faster parser/printer than Babel. I know Acorn can do it. There's also Sucrase, and a few others.

My suggestion to John was to default to anything fast and to expose an option for the user to do it for you instead, since mdx doesn't reaaally care how the jsx gets compiled to JS. Or wouldn't need to, as far as I understand. So a user could give mdx a callback like function callback(jsxString) { return parser(jsxString).serialize(pragma); } and mdx would just run it instead.

If I'm not mistaken, this way MDX wouldn't need to run a JS parser at all.

One other potential trick is to concat the expressions with a searchable separator (an identifier of sorts or the debugger statement) and concat the jsx expressions together. Feed them to a parser, print them again, split on the debugger statement (or whatever you pick). That may already be what's happening now, I'm not sure..?

Oh and a third option is to allow the user to pass through a Babel config / options for the whole build step. That way if Babel is ran inside MDX anyways, it can just as well also do all the other transformations, like polyfill transforms etc, so that the main pipeline doesn't need to process it again. Potentially. But that might be a pretty big pandora's box of complexity to open up.

FYI: https://github.com/gatsbyjs/gatsby/pull/26265 adds a baseline mdx benchmark to benchmarks/mdx-without-images where you can run N=10000 M=4 yarn bench to run a benchmark on 10000 basic mdx files and giving it 4gb of memory. Has no images and should test most of the main mdx pipeline (improvements to cover more of mdx are welcome).

Can't the parsed and transformed AST be passed directly to babel, which then can skip the parse step? That way mdx doesn't need to generate the js and the next step doesn't need to parse, just traverse and generate the final output?

Can't the parsed and transformed AST be passed directly to babel, which then can skip the parse step? That way mdx doesn't need to generate the js and the next step doesn't need to parse, just traverse and generate the final output?

There are methods to parse the AST upfront, and transform the AST itself. But it looks like the mdx-hast-to-jsx is parsing different parts of the MDX HAST. Which means, different ASTs for all 3 different transformSync calls.

PS. I'm happy to help anyway I can, at Expo we would love improved performance 鉂わ笍

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dioptre picture dioptre  路  3Comments

silvenon picture silvenon  路  4Comments

EddyVinck picture EddyVinck  路  3Comments

trevordmiller picture trevordmiller  路  3Comments

waterfoul picture waterfoul  路  3Comments