Gutenberg: Overview of Short-term Parsing Enhancements

Created on 27 Jul 2018  Â·  7Comments  Â·  Source: WordPress/gutenberg

One of the foundational pieces of this project, the block-based language of Gutenberg has so far been implemented in the system by a set of parsers generated from a PEG grammar describing the language.

Two motivating factors to back this emerged early on in the development cycle: i) a PEG grammar is easy to understand, review and maintain; ii) the parsers generated from it, one in JS and one in PHP, worked out of the box, and — with caveats — this proved adequate even as the project grew: for instance, the parsing stack easily accommodated block nesting.

This intelligibility and quasi-self-documenting character of the PEG grammar are what led to establishing it as the official specification of the language of Gutenberg in the form of its own package. This is an important step for the long-term success of Gutenberg as not just an editor, but a multi-environment ecosystem bound by the same language.

Opportunities

As Gutenberg braces for merging into core, all the tooling surrounding blocks needs to be as robust and versatile as any other part of WordPress. Among other things, this means that webmasters, plugin authors and platform engineers should have the power to hot-swap parsers to optimize for given requirements — whether requirements concern block analysis, hardware constraints, runtime, etc. Past experiments have confirmed that the PEG grammar is easily tweaked to output specific block information and can be optimized if told to ignore other aspects.

Furthermore, we have a duty to perform as well as possible. Our auto-generated parsers are known not to perform as well as other systems, a trade-off we made for clarity. Client-side parser performance has so far been satisfying, but the server-side parser is a known bottleneck and may not be suitable for demanding posts. The server-side issue was mitigated by altogether eschewing the PHP parser when rendering front-end pages in favor of a simpler parse, but the fact is that obtaining a full parse tree of a post on the server should be performant and robust.

Finally, Gutenberg blocks are to reach well beyond WP-Admin and base WordPress installations. Namely, any third-party clients should be able to work with Gutenberg blocks, from the mobile apps to any community tool. This requirement implies that the language of Gutenberg will need to be parsed in diverse environments and runtimes, well beyond the scope of the base Gutenberg project.

Thus, there is the opportunity to establish the language of Gutenberg as a _de facto_ standard for semantic content formatting. Perhaps choosing parsers will be similar to choosing database systems: most site owners will go with the default, such as MySQL, but competing software abounds for those with different needs.

Parser tooling

Closes #6994.

Before we look at parsing itself, we need to be able to develop, validate and compare implementations.

Develop. Any party should be able to understand the specified language. For this, it should be clear that the PEG grammar is intended to be used as a reference. Tasks:

  • [x] Offer PEG grammar and its parser as its own spec-parser package. #7664
  • [x] Auto-generate human-readable version of block grammar. #6116
  • [ ] Increase visibility of this reference, perhaps via a standalone documentation page.

Validate. Developers should be provided official tools that can, as best as possible, test for correctness of parsers. Tasks:

  • [ ] Develop a large and diverse corpus, comprising posts long and short, with many and few blocks, different kinds of content, written in different locales, and with documents full of various kinds of invalid blocks.
  • [ ] Provide easy-to-use validators that test against the corpus.

    • [ ] Add a JEST test to build and compare language-agnostic parsers. #6030

Compare. A parser doesn't just need to work, it likely needs to perform well. To help inform development (and instigate a healthy sense of competition?), it should be possible to compare parser performance side-by-side or multilaterally. Here, performance encompasses both space and time. Tasks:

  • [ ] Parser: Build system to compare alternative parser implementations. #6831

Ultimately, all tasks for the above paragraphs should converge towards providing:

Related open issues:

  • [x] Check support for /u flag in installed PCRE library #4852
  • [x] Verify parser doesn't cause PCRE fault with Subresource Integrity (SRI) Manager plugin and PHP 5.6.36 #8671

Exploring new parsers

The past months have seen fascinating and encouraging developments in the development of competing parsers.

  • @Hywan developed a Rust parser that compiles to a number of different targets, and notably to those that core Gutenberg cares about: the server, in the form of a PHP extension; the client, via WebAssembly binary and an ASM.js fallback. Through use of the Hoa compiler, native PHP bindings are also available, including an experimental PCRE parser.
  • @dmsnell wrote a hand-coded parser #8083, working with the PHP interpreter to seek optimal performance.
  • @pento asked the question of "if we start with the spec grammar, how much can we gain from fine-tuning it?" in Improve Parser Performance #8044.

These experiments have been benchmarked using gutenberg-document-library, and results have been extremely encouraging.

_to do: insert benchmark tables for CPU and memory_

Framework [Feature] Parsing [Type] Overview [Type] Task

Most helpful comment

Thank you for the very detailed post.

The project I develop is the Rust Gutenberg parser. Before posting the benchmark results, let me quickly talk about the project.

Presentation

The goal of this parser is to provide a single implementation with multiple bindings to reach different environments and languages. Thus, we have only one codebase to maintain and to keep updated. This implementation is written in Rust, with the help of the nom library. It is one of the fastest parser (combinator) in the world, with a strong community, and is very popular (823k downloads at the time of writing).

Bindings are very light. For instance, the WebAssembly binding is only 300 LOC (with comments). The ASM.js binding is only 30 LOC (with comments). The C binding is free (no code to write). The PHP binding is 300 LOC (with comments). The NodeJS binding is 150 LOC (with comments). They are tiny surfaces of code to review.

As you can see, Rust can be embedded almost everywhere: On the Web through WASM and ASM.js, in PHP as an extension, in NodeJS as a native module or through WASM/ASM.js, in Java through FFI (as a POC now) etc. This is the motivation behind Rust: A single implementation usable in multiple environments and languages. Also Rust is very safe and fast, safest to any solution we came up with today. Finally, Rust memory usage is predictable and efficient, which also motivates the choice for this language.

Rust provides solid types. The parser then produces an AST that is a single type named Node. It is memory compact and strict.

Documentation and specification

Rust has rustdoc, a tool to test the documentation and all the examples, but also to publish the documentation as HTML files. Here is a (outdated) version of the Rust Gutenberg parser.

Here is the documentation for the root function, which is the entry point of the parser. We can see a description, an example, and a code.

I would like to point out that the grammar is specified with the EBNF format. So we have the specification here.

But more importantly, just like the root function, each rule of the parser is documented and contains an example, e.g. with the block rule. We are sure that this example always runs, and is always valid thanks to rustdoc. Thus it increases the onboarding for new users or contributors.

Tests

Since the parser is written with a parser combinator framework, each parser is unit testable. There are also integration tests. And finally, each bindings have its own unit and integration tests.

Bindings

As I said, there is a single implementation written in Rust, and then we embed this implementation into multiple environments and languages. So far, the following bindings are implemented, tested, documented and benched:

  • WebAssembly (as a binary): To use the parser mostly in a Web environment,
  • ASM.js (as a JavaScript code): As a fallback of the WebAssembly binary in a Web environment (for instance for Internet Explorer),
  • C (as a static library + a header file): To include the Rust parser in every C compatible project,
  • PHP (as a native extension): To use the parser in PHP natively,
  • NodeJS (as a native module): To use the parser in NodeJS natively (note that NodeJS supports WebAssembly and ASM.js too).

They have been extensively tested. The PHP extension has been checked against many tools (like valgrind) to verify there is no memory leaks (and there isn't).

Benchmarks

We use benchmarks as a factor of the success of this project. The problems that need to be addressed are:

  • The PEG.js parser is slow when dealing with large inputs (aka blog posts), like 2.5s to parse [moby-dick-parsed.html],
  • The PEG.php parser is slow with medium inputs, and quickly hits the default memory limit, like 5.4s to parse [moby-dick-parsed.html].

Benchmarking with gutenberg-parser-comparator

PEG.js parser vs. Rust parser as a WebAssembly binary

I used gutenberg-parser-comparator to compare the PEG.js parser against the WebAssembly binary in Firefox nightly:

| file | Javascript parser (ms) | Rust parser as a WASM binary (ms) | speedup |
|-|-|-|-|
| [demo-post.html] | 13.167 | 0.137 | × 96 |
| [shortcode-shortcomings.html] | 26.784 | 0.225 | × 119 |
| [redesigning-chrome-desktop.html] | 75.500 | 0.905 | × 83 |
| [web-at-maximum-fps.html] | 88.118 | 0.698 | × 126 |
| [early-adopting-the-future.html] | 201.011 | 0.927 | × 217 |
| [pygmalian-raw-html.html] | 311.416 | 1.016 | × 307 |
| [moby-dick-parsed.html] | 2,466.533 | 14.673 | × 168 |

The WASM binary of the Rust parser is in average 159 times faster than
the actual Javascript implementation. The median of the speedup is 126.

PEG.js parser vs. Rust parser as an ASM.js module

I used gutenberg-parser-comparator to compare the PEG.js parser against the ASM.js fallback module (fallback of the WASM binary) in Firefox nightly:

| file | Javascript parser (ms) | Rust parser as an ASM.js module (ms) | speedup |
|-|-|-|-|
| [demo-post.html] | 15.368 | 2.718 | × 6 |
| [shortcode-shortcomings.html] | 31.022 | 8.004 | × 4 |
| [redesigning-chrome-desktop.html] | 106.416 | 19.223 | × 6 |
| [web-at-maximum-fps.html] | 82.920 | 27.197 | × 3 |
| [early-adopting-the-future.html] | 119.880 | 38.321 | × 3 |
| [pygmalian-raw-html.html] | 349.075 | 23.656 | × 15 |
| [moby-dick-parsed.html] | 2,543.750 | 361.423 | × 7 |

The ASM.js module version of the Rust parser is in average 6 times faster than the actual Javascript implementation. The median of the speedup is 6.

Benchmarking with benchmark.js

I used benchmark.js to get high-resolution timers and statistically significant results.

WebAssembly

I used benchmark.js on the WebAssembly binary to check its own performance (with no comparison) in NodeJS latest stable:

| file | throughput/time | number of runs sampled |
|-|-|-|
| autoclosing_block | 118,970 ops/sec (0.008ms) ±1.58% | 78 |
| early_adopting_the_future | 380 ops/sec (2.633ms) ±14.04% | 54 |
| gutenberg_demo | 2,777 ops/sec (0.360ms) ±2.83% | 74 |
| moby_dick_parsed | 37 ops/sec (22.292ms) ±5.32% | 57 |
| pygmalian_raw_html | 700 ops/sec (1.492ms) ±1.33% | 79 |
| redesigning_chrome_desktop | 370 ops/sec (2.705ms) ±18.71% | 64 |
| shortcode_shortcomings | 1,680 ops/sec (0.595ms) ±2.57% | 73 |
| web_at_maximum_fps | 681 ops/sec (1.469ms) ±5.50% | 72 |

Benchmarking with PHPBench

I used PHPBench to get statistically signifiant results.

PHP native extension

I used PHPBench on the PHP native extension to check its own performance (with no comparison) in PHP 7.2 with other default extensions enabled:

| params | revs | its | mem_peak | mean | mode | best | worst | rstdev |
|------------------------------------------|------|-----|------------|-------------|-------------|-------------|-------------|--------|
{"subject":"autoclosing-block"} | 1000 | 10 | 472,240b | 1.164μs | 1.174μs | 0.974μs | 1.259μs | 6.26% |
| {"subject":"early-adopting-the-future"} | 1000 | 10 | 760,888b | 305.396μs | 303.148μs | 296.700μs | 328.691μs | 2.81% |
| {"subject":"gutenberg-demo"} | 1000 | 10 | 487,928b | 58.186μs | 57.660μs | 55.948μs | 63.120μs | 3.20% |
| {"subject":"moby-dick-parsed"} | 1000 | 10 | 5,826,088b | 5,058.416μs | 4,962.539μs | 4,726.858μs | 5,474.889μs | 4.46% |
| {"subject":"pygmalian-raw-html"} | 1000 | 10 | 925,256b | 51.074μs | 50.928μs | 48.920μs | 54.716μs | 2.96% |
|{"subject":"redesigning-chrome-desktop"} | 1000 | 10 | 790,712b | 387.499μs | 383.050μs | 379.184μs | 409.578μs | 2.34% |
| {"subject":"shortcode-shortcomings"} | 1000 | 10 | 532,488b | 92.731μs | 93.002μs | 89.908μs | 95.436μs | 1.68% |
| {"subject":"web-at-maximum-fps"} | 1000 | 10 | 686,784b | 264.698μs | 266.231μs | 256.919μs | 269.058μs | 1.44% |

Note that times are reported as µs (10e-6), not ms (10e-3). There is a difference of 3 order of magnitude.

Benchmarking with Criterion

I used Criterion to get statistically signifiant results.

Rust parser

Because the Rust parser needs to be benchmarked too!

| subject | time (mean) |
|-|-|
| autoclosing_block | 0.182µs |
| early_adopting_the_future | 92.498µs |
| gutenberg_demo | 15.513µs |
| moby_dick_parsed | 1348.9µs |
| pygmalian_raw_html | 34.027µs |
| redesigning_chrome_desktop | 117.630µs |
| shortcode_shortcomings | 27.605µs |
| web_at_maximum_fps | 80.677µs |

Note that times are reported as µs (10e-6). Also note that autoclosing_block is parsed in 0.182µs, so 182ns (yep, nanosecond 😛), and that moby_dick_parsed is parsed in 1.3ms.

This is roughly the C performance I guess, since it uses the same code.

Conclusion

The Rust parser provides a single, safe, and very fast implementation.

We have successfully embeded the Rust parser in multiple environments and languages through bindings, like WebAssembly, ASM.js, C, PHP, and NodeJS.

The WebAssembly binding is safer and faster than the PEG.js implementation. It shows a speedup of 159. The binding is written in 300 LOC.

  • A large post like [moby-dick-parsed.html] takes 2466ms to parse with PEG.js, while it takes 14ms to parse with WebAssembly.
  • A medium post like [early-adopting-the-future.html] takes 201ms to parse with PEG.js, while it takes 0.9ms to parse with WebAssembly.
  • A post with no Gutenberg blocks like [pygmalian-raw-html.html] takes 311ms to parse with PEG.js, while it takes 1ms to parse with WebAssembly.

The ASM.js binding is a fallback of the WebAssembly binary for environments that do not support WebAssembly. It is faster on all documents, and it shows a speedup of 6. The binding is written in 30 LOC.

The C binding is almost free. It is as fast as the Rust parser.

The PHP binding is safer, faster and radically less memory gourmand than the PEG.php parser. It shows a speedup of 5230. The binding is written in 300 LOC.

  • A large post like [moby-dick-parsed.html] takes 5437ms to parse with PEG.php, while it takes 5ms to parse with the PHP extension,
  • A medium post like [early-adopting-the-future.html] takes 280ms to parse with PEG.php, while it takes 0.2ms to parse with the PHP extension.
  • A post with no Gutenberg blocks like [pygmalian-raw-html.html] takes 377ms to parse with PEG.php, while it takes 0.05ms to parse with the PHP extension.

The NodeJS binding is working, but I didn't had time to benchmark it yet. The WebAssembly benchmark was done in NodeJS, and also in Firefox. We can hope that the NodeJS native module is faster than WebAssembly for now (because WebAssembly does not support SIMD, nor tail calls etc _yet_). So in all cases, it is faster and safer than the PEG.js parser, since the binding is written in Rust and does not involve any C code. The binding is written in 150 LOC.

We can conclude than this project addresses the limitations of memory usages and speed, and provides a neat way to maintain a single implementation that can be used in many environments almost for free compared to a situation where we have to re-implement the parser for every environment. Rust provides the tooling to write, to test and to publish the documentation, and so the specification of the Gutenberg block language, which ease the onboarding of new users or contributors.

Finally, the WebAssembly and ASM.js bindings can be distributed in a NPM package. It does not break the way we want to distribute Gutenberg packages. Rust is not required to use the NPM package, only the WASM binary (which is not platform dependent) and the ASM.js file are distributed as is. The PHP extension has already been installed in 2 environments (macOS and Debian) successfully. It follows the classical extension compilation and installation process. It is possible to automate everything with Pickle. Rust is required to install the PHP extension.

Thank you for your reading 🙂.

Edit: I've found another easy low-hanging fruit to optimise the WASM binding. Then I've updated the WASM and ASM.js benchmark results (resp. 73 to 86 times faster, and 4 to 6 times faster 🎉).

Edit 2: I've found another optimisation for the WASM binding. Then I've updated the WASM benchmark results, it is 159 times faster (instead of 86 times) :tada: :tada:!

All 7 comments

Updated issue description to include:

Verify parser doesn't cause PCRE fault with Subresource Integrity (SRI) Manager plugin and PHP 5.6.36 #8671

Thank you for the very detailed post.

The project I develop is the Rust Gutenberg parser. Before posting the benchmark results, let me quickly talk about the project.

Presentation

The goal of this parser is to provide a single implementation with multiple bindings to reach different environments and languages. Thus, we have only one codebase to maintain and to keep updated. This implementation is written in Rust, with the help of the nom library. It is one of the fastest parser (combinator) in the world, with a strong community, and is very popular (823k downloads at the time of writing).

Bindings are very light. For instance, the WebAssembly binding is only 300 LOC (with comments). The ASM.js binding is only 30 LOC (with comments). The C binding is free (no code to write). The PHP binding is 300 LOC (with comments). The NodeJS binding is 150 LOC (with comments). They are tiny surfaces of code to review.

As you can see, Rust can be embedded almost everywhere: On the Web through WASM and ASM.js, in PHP as an extension, in NodeJS as a native module or through WASM/ASM.js, in Java through FFI (as a POC now) etc. This is the motivation behind Rust: A single implementation usable in multiple environments and languages. Also Rust is very safe and fast, safest to any solution we came up with today. Finally, Rust memory usage is predictable and efficient, which also motivates the choice for this language.

Rust provides solid types. The parser then produces an AST that is a single type named Node. It is memory compact and strict.

Documentation and specification

Rust has rustdoc, a tool to test the documentation and all the examples, but also to publish the documentation as HTML files. Here is a (outdated) version of the Rust Gutenberg parser.

Here is the documentation for the root function, which is the entry point of the parser. We can see a description, an example, and a code.

I would like to point out that the grammar is specified with the EBNF format. So we have the specification here.

But more importantly, just like the root function, each rule of the parser is documented and contains an example, e.g. with the block rule. We are sure that this example always runs, and is always valid thanks to rustdoc. Thus it increases the onboarding for new users or contributors.

Tests

Since the parser is written with a parser combinator framework, each parser is unit testable. There are also integration tests. And finally, each bindings have its own unit and integration tests.

Bindings

As I said, there is a single implementation written in Rust, and then we embed this implementation into multiple environments and languages. So far, the following bindings are implemented, tested, documented and benched:

  • WebAssembly (as a binary): To use the parser mostly in a Web environment,
  • ASM.js (as a JavaScript code): As a fallback of the WebAssembly binary in a Web environment (for instance for Internet Explorer),
  • C (as a static library + a header file): To include the Rust parser in every C compatible project,
  • PHP (as a native extension): To use the parser in PHP natively,
  • NodeJS (as a native module): To use the parser in NodeJS natively (note that NodeJS supports WebAssembly and ASM.js too).

They have been extensively tested. The PHP extension has been checked against many tools (like valgrind) to verify there is no memory leaks (and there isn't).

Benchmarks

We use benchmarks as a factor of the success of this project. The problems that need to be addressed are:

  • The PEG.js parser is slow when dealing with large inputs (aka blog posts), like 2.5s to parse [moby-dick-parsed.html],
  • The PEG.php parser is slow with medium inputs, and quickly hits the default memory limit, like 5.4s to parse [moby-dick-parsed.html].

Benchmarking with gutenberg-parser-comparator

PEG.js parser vs. Rust parser as a WebAssembly binary

I used gutenberg-parser-comparator to compare the PEG.js parser against the WebAssembly binary in Firefox nightly:

| file | Javascript parser (ms) | Rust parser as a WASM binary (ms) | speedup |
|-|-|-|-|
| [demo-post.html] | 13.167 | 0.137 | × 96 |
| [shortcode-shortcomings.html] | 26.784 | 0.225 | × 119 |
| [redesigning-chrome-desktop.html] | 75.500 | 0.905 | × 83 |
| [web-at-maximum-fps.html] | 88.118 | 0.698 | × 126 |
| [early-adopting-the-future.html] | 201.011 | 0.927 | × 217 |
| [pygmalian-raw-html.html] | 311.416 | 1.016 | × 307 |
| [moby-dick-parsed.html] | 2,466.533 | 14.673 | × 168 |

The WASM binary of the Rust parser is in average 159 times faster than
the actual Javascript implementation. The median of the speedup is 126.

PEG.js parser vs. Rust parser as an ASM.js module

I used gutenberg-parser-comparator to compare the PEG.js parser against the ASM.js fallback module (fallback of the WASM binary) in Firefox nightly:

| file | Javascript parser (ms) | Rust parser as an ASM.js module (ms) | speedup |
|-|-|-|-|
| [demo-post.html] | 15.368 | 2.718 | × 6 |
| [shortcode-shortcomings.html] | 31.022 | 8.004 | × 4 |
| [redesigning-chrome-desktop.html] | 106.416 | 19.223 | × 6 |
| [web-at-maximum-fps.html] | 82.920 | 27.197 | × 3 |
| [early-adopting-the-future.html] | 119.880 | 38.321 | × 3 |
| [pygmalian-raw-html.html] | 349.075 | 23.656 | × 15 |
| [moby-dick-parsed.html] | 2,543.750 | 361.423 | × 7 |

The ASM.js module version of the Rust parser is in average 6 times faster than the actual Javascript implementation. The median of the speedup is 6.

Benchmarking with benchmark.js

I used benchmark.js to get high-resolution timers and statistically significant results.

WebAssembly

I used benchmark.js on the WebAssembly binary to check its own performance (with no comparison) in NodeJS latest stable:

| file | throughput/time | number of runs sampled |
|-|-|-|
| autoclosing_block | 118,970 ops/sec (0.008ms) ±1.58% | 78 |
| early_adopting_the_future | 380 ops/sec (2.633ms) ±14.04% | 54 |
| gutenberg_demo | 2,777 ops/sec (0.360ms) ±2.83% | 74 |
| moby_dick_parsed | 37 ops/sec (22.292ms) ±5.32% | 57 |
| pygmalian_raw_html | 700 ops/sec (1.492ms) ±1.33% | 79 |
| redesigning_chrome_desktop | 370 ops/sec (2.705ms) ±18.71% | 64 |
| shortcode_shortcomings | 1,680 ops/sec (0.595ms) ±2.57% | 73 |
| web_at_maximum_fps | 681 ops/sec (1.469ms) ±5.50% | 72 |

Benchmarking with PHPBench

I used PHPBench to get statistically signifiant results.

PHP native extension

I used PHPBench on the PHP native extension to check its own performance (with no comparison) in PHP 7.2 with other default extensions enabled:

| params | revs | its | mem_peak | mean | mode | best | worst | rstdev |
|------------------------------------------|------|-----|------------|-------------|-------------|-------------|-------------|--------|
{"subject":"autoclosing-block"} | 1000 | 10 | 472,240b | 1.164μs | 1.174μs | 0.974μs | 1.259μs | 6.26% |
| {"subject":"early-adopting-the-future"} | 1000 | 10 | 760,888b | 305.396μs | 303.148μs | 296.700μs | 328.691μs | 2.81% |
| {"subject":"gutenberg-demo"} | 1000 | 10 | 487,928b | 58.186μs | 57.660μs | 55.948μs | 63.120μs | 3.20% |
| {"subject":"moby-dick-parsed"} | 1000 | 10 | 5,826,088b | 5,058.416μs | 4,962.539μs | 4,726.858μs | 5,474.889μs | 4.46% |
| {"subject":"pygmalian-raw-html"} | 1000 | 10 | 925,256b | 51.074μs | 50.928μs | 48.920μs | 54.716μs | 2.96% |
|{"subject":"redesigning-chrome-desktop"} | 1000 | 10 | 790,712b | 387.499μs | 383.050μs | 379.184μs | 409.578μs | 2.34% |
| {"subject":"shortcode-shortcomings"} | 1000 | 10 | 532,488b | 92.731μs | 93.002μs | 89.908μs | 95.436μs | 1.68% |
| {"subject":"web-at-maximum-fps"} | 1000 | 10 | 686,784b | 264.698μs | 266.231μs | 256.919μs | 269.058μs | 1.44% |

Note that times are reported as µs (10e-6), not ms (10e-3). There is a difference of 3 order of magnitude.

Benchmarking with Criterion

I used Criterion to get statistically signifiant results.

Rust parser

Because the Rust parser needs to be benchmarked too!

| subject | time (mean) |
|-|-|
| autoclosing_block | 0.182µs |
| early_adopting_the_future | 92.498µs |
| gutenberg_demo | 15.513µs |
| moby_dick_parsed | 1348.9µs |
| pygmalian_raw_html | 34.027µs |
| redesigning_chrome_desktop | 117.630µs |
| shortcode_shortcomings | 27.605µs |
| web_at_maximum_fps | 80.677µs |

Note that times are reported as µs (10e-6). Also note that autoclosing_block is parsed in 0.182µs, so 182ns (yep, nanosecond 😛), and that moby_dick_parsed is parsed in 1.3ms.

This is roughly the C performance I guess, since it uses the same code.

Conclusion

The Rust parser provides a single, safe, and very fast implementation.

We have successfully embeded the Rust parser in multiple environments and languages through bindings, like WebAssembly, ASM.js, C, PHP, and NodeJS.

The WebAssembly binding is safer and faster than the PEG.js implementation. It shows a speedup of 159. The binding is written in 300 LOC.

  • A large post like [moby-dick-parsed.html] takes 2466ms to parse with PEG.js, while it takes 14ms to parse with WebAssembly.
  • A medium post like [early-adopting-the-future.html] takes 201ms to parse with PEG.js, while it takes 0.9ms to parse with WebAssembly.
  • A post with no Gutenberg blocks like [pygmalian-raw-html.html] takes 311ms to parse with PEG.js, while it takes 1ms to parse with WebAssembly.

The ASM.js binding is a fallback of the WebAssembly binary for environments that do not support WebAssembly. It is faster on all documents, and it shows a speedup of 6. The binding is written in 30 LOC.

The C binding is almost free. It is as fast as the Rust parser.

The PHP binding is safer, faster and radically less memory gourmand than the PEG.php parser. It shows a speedup of 5230. The binding is written in 300 LOC.

  • A large post like [moby-dick-parsed.html] takes 5437ms to parse with PEG.php, while it takes 5ms to parse with the PHP extension,
  • A medium post like [early-adopting-the-future.html] takes 280ms to parse with PEG.php, while it takes 0.2ms to parse with the PHP extension.
  • A post with no Gutenberg blocks like [pygmalian-raw-html.html] takes 377ms to parse with PEG.php, while it takes 0.05ms to parse with the PHP extension.

The NodeJS binding is working, but I didn't had time to benchmark it yet. The WebAssembly benchmark was done in NodeJS, and also in Firefox. We can hope that the NodeJS native module is faster than WebAssembly for now (because WebAssembly does not support SIMD, nor tail calls etc _yet_). So in all cases, it is faster and safer than the PEG.js parser, since the binding is written in Rust and does not involve any C code. The binding is written in 150 LOC.

We can conclude than this project addresses the limitations of memory usages and speed, and provides a neat way to maintain a single implementation that can be used in many environments almost for free compared to a situation where we have to re-implement the parser for every environment. Rust provides the tooling to write, to test and to publish the documentation, and so the specification of the Gutenberg block language, which ease the onboarding of new users or contributors.

Finally, the WebAssembly and ASM.js bindings can be distributed in a NPM package. It does not break the way we want to distribute Gutenberg packages. Rust is not required to use the NPM package, only the WASM binary (which is not platform dependent) and the ASM.js file are distributed as is. The PHP extension has already been installed in 2 environments (macOS and Debian) successfully. It follows the classical extension compilation and installation process. It is possible to automate everything with Pickle. Rust is required to install the PHP extension.

Thank you for your reading 🙂.

Edit: I've found another easy low-hanging fruit to optimise the WASM binding. Then I've updated the WASM and ASM.js benchmark results (resp. 73 to 86 times faster, and 4 to 6 times faster 🎉).

Edit 2: I've found another optimisation for the WASM binding. Then I've updated the WASM benchmark results, it is 159 times faster (instead of 86 times) :tada: :tada:!

@Hywan would be great to get this into a state where we can easily build and test the Rust-based parser and also to where we have a PR for Gutenberg that uses it.

@mcsf, @aduth, and @mtias I have updated https://github.com/WordPress/gutenberg/pull/8083 and all tests are passing there. since I'm not familiar with the packing issues I'd appreciate some help from someone who does - maybe @gziolo or @youknowriad.

Performance-wise the hand-written parser seems to be performing well enough that we may be able to lift some demands that we have so-far placed on the project, such as not performing the parse on the server. For example, just as @Hywan used the moby-dick-parsed.html file as a test this parser on my laptop compares at 6.03s for the auto-generated parser and 16ms for the hand-written one. 16ms may have some room for improvement but clearly it's a different ballgame for server-side parsing than five or six seconds is.

There's parallel work going on to make the parse asynchronous but personally I'd love to see us get a "default" parser in sooner than that work so that we can get past the performance issues that have been creeping up. I wanted to setup some mechanism to load custom parsers in #8083 but I realized it would be too much for a single PR and wanted to wait until we had two actual choices in the project before making the code to pick one.

That is, I'd love to see us push out a parser that's at least ready for production performance issues, continue making the parse asynchronous in the meantime, add the code to allow overwriting the parse implementation via a plugin, and fill up the tooling to compare parsers.

Previously I pushed harder for the tooling but a couple things have changed since then:

  • the merge is coming soon
  • we have primitive "stupid" tooling available from the work I did in #6831
  • performance appears to be becoming a bigger issue, especially on larger more complicated documents

@Hywan would be great to get this into a state where we can easily build and test the Rust-based parser and also to where we have a PR for Gutenberg that uses it.

Sure. I wait on #7970 to be merged before starting a PR. If someone wants to try the WASM binary, artefacts are attached to releases (https://github.com/Hywan/gutenberg-parser-rs/releases). I'm working on a Docker image, even if the justfile file guides the user to install all the required tools.

the moby-dick-parsed.html file as a test [for] this parser on my laptop […] 16ms for the hand-written one.

I think you're using Chrome to run the tests. I tried with Safari (because I don't have Chrome nor Chromium), and the moby-dick-parsed.html test runs in 19ms 😄.

I think you're using Chrome to run the tests. I tried with Safari (because I don't have Chrome nor Chromium), and the moby-dick-parsed.html test runs in 19ms 😄.

actually I reran the tests while I had the heat sink off of my CPU and it took 418ms but then when I chilled it with liquid nitrogen it only took 7ms 🙃

I'm working on a Docker image

awesome!

Side note: I've found another optimisation for the WASM binding. The average speedup is now 159 instead of 86 :tada:. And I achieved my goal with moby-dick-parsed.html which is parsed in 14ms, and pygmalian-raw-html in 1ms.

My detailed comment has been updated accordingly.

This issue was about _short-term_ enhancements, and it's been a year and a half. :)

Since then, we've made great strides both in this issue and in other spaces, even though some of the plans — e.g. a system to compare competing parsers, integrating asynchronous parsers — didn't come to fruition. And I think that's fine. It's a reflection of evolving priorities.

Ultimately, parsing has greatly improved thanks to all those involved. And for this I'll close this issue in favour of the broader #8352, which also encompasses parsing-related ideas such as the exciting #18414. Any new specific parser-related tasks can be tracked in new issues. Thank you, everyone.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

davidsword picture davidsword  Â·  3Comments

JohnPixle picture JohnPixle  Â·  3Comments

aaronjorbin picture aaronjorbin  Â·  3Comments

jasmussen picture jasmussen  Â·  3Comments

franz-josef-kaiser picture franz-josef-kaiser  Â·  3Comments