Julia: Replace Markdown stdlib with CommonMark.jl

Created on 2 Sep 2020  Â·  5Comments  Â·  Source: JuliaLang/julia

The Markdown stdlib has a lot of bugs (see e.g. https://github.com/JuliaLang/julia/issues?q=is%3Aopen+is%3Aissue+label%3Amarkdown where I labelled the ones I found from searching "markdown").
I wonder if we should just replace it with CommonMark.jl which seems to follow the spec better(?).

cc @MichaelHatherly @mortenpi

markdown stdlib

Most helpful comment

which seems to follow the spec better(?).

It currently passes all of the cm spec (https://github.com/MichaelHatherly/CommonMark.jl/blob/8706bda516a053fc55643478b1208680847c6afb/test/runtests.jl#L4-L15) so it's a fair bit more compliant, but as with all specs there's probably some untested corner cases that haven't been exercised properly.

A proper public api for ast manipulations is needed prior to making this a stdlib as well, currently requires digging into internal fields to do anything useful, possibly making use of AbstractTrees.jl if it provides enough of an interface (haven't looked into that).

With regards to actually replacing it, Morten is definitely right:

  • The ASTs are very different, possible to make a translation layer though. There's a lot of heavily used code that relies on Markdown's ast internals.
  • There's probably some differences in parsing, I've not investigated those differences though. (I'd hope that an differences found are that Markdown parsers contrary to the spec and CommonMark parses correctly.)

I am wondering if we'd need to create a system to support the two parsers and ASTs simultaneously?

Definitely doable.

With all that said, here's a possible multi-stage replacement plan:

  • [ ] start trying out CommonMark more within the ecosystem, this is being done in places:
  • [ ] get a few more eyes and contributors on the code, I'm not planning on disappearing, but a bus factor of 1 ain't good. It is, of course, not an easy thing to solve; I'm happy to help onboard anyone interested.
  • [ ] remove the current deps within the package, URIParser, JSON, and Crayons, I assume stdlib packages shouldn't be depending on external deps themselves if at all possible?
  • [ ] settle on a public api for AST manipulations.
  • [ ] make sure CommonMark works consistently on Julia 1.0 through to nightly. No reason for it not too, I just don't test that many versions currently.
  • [ ] add CommonMark as a stdlib alongside Markdown. (Luckily they have different names!)
  • [ ] add conversion functions between the two representations.
  • [ ] transition critical packages over to using CommonMark instead.
  • [ ] deprecate Markdown for the remainder of 1.x series of releases. CommonMark (the package) can be used by packages in Julia 1.x and lower.
  • [ ] remove it for 2.0 (the distant future :rofl:).

All 5 comments

I think replacing the parser would, in principle, be great (both to just get a better parser, but also to follow the CommonMark spec). However, just swapping out the parser would be breaking:

  1. CommonMark.jl would bring it's own AST, so everything that relies on that will break.
  2. As there are differences in parsing rules, existing Markdown wouldn't necessarily parse the same way. So you'd have docstring that work only on new Julia versions or just on old Julia versions.

I am wondering if we'd need to create a system to support the two parsers and ASTs simultaneously? Maybe a macro + Compat.jl solution to tag a module to indicate that it should use the new parsers? And when fetching docstrings, it would fetch the existing AST by default, but could be converted to the new one?

which seems to follow the spec better(?).

It currently passes all of the cm spec (https://github.com/MichaelHatherly/CommonMark.jl/blob/8706bda516a053fc55643478b1208680847c6afb/test/runtests.jl#L4-L15) so it's a fair bit more compliant, but as with all specs there's probably some untested corner cases that haven't been exercised properly.

A proper public api for ast manipulations is needed prior to making this a stdlib as well, currently requires digging into internal fields to do anything useful, possibly making use of AbstractTrees.jl if it provides enough of an interface (haven't looked into that).

With regards to actually replacing it, Morten is definitely right:

  • The ASTs are very different, possible to make a translation layer though. There's a lot of heavily used code that relies on Markdown's ast internals.
  • There's probably some differences in parsing, I've not investigated those differences though. (I'd hope that an differences found are that Markdown parsers contrary to the spec and CommonMark parses correctly.)

I am wondering if we'd need to create a system to support the two parsers and ASTs simultaneously?

Definitely doable.

With all that said, here's a possible multi-stage replacement plan:

  • [ ] start trying out CommonMark more within the ecosystem, this is being done in places:
  • [ ] get a few more eyes and contributors on the code, I'm not planning on disappearing, but a bus factor of 1 ain't good. It is, of course, not an easy thing to solve; I'm happy to help onboard anyone interested.
  • [ ] remove the current deps within the package, URIParser, JSON, and Crayons, I assume stdlib packages shouldn't be depending on external deps themselves if at all possible?
  • [ ] settle on a public api for AST manipulations.
  • [ ] make sure CommonMark works consistently on Julia 1.0 through to nightly. No reason for it not too, I just don't test that many versions currently.
  • [ ] add CommonMark as a stdlib alongside Markdown. (Luckily they have different names!)
  • [ ] add conversion functions between the two representations.
  • [ ] transition critical packages over to using CommonMark instead.
  • [ ] deprecate Markdown for the remainder of 1.x series of releases. CommonMark (the package) can be used by packages in Julia 1.x and lower.
  • [ ] remove it for 2.0 (the distant future :rofl:).

@domluna is using it in https://github.com/domluna/JuliaFormatter.jl, I believe it's disabled by default there? So long as that package is being used by a good cross-section of users we may pick up some good edge-cases that parse wrong.

It's used for formatting docstrings which is not on by default. It's also used to format markdown files and julia code inside those files which is also off by default. Having it on by default would probably cause a bit of confusion.

Once you all come up with a plan, I can help with stdlib surgery which is super annoying but also very uninteresting.

I'm not planning on disappearing

Fool me once... ;)

Was this page helpful?
0 / 5 - 0 ratings