Julia: Markdown.parse doesn't parse raw HTML properly.

Created on 5 Aug 2016 · 9Comments · Source: JuliaLang/julia

I noticed Markdown.parse doesn't parse HTML tags properly while writing docs with Documenter.jl (xref: https://github.com/JuliaDocs/Documenter.jl/issues/176).

Most Markdown parsers support this feature, so I think Base.Markdown should do as well.

For example, two consecutive hyphens are recognized as an em dash as follows:

julia> Markdown.parse("<!-- comment -->")
  <!– comment –>

CC: @MichaelHatherly

julia> versioninfo()
Julia Version 0.5.0-rc1+0
Commit cede539* (2016-08-04 08:48 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i5-4288U CPU @ 2.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

docsystem markdown stdlib

Source

bicycle1885

👍1

Most helpful comment

I fear that we're going to end up having a full HTML parser in Base :|

In CommonMark, the intention is that it should be possible to recognise HTML using simple rules, i.e. without a full parser. You don't have to do any processing on it so it's not to tricky to match up < and > characters and avoid escaping that section.

MikeInnes on 7 Oct 2016

👍3

All 9 comments

I fear that we're going to end up having a full HTML parser in Base :|

StefanKarpinski on 8 Aug 2016

Nice! Which Javascript library are you going to use?

eschnett on 8 Aug 2016

I fear that we're going to end up having a full HTML parser in Base :|

Yeah, that's not something I'd like to end up happening. Most markdown parsers seem to just use some regex monstrosities to catch raw HTML, which appears to work alright.

MichaelHatherly on 8 Aug 2016

Reminds me of this: http://stackoverflow.com/a/1732454

KristofferC on 8 Aug 2016

😄2

How about using CommonMark? It already has the libcmark library that supports HTML tags.

bicycle1885 on 8 Aug 2016

Does CommonMark support some form of table syntax yet @bicycle1885? From the last time I looked through the spec I didn't come across anything.

I think it would probably be a good idea to wrap libcmark anyway (at some point) even if it's just to make it easier to check how much of CommonMark we actually adhere to, which is most likely not much at the moment.

MichaelHatherly on 9 Aug 2016

No, CommonMark seems to be very conservative to add extensions like table syntax. I'm not sure but I think we can do some preprocessing to convert table syntax extension to HTML tables before passing a string to libcmark.

bicycle1885 on 9 Aug 2016