Julia: Markdown.parse doesn't parse raw HTML properly.

Created on 5 Aug 2016  Â·  9Comments  Â·  Source: JuliaLang/julia

I noticed Markdown.parse doesn't parse HTML tags properly while writing docs with Documenter.jl (xref: https://github.com/JuliaDocs/Documenter.jl/issues/176).

Most Markdown parsers support this feature, so I think Base.Markdown should do as well.

For example, two consecutive hyphens are recognized as an em dash as follows:

julia> Markdown.parse("<!-- comment -->")
  <!– comment –>

CC: @MichaelHatherly


julia> versioninfo()
Julia Version 0.5.0-rc1+0
Commit cede539* (2016-08-04 08:48 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i5-4288U CPU @ 2.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

docsystem markdown stdlib

Most helpful comment

I fear that we're going to end up having a full HTML parser in Base :|

In CommonMark, the intention is that it should be possible to recognise HTML using simple rules, i.e. without a full parser. You don't have to do any processing on it so it's not to tricky to match up < and > characters and avoid escaping that section.

All 9 comments

I fear that we're going to end up having a full HTML parser in Base :|

Nice! Which Javascript library are you going to use?

I fear that we're going to end up having a full HTML parser in Base :|

Yeah, that's not something I'd like to end up happening. Most markdown parsers seem to just use some regex monstrosities to catch raw HTML, which appears to work alright.

How about using CommonMark? It already has the libcmark library that supports HTML tags.

Does CommonMark support some form of table syntax yet @bicycle1885? From the last time I looked through the spec I didn't come across anything.

I think it would probably be a good idea to wrap libcmark anyway (at some point) even if it's just to make it easier to check how much of CommonMark we actually adhere to, which is most likely not much at the moment.

No, CommonMark seems to be very conservative to add extensions like table syntax. I'm not sure but I think we can do some preprocessing to convert table syntax extension to HTML tables before passing a string to libcmark.

I fear that we're going to end up having a full HTML parser in Base :|

In CommonMark, the intention is that it should be possible to recognise HTML using simple rules, i.e. without a full parser. You don't have to do any processing on it so it's not to tricky to match up < and > characters and avoid escaping that section.

I currently use this NodeJS hack as a work around for generating AWS documentation: https://github.com/samoconnor/AWSCore.jl/blob/master/src/HTML2MD.jl

It would be nice if HTML in markdown just worked.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ararslan picture ararslan  Â·  3Comments

felixrehren picture felixrehren  Â·  3Comments

StefanKarpinski picture StefanKarpinski  Â·  3Comments

StefanKarpinski picture StefanKarpinski  Â·  3Comments

manor picture manor  Â·  3Comments