Pandoc: Carriage return in HTML should create a space

Created on 11 Apr 2018 · 9Comments · Source: jgm/pandoc

Bug demo:

$ printf 'foo\rbar' | pandoc -f html -t native
[Plain [Str "foobar"]]

According to HTML spec

Newlines in HTML may be represented either as U+000D CARRIAGE RETURN (CR) characters, U+000A LINE FEED (LF) characters, or pairs of U+000D CARRIAGE RETURN (CR), U+000A LINE FEED (LF) characters in that order.

Browser displays single CR as whitespace, while pandoc silently drops it and joins words.

Simply replacing (crFilter inp) with inp in Text.Pandoc.Readers.HTML.readHTML doesn't help.

bug HTML reader

Source

link2xt

All 9 comments

We support CRLF and LF line breaks. This covers all the commonly used
platforms. What system are you using wher CR is used for line breaks?

jgm on 11 Apr 2018

I am not creating these documents. I have existing HTML documents that I need to convert to LaTeX. It is likely that some old windows or even DOS was used to create them, but it doesn't matter IMO. Such HTML documents already exist, I don't generate them myself. pandoc joins words together at the end of line while reading them.

I don't see where '\r' is not processed. Looks like it is processed everywhere '\n' is processed. I expected it to produce Space instead of SoftBreak, but it is just consumed without a trace.

link2xt on 11 Apr 2018

The problem is in Text/Pandoc/UTF8.hs also. Going to submit PR.

link2xt on 11 Apr 2018

Indeed, we intentionally strip out CR characters here.

This is an optimization that has been in pandoc since
the beginning, pretty much; it allows parsers to assume that
we don't have CRs, and that line endings are NLs.

Changing the behavior of this function would likely
cause bad results for documents containing CRs in various
formats, since the parsers won't be expecting them.

jgm on 11 Apr 2018

Admittedly this approach is a bit ugly. But the reasoning was
that modern systems don't use CR line endings. For the very
rare cases where you have to handle something that does (e.g.
your case), it's easy enough to pipe the input through
tr '\r' ' ', isn't it?

jgm on 11 Apr 2018

Agreed that pandoc should parse CR in HTML as a SoftBreak, rather than passing it through literally. That quote from the HTML spec suggests that CR should be treated like LF and CRLF. I don't understand the testcase added in #4548, though, which looks like it is passing through the CR.

Other input formats are possibly specified differently, so I'm -1 on making a change to common code, rather than just the HTML reader.

quasicomputational on 11 Apr 2018

@quasicomputational text function turns \r (and \n) into softbreak, so testcase is ok. It does not pass \r through. It is just a different way of writing str "foo" <> softbreak <> str "bar".

Addition: I also thought that it might be not obvious for someone reading the test, but then decided it serves as a regression test for text. Nobody reads the tests until they break anyway, and there are two cases when it will break: either HTML reader is broken or text is broken. It will not happen at the same time as these functions are from different packages (pandoc vs pandoc-types).

link2xt on 11 Apr 2018

👍1

@jgm

This is an optimization that has been in pandoc since the beginning, pretty much; it allows parsers to assume that we don't have CRs, and that line endings are NLs.

See https://github.com/jgm/pandoc/pull/4548#issuecomment-380589168 . Parsers don't assume it, they call crFilter on their input.

But the reasoning was that modern systems don't use CR line endings. For the very rare cases where you have to handle something that does (e.g. your case), it's easy enough to pipe the input through tr '\r' ' ', isn't it?

In my case I found out that some words are glued irreversibly only after I converted to LaTeX, did some manual work to fix the formatting and started proofreading the result. Had to tr and redo the same work twice. It is a bug, which can be easily fixed, why not fix it then? In unlikely scenario that some reader breaks because of the lack of filtering, it is only a matter of adding crFilter to it.

link2xt on 11 Apr 2018

Maybe a better fix would to be to replace the use of crFilter in the HTML reader with something that replaces \r with \n.

Maybe we need to make sure \r\n is still converted to \n, not \n\n (or otherwise change pCodeBlock). Not sure whether that code already exists somewhere? The closest I could find was the now unexported lines' from Data.Text.

I had a quick look... but weirdly, the \r doesn't even seem to reach the input of the HTML reader. At least not on macOS. I put a trace $ show inp after readHtml opt inp = ..., then:

printf 'foo\rbar' | stack exec pandoc -- -f html -t native --trace

[trace] "foobar\n"

(same when reading from a file).

mb21 on 24 Aug 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Headers 4 levels deep render differently

chrissound · 4Comments

[markdown reader] indentation in HTML blocks is parsed as code block when markdown_in_html_blocks is enabled

timjb · 5Comments

docx and --number-sections

tolot27 · 5Comments

Does Pandoc plan to support converting to/from Wolfram/Mathematica notebooks?

georgewsinger · 4Comments

org mode headings past level three converted to numbered outline list

acate · 3Comments