Bug demo:
$ printf 'foo\rbar' | pandoc -f html -t native
[Plain [Str "foobar"]]
According to HTML spec
Newlines in HTML may be represented either as U+000D CARRIAGE RETURN (CR) characters, U+000A LINE FEED (LF) characters, or pairs of U+000D CARRIAGE RETURN (CR), U+000A LINE FEED (LF) characters in that order.
Browser displays single CR as whitespace, while pandoc silently drops it and joins words.
Simply replacing (crFilter inp) with inp in Text.Pandoc.Readers.HTML.readHTML doesn't help.
We support CRLF and LF line breaks. This covers all the commonly used
platforms. What system are you using wher CR is used for line breaks?
I am not creating these documents. I have existing HTML documents that I need to convert to LaTeX. It is likely that some old windows or even DOS was used to create them, but it doesn't matter IMO. Such HTML documents already exist, I don't generate them myself. pandoc joins words together at the end of line while reading them.
I don't see where '\r' is not processed. Looks like it is processed everywhere '\n' is processed. I expected it to produce Space instead of SoftBreak, but it is just consumed without a trace.
The problem is in Text/Pandoc/UTF8.hs also. Going to submit PR.
Indeed, we intentionally strip out CR characters here.
This is an optimization that has been in pandoc since
the beginning, pretty much; it allows parsers to assume that
we don't have CRs, and that line endings are NLs.
Changing the behavior of this function would likely
cause bad results for documents containing CRs in various
formats, since the parsers won't be expecting them.
Admittedly this approach is a bit ugly. But the reasoning was
that modern systems don't use CR line endings. For the very
rare cases where you have to handle something that does (e.g.
your case), it's easy enough to pipe the input through
tr '\r' ' ', isn't it?
Agreed that pandoc should parse CR in HTML as a SoftBreak, rather than passing it through literally. That quote from the HTML spec suggests that CR should be treated like LF and CRLF. I don't understand the testcase added in #4548, though, which looks like it is passing through the CR.
Other input formats are possibly specified differently, so I'm -1 on making a change to common code, rather than just the HTML reader.
@quasicomputational text function turns \r (and \n) into softbreak, so testcase is ok. It does not pass \r through. It is just a different way of writing str "foo" <> softbreak <> str "bar".
Addition: I also thought that it might be not obvious for someone reading the test, but then decided it serves as a regression test for text. Nobody reads the tests until they break anyway, and there are two cases when it will break: either HTML reader is broken or text is broken. It will not happen at the same time as these functions are from different packages (pandoc vs pandoc-types).
@jgm
This is an optimization that has been in pandoc since the beginning, pretty much; it allows parsers to assume that we don't have CRs, and that line endings are NLs.
See https://github.com/jgm/pandoc/pull/4548#issuecomment-380589168 . Parsers don't assume it, they call crFilter on their input.
But the reasoning was that modern systems don't use CR line endings. For the very rare cases where you have to handle something that does (e.g. your case), it's easy enough to pipe the input through
tr '\r' ' ', isn't it?
In my case I found out that some words are glued irreversibly only after I converted to LaTeX, did some manual work to fix the formatting and started proofreading the result. Had to tr and redo the same work twice. It is a bug, which can be easily fixed, why not fix it then? In unlikely scenario that some reader breaks because of the lack of filtering, it is only a matter of adding crFilter to it.
Maybe a better fix would to be to replace the use of crFilter in the HTML reader with something that replaces \r with \n.
Maybe we need to make sure \r\n is still converted to \n, not \n\n (or otherwise change pCodeBlock). Not sure whether that code already exists somewhere? The closest I could find was the now unexported lines' from Data.Text.
I had a quick look... but weirdly, the \r doesn't even seem to reach the input of the HTML reader. At least not on macOS. I put a trace $ show inp after readHtml opt inp = ..., then:
printf 'foo\rbar' | stack exec pandoc -- -f html -t native --trace
[trace] "foobar\n"
(same when reading from a file).