Pandoc: Support for plain text reader

Created on 23 May 2020 · 16Comments · Source: jgm/pandoc

A readPlain that more or less mirrors the existing writePlain would be very nice for dealing with plain text.

Currently we can either just wrap the text in Pandoc mempty [Str text] or just treat it as markdown, but we would likely get better results with a proper readPlain function.

reader

Source

tysonzero

Most helpful comment

I think there are really two main issues here:

Would having a 'plain' reader that isn't even close to being the inverse of the 'plain' writer create confusion?
What structural elements are recognized? Presumably spaces are collapsed (since Str elements in pandoc don't contain spaces) -- this would all be done by Builder.text -- and presumably we want blank lines to create new Paras. Would those things be acceptable?

jgm on 24 May 2020

👍3

All 16 comments

I think the reason that no reader was created for plain text is that plain text provides no information about formatting, for example, which text comprises section headers versus section bodies. Various strategies are sometimes used for simulating formatting in plain text, but are generally not clear enough for a machine to interpret. The varieties of Markdown provide clear rules that both a human and machine can follow, for creating and interpreting formatting information in documents that look like plain-text documents, and I suspect that the Markdown readers are as close as ever would be possible to a plain text reader.

brainchild0 on 23 May 2020

I realize very little formatting will be brought in.

All I really want is something that can read plain text without ruining it. As currently things like underscore and bracket usage could give results that don't match the writer's intention.

Honestly I don't even hate:

readPlain :: PandocMonad m => ReaderOptions -> Text -> m Pandoc
readPlain _ = pure . Pandoc mempty . pure . Plain . pure . Str

I just figured that since writePlain already exists and throws in some formatting, it might be worth mirroring some of it's behavior.

tysonzero on 23 May 2020

Probably the most natural version of readPlain would be something like

readPlain _ = return . Pandoc nullMeta . map (Para . text . mconcat) . splitWhen T.null . T.lines

You'd get paragraphs with Str and Space elements, that's it. I don't know how useful this would be, though. What would it be useful for, exactly?

jgm on 23 May 2020

One use for it is importing ASCII text files that _look_ sort of like Markdown but are not and the resulting conversion to other formats is garbage. I've run into a use case for this before.

alerque on 23 May 2020

👍1

Does looking "sort of like Markdown" mean using some kind of delimiters in the text, like the way / is sometimes used to represent italics?

Maybe pushing some example sources to a jist would help.

brainchild0 on 23 May 2020

My use case is basically what @alerque said.

I'm parsing emails that can be either HTML or Plain text, and I want to normalize them both into markdown. If I don't do something to at least escape the plain text, then it will sometimes be garbled.

tysonzero on 23 May 2020

I suspect that the Markdown readers are as close as ever would be possible to a plain text reader.

This is quite a bit off the mark. The Markdown reader looks at all sorts of things in the file looking for syntax clues and interprets what it finds into a full fledged document structure. A plain text reader that did NOT do any of that would be a lot simpler, and certainly possible.

Maybe pushing some example sources to a jist[sic] would help.

Stop trying to make this complicated. Plain text is plain text. You don't need an example, any valid UTF-8 should round-trip from the input side to the output side unchanged via pandoc -f plain -t plain. Right now the closes thing to that is maybe pandoc -f markdown -t plain, but tons of stuff on the input side will get obliterated because the reader thought it was markdown.

Consider somebody who used # hash marks as a divider (instead of --- as in Markdown):

$ pandoc -f markdown -t plain <<< "foo\n\n##\n\nbar"

A hundred other examples could be shown. The Markdown reador garbles anything that isn't actual Markdown. Plain text is sometimes just that, plain text. There should be a reader. At most it may add paragraph breaks similar to the way the current writter outputs them.

alerque on 23 May 2020

👍1

Right now, this can also be done by wrapping inside html comment and using a filter to capture the RawBlock and split on newlines... as long as the text doesn't already contain html comments.

Or by sending the file name through an "infile" Meta value, as in 2311#issuecomment-335171218.

A PlainText reader could be an open door for those who would dare write custom _readers_ in Lua... if that's something one would want to do.

kysko on 24 May 2020

🎉1

A PlainText reader could be an open door for those who would dare write custom _readers_ in Lua... if that's something one would want to do.

Can't you just use the existing filter library to initialize an empty tree and then populate it with parser results?

brainchild0 on 24 May 2020

Can't you just use the existing filter library to initialize an empty tree and then populate it with parser results?

What will the parser parse? How would it get the ungarbled data?
And what parser? (that would be the biggest hurdle of course)
Then, yes, fill in the tree.

kysko on 24 May 2020

@kysko: The idea is to write a simple standalone script that creates an empty tree, finds the word breaks and paragraph breaks in text in standard input ("ungarbled"), populates the tree, and prints it. That is a custom reader. It's also a parser.

__Response to below__: My question simply directed at the observation that if a library creates an empty tree, then you don't need another reader to create it.

It seems that the python package pandocfilters does not create an empty tree. The basic form, however, is simply {"meta": {}, "blocks": {}, "pandoc-api-version": [1, 2]}. This structure can be hard coded.

This example is a custom reader (not necessarily the cleanest implementation), which is a plain text reader, and doesn't depend on any other reader.

Using @alerque's example:

$ echo -e "foo\n\n##\n\nbar" | python ./plainreader.py  | pandoc -f json -t plain
foo

##

bar

$ echo -e "foo\n\n##\n\nbar" | python ./plainreader.py  | pandoc -f json -t markdown
foo

\#\#

bar

brainchild0 on 24 May 2020

@kysko: The idea is to write a simple standalone script that creates an empty tree, finds the word breaks and paragraph breaks in text in standard input ("ungarbled"), populates the tree, and prints it. That is a custom reader. It's also a parser.

The OP suggested the creation of a simple plain text reader, possibly for multiple plain inputs. It seemed some think this might be useless, and I just offered another _possible_ usefulness. Might not be the OP's original reason, but there it is nonetheless.

kysko on 24 May 2020

This is a response to @brainchild0's multiple edits to his previous message, which I post here _normally_ as an element of the discussion, rather than an edit of my previous response.

My initial contribution to this Issue were 3 shortish paragraphs, the first two being hacks to simulate a plain reader, and the third being about another possible usefulness of a native plain reader.

@brainchild0, you seem to fixate on that 3rd paragraph. Your example is commendable on its own, but perhaps one would want to use pandoc's own internal Lua to process the input raw text, just like pandoc's Lua custom writer handles the output... a little semblance of symmetry.

The point of the OP, if I'm not mistaken, is to have a native and simple plain reader, a "Null" reader if you like, that doesn't change anything in the initial text. If for you the standard input is sufficient, fine... but that could have been your initial response to the OP, which it wasn't.

This issue is about a native plain text reader, preserving its UTF-8 content across the platforms of nix, macOS and _Windows_, and whether it would be useful for *whatever reason, but especially for the OP's own reasons.

Without this native plain reader, there are still the hacks mentioned throughout this discussion, nothing is lost, and I don't consider this an absolute necessity.

That being said, I apologize to the OP if I inadvertently introduced a foreign element in his Issue. I'll wait for his own input before adding anything else.

kysko on 24 May 2020

I think there are really two main issues here:

Would having a 'plain' reader that isn't even close to being the inverse of the 'plain' writer create confusion?
What structural elements are recognized? Presumably spaces are collapsed (since Str elements in pandoc don't contain spaces) -- this would all be done by Builder.text -- and presumably we want blank lines to create new Paras. Would those things be acceptable?

jgm on 24 May 2020

👍3

* Would having a 'plain' reader that isn't even close to being the inverse of the 'plain' writer create confusion?

Yes it might. Another name like unformatted, null, or passthrough instead of plain might be in order. You know the ecosystem for that better.

* What structural elements are recognized?  Presumably spaces are collapsed (since Str elements in pandoc don't contain spaces) -- this would all be done by Builder.text -- and presumably we want blank lines to create new Paras.  Would those things be acceptable?

This is trickier. I guess the options are one big blob string with no structure at all, or split on \n{2,} and have Para elements at the cost of collapsing some whitespace? If would say the latter is preferable for my use case, but that might indicate we should review possible other use cases.

alerque on 24 May 2020

I can see that a plain text reader might be useful... just a simple way to get text into pandoc without _foo_ being interpreted as italics etc. Probably we should have an extension on it toggling line-break behaviour.