Pandoc: Make sentence part of AST

Created on 22 Feb 2017  Â·  19Comments  Â·  Source: jgm/pandoc

LaTeX writers already tries to guess that, for example, it has to put ~ after "e.g." but it fails to recognize whether abbreviation is an end of sentence or not, while it should to produce correct spacing:
https://stackoverflow.com/questions/2024338/latex-sometimes-puts-too-much-or-too-little-space-after-periods

To properly recognize it, AST should contain this information somehow. For example, multiple sentences can be represented as multiple Str's or AST structure may be extended.

Textual formats already have a tradition of putting double space between sentences, it can be adopted for them. See https://en.wikipedia.org/wiki/Sentence_spacing#Digital_age on usage of double space in Emacs.

EDIT: direct link to Emacs manual section and vim manual section on sentences.

AST change more-discussion-needed

Most helpful comment

An alternative to having a Sentence element in the AST would be to have two space elements: Space (regular interword space) and LongSpace (intersentence space).

The readers could generate LongSpace when there are two or more spaces.

All 19 comments

In fact, in the case of "Mr.", "e.g.", etc., this is not done by the LaTeX writer, but by the Markdown reader, which inserts a nonbreaking space.

I wouldn't want to require two spaces between sentences. Most people these days don't do that, and it's not part of any Markdown standard. Plus, what about sentences that end at the end of the line? Nobody types two spaces after that kind of sentence.

I wouldn't want to require two spaces between sentences.

Two spaces are required only when end of sentence follows capital letter.

Plus, what about sentences that end at the end of the line?

End of line marks end of sentence. If it is abbreviation, writer has to avoid breaking line at this place.

Two spaces are required only when end of sentence follows capital letter.

Is this what you meant to say? In English this happens very infrequently. I assume you meant: "only when the end of the sentence follows a word beginning with a capital letter." But that wouldn't help with things like "e.g.".

No, it is not about the word beginning with a capital letter, but about the word ending with a capital letter. I was referring to example from stack overflow:

I watched Superman III. Then I went home.

Here first sentence ends with a capital letter. This one should translate to LaTeX

I watched Superman III\@. Then I went home.

because III. is not an abbreviation. To distinguish abbreviation from normal end of sentence, writer has to put two spaces or newline between "III." and "Then" when writing Markdown or similar lightweight markup.

I think it will always be possible to come up with an ambiguous sentence. We should simply allow \@ to be translated to raw TeX so it doesn't get lost on conversion.

@ilabdsf I'm afraid I've lost the thread here. I questioned whether people would want to have to put two spaces between sentences. You replied that this would only be necessary in the rare case where the sentence ends with a capital letter. But now I don't understand how pandoc is supposed to distinguish between things like "Mr. Jones" where the period does not end a sentence, and things like "...Sam. Mary" where it does. I had thought originally that your answer was that the user had to put two spaces after the period in the second case, but then your reply suggested otherwise. So now I'm just confused.

@jgm

I had thought originally that your answer was that the user had to put two spaces after the period in the second case

It is correct. If you put two spaces, it should translate to

Sam\@. Mary

and if you don't (because it is an abbreviation or you simply don't care about typography), it should translate to

Sam. Mary

FWIW double spacing for sentences is not entirely an English specialty. Swedish typography doesn't use it but I was taught to use it in typewriting back in the days, and I wouldn't be surprised if a similar distinction between typography and typewriting existed elsewhere. Oddly I still observe this rule when using a fixed-width font as in a text editor.

I'm not so sure that I like that pandoc inserts ~ automatically after some abbreviations. In my case mostly because it's English biased and gives wrong results in documents in other languages, not so much because things which are abbreviations in English aren't in other languages but the other way around: things which are abbreviations in other languages aren't in English so they are not handled automatically. We do have a way to visibly insert a non-breaking space (\) and those who care can use that. At the very least the automatic insertion of nb-spaces should be possible to turn off.

(As an editor I have far more trouble because people omit the periods in standard abbreviations. Apparently a Swedish mishabit since I have a few English native speakers for whom I edit what they have written in Swedish and they commit this error far less. I've even considered writing a filter to fix this automatically, since its usually easier to "unfix" ambiguous cases.)

@bjp see #256

An alternative to having a Sentence element in the AST would be to have two space elements: Space (regular interword space) and LongSpace (intersentence space).

The readers could generate LongSpace when there are two or more spaces.

Also, groff man pages (as supported by Man.hs) could benefit from this change.
As groff manual says:

End each sentence with two spaces – or better, start each sentence on a new line. gtroff recognizes characters that usually end a sentence, and inserts sentence space accordingly.

Also, from Plan 9 troff manual:

An input text line ending with ., ?, or !, optionally followed by any number of ", ’, ), ], *, or †, is taken to be the end of a sentence, and an additional space character is automatically provided during filling. To prevent this, add \& to the end of the input line. Multiple inter-word space characters found in the input are retained, except for trailing spaces; initial spaces also cause a break.

@ilabdsf - the man writer already puts newlines between sentences.
(Of course, we just use heuristics to determine what is a sentence, which isn't 100% reliable.)

@ilabdsf - the man writer already puts newlines between sentences.
(Of course, we just use heuristics to determine what is a sentence, which isn't 100% reliable.)

@jgm - Any chance this behavior is available as a flag/option for markdown? I have a workflow in which I write one sentence to a line in markdown, process into a docx via pandoc, and then receive feedback in tracked changes.

It all works ok except I end up having to add a line break after each period and colon with a regular expression. I'm slowly learning not to end any sentences with anything else.

Sorry if this is the wrong place to ask, but I've been looking around the docs and filters and haven't found anything yet.

Cheers,
Matt

edit - removed extra line breaks, because GFM

@matthewbegun So what you want is to keep each sentence in your Markdown source on one physical line in the text file?
I use the --wrap-preserve option for that. Using that option the only discipline you need to keep is to make sure that each sentence really sits on one physical line in your Markdown file. I use it to curb my tendency to write very long sentences.

Doing it by regex, unless you inspect every instance manually, will never work properly even if you also look for question marks, exclamation marks and ellipses, since it is normal for some sentences to be followed by parentheses. I guess a regex like (\w[.:?!\)]+)[^\S\n], replacing each each match with the capture plus a newline might come pretty close, but it still will break on abbreviations!

@matthewbegun So what you want is to keep each sentence in your Markdown source on one physical line in the text file?

Er no - that is what I start with. My "source of truth" is a collection of markdown files, all with one sentence per line, on github. Then I use pandoc to turn all those files into a manuscript in docx for several of my colleagues to review and edit (with track changes and comments) in Word.

My issue is that when I import the docx file (with the tracked changes and comments) the sentences are no longer on individual lines. Which makes sense because in the docx file they aren't. I then have to process the markdown produced by pandoc to get back to something useful for me - i.e. one sentence per line, and split into multiple source files.

_\

Doing it by regex, unless you inspect every instance manually, will never work properly even if you also look for question marks, exclamation marks and ellipses, since it is normal for some sentences to be followed by parentheses. I guess a regex like (\w[.:?!\)]+)[^\S\n], replacing each each match with the capture plus a newline might come pretty close, but it still will break on abbreviations!

Yes this is my point exactly! I am using a similar regex, and learning to write in a style which conforms to it. It is workable but irritating ;). But I see in the source that the sentence heuristic for the man-page writer is a little more sophisticated, I think. I don't grok haskell at all though.

I have been looking around at sentence splitter and pragmatic segmenter as potential drop in filters, but if there was already something built in I'd prefer use that. Laziest possible solution and all that.

@matthewbegun You could insert U+200B ZERO WIDTH SPACE at the end of each line before converting to DOCX and replace it with newline after conversion from DOCX. There is of course a risk that your coworkers remove some of the zero width spaces, but I guess you need to do that kind of cleaning up already.

@bpj that is a great idea, I will test that out and see how it goes.

I think I can set it up as a git hook so I don't have to do it manually. Either that or add it to the GitHub action I use to generate the docx.

Hopefully Word doesn't automatically remove them at some stage in the process.

Thanks!

Was this page helpful?
0 / 5 - 0 ratings